## Class 9 - Part 2 
### Advanced Machine Learning Concepts
In this practical session, we will delve into several advanced machine learning concepts crucial for developing robust predictive models. Here's an overview of the key topics we will explore:

1. **Curse of Dimensionality:** We will explore how high-dimensional spaces increase the sparsity of data, making it difficult to effectively train and optimize machine learning models due to increased computational complexity and data requirements.
2. **Linear Discriminant Analysis (LDA):** This topic covers LDA, a dimensionality reduction technique that is also used for classification. It seeks to maximize the separability among known categories.
3. **Hyperparameter Tuning:** We'll discuss methods to select the best set of hyperparameters for a given machine learning model, enhancing its performance on unseen data.
   - **Grid Search:** This is a technique for hyperparameter tuning that methodically builds and evaluates a model for each combination of algorithm parameters specified in a grid.
4. **Metrics in Machine Learning:** We will review various metrics used to evaluate the performance of machine learning models, such as accuracy, precision, recall, F1-score, and AUC-ROC among others.
   - **Confusion Matrix:** An important tool to summarize the performance of a classification algorithm. It provides insights into the types of errors made by the model.
5. **k-Fold Cross-Validation:** This topic covers the technique of dividing the dataset into k-subsets and using each in turn for testing a model trained on the remaining k-1 subsets. It is a method used to estimate the skill of the model on new data.

This session will provide hands-on experience in implementing these concepts in Python, using various libraries to help illustrate these advanced techniques in action.

In [0]:
# Import necessary libraries and modules for data manipulation, machine learning, and metric evaluation:
import numpy as np  # Import NumPy, a library for numerical operations like matrix manipulations and advanced mathematical functions.
import pandas as pd  # Import pandas, a library for data manipulation and analysis, particularly useful for handling structured data like data frames.
import seaborn as sns  # Import seaborn, a Python data visualization library based on matplotlib that provides a high-level interface for drawing attractive statistical graphics.
import matplotlib.pyplot as plt  # Import pyplot from matplotlib to create figures and plots, essential for data visualization in Python.

# Import specific functions and classes from scikit-learn (sklearn):
from sklearn.datasets import load_iris  # Import the function to load the Iris dataset, a classic dataset in pattern recognition.
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold  # Import functions and classes for splitting data into train/test sets, cross-validating scores, and creating stratified folds for cross-validation.
from sklearn.neighbors import KNeighborsClassifier  # Import the k-Nearest Neighbors classifier, a simple yet effective classification algorithm.
from sklearn.svm import SVC  # Import the Support Vector Machine classifier, powerful for medium-sized datasets and complex problems.
from sklearn.naive_bayes import GaussianNB  # Import the Gaussian Naive Bayes classifier, effective for classification based on applying Bayes' theorem with strong (naive) feature independence assumptions.
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report  # Import various performance metrics to evaluate the accuracy, precision, recall, F1 score, and more comprehensive classification reports.

In [0]:
# Load the Iris dataset from sklearn's dataset library. The Iris dataset is commonly used for testing machine learning algorithms.
iris = load_iris()

# Assign the feature data to 'X' and the target labels to 'y'. 
# 'X' contains the attributes (sepal length, sepal width, petal length, petal width), 
# while 'y' contains the class labels (species of iris flowers).
X = iris.data
y = iris.target

# Split the dataset into training and testing sets. 
# 'test_size=0.3' configures 30% of the dataset to be used as the test set. 
# 'random_state=22' ensures that the splits are reproducible and consistent across different runs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22)

### Data Description:
The Iris dataset is one of the most well-known datasets in the field of machine learning. It was introduced by British statistician and biologist Ronald Fisher in 1936 as an example of linear discriminant analysis. This dataset is often used for classification tasks and testing machine learning algorithms.

The Iris dataset consists of 150 instances of iris flowers, each with four attributes: sepal length, sepal width, petal length, and petal width. All measurements are in centimeters. These data are used to classify the instances into one of three species or classes of iris, which are:
 - Iris-setosa (0)
 - Iris-versicolor (1)
 - Iris-virginica(2)

![Iris](https://www.math.umd.edu/~petersd/666/html/iris_with_labels.jpg)

In [0]:
# Define classifiers with their respective hyperparameters:
# Initialize the K-Nearest Neighbors (KNN) classifier with 3 neighbors. 
# This setting specifies that the label of a new point is predicted based on the majority vote of its three nearest neighbors.
knn = KNeighborsClassifier(n_neighbors=3)

# Initialize the Support Vector Machine (SVM) with a linear kernel. 
# The linear kernel is chosen for its simplicity and effectiveness in linearly separable data sets.
svm = SVC(kernel='linear')

# Initialize the Gaussian Naive Bayes classifier, 
# which applies Naive Bayes classification methods assuming that features follow a Gaussian distribution.
nb = GaussianNB()

# Fit models to the training data:
# Train the KNN model using the training data and labels.
knn.fit(X_train, y_train)

# Train the SVM model using the training data and labels. This involves finding the hyperplane that best separates the classes.
svm.fit(X_train, y_train)

# Train the Gaussian Naive Bayes model using the training data and labels. 
# This model calculates the probabilities of the different features belonging to each class based on the Gaussian distribution.
nb.fit(X_train, y_train)

In [0]:
# Generate predictions using the fitted models:
pred_knn = knn.predict(X_test)  # Use the KNN model to predict the labels for the test data.
pred_svm = svm.predict(X_test)  # Use the SVM model to predict the labels for the test data.
pred_nb = nb.predict(X_test)    # Use the Naive Bayes model to predict the labels for the test data.

# Create confusion matrices for each model to evaluate their performance:
cm_knn = confusion_matrix(y_test, pred_knn)  # Confusion matrix for KNN predictions.
cm_svm = confusion_matrix(y_test, pred_svm)  # Confusion matrix for SVM predictions.
cm_nb = confusion_matrix(y_test, pred_nb)    # Confusion matrix for Naive Bayes predictions.

# Display the confusion matrices using seaborn's heatmap for better visualization:
fig, ax = plt.subplots(1, 3, figsize=(18, 6))  # Create a figure with three subplots.
sns.heatmap(cm_knn, annot=True, ax=ax[0], cmap='Blues', fmt='g')  # Display KNN confusion matrix.
ax[0].set_title('KNN Confusion Matrix')  # Set title for the KNN confusion matrix plot.
sns.heatmap(cm_svm, annot=True, ax=ax[1], cmap='Blues', fmt='g')  # Display SVM confusion matrix.
ax[1].set_title('SVM Confusion Matrix')  # Set title for the SVM confusion matrix plot.
sns.heatmap(cm_nb, annot=True, ax=ax[2], cmap='Blues', fmt='g')  # Display Naive Bayes confusion matrix.
ax[2].set_title('Naive Bayes Confusion Matrix')  # Set title for the Naive Bayes confusion matrix plot.

# Print classification reports to provide a comprehensive overview of the performance of each classifier:
print("KNN Classification Report:\n", classification_report(y_test, pred_knn))  # Print KNN classification report.
print("SVM Classification Report:\n", classification_report(y_test, pred_svm))  # Print SVM classification report.
print("Naive Bayes Classification Report:\n", classification_report(y_test, pred_nb))  # Print Naive Bayes classification report.

In [0]:
# Initialize the k-Fold validation:
# Create an instance of StratifiedKFold, which ensures that each fold of the dataset has the same proportion of examples in each class,
# using 5 splits, which is a common choice for k-fold cross-validation.
kfold = StratifiedKFold(n_splits=5)

# Calculate cross-validation scores for each classifier:
# Compute cross-validation scores for the KNN classifier. This function evaluates the model using the k-fold cross-validation method,
# returning the accuracy for each fold, ensuring a robust estimate of the model's performance.
scores_knn = cross_val_score(knn, X, y, cv=kfold)

# Compute cross-validation scores for the SVM classifier. Similarly, this evaluates the SVM model across each fold defined by StratifiedKFold, providing accuracy scores that help assess the overall effectiveness of the model.
scores_svm = cross_val_score(svm, X, y, cv=kfold)

# Compute cross-validation scores for the Naive Bayes classifier. This uses the same k-fold cross-validation to evaluate how the GaussianNB model performs on different subsets of the dataset, giving a clear picture of its generalization capability.
scores_nb = cross_val_score(nb, X, y, cv=kfold)

# Print the cross-validation scores for each model to see their performance stability and effectiveness:
print("KNN Cross-Validation Scores:", scores_knn)  # Output the array of scores for KNN from each fold.
print("SVM Cross-Validation Scores:", scores_svm)  # Output the array of scores for SVM from each fold.
print("Naive Bayes Cross-Validation Scores:", scores_nb)  # Output the array of scores for Naive Bayes from each fold.

Cross-validation scores are a powerful measure for assessing a model's effectiveness across different segments of data, aiding in robust model selection and tuning, and providing confidence in the model's ability to generalize to new data.

---------------------------------------------------------------------------------------------------------------------------

Author: <b>Julio Iglesias</b>