## Class 9 - Part 1 
### Advanced Machine Learning Concepts
In this practical session, we will delve into several advanced machine learning concepts crucial for developing robust predictive models. Here's an overview of the key topics we will explore:

1. **Curse of Dimensionality:** We will explore how high-dimensional spaces increase the sparsity of data, making it difficult to effectively train and optimize machine learning models due to increased computational complexity and data requirements.
2. **Linear Discriminant Analysis (LDA):** This topic covers LDA, a dimensionality reduction technique that is also used for classification. It seeks to maximize the separability among known categories.
3. **Hyperparameter Tuning:** We'll discuss methods to select the best set of hyperparameters for a given machine learning model, enhancing its performance on unseen data.
   - **Grid Search:** This is a technique for hyperparameter tuning that methodically builds and evaluates a model for each combination of algorithm parameters specified in a grid.
4. **Metrics in Machine Learning:** We will review various metrics used to evaluate the performance of machine learning models, such as accuracy, precision, recall, F1-score, and AUC-ROC among others.
   - **Confusion Matrix:** An important tool to summarize the performance of a classification algorithm. It provides insights into the types of errors made by the model.
5. **k-Fold Cross-Validation:** This topic covers the technique of dividing the dataset into k-subsets and using each in turn for testing a model trained on the remaining k-1 subsets. It is a method used to estimate the skill of the model on new data.

This session will provide hands-on experience in implementing these concepts in Python, using various libraries to help illustrate these advanced techniques in action.

In [0]:
# Import necessary libraries and modules for data manipulation, machine learning, and metric evaluation:
import numpy as np  # NumPy is used for numerical operations and handling arrays.
import matplotlib.pyplot as plt  # Matplotlib is used for creating static, interactive, and animated visualizations in Python.
import seaborn as sns  # Seaborn is a Python data visualization library based on matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.
import pandas as pd  # Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language.

# Import specific functions and classes from scikit-learn (sklearn):
from sklearn.datasets import fetch_lfw_people  # Function to load the Labeled Faces in the Wild (LFW) people dataset.
from sklearn.model_selection import train_test_split  # Function to split datasets into training and testing subsets.
from sklearn.preprocessing import StandardScaler  # Class for standardizing features by removing the mean and scaling to unit variance.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis  # A classifier with a linear decision boundary.
from sklearn.neighbors import KNeighborsClassifier  # Classifier implementing the k-nearest neighbors vote.
from sklearn.svm import SVC  # Support Vector Machine classifier for both classification and regression analysis.
from sklearn.naive_bayes import GaussianNB  # Classifier implementing Naive Bayes algorithm for assuming Gaussian distributed features.
from sklearn.metrics import accuracy_score  # Function to calculate the accuracy, the set of labels predicted for a sample must match the corresponding set of labels in y_true.
from sklearn.model_selection import GridSearchCV # GridSearchCV is a tool that helps in tuning hyperparameters.

In [0]:
# Fetch the Labeled Faces in the Wild (LFW) dataset, specifying that only those people should be included
# who have at least 70 images in the dataset. This helps ensure a sufficiently large number of samples
# for each class, which is important for training stable machine learning models. The images are resized
# to 0.4 of their original size to reduce computational cost and improve processing speed.
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# Extract feature data from the dataset; this typically includes pixel values of the images, 
# which serve as the input features for machine learning models.
X = lfw_people.data

# Extract the target labels corresponding to the identities of the people in the images.
# These labels are used as the output or target variable for supervised learning models.
y = lfw_people.target

### Data Description:
The Labeled Faces in the Wild (LFW) People dataset from scikit-learn is designed for studying face recognition problems. It's preprocessed to facilitate direct usage in machine learning models, making it a popular choice for image-based classification tasks.
- **Features:** Each image is preprocessed and flattened into a one-dimensional array. The dataset typically involves images that have been resized, converted to grayscale, and each pixel's intensity is used as a feature. The number of features per sample depends on the image size specified during loading; for instance, if images are resized to 0.4 of their original size, this significantly reduces the dimensionality.
- **Target:** The target variable is the identity of the person in the photograph. It's a multiclass classification problem, with each class representing a different individual.

In [0]:
# Retrieve the names corresponding to the target labels in the LFW dataset, which are essentially the names
# of the individuals whose images are included in the dataset. These names are useful for understanding
# and interpreting the target labels.
target_names = lfw_people.target_names

# Print some details about the dataset to understand its composition and structure:
print("- Number of samples: ", X.shape[0])  # The total number of images in the dataset.
print("- Number of features per sample (flattened image size): ", X.shape[1])  # The number of pixels in each image, representing features.
print("- Number of classes (individuals): ", len(target_names))  # The number of different individuals, i.e., the class labels.
print("- Feature names (not applicable as features are pixel values): ", "Pixel values from resized images")  # In this dataset, features are raw pixel values, so there are no 'feature names' as such.
print("- Target names (individuals): ", target_names)  # Names of the individuals, which correspond to the class labels in the dataset.

In [0]:
# Initialize an instance of StandardScaler, which standardizes features by removing the mean and scaling to unit variance.
# This is an important preprocessing step for many machine learning algorithms to perform optimally, especially for those
# that assume data is normally distributed or those that are sensitive to the scale of the features like SVM and KNN.
scaler = StandardScaler()

# Fit the scaler to the data and then transform it. This means the scaler computes the mean and standard deviation of each feature
# in the dataset to be used for later scaling (subtracting the mean and dividing by the standard deviation), and then it transforms
# the data to put it onto one scale.
X_scaled = scaler.fit_transform(X)

In [0]:
# Split the standardized feature data and corresponding labels into training and testing sets.
# 'X_scaled' contains the scaled pixel values of the images, and 'y' contains the target labels (identities of individuals).
# The data is split with 30% reserved for the test set to evaluate the model's performance on unseen data.
# The 'random_state' parameter ensures that the split is reproducible, meaning the same training and testing sets
# are generated each time the code is run, which is important for debugging and comparing model performance across different runs.
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

In [0]:
# Define a dictionary of classifiers for easy reference and batch processing. 
# This dictionary associates each classifier type with its corresponding instance:
classifiers = {
    'KNN': KNeighborsClassifier(n_neighbors=3), # k-Nearest Neighbors classifier initialized with 3 nearest neighbors. 
                                                #Useful for non-linear data.
    'SVM': SVC(kernel='linear'),                # Support Vector Machine classifier with a linear kernel. 
                                                # Good for linearly separable data.
    'Naive Bayes': GaussianNB()                 # Naive Bayes classifier assuming Gaussian distribution of features. 
                                                # Effective for large feature spaces.
}

In [0]:
# Initialize an empty dictionary to store the accuracy results of each classifier when trained on scaled data (Original).
results_scaled = {}

# Loop through each classifier defined in the 'classifiers' dictionary.
for name, clf in classifiers.items():
    # Train the classifier on the scaled training data.
    clf.fit(X_train_scaled, y_train)
    
    # Predict the labels of the test dataset using the trained classifier.
    predictions_scaled = clf.predict(X_test_scaled)
    
    # Calculate the accuracy by comparing the predicted labels to the actual labels of the test data.
    accuracy_scaled = accuracy_score(y_test, predictions_scaled)
    
    # Store the accuracy of each classifier in the 'results_scaled' dictionary, keyed by the classifier's name.
    results_scaled[name] = accuracy_scaled

In [0]:
# Print out the accuracy results for each classifier trained on scaled data(Original).
print("Accuracy using Scaled Data (Original):")
for name, acc in results_scaled.items():
  # Print each classifier's name and its corresponding accuracy formatted to two decimal places.
    print(f"{name}: {acc:.2f}")

In [0]:
# Initialize an instance of Linear Discriminant Analysis (LDA). LDA is a dimensionality reduction technique 
# that is also commonly used as a linear classifier. It projects the data onto a lower-dimensional space 
# with good class separability in order to maximize the ratio of between-class variance to within-class variance.

lda = LinearDiscriminantAnalysis()

# Fit the LDA model to the scaled training data and transform the training data to reduce its dimensionality.
# This step involves finding the axes that maximize the separation between multiple classes and using these axes 
# to project the data into a space with fewer dimensions.
X_train_lda = lda.fit_transform(X_train_scaled, y_train)

# Transform the scaled test data using the same LDA model. Note that we only transform the test data 
# (without fitting) because the transformation must use the model parameters learned from the training data.
# This ensures that the test data is projected in the same way as the training data.
X_test_lda = lda.transform(X_test_scaled)

In [0]:
# Initialize an empty dictionary to store the accuracy results of each classifier when trained and tested on LDA-transformed data.
results_lda = {}

# Loop through the classifiers defined in the 'classifiers' dictionary.
for name, clf in classifiers.items():
    # Train each classifier on the LDA-transformed training data.
    clf.fit(X_train_lda, y_train)
    
    # Use the trained classifier to make predictions on the LDA-transformed test data.
    predictions_lda = clf.predict(X_test_lda)
    
    # Calculate the accuracy of the predictions by comparing them to the true labels of the test data.
    accuracy_lda = accuracy_score(y_test, predictions_lda)
    
    # Store the accuracy of each classifier in the 'results_lda' dictionary, keyed by the classifier name.
    results_lda[name] = accuracy_lda

In [0]:
# Print out the accuracy results for each classifier trained on scaled data (Original).
print("Accuracy using Scaled Data (Original):")
for name, acc in results_scaled.items():
  # Print each classifier's name and its corresponding accuracy formatted to two decimal places.
    print(f"{name}: {acc:.2f}")
# Print out the accuracy results for LDA-Transformed data.
print("\nAccuracy using LDA-Transformed Data:")
for name, acc in results_lda.items():
  # Print each classifier's name and its corresponding accuracy formatted to two decimal places.
    print(f"{name}: {acc:.2f}")

### Analysis of the Results
- **K-Nearest Neighbors (KNN):** The accuracy significantly improves when using LDA. This suggests that LDA is effective in reducing the dimensions while enhancing the separability of the classes for KNN, which struggles with high-dimensional data due to the curse of dimensionality.
- **Support Vector Machine (SVM):** SVM performs better on the original scaled data compared to the LDA-transformed data. SVM is generally effective in high-dimensional spaces, especially with appropriate kernel choices, and might lose valuable information if the data is overly simplified or if important discriminative features are lost during transformation.
- **Naive Bayes:** There is a substantial increase in accuracy with LDA. Naive Bayes often benefits from LDA because LDA can help alleviate some of the independence assumptions of Naive Bayes by projecting features into a space where class separability is maximized.

###Grid Search: 
To fine-tune the hyperparameters of the K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Naive Bayes classifiers using Grid Search, and focusing only on the LDA-transformed data, you can use GridSearchCV from scikit-learn. This tool allows you to define a grid of hyperparameter ranges and automatically finds the best combination of parameters through cross-validation.

In [0]:
# Define parameter grid for KNN classifier tuning. This dictionary specifies multiple hyperparameters to explore:
knn_params = {
    'n_neighbors': range(10, 30),  # Exploring a broader range of neighbors, specifically between 10 and 30.
    'weights': ['uniform', 'distance'],  # Testing both possible weighting strategies: uniform and distance-based.
    'metric': ['euclidean', 'manhattan', 'chebyshev'],  # Including three types of distance metrics to evaluate which performs best.
    'p': [1, 2]  # Defining 'p' parameter values for the Minkowski metric: 1 for Manhattan and 2 for Euclidean distances.
}

# Initialize GridSearchCV with the KNN classifier. Set the defined parameter grid, use 5-fold cross-validation,
# set scoring to 'accuracy' to evaluate model performance, and enable verbose output for more detailed logging during execution.
knn_grid = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, scoring='accuracy', verbose=1)

# Fit the grid search model to the LDA-transformed training data to find the best hyperparameters.
knn_grid.fit(X_train_lda, y_train)

# Print the best hyperparameters found during the grid search.
print(f"Best parameters for KNN: {knn_grid.best_params_}")

# Print the highest cross-validated accuracy achieved with the best hyperparameters.
print(f"Best cross-validated accuracy for KNN: {knn_grid.best_score_:.2f}")

In [0]:
# Define parameter grid for SVM classifier tuning. The grid specifies a range of values to be tested for each hyperparameter:
svm_params = {
    'C': [0.01, 0.1, 1],  # Testing regularization parameters to understand the effect of slightly larger and smaller values around the previously effective ones.
    'kernel': ['linear'],  # Focusing on the linear kernel as it was previously found to be effective.
    'gamma': [0.1, 1, 3.6, 10],  # Exploring values around 3.6 to fine-tune the model's fit.
    'degree': [3]  # Including degree for completeness, though it's not used by the linear kernel, it's required by the API.
}

# Initialize GridSearchCV with the SVM classifier. Specify the parameter grid, use 5-fold cross-validation,
# set scoring to 'accuracy' to evaluate model performance, and enable verbose output for progress updates during execution.
svm_grid = GridSearchCV(SVC(), svm_params, cv=5, scoring='accuracy', verbose=1)

# Fit the grid search model to the LDA-transformed training data to find the best hyperparameters.
svm_grid.fit(X_train_lda, y_train)

# Print the best hyperparameters that grid search found for the SVM classifier.
print(f"Best parameters for SVM: {svm_grid.best_params_}")

# Print the highest cross-validated accuracy achieved with these best hyperparameters.
print(f"Best cross-validated accuracy for SVM: {svm_grid.best_score_:.2f}")

In [0]:
# Define parameter grid for Naive Bayes classifier tuning. The grid focuses on 'var_smoothing',
# which adjusts the variance part of the calculation to prevent overfitting on very small data samples.
# Here, 50 values are linearly spaced between 0.045 and 0.055 to finely tune this parameter.
nb_params = {
    'var_smoothing': np.linspace(0.045, 0.055, 50)
}

# Initialize GridSearchCV with GaussianNB as the classifier model, specifying the parameter grid,
# number of folds for cross-validation (cv=5), the scoring metric as 'accuracy', and verbosity level 1
# for progress updates during the model fitting process.
nb_grid = GridSearchCV(GaussianNB(), nb_params, cv=5, scoring='accuracy', verbose=1)

# Fit the GridSearchCV object to the LDA-transformed training data along with the target labels.
nb_grid.fit(X_train_lda, y_train)

# After the grid search completes, print the best hyperparameters found and the highest
# cross-validated accuracy achieved with those parameters.
print(f"Best parameters for Naive Bayes: {nb_grid.best_params_}")
print(f"Best cross-validated accuracy for Naive Bayes: {nb_grid.best_score_:.2f}")


In [0]:
# Print out the accuracy results for LDA-Transformed data.
print("Accuracy using LDA-Transformed Data:")
for name, acc in results_lda.items():
  # Print each classifier's name and its corresponding accuracy formatted to two decimal places.
    print(f"{name}: {acc:.2f}")

# Retrain KNN with the best hyperparameters found
knn_optimized = KNeighborsClassifier(**knn_grid.best_params_)
knn_optimized.fit(X_train_lda, y_train)
knn_optimized_accuracy = accuracy_score(y_test, knn_optimized.predict(X_test_lda))
print(f"\nOptimized KNN Accuracy: {knn_optimized_accuracy:.2f}")

# Retrain SVM with the best hyperparameters found
svm_optimized = SVC(**svm_grid.best_params_)
svm_optimized.fit(X_train_lda, y_train)
svm_optimized_accuracy = accuracy_score(y_test, svm_optimized.predict(X_test_lda))
print(f"Optimized SVM Accuracy: {svm_optimized_accuracy:.2f}")

# Retrain Naive Bayes with the best hyperparameters found
nb_optimized = GaussianNB(**nb_grid.best_params_)
nb_optimized.fit(X_train_lda, y_train)
nb_optimized_accuracy = accuracy_score(y_test, nb_optimized.predict(X_test_lda))
print(f"Optimized Naive Bayes Accuracy: {nb_optimized_accuracy:.2f}")

---------------------------------------------------------------------------------------------------------------------------

Author: <b>Julio Iglesias</b>