# Exercise: Principal Component Analysis (PCA) and Classification Performance

In this exercise, you will apply PCA to reduce the dimensionality of the Olivetti Faces Dataset.

You will then train and test various classifiers on the PCA-reduced data.

The goal is to explore how different classifiers perform with different amounts of explained variance in PCA.

1) Load the Olivetti Faces Dataset.
2) Apply PCA with varying numbers of components to capture different levels of explained variance (e.g., 80%, 90%, and 95%).
3) For each level of explained variance, reduce the dataset's dimensionality using PCA and train the following classifiers on the transformed data:
- k-Nearest Neighbors (k-NN)
- Parzen Window Classifier (use Gaussian kernel density estimate)
- Logistic Regression

4) Compare the classification accuracy of each classifier across the different levels of explained variance.

5) Analyze and discuss:
5.1) Which classifier performs best with fewer PCA components and why?
5.2) How does the number of components (variance retained) affect each classifierâ€™s performance?

In [None]:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KernelDensity
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

# Load dataset
faces = fetch_olivetti_faces()
X, y = faces.data, faces.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define variance levels for PCA and create dictionaries to store results
variance_levels = [0.50, 0.70, 0.90, 0.95]
results = {var: {} for var in variance_levels}  # Initialize results dictionary

# Apply PCA and perform classification at each variance level
for var in variance_levels:
    # Perform PCA with the specified variance level
    pca = PCA(var)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)

    # k-Nearest Neighbors (k=5)
    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train_pca, y_train)
    y_pred_knn = knn.predict(X_test_pca)
    results[var]['k-NN'] = accuracy_score(y_test, y_pred_knn)

    # Parzen Window Classifier with Kernel Density Estimation
    bandwidth = 1.0  # You can experiment with different bandwidths
    classes = np.unique(y_train)
    log_densities = []

    for cls in classes:
        # Filter training data by class: this example considers that each class has a unique distribution
        X_train_class = X_train_pca[y_train == cls]
        kde = KernelDensity(kernel='gaussian', bandwidth=bandwidth)
        kde.fit(X_train_class)

        # Score samples in the test set (log-density for each class)
        log_density = kde.score_samples(X_test_pca)
        log_densities.append(log_density)

    # Convert list to array and find class with highest log-density for each sample
    log_densities = np.array(log_densities).T  # Shape: [n_samples, n_classes]
    y_pred_parzen = classes[np.argmax(log_densities, axis=1)]
    results[var]['Parzen Window'] = accuracy_score(y_test, y_pred_parzen)

    # Logistic Regression
    logistic = LogisticRegression(max_iter=1000)
    logistic.fit(X_train_pca, y_train)
    y_pred_logistic = logistic.predict(X_test_pca)
    results[var]['Logistic Regression'] = accuracy_score(y_test, y_pred_logistic)

# Display the results in a DataFrame for easy viewing
results_df = pd.DataFrame(results).T
print("Classification Accuracy for Different PCA Variance Levels and Classifiers:\n")
print(results_df)


Classification Accuracy for Different PCA Variance Levels and Classifiers:

          k-NN  Parzen Window  Logistic Regression
0.50  0.491667       0.725000             0.625000
0.70  0.633333       0.833333             0.841667
0.90  0.775000       0.866667             0.941667
0.95  0.783333       0.875000             0.941667
