# Week 6: Post-Module Exercises - Optimal Component Selection

### Introduction

This notebook focuses on a critical aspect of Principal Component Analysis (PCA): determining the optimal number of components to use. We'll also explore data reconstruction with PCA and how the number of components affects reconstruction quality.

Let's begin by importing the necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.express as px

# Configure plotting
mpl.rcParams["axes.spines.right"] = False
mpl.rcParams["axes.spines.top"] = False

# Machine learning libraries
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Let's load and prepare the breast cancer dataset again:

In [None]:
# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 1. Selecting the Optimal Number of PCA Components

In the main module, we briefly touched on how explained variance can help determine the number of components to keep. Here, we'll explore this in more depth using the **elbow method** and **variance thresholds**.

### How to think about the Explained Variance Ratio
A dataset in general has $x$ amount of variance, and each variable contributes to that variance. The explained variance ratio aims to figure out the percentage of the variance that a component contributes to.

### What is Cumulative Explained Variance
If we were to order components by the variance that they contribute to, this gives us the **Cumulative Explained Variance** up to that point. For the sake of visualization, you want to maximize the cumulative explained variance and minimize the number of dimensions.

In [None]:
# Perform PCA without limiting the number of components
pca_full = PCA()
pca_full.fit(X_scaled)

# Get explained variance ratio
explained_variance = pca_full.explained_variance_ratio_

# Plot the explained variance for each component
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.5, align='center', 
        label='Individual explained variance')
plt.step(range(1, len(explained_variance) + 1), np.cumsum(explained_variance), where='mid', 
        label='Cumulative explained variance')
plt.axhline(y=0.95, color='r', linestyle='-', label='95% Threshold')
plt.axhline(y=0.9, color='g', linestyle='-', label='90% Threshold')
plt.axhline(y=0.8, color='y', linestyle='-', label='80% Threshold')
plt.xlabel('Number of Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Components')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

### The Elbow Method for PCA

The elbow method involves looking for the "elbow" point in the scree plot (the plot of explained variance), where the rate of decrease in explained variance slows down significantly. This point represents a good trade-off between dimensionality reduction and information preservation.

A good way of trying to find the elbow is to find a datapoint where the preceding slope is steep and the slope after is far less steep. This corresponds to finding the point where the rate of decrease in explained variance slows down significantly, which is similar to finding the second derivative of the explained variance ratio.

<img src="DataClustering_ElbowCriterion.JPG" alt="Drawing" style="width: 650px;"/>

In the plot above, we see that at the 4th component, we have the largest difference in adjacent slopes, explaining the change in the cumulative explained variance ratio, so we pick 4 to be the number of components.

In [None]:
# Elbow method using second derivative
def find_n_components_elbow(explained_variance_ratio):
    # Calculate first differences (approximation of first derivative)
    diffs = np.diff(explained_variance_ratio)
    
    # Calculate second differences (approximation of second derivative)
    second_diffs = np.diff(diffs)
    
    # Find the first large spike in the second derivative
    # (adding 2 because of the two different operations and one because of 0-indexing)
    return np.argmax(np.abs(second_diffs)) + 2 + 1

# Threshold method
def find_n_components_variance(explained_variance_ratio, threshold=0.9):
    cumulative_variance = np.cumsum(explained_variance_ratio)
    n_components = np.argmax(cumulative_variance >= threshold) + 1
    return n_components

# Apply both methods
n_components_80 = find_n_components_variance(explained_variance, 0.8)
n_components_90 = find_n_components_variance(explained_variance, 0.9)
n_components_95 = find_n_components_variance(explained_variance, 0.95)
n_components_elbow = find_n_components_elbow(explained_variance)

print(f"Number of components for 80% variance: {n_components_80}")
print(f"Number of components for 90% variance: {n_components_90}")
print(f"Number of components for 95% variance: {n_components_95}")
print(f"Number of components using elbow method: {n_components_elbow}")

Let's visualize the "elbow" in the scree plot to better understand this method:

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, 'o-', linewidth=2, color='blue')
plt.axvline(x=n_components_elbow, color='r', linestyle='--', 
           label=f'Elbow point: {n_components_elbow} components')
plt.axvline(x=n_components_80, color='g', linestyle='--', 
           label=f'80% variance: {n_components_90} components')
plt.axvline(x=n_components_90, color='b', linestyle='--', 
           label=f'90% variance: {n_components_90} components')
plt.axvline(x=n_components_95, color='purple', linestyle='--', 
           label=f'95% variance: {n_components_95} components')
plt.title('Scree Plot with Elbow Point')
plt.xlabel('Number of Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.legend(loc='best')
plt.grid(True)
plt.show()

## 2. Visualizing in 3D

In papers, you'll typically see people only use two principal components to show clustering (often the first and second principal axes), but sometimes, authors will choose to use a **3D plot** instead. Below, we will demonstrate how this can be done.

In [None]:
# 3D PCA
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)

df_pca = pd.DataFrame({
    'PC1': X_pca_3d[:, 0],
    'PC2': X_pca_3d[:, 1],
    'PC3': X_pca_3d[:, 2],
    'Class': y
})

# Create interactive 3D scatter plot
fig = px.scatter_3d(
    df_pca,
    x='PC1',
    y='PC2',
    z='PC3',
    color='Class',
    title='3D PCA Projection',
)

fig.update_layout(
    scene=dict(
        xaxis_title='PC1',
        yaxis_title='PC2',
        zaxis_title='PC3'
    ),
    width=800,
    height=600
)

# Show the interactive plot
fig.show()

# Print explained variance for PCA
print(f"PCA explained variance ratio: {pca_3d.explained_variance_ratio_}")
print(f"Total variance explained by 3D PCA: {sum(pca_3d.explained_variance_ratio_):.3f} or {sum(pca_3d.explained_variance_ratio_)*100:.1f}%")

## Conclusion

In this post-module notebook, we've explored methods for determining the optimal number of PCA components, specifically focusing on the elbow method and variance thresholds. We've also examined how the number of components affects the quality of data reconstruction.

Key takeaways:
- The elbow method offers a mathematically driven approach to finding the optimal number of components
  - The elbow method can also be extended to other clustering-based algorithms as a heuristic for determining the number of clusters or components
- Variance thresholds (e.g., 80%, 90%, 95%) provide practical cutoffs for dimensionality reduction
- The optimal number of components depends on your specific application and requirements
- Depending on your use case, you can  generate 3D plots for your PCA analysis

These insights will help you apply PCA more effectively in your data analysis and machine learning projects.