# Module 13: Dimensionality Reduction - PCA and t-SNE

**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Estimated Time**: 75 minutes  
**Prerequisites**: [Module 02 - Data Preparation](02_data_preparation_train_test_split.ipynb), [Module 12 - Clustering](12_clustering_kmeans_dbscan.ipynb)

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand why dimensionality reduction is important
2. Explain the curse of dimensionality and its effects
3. Apply Principal Component Analysis (PCA) for dimensionality reduction
4. Understand eigenvalues, eigenvectors, and explained variance
5. Choose the optimal number of principal components
6. Use PCA for data visualization and preprocessing
7. Apply t-SNE for 2D/3D visualization of high-dimensional data
8. Understand when to use PCA vs t-SNE
9. Use PCA to improve machine learning model performance

## 1. Introduction: Why Reduce Dimensions?

### What is Dimensionality Reduction?

**Dimensionality reduction** transforms high-dimensional data into lower dimensions while preserving important information.

**Example**: 100 features ‚Üí 10 features (keeping 95% of information)

### The Curse of Dimensionality (Revisited)

As dimensions increase:
- Data becomes **sparse** (points are far apart)
- **Distances become meaningless** (all points roughly equidistant)
- **Computational cost explodes** (exponentially more data needed)
- **Visualization becomes impossible** (can't plot 100D data!)
- **Overfitting risk increases** (more parameters to learn)

### Benefits of Dimensionality Reduction

**1. Visualization** üé®
- Humans can only see 2D or 3D
- Project high-D data to 2D/3D for exploration
- Discover patterns, clusters, outliers

**2. Speed** ‚ö°
- Fewer features ‚Üí faster training
- Fewer features ‚Üí faster predictions
- Critical for large datasets

**3. Remove Redundancy** üîÑ
- Many features are correlated
- Height in cm vs height in inches (redundant!)
- Combine correlated features

**4. Avoid Overfitting** üìâ
- Fewer dimensions ‚Üí simpler models
- Less risk of fitting noise
- Better generalization

**5. Storage** üíæ
- Smaller data = less storage
- Faster I/O operations

### Two Main Approaches

**Feature Selection**:
- Choose subset of original features
- Keep: [feature_1, feature_5, feature_10]
- Discard others

**Feature Extraction** (PCA, t-SNE):
- Create new features from combinations of original features
- PC1 = 0.5√ófeature_1 + 0.3√ófeature_2 + ...
- More powerful but less interpretable

## 2. Setup and Data Loading

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import warnings

# Configuration
warnings.filterwarnings('ignore')
np.random.seed(42)
%matplotlib inline

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('Set2')

print('‚úì All libraries imported successfully!')
print(f'‚úì Random seed set to 42 for reproducibility')

In [None]:
# Load datasets for dimensionality reduction

# Dataset 1: Breast Cancer (30 features - good for PCA)
bc_df = pd.read_csv('data/sample/breast_cancer.csv')
print("Breast Cancer Dataset (High-dimensional):")
print(f"Shape: {bc_df.shape}")
print(f"Features: {bc_df.shape[1] - 1} (excluding target)")

# Dataset 2: Digits (64 features - pixel intensities)
digits_df = pd.read_csv('data/sample/digits.csv')
print("\nDigits Dataset (8x8 pixel images):")
print(f"Shape: {digits_df.shape}")
print(f"Features: {digits_df.shape[1] - 1} (pixel values)")

# Dataset 3: Wine (13 features)
wine_df = pd.read_csv('data/sample/wine.csv')
print("\nWine Dataset (Chemical properties):")
print(f"Shape: {wine_df.shape}")
print(f"Features: {wine_df.shape[1] - 1}")

## 3. Principal Component Analysis (PCA): The Concept

### What is PCA?

**PCA** = Principal Component Analysis

**Goal**: Find new axes (principal components) that capture maximum variance in data

**Key idea**: Data often varies more in some directions than others
- **PC1** (1st principal component): Direction of maximum variance
- **PC2** (2nd principal component): Direction of 2nd-most variance (perpendicular to PC1)
- **PC3**, **PC4**, ... : Continue finding perpendicular directions

### Simple Analogy: Shadow on a Wall

Imagine a 3D object casting a shadow on a 2D wall:
- Original: 3D object (high-dimensional data)
- Shadow: 2D projection (reduced dimensions)
- PCA finds the **best angle** to orient the wall so the shadow shows the most detail

### How PCA Works (Simplified)

**Step 1: Standardize data**
- Center data (mean = 0)
- Scale to unit variance
- Important: PCA is sensitive to scale!

**Step 2: Compute covariance matrix**
- Shows how features vary together
- High covariance = features are correlated

**Step 3: Find eigenvectors and eigenvalues**
- **Eigenvectors**: Directions of principal components
- **Eigenvalues**: Amount of variance in each direction
- Sort by eigenvalue (largest first)

**Step 4: Project data**
- Keep top K eigenvectors (K = desired dimensions)
- Transform original data using these vectors
- Result: Data in K dimensions

### Eigenvalues and Eigenvectors (Don't Panic!)

**Eigenvector**: A direction in space
- Think: Arrow pointing in a specific direction
- For PCA: Direction where data varies

**Eigenvalue**: Magnitude of variance in that direction
- Large eigenvalue = data spreads a lot in this direction
- Small eigenvalue = data doesn't vary much in this direction

**In PCA terms**:
- Eigenvector with largest eigenvalue = PC1 (most important)
- Eigenvector with 2nd largest eigenvalue = PC2
- And so on...

## 4. Applying PCA: Step by Step

In [None]:
# Prepare breast cancer data
X_bc = bc_df.drop('target', axis=1).values
y_bc = bc_df['target'].values

print(f"Original data shape: {X_bc.shape}")
print(f"Number of features: {X_bc.shape[1]}")
print(f"\nFeature ranges (before scaling):")
print(f"Min: {X_bc.min():.2f}")
print(f"Max: {X_bc.max():.2f}")
print(f"\nThese features have very different scales - scaling is critical!")

In [None]:
# Step 1: Standardize the data
# CRITICAL: PCA is very sensitive to feature scales!
scaler = StandardScaler()
X_bc_scaled = scaler.fit_transform(X_bc)

print("After standardization:")
print(f"Mean of each feature: ~0 (actually {X_bc_scaled.mean():.10f})")
print(f"Std of each feature: ~1 (actually {X_bc_scaled.std():.10f})")
print("\n‚úì Data is now ready for PCA!")

In [None]:
# Step 2: Apply PCA (keep all components initially to analyze)
pca_full = PCA()
X_bc_pca_full = pca_full.fit_transform(X_bc_scaled)

print("PCA Results:")
print("=" * 60)
print(f"Original dimensions: {X_bc_scaled.shape[1]}")
print(f"Number of components: {pca_full.n_components_}")
print(f"Transformed data shape: {X_bc_pca_full.shape}")
print("\nExplained variance ratio per component:")
print("(What % of variance each PC captures)\n")

for i, var_ratio in enumerate(pca_full.explained_variance_ratio_[:10], 1):
    print(f"PC{i:2d}: {var_ratio:.4f} ({var_ratio*100:.2f}%)")
    
print("\n‚úì PC1 alone captures {:.1f}% of total variance!".format(
    pca_full.explained_variance_ratio_[0] * 100))

## 5. Explained Variance: How Much Information Do We Keep?

### Explained Variance Ratio

**Measures**: What proportion of total variance each PC captures

**Interpretation**:
- PC1 = 45% ‚Üí First component captures 45% of data's variance
- PC2 = 20% ‚Üí Second component captures 20% more
- Cumulative = 65% ‚Üí Together they capture 65% of variance

### The 90% Rule

**Common practice**: Keep enough PCs to capture 90-95% of variance
- Preserves most information
- Reduces dimensions significantly
- Good balance between compression and information retention

In [None]:
# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

print("Cumulative Explained Variance:")
print("=" * 60)
print(f"{'Components':<15} {'Individual':<15} {'Cumulative':<15}")
print("=" * 60)

for i in range(min(15, len(cumulative_variance))):
    print(f"PC1-PC{i+1:<10} {pca_full.explained_variance_ratio_[i]:>10.4f} {cumulative_variance[i]:>15.4f}")

# Find number of components for 90% variance
n_components_90 = np.argmax(cumulative_variance >= 0.90) + 1
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1

print("\n" + "=" * 60)
print(f"\nüìä Key Findings:")
print(f"   - {n_components_90} components needed for 90% variance")
print(f"   - {n_components_95} components needed for 95% variance")
print(f"   - Original: {X_bc_scaled.shape[1]} dimensions")
print(f"   - Reduction: {X_bc_scaled.shape[1]} ‚Üí {n_components_90} dimensions (90% info retained)")
print(f"   - Compression ratio: {(1 - n_components_90/X_bc_scaled.shape[1])*100:.1f}% reduction!")

In [None]:
# Visualize explained variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Individual explained variance (Scree plot)
axes[0].bar(range(1, len(pca_full.explained_variance_ratio_) + 1),
           pca_full.explained_variance_ratio_,
           alpha=0.7)
axes[0].set_xlabel('Principal Component', fontsize=12)
axes[0].set_ylabel('Explained Variance Ratio', fontsize=12)
axes[0].set_title('Scree Plot: Individual Explained Variance', fontsize=14, fontweight='bold')
axes[0].set_xlim(0, 20)
axes[0].grid(True, alpha=0.3, axis='y')

# Plot 2: Cumulative explained variance
axes[1].plot(range(1, len(cumulative_variance) + 1),
            cumulative_variance,
            'o-',
            linewidth=2,
            markersize=6)
axes[1].axhline(0.90, color='red', linestyle='--', label='90% threshold', linewidth=2)
axes[1].axhline(0.95, color='orange', linestyle='--', label='95% threshold', linewidth=2)
axes[1].axvline(n_components_90, color='red', linestyle=':', alpha=0.5)
axes[1].axvline(n_components_95, color='orange', linestyle=':', alpha=0.5)
axes[1].set_xlabel('Number of Components', fontsize=12)
axes[1].set_ylabel('Cumulative Explained Variance', fontsize=12)
axes[1].set_title('Cumulative Explained Variance', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(0, 20)

plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("- Left: First few PCs capture most variance (elbow around PC5-7)")
print("- Right: Sharp rise initially, then plateau")
print(f"- {n_components_90} PCs give 90% of information (instead of {X_bc_scaled.shape[1]})!")

## 6. PCA for Visualization: 2D and 3D Projections

In [None]:
# Apply PCA with just 2 components for visualization
pca_2d = PCA(n_components=2)
X_bc_2d = pca_2d.fit_transform(X_bc_scaled)

print("PCA 2D Projection:")
print("=" * 50)
print(f"Original dimensions: {X_bc_scaled.shape[1]}")
print(f"Reduced to: 2 dimensions")
print(f"Variance preserved: {pca_2d.explained_variance_ratio_.sum():.1%}")
print(f"\nPC1 explains: {pca_2d.explained_variance_ratio_[0]:.1%}")
print(f"PC2 explains: {pca_2d.explained_variance_ratio_[1]:.1%}")

In [None]:
# Visualize 2D PCA projection
plt.figure(figsize=(10, 7))

# Plot points colored by class
scatter = plt.scatter(
    X_bc_2d[:, 0],
    X_bc_2d[:, 1],
    c=y_bc,
    cmap='coolwarm',
    alpha=0.6,
    s=50,
    edgecolors='black',
    linewidth=0.5
)

plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)', fontsize=12)
plt.title('PCA 2D Projection of Breast Cancer Data\n(30D ‚Üí 2D)', fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Class (0=Malignant, 1=Benign)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n‚úÖ Success! We can now visualize 30D data in 2D!")
print("   Notice: Classes are reasonably well-separated even in just 2D.")

In [None]:
# 3D visualization
from mpl_toolkits.mplot3d import Axes3D

pca_3d = PCA(n_components=3)
X_bc_3d = pca_3d.fit_transform(X_bc_scaled)

fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(
    X_bc_3d[:, 0],
    X_bc_3d[:, 1],
    X_bc_3d[:, 2],
    c=y_bc,
    cmap='coolwarm',
    alpha=0.6,
    s=50,
    edgecolors='black',
    linewidth=0.5
)

ax.set_xlabel(f'PC1 ({pca_3d.explained_variance_ratio_[0]:.1%})', fontsize=11)
ax.set_ylabel(f'PC2 ({pca_3d.explained_variance_ratio_[1]:.1%})', fontsize=11)
ax.set_zlabel(f'PC3 ({pca_3d.explained_variance_ratio_[2]:.1%})', fontsize=11)
ax.set_title('PCA 3D Projection\n(30D ‚Üí 3D)', fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Class', shrink=0.5)
plt.tight_layout()
plt.show()

print(f"\n3D projection preserves {pca_3d.explained_variance_ratio_.sum():.1%} of variance.")
print("Even more separation visible with third dimension!")

## 7. PCA for Preprocessing: Improving Model Performance

### Why Use PCA Before Training?

**Benefits:**
1. **Faster training**: Fewer features = faster computation
2. **Reduced overfitting**: Simpler model, less noise
3. **Remove multicollinearity**: Correlated features combined
4. **Handle curse of dimensionality**: Especially for KNN, Naive Bayes

**Workflow:**
1. Split data into train/test
2. Fit PCA on training data only
3. Transform both train and test using fitted PCA
4. Train model on reduced features
5. Evaluate on test set

In [None]:
# Compare model performance: Original vs PCA features

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_bc_scaled, y_bc, test_size=0.3, random_state=42, stratify=y_bc
)

print("Comparing Models: Original Features vs PCA Features\n")
print("=" * 70)

# Test with different numbers of components
n_components_list = [2, 5, 10, 15, 20, X_bc_scaled.shape[1]]
results = []

for n_comp in n_components_list:
    if n_comp < X_bc_scaled.shape[1]:
        # Apply PCA
        pca = PCA(n_components=n_comp)
        X_train_pca = pca.fit_transform(X_train)
        X_test_pca = pca.transform(X_test)
        var_retained = pca.explained_variance_ratio_.sum()
    else:
        # Use all original features
        X_train_pca = X_train
        X_test_pca = X_test
        var_retained = 1.0
    
    # Train logistic regression
    lr = LogisticRegression(random_state=42, max_iter=1000)
    lr.fit(X_train_pca, y_train)
    
    # Evaluate
    train_score = lr.score(X_train_pca, y_train)
    test_score = lr.score(X_test_pca, y_test)
    
    # Cross-validation
    cv_scores = cross_val_score(lr, X_train_pca, y_train, cv=5)
    cv_mean = cv_scores.mean()
    
    results.append({
        'n_components': n_comp,
        'variance': var_retained,
        'train': train_score,
        'test': test_score,
        'cv': cv_mean
    })
    
    print(f"Components: {n_comp:2d} | Variance: {var_retained:.3f} | "
          f"Train: {train_score:.3f} | Test: {test_score:.3f} | CV: {cv_mean:.3f}")

print("=" * 70)

In [None]:
# Visualize results
results_df = pd.DataFrame(results)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(results_df['n_components'], results_df['train'], 'o-', label='Train', linewidth=2)
plt.plot(results_df['n_components'], results_df['test'], 's-', label='Test', linewidth=2)
plt.plot(results_df['n_components'], results_df['cv'], '^-', label='CV', linewidth=2)
plt.xlabel('Number of Components', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Model Performance vs Number of PCA Components', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(results_df['variance'], results_df['test'], 'o-', linewidth=2, markersize=8)
plt.xlabel('Variance Retained', fontsize=12)
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('Test Accuracy vs Variance Retained', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find best number of components
best_idx = results_df['test'].idxmax()
best_result = results_df.iloc[best_idx]

print("\nüìä Key Findings:")
print(f"   Best performance: {best_result['n_components']:.0f} components")
print(f"   Test accuracy: {best_result['test']:.3f}")
print(f"   Variance retained: {best_result['variance']:.1%}")
print(f"\n‚úÖ Using PCA for preprocessing can match or exceed full-feature performance!")
print(f"   Benefit: {X_bc_scaled.shape[1] - best_result['n_components']:.0f} fewer features = faster training!")

## 8. t-SNE: Advanced Visualization

### What is t-SNE?

**t-SNE** = t-Distributed Stochastic Neighbor Embedding

**Purpose**: Visualize high-dimensional data in 2D or 3D

### How t-SNE Differs from PCA

**PCA**:
- Linear transformation
- Preserves large distances (global structure)
- Fast and deterministic
- Good for feature extraction

**t-SNE**:
- Non-linear transformation
- Preserves small distances (local structure)
- Slower and stochastic (different results each run)
- **ONLY for visualization** (not for preprocessing!)
- Better at separating clusters visually

### t-SNE Parameters

**perplexity**:
- Balances local vs global structure
- Think: "How many neighbors to consider?"
- Range: 5-50 (typically 30)
- Larger dataset ‚Üí larger perplexity

**learning_rate**:
- Controls optimization speed
- Range: 10-1000 (typically 200)
- Too low: slow convergence
- Too high: unstable, poor results

**n_iter**:
- Number of optimization iterations
- At least 1000 (typically 1000-5000)
- More iterations = better quality (but slower)

### Important t-SNE Caveats

‚ö†Ô∏è **Don't use t-SNE for preprocessing!**
- Only for visualization
- Non-deterministic (changes each run)
- No "inverse transform" to original space

‚ö†Ô∏è **Don't interpret distances directly**
- Cluster sizes don't mean anything
- Distances between clusters don't mean anything
- Only within-cluster groupings are meaningful

‚ö†Ô∏è **Reduce dimensions with PCA first**
- t-SNE is slow on high dimensions
- PCA to 50D, then t-SNE to 2D (common workflow)

In [None]:
# Apply t-SNE to digits dataset
# First, prepare the data
X_digits = digits_df.drop('target', axis=1).values
y_digits = digits_df['target'].values

# Scale features
scaler_digits = StandardScaler()
X_digits_scaled = scaler_digits.fit_transform(X_digits)

# Reduce to 50D with PCA first (speeds up t-SNE)
pca_50 = PCA(n_components=50)
X_digits_pca = pca_50.fit_transform(X_digits_scaled)

print(f"Original digits data: {X_digits_scaled.shape}")
print(f"After PCA: {X_digits_pca.shape}")
print(f"PCA variance retained: {pca_50.explained_variance_ratio_.sum():.1%}")
print("\nApplying t-SNE (this may take a minute)...")

In [None]:
# Apply t-SNE
tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate=200,
    n_iter=1000,
    random_state=42,
    verbose=0
)

X_digits_tsne = tsne.fit_transform(X_digits_pca)

print("‚úì t-SNE complete!")
print(f"Final shape: {X_digits_tsne.shape}")
print(f"KL divergence (lower is better): {tsne.kl_divergence_:.2f}")

In [None]:
# Compare PCA vs t-SNE visualization
# First get PCA 2D projection
pca_2d_digits = PCA(n_components=2)
X_digits_pca_2d = pca_2d_digits.fit_transform(X_digits_scaled)

# Create comparison plot
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# PCA visualization
scatter1 = axes[0].scatter(
    X_digits_pca_2d[:, 0],
    X_digits_pca_2d[:, 1],
    c=y_digits,
    cmap='tab10',
    alpha=0.6,
    s=30,
    edgecolors='black',
    linewidth=0.3
)
axes[0].set_xlabel('PC1', fontsize=12)
axes[0].set_ylabel('PC2', fontsize=12)
axes[0].set_title('PCA Projection\n(Linear, preserves global structure)', 
                 fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label='Digit Class')

# t-SNE visualization
scatter2 = axes[1].scatter(
    X_digits_tsne[:, 0],
    X_digits_tsne[:, 1],
    c=y_digits,
    cmap='tab10',
    alpha=0.6,
    s=30,
    edgecolors='black',
    linewidth=0.3
)
axes[1].set_xlabel('t-SNE 1', fontsize=12)
axes[1].set_ylabel('t-SNE 2', fontsize=12)
axes[1].set_title('t-SNE Projection\n(Non-linear, preserves local structure)',
                 fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=axes[1], label='Digit Class')

plt.tight_layout()
plt.show()

print("\nüìä Comparison:")
print("- PCA (left): Clusters overlap, some separation")
print("- t-SNE (right): Much better cluster separation!")
print("- t-SNE creates distinct 'islands' for each digit class")
print("\n‚úÖ t-SNE is excellent for visualizing cluster structure!")

## 9. PCA vs t-SNE: When to Use Each

### Use PCA When:

‚úÖ **Feature extraction for ML models**
   - Preprocessing step before training
   - Reduces overfitting
   - Speeds up training

‚úÖ **Need interpretable components**
   - Can examine loadings (feature contributions)
   - Understand what each PC represents

‚úÖ **Need deterministic results**
   - Same input ‚Üí same output
   - Important for reproducibility

‚úÖ **Large datasets**
   - PCA is very fast
   - Scales well

‚úÖ **Need inverse transform**
   - Can go back to original space
   - Useful for reconstruction

### Use t-SNE When:

‚úÖ **Only want visualization**
   - Not for preprocessing!
   - Final step for exploration

‚úÖ **Complex cluster structures**
   - Non-linear relationships
   - Intertwined clusters

‚úÖ **Want better visual separation**
   - Creates more distinct clusters
   - Easier to interpret visually

‚úÖ **Small to medium datasets**
   - t-SNE is slow (O(n¬≤))
   - Not practical for millions of points

### Common Workflow

**Best practice**: Use both!
1. **PCA to 50D** (fast dimensionality reduction)
2. **Train ML models** on PCA features
3. **t-SNE to 2D** (for visualization only)
4. **Plot results** to understand data

## Exercises

Now it's your turn to practice! Complete these exercises to reinforce your understanding.

### Exercise 1: PCA on Wine Dataset

Apply PCA to the wine dataset and determine optimal number of components.

**Tasks:**
1. Load `wine.csv` and separate features from target
2. Standardize features
3. Apply PCA with all components
4. Create a scree plot showing explained variance
5. How many components needed for 90% variance?
6. Create 2D PCA visualization colored by wine class
7. Are the wine classes well-separated in 2D?

In [None]:
# Your code here
# Hint: Follow the PCA workflow from sections 4-6



### Exercise 2: PCA vs Original Features Performance

Compare classification performance using original features vs PCA features.

**Tasks:**
1. Use breast cancer dataset
2. Split into train/test (70/30)
3. Train Random Forest classifier with original 30 features
4. Train Random Forest with PCA features (5, 10, 15, 20 components)
5. Compare: Training time, Test accuracy, Number of features
6. Create a plot showing accuracy vs number of PCA components
7. What's the optimal number of components? Why?

In [None]:
# Your code here
# Hint: import time; start = time.time(); ... ; elapsed = time.time() - start



### Exercise 3: t-SNE Parameter Exploration

Experiment with different t-SNE parameters to see their effects.

**Tasks:**
1. Use the digits dataset (reduce to 50D with PCA first)
2. Apply t-SNE with perplexity values: [5, 30, 50]
3. For each perplexity, create a 2D scatter plot colored by digit class
4. Which perplexity value gives the best separation?
5. What happens with perplexity=5 (too small)?
6. What happens with perplexity=50 (large)?
7. Why is it important to reduce dimensions with PCA before t-SNE?

In [None]:
# Your code here
# Hint: Create a 1x3 subplot to compare different perplexity values



### Exercise 4: Feature Contribution Analysis

Analyze which original features contribute most to principal components.

**Tasks:**
1. Load breast cancer dataset and apply PCA with 5 components
2. Get the PCA components (loadings) - these show feature contributions
3. For PC1 and PC2, identify the top 5 most influential features
4. Create a heatmap showing all feature contributions to first 5 PCs
5. Which features are most important for PC1?
6. Are any features consistently important across multiple PCs?
7. What does this tell you about the data?

In [None]:
# Your code here
# Hint: pca.components_ contains the loadings
# Use sns.heatmap() for visualization



## Summary

### Key Concepts Learned

1. **Why Dimensionality Reduction?**
   - Visualization (2D/3D from high-D)
   - Speed (faster training and prediction)
   - Avoid overfitting (simpler models)
   - Remove redundancy (correlated features)
   - Combat curse of dimensionality

2. **Principal Component Analysis (PCA)**
   - Linear transformation to new axes
   - Finds directions of maximum variance
   - Orthogonal components (uncorrelated)
   - Fast and deterministic
   - Good for preprocessing and visualization

3. **Explained Variance**
   - Measures information retained
   - Common threshold: 90-95%
   - Scree plot shows variance per component
   - Cumulative plot guides component selection

4. **PCA Workflow**
   - Standardize data (critical!)
   - Fit PCA on training data
   - Transform train and test data
   - Train model on reduced features
   - Benefit: Speed + performance

5. **t-SNE (t-Distributed Stochastic Neighbor Embedding)**
   - Non-linear dimensionality reduction
   - Preserves local structure (nearby points)
   - Excellent for visualization
   - **ONLY for visualization** (not preprocessing!)
   - Stochastic (different results each run)

6. **PCA vs t-SNE**
   - **PCA**: Feature extraction, fast, linear, deterministic
   - **t-SNE**: Visualization only, slow, non-linear, stochastic
   - **Workflow**: PCA ‚Üí train models, t-SNE ‚Üí visualize

7. **Best Practices**
   - Always standardize before PCA
   - Use PCA to ~50D before t-SNE (speed)
   - Choose components retaining 90-95% variance
   - Visualize with scree plots
   - Validate with cross-validation

### Best Practices

- **Always scale features** before PCA (mean=0, std=1)
- **Fit PCA on training data only** (avoid data leakage)
- **Use scree plot** to choose number of components
- **Start with 90% variance** as reasonable default
- **Validate performance** with cross-validation
- **Use PCA + model** rather than just PCA or just model
- **Reduce to 50D before t-SNE** for speed
- **Don't use t-SNE for ML** (only visualization)

### Common Pitfalls to Avoid

- ‚ùå Forgetting to scale features before PCA
- ‚ùå Fitting PCA on entire dataset (including test data)
- ‚ùå Using t-SNE for preprocessing (it's non-invertible!)
- ‚ùå Interpreting t-SNE distances/sizes literally
- ‚ùå Choosing too few components (losing important information)
- ‚ùå Running t-SNE on high-dimensional data directly
- ‚ùå Expecting PCA to work on categorical data

### What's Next

In **Module 14: Final Project - End-to-End ML Pipeline**, you'll:
- Apply everything you've learned in a complete project
- Build a full machine learning pipeline from scratch
- Handle real-world data with all its challenges
- Compare multiple algorithms systematically
- Perform hyperparameter tuning
- Create production-ready models
- Learn ML best practices and deployment considerations

### Additional Resources

**Videos:**
- [StatQuest: PCA Step-by-Step](https://www.youtube.com/watch?v=FgakZw6K1QQ)
- [StatQuest: PCA in Python](https://www.youtube.com/watch?v=Lsue2gEM9D0)
- [t-SNE Explained](https://www.youtube.com/watch?v=NEaUSP4YerM)

**Documentation:**
- [scikit-learn PCA Guide](https://scikit-learn.org/stable/modules/decomposition.html#pca)
- [PCA API](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
- [t-SNE API](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)

**Articles:**
- [PCA for Machine Learning](https://machinelearningmastery.com/principal-component-analysis-for-dimensionality-reduction/)
- [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)
- [PCA vs t-SNE](https://towardsdatascience.com/pca-vs-t-sne-17bcd882bf3d)