# Principal Component Analysis (PCA) - Complete Guide

## From Linear Algebra to Dimensionality Reduction

PCA is one of the most important **dimensionality reduction techniques** in machine learning. It finds the directions of maximum variance in high-dimensional data and projects it onto a lower-dimensional subspace.

### What You'll Learn
1. What PCA is and why we need dimensionality reduction
2. The mathematical foundation (covariance, eigenvalues, eigenvectors)
3. Step-by-step PCA algorithm
4. Implementation from scratch
5. Scikit-learn implementation
6. Choosing the number of components
7. Visualization and interpretation
8. Real-world applications
9. Limitations and alternatives

---


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_breast_cancer, make_blobs
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("Principal Component Analysis (PCA) - Complete Guide")
print("=" * 60)


## 1. What is Dimensionality Reduction?

### The Curse of Dimensionality

As the number of features increases, the volume of space increases exponentially. This causes:
- **Data sparsity**: Points become very far apart
- **Overfitting**: Models memorize noise
- **Computational cost**: More features = slower training
- **Visualization difficulty**: Can't visualize >3D

### Why Reduce Dimensions?

- ✅ **Visualization**: Project high-D data to 2D/3D
- ✅ **Noise reduction**: Remove less informative dimensions
- ✅ **Faster training**: Fewer features = faster models
- ✅ **Prevent overfitting**: Reduce model complexity
- ✅ **Feature extraction**: Create better features

### Types of Dimensionality Reduction

1. **Linear**: PCA, Factor Analysis
2. **Non-linear**: t-SNE, UMAP, Autoencoders


In [None]:
# Visualize the concept of dimensionality reduction
np.random.seed(42)

# Create 3D data
n_samples = 100
X_3d = np.random.randn(n_samples, 3)
X_3d[:, 2] = 0.5 * X_3d[:, 0] + 0.5 * X_3d[:, 1] + 0.1 * np.random.randn(n_samples)

# Apply PCA to reduce to 2D
pca_vis = PCA(n_components=2)
X_2d = pca_vis.fit_transform(X_3d)

fig = plt.figure(figsize=(15, 5))

# Original 3D data (projected to 2D views)
ax1 = fig.add_subplot(131, projection='3d')
ax1.scatter(X_3d[:, 0], X_3d[:, 1], X_3d[:, 2], alpha=0.6, s=50)
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_zlabel('Feature 3')
ax1.set_title('Original 3D Data', fontweight='bold')

# 2D projection (first two features)
ax2 = fig.add_subplot(132)
ax2.scatter(X_3d[:, 0], X_3d[:, 1], alpha=0.6, s=50)
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.set_title('Simple 2D Projection\n(Most information lost)', fontweight='bold', color='orange')
ax2.grid(True, alpha=0.3)

# PCA 2D projection
ax3 = fig.add_subplot(133)
ax3.scatter(X_2d[:, 0], X_2d[:, 1], alpha=0.6, s=50)
ax3.set_xlabel('PC1 (Principal Component 1)')
ax3.set_ylabel('PC2 (Principal Component 2)')
ax3.set_title('PCA 2D Projection\n(Maximum variance preserved)', fontweight='bold', color='green')
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

variance_explained = pca_vis.explained_variance_ratio_
print(f"Variance explained by PC1: {variance_explained[0]:.1%}")
print(f"Variance explained by PC2: {variance_explained[1]:.1%}")
print(f"Total variance preserved: {variance_explained.sum():.1%}")


## 2. What is Principal Component Analysis?

### Core Idea

**PCA** finds the directions (principal components) in which the data varies the most. It projects data onto these directions to reduce dimensionality while preserving maximum variance.

### Key Concepts

1. **Principal Components (PCs)**: New orthogonal directions in feature space
2. **Variance**: Measure of data spread in each direction
3. **Eigenvectors**: Directions of principal components
4. **Eigenvalues**: Amount of variance along each component

### Mathematical Goal

Find a lower-dimensional representation that:
- Maximizes variance in the projected space
- Minimizes reconstruction error
- Removes correlations between features


## 3. Mathematical Foundation

### Step-by-Step Mathematics

**Step 1: Standardize the Data**

Center the data (subtract mean):
$$X_{centered} = X - \bar{X}$$

Scale the data (divide by std, if needed):
$$X_{scaled} = \frac{X_{centered}}{\sigma}$$

**Step 2: Compute Covariance Matrix**

$$C = \frac{1}{n-1} X^T X$$

where $n$ is the number of samples.

**Step 3: Eigenvalue Decomposition**

Find eigenvalues $\lambda_i$ and eigenvectors $v_i$ of $C$:
$$C v_i = \lambda_i v_i$$

**Step 4: Select Principal Components**

- Eigenvalues: Amount of variance along each direction
- Eigenvectors: Directions of principal components
- Sort by eigenvalues (largest first)

**Step 5: Project Data**

Transform data to new space:
$$Y = X \times V_k$$

where $V_k$ contains the top $k$ eigenvectors (columns).


In [None]:
# Visualize PCA mathematically - 2D example
np.random.seed(42)
mean = [0, 0]
cov = [[3, 1.5], [1.5, 1]]  # Covariance matrix
X_data = np.random.multivariate_normal(mean, cov, 200)

# Center the data
X_centered = X_data - X_data.mean(axis=0)

# Compute covariance matrix
cov_matrix = np.cov(X_centered.T)
print("Covariance Matrix:")
print(cov_matrix)

# Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print(f"\nEigenvalues (variances): {eigenvalues}")
print(f"\nEigenvector 1 (PC1): {eigenvectors[:, 0]}")
print(f"Eigenvector 2 (PC2): {eigenvectors[:, 1]}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Original data with principal components
ax = axes[0]
ax.scatter(X_centered[:, 0], X_centered[:, 1], alpha=0.6, s=50)

# Plot principal components
origin = X_centered.mean(axis=0)
for i, (eigenval, eigenvec) in enumerate(zip(eigenvalues, eigenvectors.T)):
    ax.arrow(origin[0], origin[1], eigenvec[0] * 3 * np.sqrt(eigenval), 
             eigenvec[1] * 3 * np.sqrt(eigenval),
             head_width=0.3, head_length=0.2, fc=f'C{i}', ec=f'C{i}', 
             linewidth=2, label=f'PC{i+1} (λ={eigenval:.2f})')

ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title('Original Data with Principal Components', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')

# Projected data
X_projected = X_centered @ eigenvectors
ax = axes[1]
ax.scatter(X_projected[:, 0], X_projected[:, 1], alpha=0.6, s=50, c='green')
ax.set_xlabel('PC1 (Principal Component 1)', fontsize=12)
ax.set_ylabel('PC2 (Principal Component 2)', fontsize=12)
ax.set_title('Projected Data (Rotated to PC space)', fontweight='bold')
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax.axvline(x=0, color='k', linestyle='--', alpha=0.3)
ax.set_aspect('equal')

plt.tight_layout()
plt.show()

variance_ratio = eigenvalues / eigenvalues.sum()
print(f"\nVariance explained by PC1: {variance_ratio[0]:.1%}")
print(f"Variance explained by PC2: {variance_ratio[1]:.1%}")


## 4. Implementation from Scratch

Let's implement PCA step-by-step to understand every detail!


In [None]:
class PCAScratch:
    """Principal Component Analysis from Scratch"""
    
    def __init__(self, n_components=None):
        self.n_components = n_components
        self.components_ = None
        self.mean_ = None
        self.explained_variance_ = None
        self.explained_variance_ratio_ = None
    
    def fit(self, X):
        """Fit PCA to data"""
        # Step 1: Center the data
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_
        
        # Step 2: Compute covariance matrix
        n_samples = X.shape[0]
        cov_matrix = np.dot(X_centered.T, X_centered) / (n_samples - 1)
        
        # Step 3: Eigenvalue decomposition
        eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
        
        # Step 4: Sort by eigenvalue (descending)
        idx = eigenvalues.argsort()[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        # Step 5: Select number of components
        if self.n_components is None:
            self.n_components = min(X.shape[0], X.shape[1])
        
        self.components_ = eigenvectors[:, :self.n_components].T
        
        # Store explained variance
        self.explained_variance_ = eigenvalues[:self.n_components]
        self.explained_variance_ratio_ = self.explained_variance_ / eigenvalues.sum()
        
        return self
    
    def transform(self, X):
        """Transform data to principal component space"""
        if self.components_ is None:
            raise ValueError("Model must be fitted first")
        X_centered = X - self.mean_
        return np.dot(X_centered, self.components_.T)
    
    def fit_transform(self, X):
        """Fit and transform in one step"""
        return self.fit(X).transform(X)

# Test our implementation
X_test = np.random.randn(100, 4)
X_test[:, 3] = 0.5 * X_test[:, 0] + 0.5 * X_test[:, 1] + 0.1 * np.random.randn(100)

pca_scratch = PCAScratch(n_components=2)
X_transformed = pca_scratch.fit_transform(X_test)

print("PCA from Scratch Results:")
print("=" * 50)
print(f"Original shape: {X_test.shape}")
print(f"Transformed shape: {X_transformed.shape}")
print(f"\nExplained variance ratio:")
print(f"  PC1: {pca_scratch.explained_variance_ratio_[0]:.1%}")
print(f"  PC2: {pca_scratch.explained_variance_ratio_[1]:.1%}")
print(f"  Total: {pca_scratch.explained_variance_ratio_.sum():.1%}")
print(f"\nFirst Principal Component:")
print(f"  {pca_scratch.components_[0]}")


## 5. Scikit-learn Implementation

### Using scikit-learn PCA

Scikit-learn provides optimized, production-ready PCA implementation.


In [None]:
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

print("Iris Dataset PCA Analysis")
print("=" * 50)
print(f"Original shape: {X.shape}")
print(f"Features: {feature_names}")

# Standardize the data (important for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"\nAfter PCA shape: {X_pca.shape}")
print(f"\nExplained variance ratio:")
for i, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"  PC{i+1}: {ratio:.1%}")
print(f"  Total: {pca.explained_variance_ratio_.sum():.1%}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Original data (first 2 features)
ax = axes[0]
scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.7, s=60)
ax.set_xlabel(feature_names[0], fontsize=11)
ax.set_ylabel(feature_names[1], fontsize=11)
ax.set_title('Original Data (First 2 Features)', fontweight='bold')
ax.grid(True, alpha=0.3)
plt.colorbar(scatter, ax=ax, label='Species')

# PCA projection
ax = axes[1]
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7, s=60)
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=11)
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=11)
ax.set_title('PCA Projection (2D)', fontweight='bold')
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.colorbar(scatter, ax=ax, label='Species')

plt.tight_layout()
plt.show()

# Show component loadings
print("\nComponent Loadings (How features contribute to PCs):")
components_df = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(2)],
    index=feature_names
)
print(components_df)


## 6. Choosing the Number of Components

### How Many Components Should We Keep?

We need to balance:
- **Information retention**: Keep enough variance
- **Dimensionality reduction**: Reduce dimensions significantly

### Methods for Choosing Components

1. **Variance Explained**: Keep components that explain X% of variance (e.g., 95%)
2. **Elbow Method**: Plot cumulative variance, look for elbow
3. **Scree Plot**: Plot eigenvalues, look for drop-off
4. **Kaiser Criterion**: Keep components with eigenvalue > 1


In [None]:
# Analyze variance explained for different numbers of components
iris = load_iris()
X = iris.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)

# Calculate cumulative variance
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Plot
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Variance explained per component
ax = axes[0]
ax.bar(range(1, 5), pca_full.explained_variance_ratio_, alpha=0.7, color='steelblue')
ax.axhline(y=0.1, color='r', linestyle='--', alpha=0.7, label='10% threshold')
ax.set_xlabel('Principal Component', fontsize=11)
ax.set_ylabel('Variance Explained Ratio', fontsize=11)
ax.set_title('Variance Explained per Component', fontweight='bold')
ax.set_xticks(range(1, 5))
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Cumulative variance
ax = axes[1]
ax.plot(range(1, 5), cumulative_variance, 'o-', linewidth=2, markersize=8, color='green')
ax.axhline(y=0.95, color='r', linestyle='--', alpha=0.7, label='95% threshold')
ax.axhline(y=0.99, color='orange', linestyle='--', alpha=0.7, label='99% threshold')
ax.fill_between(range(1, 5), cumulative_variance, alpha=0.3, color='green')
ax.set_xlabel('Number of Components', fontsize=11)
ax.set_ylabel('Cumulative Variance Explained', fontsize=11)
ax.set_title('Cumulative Variance Explained', fontweight='bold')
ax.set_xticks(range(1, 5))
ax.set_ylim([0, 1.05])
ax.legend()
ax.grid(True, alpha=0.3)

# Scree plot (eigenvalues)
ax = axes[2]
ax.plot(range(1, 5), pca_full.explained_variance_, 's-', linewidth=2, markersize=8, color='red')
ax.axhline(y=1, color='orange', linestyle='--', alpha=0.7, label='Kaiser criterion (λ=1)')
ax.set_xlabel('Principal Component', fontsize=11)
ax.set_ylabel('Eigenvalue (Variance)', fontsize=11)
ax.set_title('Scree Plot (Eigenvalues)', fontweight='bold')
ax.set_xticks(range(1, 5))
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Component Selection Analysis:")
print("=" * 50)
for i, (var, cum_var) in enumerate(zip(pca_full.explained_variance_ratio_, cumulative_variance), 1):
    print(f"PC{i}: {var:.1%} variance (Cumulative: {cum_var:.1%})")

# Find number of components for 95% variance
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {n_components_95}")
print(f"Components needed for 99% variance: {np.argmax(cumulative_variance >= 0.99) + 1}")


## 7. Real-World Application: Feature Reduction for Classification

Use PCA to reduce dimensions before training a classifier.


In [None]:
# Load Breast Cancer dataset (high-dimensional)
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

print("Breast Cancer Dataset - PCA for Classification")
print("=" * 60)
print(f"Original shape: {X.shape}")
print(f"Number of features: {X.shape[1]}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifier on original data
lr_original = LogisticRegression(random_state=42, max_iter=1000)
lr_original.fit(X_train_scaled, y_train)
acc_original = accuracy_score(y_test, lr_original.predict(X_test_scaled))

# Apply PCA and train classifier
pca = PCA(n_components=0.95)  # Keep 95% variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print(f"\nAfter PCA (95% variance):")
print(f"  Shape: {X_train_pca.shape}")
print(f"  Components: {pca.n_components_}")
print(f"  Variance explained: {pca.explained_variance_ratio_.sum():.1%}")

lr_pca = LogisticRegression(random_state=42, max_iter=1000)
lr_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, lr_pca.predict(X_test_pca))

print(f"\nClassification Accuracy:")
print(f"  Original ({X_train_scaled.shape[1]} features): {acc_original:.3f}")
print(f"  After PCA ({pca.n_components_} components): {acc_pca:.3f}")
print(f"\n  Reduction: {X_train_scaled.shape[1] - pca.n_components_} features removed")
print(f"  Accuracy change: {acc_pca - acc_original:+.3f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Variance explained
ax = axes[0]
ax.plot(range(1, min(15, pca.n_components_) + 1), 
        pca.explained_variance_ratio_[:15], 'o-', linewidth=2)
ax.set_xlabel('Principal Component', fontsize=11)
ax.set_ylabel('Variance Explained Ratio', fontsize=11)
ax.set_title('Variance Explained by Each Component', fontweight='bold')
ax.grid(True, alpha=0.3)

# Accuracy comparison
ax = axes[1]
ax.bar(['Original', 'PCA'], [acc_original, acc_pca], alpha=0.7, color=['steelblue', 'green'])
ax.set_ylabel('Accuracy', fontsize=11)
ax.set_title('Classification Accuracy Comparison', fontweight='bold')
ax.set_ylim([0.9, 1.0])
ax.grid(True, alpha=0.3, axis='y')
for i, acc in enumerate([acc_original, acc_pca]):
    ax.text(i, acc + 0.005, f'{acc:.3f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()


## 8. Limitations of PCA

### When PCA Works Well:
- ✅ **Linear relationships** between features
- ✅ **High-dimensional data** with correlations
- ✅ **Noise reduction** needed
- ✅ **Fast computation** required

### When PCA Fails:
- ❌ **Non-linear relationships** between features
- ❌ **All features are independent** (no correlation)
- ❌ **Interpretability** is critical (PCs are linear combinations)
- ❌ **Outliers** present (sensitive to outliers)

### Alternatives:
- **Kernel PCA**: Non-linear PCA using kernel trick
- **t-SNE / UMAP**: Non-linear dimensionality reduction for visualization
- **Autoencoders**: Neural network-based dimensionality reduction
- **ICA**: Independent Component Analysis (for independent sources)


## 9. Practice Problems

### Problem 1: Face Recognition with PCA

Use PCA to reduce dimensions of face images while preserving maximum variance.

### Problem 2: Feature Selection Analysis

Apply PCA to a high-dimensional dataset and analyze which original features contribute most to each principal component.

### Problem 3: Optimal Component Selection

Given a dataset, find the optimal number of components using variance explained and visualization quality.


## 10. Summary & Key Takeaways

### Key Concepts:

1. **PCA Algorithm**:
   - Center and standardize data
   - Compute covariance matrix
   - Find eigenvalues and eigenvectors
   - Select top k components
   - Transform data

2. **Principal Components**:
   - Orthogonal directions of maximum variance
   - Ordered by variance explained
   - Uncorrelated with each other

3. **When to Use**:
   - High-dimensional data
   - Visualization needs (2D/3D projection)
   - Noise reduction
   - Feature extraction

4. **Component Selection**:
   - Variance explained threshold (e.g., 95%)
   - Elbow method
   - Scree plot
   - Kaiser criterion (eigenvalue > 1)

### Important Notes:

- ✅ **Always standardize** before PCA (unless all features are on same scale)
- ✅ **Interpretability**: PCs are linear combinations, harder to interpret
- ✅ **No labels needed**: PCA is unsupervised
- ⚠️ **Linear only**: Assumes linear relationships
- ⚠️ **Outlier sensitive**: Outliers can dominate principal components

### Time Complexity:
- Covariance matrix: O(n × d²) where n=samples, d=features
- Eigenvalue decomposition: O(d³)
- Transformation: O(n × d × k) where k=components

### Next Steps:
1. **Kernel PCA**: Non-linear dimensionality reduction
2. **t-SNE**: Non-linear visualization technique
3. **Factor Analysis**: Related dimensionality reduction method
4. **Autoencoders**: Deep learning approach to dimensionality reduction

---

**Resources:**
- Scikit-learn Documentation: https://scikit-learn.org/stable/modules/decomposition.html#pca
- "The Elements of Statistical Learning" by Hastie, Tibshirani, Friedman
- "Pattern Recognition and Machine Learning" by Christopher Bishop

---

**End of Notebook**
