# PCA - Complete Guide
## Principal Component Analysis for Dimensionality Reduction

## üìö Learning Objectives
- Understand PCA and dimensionality reduction
- Learn how PCA finds principal components
- Determine optimal number of components
- Apply PCA for visualization and feature reduction
- Understand variance explained and reconstruction error
- Use PCA for noise reduction and data compression

## üéØ What is PCA?

**Principal Component Analysis (PCA)** is an unsupervised dimensionality reduction technique that transforms data into a new coordinate system where the greatest variances lie on the first coordinates (principal components).

### Key Concepts:

1. **Principal Components**: New orthogonal axes that capture maximum variance
2. **Variance Explained**: How much information each component retains
3. **Dimensionality Reduction**: Reduce features while preserving information
4. **Linear Transformation**: PCA finds linear combinations of original features

### How PCA Works:
1. **Standardize** the data (mean=0, std=1)
2. **Compute covariance matrix** of features
3. **Find eigenvectors and eigenvalues** of covariance matrix
4. **Sort eigenvectors** by eigenvalues (descending)
5. **Project data** onto top k eigenvectors

### When to Use PCA:
‚úÖ **High-dimensional data** (many features)  
‚úÖ **Visualization** (reduce to 2D/3D)  
‚úÖ **Feature reduction** (remove redundancy)  
‚úÖ **Noise reduction** (keep signal, remove noise)  
‚úÖ **Speed up algorithms** (fewer features = faster)  
‚úÖ **Multicollinearity** (correlated features)  

### Advantages:
‚úÖ Removes correlated features  
‚úÖ Improves algorithm performance  
‚úÖ Reduces overfitting  
‚úÖ Enables visualization  
‚úÖ No parameters to tune (just n_components)  

### Disadvantages:
‚ùå Loss of interpretability (new features are combinations)  
‚ùå Assumes linear relationships  
‚ùå Sensitive to feature scaling  
‚ùå May not work well for non-linear data  

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris, load_wine, load_breast_cancer, load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Libraries imported successfully!")

## Part 1: Understanding PCA with Iris Dataset
### 1Ô∏è‚É£ Load and Explore Data

In [None]:
# Load iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')

print(f"Dataset shape: {X.shape}")
print(f"Features: {list(X.columns)}")
print(f"\nClasses: {iris.target_names}")

# Check correlations
print(f"\nüìä Feature Correlations:")
corr_matrix = X.corr()
print(corr_matrix)

# Visualize correlations
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
           square=True, linewidths=1, cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüí° High correlations indicate redundancy - PCA can help!")

### 2Ô∏è‚É£ Apply PCA - Step by Step

In [None]:
# Step 1: Standardize the data (CRITICAL for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Step 1: Data Standardized ‚úÖ")
print(f"Mean: {X_scaled.mean(axis=0).round(10)}")
print(f"Std: {X_scaled.std(axis=0).round(2)}")

# Step 2: Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

print(f"\nStep 2: PCA Applied ‚úÖ")
print(f"Original dimensions: {X.shape[1]}")
print(f"Principal components: {pca.n_components_}")

# Step 3: Analyze variance explained
variance_explained = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(variance_explained)

print(f"\nüìä Variance Explained by Each Component:")
for i, (var, cum_var) in enumerate(zip(variance_explained, cumulative_variance)):
    print(f"PC{i+1}: {var:.4f} ({var*100:.2f}%) | Cumulative: {cum_var:.4f} ({cum_var*100:.2f}%)")

# Visualize variance explained
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Individual variance
axes[0].bar(range(1, len(variance_explained)+1), variance_explained, 
           color='skyblue', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Principal Component', fontsize=12)
axes[0].set_ylabel('Variance Explained', fontsize=12)
axes[0].set_title('Variance Explained by Each Component', fontsize=14, fontweight='bold')
axes[0].set_xticks(range(1, len(variance_explained)+1))
axes[0].grid(True, alpha=0.3, axis='y')

# Cumulative variance
axes[1].plot(range(1, len(cumulative_variance)+1), cumulative_variance, 
            marker='o', linewidth=2, markersize=8, color='green')
axes[1].axhline(y=0.95, color='red', linestyle='--', linewidth=2, label='95% Variance')
axes[1].axhline(y=0.90, color='orange', linestyle='--', linewidth=2, label='90% Variance')
axes[1].set_xlabel('Number of Components', fontsize=12)
axes[1].set_ylabel('Cumulative Variance Explained', fontsize=12)
axes[1].set_title('Cumulative Variance Explained', fontsize=14, fontweight='bold')
axes[1].set_xticks(range(1, len(cumulative_variance)+1))
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüí° First 2 components explain {cumulative_variance[1]*100:.2f}% of variance!")

### 3Ô∏è‚É£ Visualize Data in Principal Component Space

In [None]:
# Reduce to 2D for visualization
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)

# Create DataFrame for easy plotting
df_pca = pd.DataFrame({
    'PC1': X_pca_2d[:, 0],
    'PC2': X_pca_2d[:, 1],
    'Species': [iris.target_names[i] for i in y]
})

# Visualize
plt.figure(figsize=(12, 8))

for species in iris.target_names:
    mask = df_pca['Species'] == species
    plt.scatter(df_pca.loc[mask, 'PC1'], 
               df_pca.loc[mask, 'PC2'],
               label=species, s=100, alpha=0.7, edgecolors='black')

plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)', fontsize=12)
plt.title('Iris Dataset in Principal Component Space (2D)', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='k', linestyle='--', linewidth=0.5)
plt.axvline(x=0, color='k', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

print(f"\nüí° Classes are well-separated in PC space!")
print(f"üí° We reduced from {X.shape[1]} to 2 dimensions while keeping {pca_2d.explained_variance_ratio_.sum()*100:.1f}% of information")

### 4Ô∏è‚É£ Understanding Principal Components (Loadings)

In [None]:
# Get component loadings (how original features contribute to PCs)
loadings = pd.DataFrame(
    pca_2d.components_.T,
    columns=['PC1', 'PC2'],
    index=X.columns
)

print("üìä Principal Component Loadings:")
print(loadings)
print("\nüí° Loadings show how much each original feature contributes to each PC")

# Visualize loadings
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Heatmap
sns.heatmap(loadings.T, annot=True, fmt='.3f', cmap='RdBu_r', 
           center=0, ax=axes[0], cbar_kws={'label': 'Loading'})
axes[0].set_title('Component Loadings Heatmap', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Original Features', fontsize=12)
axes[0].set_ylabel('Principal Components', fontsize=12)

# Bar plot for PC1
loadings['PC1'].plot(kind='barh', ax=axes[1], color='steelblue', edgecolor='black')
axes[1].set_title('PC1 Loadings (Most Important Component)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Loading Value', fontsize=12)
axes[1].axvline(x=0, color='black', linestyle='-', linewidth=1)
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("- Positive loading: Feature increases with PC")
print("- Negative loading: Feature decreases with PC")
print("- Large absolute value: Feature is important for this PC")

## Part 2: PCA for High-Dimensional Data
### 5Ô∏è‚É£ Handwritten Digits (64 dimensions ‚Üí 2D)

In [None]:
# Load digits dataset (8x8 images = 64 features)
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print(f"Digits dataset shape: {X_digits.shape}")
print(f"Number of features: {X_digits.shape[1]}")
print(f"Number of classes: {len(np.unique(y_digits))}")

# Show sample images
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
axes = axes.ravel()

for i in range(10):
    axes[i].imshow(digits.images[i], cmap='gray')
    axes[i].set_title(f'Digit: {digits.target[i]}', fontsize=11)
    axes[i].axis('off')

plt.suptitle('Sample Handwritten Digits (8x8 pixels)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Apply PCA
scaler_digits = StandardScaler()
X_digits_scaled = scaler_digits.fit_transform(X_digits)

pca_digits = PCA()
X_digits_pca = pca_digits.fit_transform(X_digits_scaled)

# Analyze variance
cumvar_digits = np.cumsum(pca_digits.explained_variance_ratio_)

# Find number of components for 95% variance
n_components_95 = np.argmax(cumvar_digits >= 0.95) + 1
n_components_90 = np.argmax(cumvar_digits >= 0.90) + 1

print(f"\nüìä Dimensionality Reduction Results:")
print(f"Original dimensions: {X_digits.shape[1]}")
print(f"Components for 90% variance: {n_components_90} ({n_components_90/X_digits.shape[1]*100:.1f}% reduction)")
print(f"Components for 95% variance: {n_components_95} ({n_components_95/X_digits.shape[1]*100:.1f}% reduction)")

# Plot cumulative variance
plt.figure(figsize=(12, 6))
plt.plot(range(1, len(cumvar_digits)+1), cumvar_digits, linewidth=2, color='blue')
plt.axhline(y=0.95, color='red', linestyle='--', linewidth=2, label='95% Variance')
plt.axhline(y=0.90, color='orange', linestyle='--', linewidth=2, label='90% Variance')
plt.axvline(x=n_components_95, color='red', linestyle=':', linewidth=2, alpha=0.5)
plt.axvline(x=n_components_90, color='orange', linestyle=':', linewidth=2, alpha=0.5)
plt.xlabel('Number of Components', fontsize=12)
plt.ylabel('Cumulative Variance Explained', fontsize=12)
plt.title('Cumulative Variance Explained - Digits Dataset', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 6Ô∏è‚É£ Visualize High-Dimensional Data in 2D

In [None]:
# Reduce to 2D
pca_2d_digits = PCA(n_components=2)
X_digits_2d = pca_2d_digits.fit_transform(X_digits_scaled)

# Visualize
plt.figure(figsize=(14, 10))

scatter = plt.scatter(X_digits_2d[:, 0], X_digits_2d[:, 1], 
                     c=y_digits, cmap='tab10', 
                     s=50, alpha=0.6, edgecolors='black', linewidths=0.5)

plt.xlabel(f'PC1 ({pca_2d_digits.explained_variance_ratio_[0]*100:.1f}% variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca_2d_digits.explained_variance_ratio_[1]*100:.1f}% variance)', fontsize=12)
plt.title('Handwritten Digits in 2D Principal Component Space\n(64D ‚Üí 2D)', 
         fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Digit', ticks=range(10))
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüí° Reduced from {X_digits.shape[1]} to 2 dimensions!")
print(f"üí° Retained {pca_2d_digits.explained_variance_ratio_.sum()*100:.1f}% of variance")

## Part 3: PCA for Machine Learning
### 7Ô∏è‚É£ Impact on Model Performance

In [None]:
# Load breast cancer dataset (30 features)
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

print(f"Breast Cancer dataset: {X_cancer.shape}")
print(f"Features: {X_cancer.shape[1]}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

# Scale data
scaler_cancer = StandardScaler()
X_train_scaled = scaler_cancer.fit_transform(X_train)
X_test_scaled = scaler_cancer.transform(X_test)

# Test different numbers of components
n_components_list = [2, 5, 10, 15, 20, 25, 30]
results = []

for n_comp in n_components_list:
    # Apply PCA
    pca_temp = PCA(n_components=n_comp)
    X_train_pca = pca_temp.fit_transform(X_train_scaled)
    X_test_pca = pca_temp.transform(X_test_scaled)
    
    # Train Logistic Regression
    lr = LogisticRegression(random_state=42, max_iter=10000)
    lr.fit(X_train_pca, y_train)
    lr_acc = accuracy_score(y_test, lr.predict(X_test_pca))
    
    # Train Random Forest
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train_pca, y_train)
    rf_acc = accuracy_score(y_test, rf.predict(X_test_pca))
    
    # Variance explained
    var_explained = pca_temp.explained_variance_ratio_.sum()
    
    results.append({
        'n_components': n_comp,
        'variance_explained': var_explained,
        'logistic_regression': lr_acc,
        'random_forest': rf_acc
    })

results_df = pd.DataFrame(results)
print("\nüìä Model Performance vs Number of Components:")
print(results_df.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Accuracy vs components
axes[0].plot(results_df['n_components'], results_df['logistic_regression'], 
            marker='o', linewidth=2, label='Logistic Regression', color='blue')
axes[0].plot(results_df['n_components'], results_df['random_forest'],
            marker='s', linewidth=2, label='Random Forest', color='green')
axes[0].set_xlabel('Number of Components', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Model Accuracy vs Number of PCA Components', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Variance vs components
axes[1].plot(results_df['n_components'], results_df['variance_explained'],
            marker='o', linewidth=2, color='purple')
axes[1].axhline(y=0.95, color='red', linestyle='--', linewidth=2, label='95% Variance')
axes[1].set_xlabel('Number of Components', fontsize=12)
axes[1].set_ylabel('Cumulative Variance Explained', fontsize=12)
axes[1].set_title('Variance Explained vs Number of Components', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print(f"- With just {results_df.iloc[2]['n_components']} components ({results_df.iloc[2]['variance_explained']*100:.1f}% variance):")
print(f"  Logistic Regression: {results_df.iloc[2]['logistic_regression']:.4f}")
print(f"  Random Forest: {results_df.iloc[2]['random_forest']:.4f}")
print(f"- Original features: {X_cancer.shape[1]}")
print(f"- Dimensionality reduction: {(1 - results_df.iloc[2]['n_components']/X_cancer.shape[1])*100:.1f}%")

### 8Ô∏è‚É£ Image Reconstruction and Compression

In [None]:
# Select a digit image
sample_idx = 0
original_image = digits.images[sample_idx]
original_flat = digits.data[sample_idx]

# Standardize
original_scaled = scaler_digits.transform([original_flat])

# Reconstruct with different numbers of components
n_components_test = [2, 5, 10, 20, 40, 64]

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

# Original
axes[0].imshow(original_image, cmap='gray')
axes[0].set_title('Original\n(64 features)', fontsize=11, fontweight='bold')
axes[0].axis('off')

# Reconstructions
for idx, n_comp in enumerate(n_components_test[:-1], 1):
    # Apply PCA
    pca_temp = PCA(n_components=n_comp)
    pca_temp.fit(X_digits_scaled)
    
    # Transform and inverse transform
    transformed = pca_temp.transform(original_scaled)
    reconstructed = pca_temp.inverse_transform(transformed)
    
    # Reshape to image
    reconstructed_image = reconstructed.reshape(8, 8)
    
    # Calculate reconstruction error
    mse = np.mean((original_scaled - reconstructed)**2)
    var_explained = pca_temp.explained_variance_ratio_.sum()
    
    # Plot
    axes[idx].imshow(reconstructed_image, cmap='gray')
    axes[idx].set_title(f'{n_comp} components\n({var_explained*100:.1f}% var, MSE={mse:.4f})', 
                       fontsize=10)
    axes[idx].axis('off')

# Hide last subplot
axes[-1].axis('off')

plt.suptitle('Image Reconstruction with Different Numbers of PCA Components', 
            fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüí° PCA can be used for image compression!")
print("üí° Trade-off: Fewer components = more compression but lower quality")

## üìä Key Takeaways

### PCA Advantages:
‚úÖ **Reduces dimensionality** - Fewer features, faster algorithms  
‚úÖ **Removes multicollinearity** - Decorrelates features  
‚úÖ **Noise reduction** - Keeps signal, removes noise  
‚úÖ **Visualization** - Enables 2D/3D plotting of high-D data  
‚úÖ **No parameters** - Only need to choose n_components  
‚úÖ **Interpretable variance** - Clear metric of information retained  

### PCA Disadvantages:
‚ùå **Loss of interpretability** - PCs are combinations of original features  
‚ùå **Linear only** - Assumes linear relationships  
‚ùå **Sensitive to scaling** - Must standardize data first  
‚ùå **Information loss** - Discarding components loses information  
‚ùå **Not for categorical data** - Works best with continuous features  

### How to Choose Number of Components:

**Method 1: Variance Threshold**
- Keep components that explain 90-95% of variance
- Common in practice

**Method 2: Elbow Method**
- Plot cumulative variance
- Look for "elbow" where curve flattens

**Method 3: Cross-Validation**
- Test different n_components
- Choose based on downstream task performance

**Method 4: Kaiser Criterion**
- Keep components with eigenvalue > 1
- Less common in practice

### Best Practices:

1. **Always standardize** - PCA is sensitive to feature scales
   ```python
   scaler = StandardScaler()
   X_scaled = scaler.fit_transform(X)
   ```

2. **Check variance explained** - Ensure you keep enough information
   ```python
   cumsum = np.cumsum(pca.explained_variance_ratio_)
   n_components = np.argmax(cumsum >= 0.95) + 1
   ```

3. **Visualize results** - Plot data in PC space

4. **Validate with downstream task** - Check if PCA improves model performance

5. **Consider alternatives** - For non-linear data, try t-SNE, UMAP, or kernel PCA

### When to Use PCA:
‚úÖ High-dimensional data (many features)  
‚úÖ Correlated features (multicollinearity)  
‚úÖ Need visualization (reduce to 2D/3D)  
‚úÖ Speed up algorithms (fewer features)  
‚úÖ Noise reduction  
‚úÖ Data compression  

### When NOT to Use PCA:
‚ùå Features are already uncorrelated  
‚ùå Interpretability is critical  
‚ùå Non-linear relationships (use kernel PCA, t-SNE)  
‚ùå Categorical data  
‚ùå Small number of features  

### Real-World Applications:
1. **Image Compression** - Reduce storage size
2. **Face Recognition** - Eigenfaces
3. **Genomics** - Gene expression analysis
4. **Finance** - Portfolio optimization
5. **Recommender Systems** - Latent factor models
6. **Anomaly Detection** - Reconstruction error

### PCA Workflow:
```python
# 1. Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Fit PCA
pca = PCA(n_components=0.95)  # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)

# 3. Check variance
print(f"Components: {pca.n_components_}")
print(f"Variance: {pca.explained_variance_ratio_.sum()}")

# 4. Use transformed data
model.fit(X_pca, y)
```

### Comparison with Other Methods:

| Method | Type | Linear | Preserves Distance | Speed |
|--------|------|--------|-------------------|-------|
| **PCA** | Unsupervised | Yes | Global | Fast |
| **t-SNE** | Unsupervised | No | Local | Slow |
| **UMAP** | Unsupervised | No | Local | Medium |
| **LDA** | Supervised | Yes | Global | Fast |
| **Autoencoder** | Unsupervised | No | Learned | Medium |

### Next Steps:
1. Try kernel PCA for non-linear data
2. Compare with t-SNE for visualization
3. Use PCA as preprocessing for ML models
4. Apply to your own high-dimensional datasets
5. Experiment with incremental PCA for large datasets