# 15. Principal Component Analysis (PCA)

**Purpose:** Learn and revise **PCA** for dimensionality reduction in Scikit-learn.

---

## What is PCA?

**Principal Component Analysis** finds **linear combinations** of the original features that capture the most **variance**. The first principal component (PC1) is the direction of maximum variance; PC2 is the next, orthogonal to PC1; and so on.

- **Eigen decomposition** of the covariance matrix (or SVD of centered data) gives the principal directions and explained variance.
- You choose the number of components \( k \) (or a target fraction of variance, e.g. 0.95). Transformed data has \( k \) dimensions.

**Key idea:** **Center** (and usually **scale**) the data before PCA. Use for visualization (e.g. 2D), noise reduction, or as preprocessing before another model.

## Concepts to Remember

| Concept | Description |
|--------|-------------|
| **n_components** | Number of components to keep, or float (e.g. 0.95) for variance ratio. |
| **Explained variance** | How much variance each PC captures; **explained_variance_ratio_** sums to 1. |
| **Centering** | PCA centers data by default; use **StandardScaler** if features have different scales. |
| **When to use** | Dimensionality reduction, visualization, decorrelation, noise reduction. |

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
np.random.seed(42)
X = np.random.randn(100, 5)  # 5 features
X[:, 1] = 0.9 * X[:, 0] + 0.1 * np.random.randn(100)  # correlate with feature 0

scaler = StandardScaler()
X_s = scaler.fit_transform(X)

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_s)

print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Cumulative:", np.cumsum(pca.explained_variance_ratio_))
print("Components shape:", pca.components_.shape)

In [None]:
plt.figure(figsize=(6, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Data in first 2 principal components")
plt.tight_layout()
plt.show()

## Key Takeaways

- **PCA(n_components=k)**; **fit_transform** on (centered/scaled) data; **components_** are the principal directions.
- **explained_variance_ratio_** tells you how much variance each PC captures; use it to choose k.
- **inverse_transform** reconstructs data in original space (with loss if k < full rank).