# Dimensionality Reduction

Dimensionality reduction reduces the number of features while keeping important information.

### Why Dimensionality Reduction?
- Simplify high-dimensional data
- Reduce noise & redundancy
- Speed up training
- Improve visualization (2D/3D)

### Popular Techniques:
- **PCA (Principal Component Analysis)**: Unsupervised, captures variance.
- **LDA (Linear Discriminant Analysis)**: Supervised, maximizes class separability.
- **t-SNE / UMAP**: Non-linear visualization methods (good for clusters).


In [1]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE

In [2]:
# Load dataset (Digits dataset)
digits = load_digits()
X = digits.data
y = digits.target

print("Original shape:", X.shape)

In [3]:
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 1. Principal Component Analysis (PCA)

In [4]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=15)
plt.legend(*scatter.legend_elements(), title="Digits")
plt.title("PCA (2 Components)")
plt.show()

print("Explained variance ratio:", pca.explained_variance_ratio_)

## 2. Linear Discriminant Analysis (LDA)
Supervised method: tries to separate classes while reducing dimensions.

In [5]:
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='tab10', s=15)
plt.legend(*scatter.legend_elements(), title="Digits")
plt.title("LDA (2 Components)")
plt.show()

## 3. t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE is non-linear and mainly used for visualization in 2D/3D.

In [6]:
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled[:1000])  # use subset (t-SNE is slow)
y_subset = y[:1000]

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_subset, cmap='tab10', s=15)
plt.legend(*scatter.legend_elements(), title="Digits")
plt.title("t-SNE (2 Components)")
plt.show()

### Key Takeaways:
- **PCA**: Unsupervised, fast, keeps variance.
- **LDA**: Supervised, maximizes class separation.
- **t-SNE**: Non-linear, great for visualization (but slow).

👉 Use PCA/LDA for preprocessing before ML models, and t-SNE mainly for plotting high-dimensional data.