# Day 45: Advanced Dimensionality Reduction - UMAP and Comparisons

Welcome to Day 45, the final lesson of Week 9! In this lesson, we explore advanced dimensionality reduction techniques, focusing on UMAP (Uniform Manifold Approximation and Projection) and comparing it with traditional methods like PCA and t-SNE.

## Definition

**Dimensionality Reduction** is the transformation of data from a high-dimensional space into a low-dimensional space while retaining meaningful properties of the original data. This is critical in modern machine learning where datasets often have hundreds or thousands of features.

**UMAP (Uniform Manifold Approximation and Projection)** is a novel manifold learning technique for dimensionality reduction based on Riemannian geometry and algebraic topology. Unlike traditional methods, UMAP excels at preserving both local and global data structure.

## Importance

Dimensionality reduction is essential for several reasons:

1. **Visualization**: Humans can only visualize 2D or 3D data effectively
2. **Computational Efficiency**: Fewer dimensions mean faster algorithms
3. **Curse of Dimensionality**: Many ML algorithms struggle in high dimensions
4. **Noise Reduction**: Removing irrelevant features improves model performance
5. **Feature Engineering**: Creating better representations for downstream tasks

## Applications and Examples

1. **Genomics**: Analyzing expression patterns across thousands of genes
2. **Computer Vision**: Extracting features from images with millions of pixels
3. **NLP**: Visualizing word embeddings and document similarities
4. **Anomaly Detection**: Identifying outliers in high-dimensional sensor data
5. **Recommendation Systems**: Understanding user preferences across many items

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, make_swiss_roll
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, trustworthiness
import warnings
import time
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set plotting style
try:
    plt.style.use('seaborn-v0_8-darkgrid')
except:
    plt.style.use('default')

sns.set_palette("husl")

print("✓ Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

# Mathematical Foundations

## 1. Principal Component Analysis (PCA)

PCA is a linear technique that finds orthogonal directions of maximum variance.

**Mathematical Formulation:**

For data matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$:

1. Center data: $\mathbf{X}_c = \mathbf{X} - \bar{\mathbf{X}}$
2. Compute covariance: $\mathbf{C} = \frac{1}{n-1}\mathbf{X}_c^T\mathbf{X}_c$
3. Find eigenvectors $\mathbf{v}_i$ and eigenvalues $\lambda_i$
4. Project: $\mathbf{Y} = \mathbf{X}_c \mathbf{W}$

**Variance explained:**

$$\text{Variance}_i = \frac{\lambda_i}{\sum_j \lambda_j}$$

**Advantages**: Fast, deterministic, interpretable, preserves global structure

**Limitations**: Only linear, may miss complex patterns

## 2. t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE minimizes KL divergence between high and low-dimensional probability distributions.

**High-dimensional similarity:**

$$p_{ij} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma^2)}{\sum_{k \neq l} \exp(-||x_k - x_l||^2 / 2\sigma^2)}$$

**Low-dimensional similarity (t-distribution):**

$$q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l}(1 + ||y_k - y_l||^2)^{-1}}$$

**Objective:**

$$C = KL(P||Q) = \sum_i \sum_j p_{ij} \log\frac{p_{ij}}{q_{ij}}$$

**Advantages**: Excellent visualization, preserves local structure

**Limitations**: Slow, non-deterministic, poor global structure preservation

## 3. UMAP (Uniform Manifold Approximation and Projection)

UMAP uses topological data analysis and Riemannian geometry.

**Fuzzy set membership (high-dimensional):**

$$w_{ij} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)$$

where $\rho_i$ is distance to nearest neighbor, $\sigma_i$ normalizes

**Low-dimensional embedding:**

$$v_{ij} = \frac{1}{1 + a \cdot d(y_i, y_j)^{2b}}$$

**Optimization via cross-entropy** (similar concept to t-SNE but different distribution)

**Advantages**: Fast, scalable, preserves both local and global structure

**Limitations**: Newer, requires parameter tuning

In [None]:
# Load digits dataset (8x8=64 dimensional)
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print("="*60)
print("DIGITS DATASET")
print("="*60)
print(f"Shape: {X_digits.shape}")
print(f"Features: {X_digits.shape[1]} (8x8 pixel images)")
print(f"Samples: {X_digits.shape[0]}")
print(f"Classes: {len(np.unique(y_digits))} (digits 0-9)")
print("="*60)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_digits)

print(f"\nStandardized: mean={X_scaled.mean():.2e}, std={X_scaled.std():.2f}")

# Visualize samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_digits[i].reshape(8, 8), cmap='gray')
    ax.set_title(f"Digit: {y_digits[i]}")
    ax.axis('off')
plt.suptitle('Sample Handwritten Digits (64 features each)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Apply PCA
print("\nApplying PCA...")
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total: {pca.explained_variance_ratio_.sum():.2%}")

# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_digits, cmap='tab10',
                     alpha=0.6, edgecolors='k', linewidth=0.5, s=50)
plt.colorbar(scatter, label='Digit', ticks=range(10))
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.title('PCA: Digits in 2D', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nNote: PCA captures {pca.explained_variance_ratio_.sum():.1%} of variance")
print("Some digit classes overlap - PCA is linear!")

In [None]:
# Apply t-SNE
print("\nApplying t-SNE (may take 30-60 seconds)...")
start = time.time()

tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000, verbose=0)
X_tsne = tsne.fit_transform(X_scaled)

print(f"Completed in {time.time()-start:.2f}s")
print(f"Final KL divergence: {tsne.kl_divergence_:.4f}")

# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_digits, cmap='tab10',
                     alpha=0.6, edgecolors='k', linewidth=0.5, s=50)
plt.colorbar(scatter, label='Digit', ticks=range(10))
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('t-SNE: Digits in 2D', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nExcellent cluster separation!")

In [None]:
# Install and apply UMAP
import subprocess
import sys

try:
    import umap
    print("✓ UMAP already installed")
except ImportError:
    print("Installing UMAP...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "umap-learn"])
    import umap
    print("✓ UMAP installed!")

# Apply UMAP
print("\nApplying UMAP...")
start = time.time()

reducer = umap.UMAP(n_components=2, random_state=42, n_neighbors=15, min_dist=0.1)
X_umap = reducer.fit_transform(X_scaled)

print(f"Completed in {time.time()-start:.2f}s")

# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y_digits, cmap='tab10',
                     alpha=0.6, edgecolors='k', linewidth=0.5, s=50)
plt.colorbar(scatter, label='Digit', ticks=range(10))
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.title('UMAP: Digits in 2D', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nClear clusters + faster than t-SNE!")

In [None]:
# Side-by-side comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for (emb, name, ax) in [(X_pca, 'PCA', axes[0]),
                         (X_tsne, 't-SNE', axes[1]),
                         (X_umap, 'UMAP', axes[2])]:
    scatter = ax.scatter(emb[:, 0], emb[:, 1], c=y_digits, cmap='tab10',
                        alpha=0.6, edgecolors='k', linewidth=0.5, s=40)
    ax.set_title(name, fontsize=14, fontweight='bold')
    ax.set_xlabel(f'{name} 1')
    ax.set_ylabel(f'{name} 2')
    ax.grid(True, alpha=0.3)

plt.colorbar(scatter, ax=axes, label='Digit', ticks=range(10), pad=0.01)
plt.suptitle('Comparison: PCA vs t-SNE vs UMAP', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nObservations:")
print("  PCA: Fast, linear, some overlap")
print("  t-SNE: Slow, excellent clusters, loses global structure")
print("  UMAP: Fast, good clusters + global structure")

# Understanding UMAP Parameters

UMAP has two key hyperparameters:

## 1. `n_neighbors` (default=15)

Controls local vs global structure balance:

- **Small (5-10)**: Very local focus, tight clusters
- **Large (30-50)**: More global structure preservation

## 2. `min_dist` (default=0.1)

Controls point packing in low-dimensional space:

- **Small (0.0-0.1)**: Tight clusters
- **Large (0.5-0.99)**: Spread out points

**Recommended Settings:**
- Clustering: `min_dist=0.0`, `n_neighbors=15`
- Visualization: `min_dist=0.1`, `n_neighbors=15-30`
- Topology: `min_dist=0.1-0.3`, `n_neighbors=30-50`

In [None]:
# Parameter exploration
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
fig.suptitle('UMAP Parameter Effects', fontsize=16, fontweight='bold')

params = [
    {'n_neighbors': 5, 'min_dist': 0.0, 'title': 'Local + Tight\n(n=5, d=0.0)'},
    {'n_neighbors': 15, 'min_dist': 0.1, 'title': 'Balanced (DEFAULT)\n(n=15, d=0.1)'},
    {'n_neighbors': 50, 'min_dist': 0.1, 'title': 'Global + Tight\n(n=50, d=0.1)'},
    {'n_neighbors': 15, 'min_dist': 0.9, 'title': 'Balanced + Loose\n(n=15, d=0.9)'},
]

for ax, param in zip(axes.flat, params):
    print(f"Computing: {param['title'].split(chr(10))[0]}...")
    reducer = umap.UMAP(n_components=2, n_neighbors=param['n_neighbors'],
                       min_dist=param['min_dist'], random_state=42)
    emb = reducer.fit_transform(X_scaled)
    scatter = ax.scatter(emb[:, 0], emb[:, 1], c=y_digits, cmap='tab10',
                        alpha=0.6, s=20, edgecolors='w', linewidth=0.3)
    ax.set_title(param['title'], fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.colorbar(scatter, ax=axes, label='Digit', pad=0.01)
plt.tight_layout()
plt.show()

print("\nLower n_neighbors → more local, tighter")
print("Higher n_neighbors → more global, looser")
print("Lower min_dist → tighter packing")
print("Higher min_dist → more spread out")

# The Swiss Roll Test

The Swiss Roll is a classic test for manifold learning - a 2D surface rolled in 3D.

**Why it matters**: Success means the algorithm can "unroll" non-linear manifolds while preserving structure.

- **PCA fails**: It's linear
- **t-SNE fragments**: Focuses too locally  
- **UMAP succeeds**: True manifold learning

In [None]:
# Generate Swiss Roll
X_swiss, color = make_swiss_roll(n_samples=1500, noise=0.1, random_state=42)

print(f"Swiss Roll shape: {X_swiss.shape} (3D)")
print(f"Color range: [{color.min():.1f}, {color.max():.1f}]")

# Visualize 3D
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(14, 6))

ax1 = fig.add_subplot(121, projection='3d')
scatter1 = ax1.scatter(X_swiss[:, 0], X_swiss[:, 1], X_swiss[:, 2],
                      c=color, cmap='viridis', s=20, alpha=0.6)
ax1.set_title('Swiss Roll in 3D', fontweight='bold')
ax1.view_init(elev=12, azim=-66)
fig.colorbar(scatter1, ax=ax1, shrink=0.8, label='Position')

ax2 = fig.add_subplot(122, projection='3d')
scatter2 = ax2.scatter(X_swiss[:, 0], X_swiss[:, 1], X_swiss[:, 2],
                      c=color, cmap='viridis', s=20, alpha=0.6)
ax2.set_title('Swiss Roll (Another Angle)', fontweight='bold')
ax2.view_init(elev=20, azim=45)
fig.colorbar(scatter2, ax=ax2, shrink=0.8, label='Position')

plt.suptitle('A 2D Manifold Embedded in 3D Space', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Apply all methods to Swiss Roll
X_swiss_scaled = StandardScaler().fit_transform(X_swiss)

print("Unrolling the Swiss Roll...")
X_sw_pca = PCA(n_components=2, random_state=42).fit_transform(X_swiss_scaled)
X_sw_tsne = TSNE(n_components=2, random_state=42, perplexity=30).fit_transform(X_swiss_scaled)
X_sw_umap = umap.UMAP(n_components=2, random_state=42, n_neighbors=30, min_dist=0.0).fit_transform(X_swiss_scaled)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].scatter(X_sw_pca[:, 0], X_sw_pca[:, 1], c=color, cmap='viridis', alpha=0.6, s=20)
axes[0].set_title('PCA: Cannot Unroll\n(Linear fails)', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

axes[1].scatter(X_sw_tsne[:, 0], X_sw_tsne[:, 1], c=color, cmap='viridis', alpha=0.6, s=20)
axes[1].set_title('t-SNE: Fragments\n(Too local)', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

scatter = axes[2].scatter(X_sw_umap[:, 0], X_sw_umap[:, 1], c=color, cmap='viridis', alpha=0.6, s=20)
axes[2].set_title('UMAP: Success!\n(Preserves topology)', fontsize=13, fontweight='bold')
axes[2].grid(True, alpha=0.3)

plt.colorbar(scatter, ax=axes, label='Position Along Roll', pad=0.01)
plt.suptitle('Which Method Successfully Unrolls the Swiss Roll?', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("✗ PCA: Failed - cannot handle non-linearity")
print("~ t-SNE: Partial - breaks continuity")
print("✓ UMAP: Success - unrolls with preserved gradient!")
print("="*60)

In [None]:
# Quantitative evaluation
print("\n" + "="*60)
print("TRUSTWORTHINESS METRICS")
print("="*60)
print("Measures if k-nearest neighbors in high-D are also close in low-D")
print("Higher is better (0 to 1)")
print("="*60)

results = []
for k in [5, 10, 20]:
    t_pca = trustworthiness(X_scaled, X_pca, n_neighbors=k)
    t_tsne = trustworthiness(X_scaled, X_tsne, n_neighbors=k)
    t_umap = trustworthiness(X_scaled, X_umap, n_neighbors=k)
    results.append({'k': k, 'PCA': f'{t_pca:.4f}', 't-SNE': f'{t_tsne:.4f}', 'UMAP': f'{t_umap:.4f}'})

df = pd.DataFrame(results)
print("\nDigits Dataset:")
print(df.to_string(index=False))

# Timing
print("\n" + "="*60)
print("COMPUTATIONAL EFFICIENCY (1000 samples)")
print("="*60)

X_sample = X_scaled[:1000]
timings = {}

start = time.time()
_ = PCA(n_components=2, random_state=42).fit_transform(X_sample)
timings['PCA'] = time.time() - start

start = time.time()
_ = TSNE(n_components=2, random_state=42, n_iter=300, verbose=0).fit_transform(X_sample)
timings['t-SNE'] = time.time() - start

start = time.time()
_ = umap.UMAP(n_components=2, random_state=42).fit_transform(X_sample)
timings['UMAP'] = time.time() - start

for method, t in timings.items():
    print(f"{method:8s}: {t:6.3f} seconds")

print(f"\nSpeedup vs t-SNE:")
print(f"  PCA:  {timings['t-SNE']/timings['PCA']:.1f}x faster")
print(f"  UMAP: {timings['t-SNE']/timings['UMAP']:.1f}x faster")
print("="*60)

# Decision Guide

## Use PCA when:
✅ Speed is critical  
✅ Linear relationships suffice  
✅ Need interpretable components  
✅ Want global structure  
✅ Need to reverse transformation  
✅ Preprocessing for ML  

**Examples**: Image compression, noise reduction, initial EDA

## Use t-SNE when:
✅ Visualization is primary goal  
✅ Local structure critical  
✅ < 10,000 samples  
✅ Time not constrained  
✅ Publication-quality figures  

**Examples**: Cluster visualization, exploring embeddings, understanding separability

## Use UMAP when:
✅ Need both local AND global structure  
✅ Large datasets (> 10,000)  
✅ Embedding for downstream tasks  
✅ Need to transform new data  
✅ Efficiency matters  
✅ Complex manifolds  

**Examples**: Single-cell genomics, document clustering, anomaly detection, large-scale viz

## Quick Reference

| Criterion | PCA | t-SNE | UMAP |
|-----------|-----|-------|------|
| **Speed** | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ |
| **Local Structure** | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| **Global Structure** | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ |
| **Scalability** | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐ |
| **Deterministic** | ✅ | ❌ | ✅ |
| **New Data** | ✅ | ❌ | ✅ |
| **Best For** | Preprocessing | Visualization | General Purpose |

## Exercise For The Reader

Apply these techniques to the Wine dataset!

**Your Tasks:**
1. Load the Wine dataset from sklearn
2. Standardize features
3. Apply PCA, t-SNE, and UMAP
4. Create side-by-side visualizations
5. Compute trustworthiness metrics
6. Answer: Which method best separates wine classes?

In [None]:
# Exercise: Wine Dataset
from sklearn.datasets import load_wine

wine = load_wine()
X_wine = wine.data
y_wine = wine.target

print(f"Wine dataset: {X_wine.shape}")
print(f"Classes: {wine.target_names}")

# Standardize
X_wine_scaled = StandardScaler().fit_transform(X_wine)

# Apply methods
X_wine_pca = PCA(n_components=2, random_state=42).fit_transform(X_wine_scaled)
X_wine_tsne = TSNE(n_components=2, random_state=42, verbose=0).fit_transform(X_wine_scaled)
X_wine_umap = umap.UMAP(n_components=2, random_state=42).fit_transform(X_wine_scaled)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
colors = ['red', 'green', 'blue']

for i, (emb, name, ax) in enumerate([(X_wine_pca, 'PCA', axes[0]),
                                       (X_wine_tsne, 't-SNE', axes[1]),
                                       (X_wine_umap, 'UMAP', axes[2])]):
    for j, wine_name in enumerate(wine.target_names):
        mask = y_wine == j
        ax.scatter(emb[mask, 0], emb[mask, 1], c=colors[j],
                  label=wine_name, alpha=0.7, edgecolors='k', s=60)
    ax.set_title(name, fontsize=14, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('Wine Dataset Comparison', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Metrics
print("\nTrustworthiness (k=10):")
for emb, name in [(X_wine_pca, 'PCA'), (X_wine_tsne, 't-SNE'), (X_wine_umap, 'UMAP')]:
    t = trustworthiness(X_wine_scaled, emb, n_neighbors=10)
    print(f"  {name:8s}: {t:.4f}")

print("\n✓ Exercise complete!")

# Key Takeaways

Congratulations on completing Day 45 and Week 9!

## What You've Learned

1. **Three Major Techniques**
   - PCA: Linear, fast, interpretable
   - t-SNE: Excellent visualization, local focus
   - UMAP: Balanced, modern, scalable

2. **Mathematical Foundations**
   - PCA: Eigendecomposition, variance maximization
   - t-SNE: KL divergence, probability distributions
   - UMAP: Manifold learning, topology preservation

3. **Practical Skills**
   - Apply and compare methods
   - Choose appropriate technique
   - Tune UMAP parameters
   - Evaluate with trustworthiness
   - Visualize effectively

4. **Key Principles**
   - No single best method
   - Use quantitative metrics
   - Consider downstream tasks
   - UMAP often best balance
   - t-SNE still king for viz
   - PCA essential for preprocessing

## What's Next?

- Apply to your datasets
- Explore supervised UMAP
- Learn about autoencoders
- Investigate other manifold methods (Isomap, LLE)
- Use in ML pipelines

**Your dimensionality reduction journey is complete!** 🎉

## Further Resources

### Papers
1. **UMAP** - McInnes et al. (2018) - https://arxiv.org/abs/1802.03426
2. **t-SNE** - van der Maaten & Hinton (2008) - JMLR
3. **PCA** - Jolliffe (2002) - Springer

### Documentation
4. **UMAP Docs** - https://umap-learn.readthedocs.io/
5. **Scikit-learn Manifold** - https://scikit-learn.org/stable/modules/manifold.html
6. **Distill.pub t-SNE** - https://distill.pub/2016/misread-tsne/

### Interactive
7. **Understanding UMAP** - https://pair-code.github.io/understanding-umap/
8. **PCA Explained Visually** - http://setosa.io/ev/principal-component-analysis/

### Books
9. **Elements of Statistical Learning** - Hastie, Tibshirani, Friedman
10. **Hands-On Machine Learning** - Aurélien Géron (3rd Ed)

### Practice
11. **Kaggle Datasets** - https://www.kaggle.com/datasets
12. **UCI ML Repository** - https://archive.ics.uci.edu/ml/

**Happy Learning! 🚀**