# Lab Extra: PCA vs t-SNE on Handwritten Digits

PCA is a **linear** method — it finds directions of maximum variance using a fixed matrix multiplication. But what if the data has **nonlinear structure** that can't be captured by straight lines?

### The Dataset

We'll explore this using the **UCI Optical Recognition of Handwritten Digits** dataset: 1,797 samples of 8×8 pixel grayscale images (64 dimensions).

**Sources:**

- [sklearn.datasets.load_digits](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/80/optical+recognition+of+handwritten+digits)

### Why handwritten digits have nonlinear structure

Consider the digit "3" written in different styles: upright, slanted, thick strokes, thin strokes, rounded, angular. Each variation changes many pixels simultaneously, but not in a simple linear way. The relationship between "slant angle" and pixel values involves trigonometry, not just scaling.

This means the collection of all "3"s doesn't form a blob or ellipse in 64D space — it forms a **curved manifold**. The same is true for each digit class. PCA can only project onto flat planes, so when it tries to separate these curved clusters, they overlap and blur together.

### t-SNE (t-distributed Stochastic Neighbor Embedding)

**t-SNE** is a **nonlinear** dimensionality reduction technique designed for visualization. Unlike PCA, it focuses on preserving **local neighborhood structure** rather than global variance.

**Why is t-SNE nonlinear?**

- **PCA**: Projects data using a fixed matrix: $\mathbf{Z} = \mathbf{X} \cdot \mathbf{W}$. The same linear transformation applies to every point.
- **t-SNE**: Uses iterative optimization that can bend and stretch different regions differently. There's no single formula — the algorithm learns a mapping that keeps neighbors together, even if that requires warping the geometry.

**How it works:**

1. **Compute pairwise similarities** in high-dimensional space: For each point, calculate the probability that it would pick each other point as a neighbor (using a Gaussian distribution)

2. **Initialize points randomly** in low-dimensional space (typically 2D or 3D)

3. **Optimize**: Iteratively move points to match the neighborhood structure from the original space (minimizing KL divergence)

The result: points that were neighbors in 64D remain neighbors in the output space.

**Key parameters:**

- `n_components`: Output dimensionality (usually 2 for visualization)
- `perplexity`: Controls the effective number of neighbors considered (typically 5–50)
- `random_state`: Set for reproducibility, since t-SNE uses random initialization


---

## Setup


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

---

## Load the Dataset


In [None]:
# Load UCI Handwritten Digits dataset (8x8 images)
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print(f"Dataset shape: {X_digits.shape}")
print(f"Each sample is an 8x8 image flattened to {X_digits.shape[1]} features")

# Show a few example digits
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap="gray")
    ax.set_title(f"Label: {y_digits[i]}")
    ax.axis("off")
plt.suptitle("Sample Handwritten Digits (8×8)")
plt.tight_layout()
plt.show()

---

### Task 2.1 PCA on Digits

- Task: Apply PCA to reduce the digits data from 64D to 2D and visualize the result.
- Points: 2.5
- Expectations: Complete the TODO lines to reduce the data to 2D using PCA. The plotting code is provided.


In [None]:
# TODO: pca_digits = ...
# TODO: X_pca = ...

# Plot PCA result
plt.figure(figsize=(10, 8))
scatter = plt.scatter(
    X_pca[:, 0], X_pca[:, 1], c=y_digits, cmap="tab10", s=5, alpha=0.7
)
plt.colorbar(scatter, label="Digit")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA on Handwritten Digits")
plt.show()

### Task 2.2 t-SNE on Digits

- Task: Apply t-SNE to reduce the digits data to 2D and visualize the result.
- Points: 2.5
- Expectations: Complete the TODO lines to reduce the data to 2D using t-SNE. The plotting code is provided. Note: t-SNE can be slow (~30 seconds).


In [None]:
# TODO: tsne = ... (Hint: Experiment with perplexity value)
# TODO: X_tsne = ...

# Plot t-SNE result
plt.figure(figsize=(10, 8))
scatter = plt.scatter(
    X_tsne[:, 0], X_tsne[:, 1], c=y_digits, cmap="tab10", s=5, alpha=0.7
)
plt.colorbar(scatter, label="Digit")
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title("t-SNE on Handwritten Digits")
plt.show()

### Task 2.3 PCA vs t-SNE separation

- Task: In your t-SNE plot, you should see distinct clusters. Why doesn't PCA produce the same clear separation, even though both methods reduce to 2D?
- Points: 5
- Expectations: A written response (1-2 paragraphs).


#### Answer


### Task 2.4 Real-world application

- Task: Give one example of a dataset from your field (signals, IT, cyber, etc.) where t-SNE would be useful for exploration. What patterns or clusters might you hope to discover?
- Points: 5
- Expectations: A written response (2-4 paragraphs).


#### Answer


### Task 2.5 t-SNE for dimensionality reduction

- Task: With PCA, we reduced 64D digits to 2D and could easily apply the same transformation to new data. Could you do the same with t-SNE — use it as a preprocessing step for a classifier? What fundamental problem would you encounter when trying to classify new, unseen data?
- Points: 5
- Expectations: A written response (2-3 paragraphs).


#### Answer
