<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 7. Dimensionality Reduction


Many machine learning problems involve thousands or even millions
of features for each training instance.


Not only do all these features make training extremely slow,
but they can also make it much harder to find a good solution.


This problem is often referred to as the **curse of dimensionality**.


Fortunately, in real-world problems, it is often possible
to reduce the number of features considerably,
turning an intractable problem into a tractable one.


For example, consider the MNIST images.


Pixels on the image borders are almost always white,
so they can be dropped without losing much information.


Additionally, neighboring pixels are often highly correlated.


If two neighboring pixels are merged into one
(for example, by averaging their intensities),
little information is lost while redundancy is reduced.


Removing redundancy can sometimes reduce noise
and improve model performance.


### ⚠️ WARNING


Reducing dimensionality can also drop useful information.


This is similar to compressing an image to JPEG:
the file becomes smaller, but quality may degrade.


If dimensionality is reduced too aggressively,
model performance may suffer.


Moreover, some models—such as neural networks—
can handle high-dimensional data efficiently.


These models can often learn useful representations
and reduce dimensionality internally.


As a result, dimensionality reduction is not always helpful
and should be applied thoughtfully.


As a result, dimensionality reduction is not always helpful
and should be applied thoughtfully.


Dimensionality reduction is also extremely useful
for data visualization.


Reducing data to two or three dimensions
makes it possible to plot high-dimensional datasets.


Visualization often reveals patterns such as clusters,
outliers, or nonlinear structures.


This is especially important when communicating results
to non–data scientists and decision makers.


# The Curse of Dimensionality


We are so used to living in three dimensions
that our intuition fails us in high-dimensional spaces.


Even a simple 4D hypercube is extremely difficult to visualize,
let alone a 200-dimensional ellipsoid in a 1,000-dimensional space.


Figure 7-1 illustrates hypercubes from 0D to 4D:
a point, a segment, a square, a cube, and a tesseract.


Many properties of space behave very differently
as dimensionality increases.


Consider a random point inside a unit square (1 × 1).


There is only about a 0.4% chance that the point
lies within 0.001 of the border.


In other words, points in low-dimensional space
are rarely "extreme" along any dimension.


Now consider a 10,000-dimensional unit hypercube.


The probability that a random point lies within 0.001
of the border exceeds 99.999999%.


In high dimensions, most points lie very close to the border.


### Distances Grow with Dimensionality


Distances between random points also behave strangely
in high-dimensional space.


In a unit square, the average distance between two random points
is approximately 0.52.


In a 3D unit cube, the average distance increases to about 0.66.


But in a 1,000,000-dimensional unit hypercube,
the average distance is about 408.25.


This seems impossible at first:
how can points be so far apart inside a unit hypercube?


The answer is that high-dimensional spaces
contain an enormous amount of volume.


### Consequences for Machine Learning


High-dimensional datasets are typically very sparse.


Most training instances are far away from each other,
which makes distance-based methods less effective.


Algorithms such as k-nearest neighbors
suffer significantly from this effect.


Some models scale poorly with dimensionality
and may become impractical.


Examples include support vector machines
and dense neural networks.


New instances are also likely to be far from all training points,
making predictions less reliable.


Patterns become harder to detect,
so models are more likely to fit noise.


As dimensionality increases,
regularization becomes increasingly important.


Models also become more difficult to interpret.


### Why More Data Is Not a Practical Fix


In theory, these problems could be mitigated
by increasing the size of the training set.


However, the number of required training instances
grows exponentially with dimensionality.


With just 100 features ranging from 0 to 1,
you would need more training instances
than atoms in the observable universe
to achieve reasonable density.


This makes dimensionality reduction
a practical necessity in many real-world problems.


# Main Approaches for Dimensionality Reduction


Before diving into specific dimensionality reduction algorithms,
it is useful to understand the two main approaches:
projection and manifold learning.


### Projection


In most real-world problems, training instances
are not spread uniformly across all dimensions.


Many features are nearly constant,
while others are highly correlated.


As a result, the data often lies within
(or very close to) a lower-dimensional subspace
of the higher-dimensional space.


Although this may sound abstract,
it becomes clearer with an example.


Consider a 3D dataset represented by points in space,
as shown in Figure 7-2.


All training instances lie close to a plane,
which is a 2D subspace of the 3D space.


If we project each instance perpendicularly
onto this plane,
we obtain a new 2D dataset.


This projection is illustrated by dashed lines
connecting the original points to the plane.


Figure 7-3 shows the resulting 2D dataset
after projection.


The dataset’s dimensionality has been reduced
from 3D to 2D.


The new axes correspond to new features,
denoted z₁ and z₂.


These features are simply the coordinates
of the projected points on the plane.


### Manifold Learning


Projection is fast and often effective,
but it is not always the best approach.


In some datasets, the lower-dimensional structure
twists and turns through the higher-dimensional space.


A classic example is the Swiss roll dataset,
shown in Figure 7-4.


This dataset consists of 3D points
arranged in the shape of a rolled sheet.


Simply projecting the data onto a plane
would squash different layers of the Swiss roll together.


This effect is shown on the left side of Figure 7-5.


What we actually want is to "unroll" the Swiss roll
into a flat 2D representation.


The correct unrolled result
is shown on the right side of Figure 7-5.


The Swiss roll is an example of a 2D manifold.


A 2D manifold is a surface that can be bent and twisted
within a higher-dimensional space.


More generally, a d-dimensional manifold
is a subset of an n-dimensional space (where d < n)
that locally resembles a d-dimensional hyperplane.


In the Swiss roll example,
d = 2 and n = 3.


Many dimensionality reduction algorithms
are based on modeling this manifold structure.


This approach is called manifold learning.


Examples include LLE, Isomap, t-SNE, and UMAP.


Manifold learning relies on the manifold assumption,
also known as the manifold hypothesis.


This hypothesis states that most real-world
high-dimensional datasets lie close to
a much lower-dimensional manifold.


This assumption is frequently observed in practice.


Consider the MNIST dataset as an example.


Handwritten digits share strong structural constraints:
connected strokes, white borders, and centered content.


Only a tiny fraction of all possible images
resemble handwritten digits.


This dramatically reduces the effective degrees of freedom,
compressing the data into a lower-dimensional manifold.


The manifold assumption is often paired
with another implicit assumption.


The learning task may become simpler
when expressed in the manifold’s lower-dimensional space.


In Figure 7-6 (top row),
a complex decision boundary in 3D
becomes a straight line in the unrolled 2D space.


However, this simplification does not always occur.


In the bottom row of Figure 7-6,
a simple decision boundary in 3D
becomes more complex in the manifold space.


This shows that dimensionality reduction
does not guarantee simpler decision boundaries.


In practice, dimensionality reduction usually speeds up training,
but it may not always improve performance.


It is most effective when:
- the dataset is small relative to the number of features
- the data is noisy
- many features are highly correlated


If domain knowledge suggests
the data-generating process is simple,
the manifold assumption is likely valid.


In such cases,
dimensionality reduction can be very beneficial.


The remainder of this chapter
covers popular dimensionality reduction algorithms
that exploit these ideas.


# PCA (Principal Component Analysis)

Principal Component Analysis (PCA) is the most popular dimensionality reduction algorithm.  
It identifies the hyperplane that lies closest to the data and then projects the data onto it.

The main goal of PCA is to **preserve as much variance as possible** while reducing the number of dimensions.


## Preserving the Variance

When projecting data onto a lower-dimensional space, PCA selects the axes (principal components)
that maximize the variance of the projected data.

Preserving variance usually means preserving information.


In [None]:
import numpy as np


## Principal Components via SVD

PCA can be computed using **Singular Value Decomposition (SVD)**.
Before applying SVD, the data must be **centered** around the origin.


In [None]:
# Example 3D dataset
X = np.array([
    [2.5, 2.4, 1.2],
    [0.5, 0.7, 0.3],
    [2.2, 2.9, 1.7],
    [1.9, 2.2, 1.1],
    [3.1, 3.0, 1.9]
])

# Center the data
X_centered = X - X.mean(axis=0)

# Perform SVD
U, s, Vt = np.linalg.svd(X_centered)

# First two principal components
c1 = Vt[0]
c2 = Vt[1]

c1, c2


Each principal component is a **unit vector** pointing in the direction of maximum variance.

Note:
- The sign of principal components is not guaranteed.
- PCA must be retrained if the dataset changes.


## Projecting Down to d Dimensions

Once the principal components are found, we can project the data
onto the first `d` components to reduce dimensionality.


In [None]:
# Project onto the first 2 principal components
W2 = Vt[:2].T
X_2D = X_centered @ W2

X_2D


## Using Scikit-Learn PCA

Scikit-Learn provides a PCA implementation that:
- Automatically centers the data
- Uses SVD internally


In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2D_sklearn = pca.fit_transform(X)

X_2D_sklearn


The `components_` attribute stores the principal components.
Each row corresponds to one principal component.


## Explained Variance Ratio

The explained variance ratio tells us how much variance
each principal component captures.


In [None]:
pca.explained_variance_ratio_


## Choosing the Right Number of Dimensions

A common approach is to keep enough components to preserve
a target percentage (e.g., 95%) of the variance.


In [None]:
from sklearn.datasets import fetch_openml

# Load MNIST
mnist = fetch_openml('mnist_784', as_frame=False)
X_train = mnist.data[:60_000]
y_train = mnist.target[:60_000]

# Fit PCA without dimensionality reduction
pca = PCA()
pca.fit(X_train)

# Cumulative explained variance
cumsum = np.cumsum(pca.explained_variance_ratio_)

# Minimum dimensions to preserve 95% variance
d = np.argmax(cumsum >= 0.95) + 1
d


Instead of manually selecting `d`, you can directly specify
the variance ratio to preserve.


In [None]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

pca.n_components_


## PCA as a Preprocessing Step (Pipeline)

PCA is often combined with supervised models using a pipeline.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV

clf = make_pipeline(
    PCA(random_state=42),
    RandomForestClassifier(random_state=42)
)

param_distrib = {
    "pca__n_components": np.arange(10, 80),
    "randomforestclassifier__n_estimators": np.arange(50, 500)
}

rnd_search = RandomizedSearchCV(
    clf,
    param_distrib,
    n_iter=10,
    cv=3,
    random_state=42
)

rnd_search.fit(X_train[:1000], y_train[:1000])
rnd_search.best_params_


## PCA for Compression

PCA can compress data by reducing dimensionality and later
reconstructing it using the inverse transformation.


In [None]:
# Reconstruct the data
X_recovered = pca.inverse_transform(X_reduced)


## Incremental PCA

Incremental PCA allows PCA to be trained on large datasets
using mini-batches.


In [None]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=154)

for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)

X_reduced_inc = inc_pca.transform(X_train)


# Random Projection

Random Projection reduces dimensionality by projecting data onto a
randomly generated lower-dimensional subspace.

Surprisingly, this technique preserves distances fairly well with high probability,
as guaranteed by the **Johnson–Lindenstrauss lemma**.


# Random Projection

Random Projection reduces dimensionality by projecting data onto a
randomly generated lower-dimensional subspace.

Surprisingly, this technique preserves distances fairly well with high probability,
as guaranteed by the **Johnson–Lindenstrauss lemma**.


In [None]:
import numpy as np


## Choosing the Target Dimensionality

Scikit-Learn provides a helper function that computes the minimum number
of dimensions required to preserve distances within a tolerance ε.


In [None]:
from sklearn.random_projection import johnson_lindenstrauss_min_dim

m = 5_000     # number of samples
epsilon = 0.1  # maximum distortion

d = johnson_lindenstrauss_min_dim(m, eps=epsilon)
d


For example, with 5,000 samples and ε = 0.1, we only need about 7,300 dimensions,
even if the original dataset has tens of thousands of features.


In [None]:
n = 20_000  # original number of features
rng = np.random.default_rng(seed=42)

# Random projection matrix
P = rng.standard_normal((d, n)) / np.sqrt(d)

# Fake high-dimensional dataset
X = rng.standard_normal((m, n))

# Project data
X_reduced = X @ P.T

X_reduced.shape


Random projection is extremely fast:
- No training required
- Only depends on data shape
- Very memory efficient compared to PCA

This makes it ideal for text, genomics, and other very high-dimensional data.


## Gaussian Random Projection (Scikit-Learn)

Scikit-Learn provides a built-in transformer that performs
Gaussian random projection automatically.


In [None]:
from sklearn.random_projection import GaussianRandomProjection

gaussian_rnd_proj = GaussianRandomProjection(
    eps=epsilon,
    random_state=42
)

X_reduced_sklearn = gaussian_rnd_proj.fit_transform(X)
X_reduced_sklearn.shape


The random projection matrix is stored in the `components_` attribute.


In [None]:
gaussian_rnd_proj.components_.shape


## Sparse Random Projection

Sparse Random Projection uses a sparse random matrix:
- Much lower memory usage
- Faster computation
- Preserves sparsity of input data

It usually performs just as well as Gaussian random projection.


In [None]:
from sklearn.random_projection import SparseRandomProjection

sparse_rnd_proj = SparseRandomProjection(
    eps=epsilon,
    random_state=42
)

X_reduced_sparse = sparse_rnd_proj.fit_transform(X)
X_reduced_sparse.shape


### Why Sparse Random Projection?

- Uses far less memory
- Faster to compute
- Ideal for large or sparse datasets (e.g., text data)

By default, only about 1 in √n entries in the projection matrix is nonzero.


## Inverse Transformation (Approximate Reconstruction)

Random projection does not directly support inverse transformation.
However, we can approximate it using the pseudoinverse of the components matrix.


In [None]:
# Compute pseudoinverse of projection matrix
components_pinv = np.linalg.pinv(gaussian_rnd_proj.components_)

# Reconstruct the data
X_recovered = X_reduced_sklearn @ components_pinv.T

X_recovered.shape


⚠️ Warning:

Computing the pseudoinverse is expensive:
- Complexity is O(dn²) or O(nd²)
- Not recommended for very large projection matrices

Random projection is mainly used for speed and scalability,
not reconstruction accuracy.


## Summary

Random Projection is:
- Simple
- Extremely fast
- Memory efficient
- Surprisingly good at preserving distances

It is especially useful for:
- Very high-dimensional data
- Sparse datasets
- Situations where PCA is too slow or too expensive


# Locally Linear Embedding (LLE)

Locally Linear Embedding (LLE) is a **nonlinear dimensionality reduction**
technique based on **manifold learning**.

Unlike PCA or Random Projection, LLE does **not use projections**.
Instead, it preserves **local neighborhood relationships**, making it
especially effective at unrolling twisted manifolds.


## Key Idea Behind LLE

LLE works in two main steps:

1. For each data point, find its *k* nearest neighbors and compute
   weights that best reconstruct the point as a linear combination
   of its neighbors.

2. Find a low-dimensional representation that preserves these
   local linear relationships as closely as possible.


In [None]:
import numpy as np


## Creating a Swiss Roll Dataset

The Swiss roll is a classic example of a nonlinear manifold.
Although it lives in 3D space, it can be unrolled into 2D.


In [None]:
from sklearn.datasets import make_swiss_roll

X_swiss, t = make_swiss_roll(
    n_samples=1000,
    noise=0.2,
    random_state=42
)

X_swiss.shape


- `X_swiss` contains the 3D coordinates of each point
- `t` represents the position along the rolled dimension

We will **not** use `t` here, but it is useful for visualization
or regression tasks.


## Applying Locally Linear Embedding

We now use Scikit-Learn’s `LocallyLinearEmbedding` class to unroll
the Swiss roll into 2 dimensions.


In [None]:
from sklearn.manifold import LocallyLinearEmbedding

lle = LocallyLinearEmbedding(
    n_components=2,
    n_neighbors=10,
    random_state=42
)

X_unrolled = lle.fit_transform(X_swiss)
X_unrolled.shape


The Swiss roll has now been unrolled into a 2D representation.

LLE preserves **local distances** very well, even though
**global distances may be distorted**.


## How LLE Works (Step 1: Local Linear Reconstruction)

For each data point x(i):

- Find its k nearest neighbors
- Compute weights wᵢⱼ that best reconstruct x(i)
  as a linear combination of its neighbors
- Weights sum to 1
- Non-neighbors have weight 0

This step captures the **local geometry** of the manifold.


## How LLE Works (Step 2: Low-Dimensional Embedding)

Using the fixed weights from step 1:

- Find low-dimensional points z(i)
- Ensure that each z(i) is reconstructed from its neighbors
  using the same weights

This step produces the final low-dimensional embedding.


## Why LLE Works Well

LLE excels when:
- The data lies on a **nonlinear manifold**
- Noise is relatively low
- The dataset is small to medium in size

It is particularly good at **unrolling twisted manifolds**.


## Computational Complexity

Scikit-Learn’s LLE implementation has the following costs:

- Nearest neighbors search: O(m log(m) · n log(k))
- Weight optimization: O(m n k³)
- Low-dimensional embedding: O(d m²)

⚠️ The **m² term** makes LLE scale poorly for very large datasets.


## Limitations of LLE

- Does not scale well to large datasets
- Sensitive to noise
- Global distances are not preserved
- Choosing the number of neighbors `k` is critical

Despite this, LLE often produces excellent results
for clean, nonlinear manifolds.


## Summary

Locally Linear Embedding (LLE):

- Is a **nonlinear** dimensionality reduction technique
- Preserves **local neighborhood relationships**
- Excels at unrolling nonlinear manifolds
- Does **not** rely on projections
- Is best suited for small to medium datasets


# Other Dimensionality Reduction Techniques

In addition to PCA, Random Projection, and LLE, Scikit-Learn provides
several other powerful dimensionality reduction techniques.

In this section, we explore:
- Multidimensional Scaling (MDS)
- Isomap
- t-SNE
- Linear Discriminant Analysis (LDA)


In [None]:
import numpy as np


## Dataset: Swiss Roll

We will reuse the Swiss roll dataset to compare different
dimensionality reduction techniques.


In [None]:
from sklearn.datasets import make_swiss_roll

X_swiss, t = make_swiss_roll(
    n_samples=1000,
    noise=0.2,
    random_state=42
)

X_swiss.shape


## Multidimensional Scaling (MDS)

Multidimensional Scaling (MDS) reduces dimensionality while trying
to preserve **pairwise distances** between instances.

It works well for **low-dimensional data**, but does not scale well
to very large datasets.


In [None]:
from sklearn.manifold import MDS

mds = MDS(
    n_components=2,
    random_state=42
)

X_mds = mds.fit_transform(X_swiss)
X_mds.shape


MDS flattens the Swiss roll while preserving **global distance structure**.
It tends to keep the overall curvature of the data.


## Isomap

Isomap builds a graph connecting each point to its nearest neighbors,
then computes **geodesic distances** (shortest paths on the graph).

It works best when the data lies on a smooth, low-dimensional manifold
with a single global structure.


In [None]:
from sklearn.manifold import Isomap

isomap = Isomap(
    n_components=2,
    n_neighbors=10
)

X_isomap = isomap.fit_transform(X_swiss)
X_isomap.shape


Isomap often **completely unrolls** the Swiss roll, removing its curvature.
This is great for some tasks, but can destroy useful global structure.


## t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE focuses on keeping **similar instances close together**
and **dissimilar ones far apart**.

It is mainly used for **visualization**, not as a preprocessing step
for machine learning models.


In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate="auto",
    random_state=42
)

X_tsne = tsne.fit_transform(X_swiss)
X_tsne.shape


t-SNE:
- Preserves **local neighborhoods**
- Amplifies **clusters**
- Often distorts global structure

Excellent for visualization, poor for downstream modeling.


## Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a **supervised** dimensionality
reduction technique.

It learns axes that **maximize class separability**, making it useful
before classification models.


⚠️ LDA requires **class labels**, so it cannot be applied directly
to the Swiss roll dataset.

Below is a minimal example using synthetic labeled data.


In [None]:
from sklearn.datasets import make_classification
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

X_cls, y_cls = make_classification(
    n_samples=1000,
    n_features=20,
    n_classes=3,
    n_informative=10,
    random_state=42
)

lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_cls, y_cls)

X_lda.shape


LDA is especially effective when:
- Classes are well separated
- You want to reduce dimensionality **and** classify
- A linear decision boundary is appropriate


## UMAP (Not in Scikit-Learn)

UMAP (Uniform Manifold Approximation and Projection) is a popular
dimensionality reduction technique for visualization.

- Preserves both local and global structure
- Scales better than t-SNE
- Not included in Scikit-Learn

Available via the `umap-learn` package.


## Summary of Techniques

| Method | Type | Preserves | Typical Use |
|------|-----|----------|------------|
| MDS | Unsupervised | Pairwise distances | Small datasets |
| Isomap | Unsupervised | Geodesic distances | Smooth manifolds |
| t-SNE | Unsupervised | Local neighborhoods | Visualization |
| LDA | Supervised | Class separation | Classification |
| UMAP | Unsupervised | Local + global | Visualization |


## Final Takeaway

There is **no single best dimensionality reduction technique**.

The right choice depends on:
- Dataset size
- Linearity vs nonlinearity
- Supervised vs unsupervised
- Visualization vs modeling

Understanding the trade-offs is key.
