**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART I** 

**The Fundamentals of Machine Learning** 

---

**CHAPTER 8 - Dimensionality Reduction** 

---

Chapter 8 explores Dimensionality Reduction, a crucial technique for handling high-dimensional datasets. Many ML problems involve thousands or millions of features, making training extremely slow and finding good solutions harder—a problem called the curse of dimensionality. This chapter covers projection and Manifold Learning approaches, and three popular techniques: PCA, Kernel PCA, and LLE. 

---

**The Curse of Dimensionality**   
We are used to living in three dimensions, so our intuition fails when imagining high-dimensional space. Even a basic 4D hypercube is incredibly hard to picture, let alone a 200-dimensional ellipsoid bent in 1,000-dimensional space. 

**Figure 8-1. Point, segment, square, cube, and tesseract (0D to 4D hypercubes)**   
![Figure8-1.jpg](./08.Chapter-08/Figure8-1.jpg) 

Many things behave differently in high-dimensional space. If you pick a random point in a unit square, it has only about 0.4% chance of being located less than 0.001 from a border. But in a 10,000-dimensional unit hypercube, this probability is greater than 99.999999%—most points are very close to the border. 

More troublesome: if you pick two points randomly in a unit square, the distance between them will be roughly 0.52 on average. In a unit 3D cube, it's roughly 0.66. But in a 1,000,000-dimensional hypercube, the average distance is about 408.25! This is counterintuitive, but there's plenty of space in high dimensions. High-dimensional datasets are at risk of being very **sparse**: most training instances are likely far away from each other. This means new instances will likely be far from any training instance, making predictions less reliable. In short, the more dimensions, the greater the risk of overfitting. 

In theory, one solution could be increasing training set size to reach sufficient density. Unfortunately, the number of training instances required grows exponentially with dimensions. With just 100 features, you'd need more training instances than atoms in the observable universe for instances to be within 0.1 of each other on average. 

---

**Main Approaches for Dimensionality Reduction**   
Before diving into specific algorithms, let's examine two main approaches: projection and Manifold Learning.

**Projection**   
In most real-world problems, training instances aren't spread uniformly across all dimensions. Many features are almost constant, while others are highly correlated. All training instances lie within (or close to) a much lower-dimensional subspace. 

**Figure 8-2. A 3D dataset lying close to a 2D subspace**   
![Figure8-2.jpg](./08.Chapter-08/Figure8-2.jpg) 

If we project every training instance perpendicularly onto this subspace, we get a new 2D dataset, reducing dimensionality from 3D to 2D. 

**Figure 8-3. The new 2D dataset after projection**   
![Figure8-3.jpg](./08.Chapter-08/Figure8-3.jpg) 

However, projection isn't always the best approach. In many cases, the subspace may twist and turn, like the famous Swiss roll toy dataset. 

**Figure 8-4. Swiss roll dataset**   
![Figure8-4.jpg](./08.Chapter-08/Figure8-4.jpg) 

Simply projecting onto a plane (e.g., dropping x3) would squash different layers together. What you really want is to unroll the Swiss roll. 

**Figure 8-5. Squashing by projecting onto a plane (left) versus unrolling the Swiss roll (right)**   
![Figure8-5.jpg](./08.Chapter-08/Figure8-5.jpg) 

**Manifold Learning**   
The Swiss roll is an example of a **2D manifold**: a 2D shape that can be bent and twisted in higher-dimensional space. More generally, a d-dimensional manifold is part of an n-dimensional space (where d < n) that locally resembles a d-dimensional hyperplane. For the Swiss roll, d = 2 and n = 3: it locally resembles a 2D plane but is rolled in the third dimension. 

Many dimensionality reduction algorithms work by modeling the manifold on which training instances lie—called **Manifold Learning**. It relies on the **manifold assumption** (or manifold hypothesis), which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. This assumption is very often empirically observed. 

The manifold assumption is often accompanied by another implicit assumption: the task (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space. 

**Figure 8-6. The decision boundary may not always be simpler with lower dimensions**   
![Figure8-6.jpg](./08.Chapter-08/Figure8-6.jpg) 

Top row shows Swiss roll split into two classes: in 3D space the decision boundary is complex, but in 2D unrolled manifold space it's a straight line. Bottom row shows the opposite: decision boundary at x1 = 5 is simple in 3D (a vertical plane) but complex in unrolled manifold (four independent line segments). 

Reducing dimensionality before training usually speeds up training but may not always lead to better or simpler solutions—it depends on the dataset. 

---

**PCA**   
**Principal Component Analysis (PCA)** is by far the most popular dimensionality reduction algorithm. It identifies the hyperplane that lies closest to the data, then projects the data onto it.

**Preserving the Variance**   
Before projecting onto a lower-dimensional hyperplane, you must choose the right hyperplane. 

**Figure 8-7. Selecting the subspace to project on**   
![Figure8-7.jpg](./08.Chapter-08/Figure8-7.jpg) 

Left shows a 2D dataset with three different axes (1D hyperplanes); right shows projection results. Projection onto the solid line preserves maximum variance, the dotted line preserves very little, and the dashed line preserves intermediate variance. 

It's reasonable to select the axis preserving maximum variance, as it will most likely lose less information. Another justification: it's the axis that minimizes mean squared distance between the original dataset and its projection. This is the simple idea behind PCA.

**Principal Components**   
PCA identifies the axis accounting for the largest variance in the training set. In Figure 8-7, it's the solid line. It also finds a second axis, orthogonal to the first, accounting for the largest remaining variance. For higher-dimensional datasets, PCA finds a third axis, fourth, fifth, and so on—as many as the number of dimensions. 

The **ith axis** is called the **ith principal component (PC)**. For each principal component, PCA finds a zero-centered unit vector pointing in the PC's direction. 

To find principal components, use **Singular Value Decomposition (SVD)**, a standard matrix factorization technique that decomposes training set matrix X into matrix multiplication of three matrices U Σ V⊺, where V contains unit vectors defining all principal components. 

**Equation 8-1. Principal components matrix**   
![Eq8-1.jpg](./08.Chapter-08/Eq8-1.jpg) 

The following code uses NumPy's svd() function to obtain all principal components, then extracts the two unit vectors defining the first two PCs:

In [None]:
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

**Note**: PCA assumes the dataset is centered around the origin. Scikit-Learn's PCA classes take care of centering for you. If implementing PCA yourself or using other libraries, don't forget to center the data first.

**Projecting Down to d Dimensions**   
Once you've identified all principal components, reduce dimensionality to d dimensions by projecting onto the hyperplane defined by the first d principal components. This ensures the projection preserves as much variance as possible. 

To project the training set onto the hyperplane and obtain reduced dataset X_d-proj of dimensionality d, compute the matrix multiplication of training set matrix X by matrix W_d, defined as the matrix containing the first d columns of V. 

**Equation 8-2. Projecting the training set down to d dimensions**   
![Eq8-2.jpg](./08.Chapter-08/Eq8-2.jpg) 

The following code projects the training set onto the plane defined by the first two principal components:

In [None]:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)

**Using Scikit-Learn**   
Scikit-Learn's PCA class uses SVD decomposition, automatically taking care of centering the data:

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)

After fitting, the components_ attribute holds the transpose of W_d (e.g., the first PC unit vector equals pca.components_.T[:, 0]).

**Explained Variance Ratio**   
Another useful piece of information is the **explained variance ratio** of each principal component, available via explained_variance_ratio_. The ratio indicates the proportion of the dataset's variance lying along each PC:

In [None]:
>>> pca.explained_variance_ratio_
array([0.84248607, 0.14631839])

This tells you 84.2% of the dataset's variance lies along the first PC, and 14.6% along the second PC. This leaves less than 1.2% for the third PC, so it probably carries little information.

**Choosing the Right Number of Dimensions**   
Instead of arbitrarily choosing dimensions, choose the number that adds up to a sufficiently large portion of variance (e.g., 95%). Unless reducing for data visualization, in which case reduce to 2 or 3. 

The following code performs PCA without reducing dimensionality, then computes the minimum dimensions required to preserve 95% of variance:

In [None]:
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

A better option: instead of specifying the number of principal components, set n_components to a float between 0.0 and 1.0, indicating the variance ratio to preserve:

In [None]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

Another option is plotting explained variance as a function of dimensions. 

**Figure 8-8. Explained variance as a function of the number of dimensions**   
![Figure8-8.jpg](./08.Chapter-08/Figure8-8.jpg) 

Shows a curve with an elbow where explained variance stops growing fast. Reducing to about 100 dimensions wouldn't lose too much explained variance.

**PCA for Compression**   
After dimensionality reduction, the training set takes much less space. Applying PCA to MNIST while preserving 95% variance results in just over 150 features instead of 784. The dataset is now less than 20% of its original size—a reasonable compression ratio that can tremendously speed up classification algorithms. 

It's possible to decompress the reduced dataset back to 784 dimensions by applying the inverse transformation. This won't give back the original data since projection lost information (within the 5% variance dropped), but it will likely be close. The mean squared distance between original and reconstructed data is called the **reconstruction error**. 

The following code compresses MNIST down to 154 dimensions, then uses inverse_transform() to decompress back to 784 dimensions:

In [None]:
pca = PCA(n_components = 154)
X_reduced = pca.fit_transform(X_train)
X_recovered = pca.inverse_transform(X_reduced)

**Figure 8-9. MNIST compression that preserves 95% of the variance**   
![Figure8-9.jpg](./08.Chapter-08/Figure8-9.jpg) 

Shows original digits (left) and corresponding digits after compression and decompression. There's slight image quality loss, but digits are mostly intact. 

**Equation 8-3. PCA inverse transformation, back to the original number of dimensions**   
![Eq8-3.jpg](./08.Chapter-08/Eq8-3.jpg)

**Randomized PCA**   
If you set svd_solver hyperparameter to "randomized", Scikit-Learn uses a stochastic algorithm called **Randomized PCA** that quickly finds an approximation of the first d principal components. Its computational complexity is O(m × d²) + O(d³), instead of O(m × n²) + O(n³) for full SVD, so it's dramatically faster when d is much smaller than n:

In [None]:
rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_train)

By default, svd_solver is set to "auto": Scikit-Learn automatically uses randomized PCA if m or n is greater than 500 and d is less than 80% of m or n, else it uses full SVD. To force full SVD, set svd_solver to "full".

**Incremental PCA**   
One problem with preceding PCA implementations is they require the whole training set to fit in memory. Fortunately, **Incremental PCA (IPCA)** algorithms have been developed. They allow splitting the training set into mini-batches and feeding an IPCA algorithm one mini-batch at a time. This is useful for large training sets and for applying PCA online (on the fly, as new instances arrive). 

The following code splits MNIST into 100 mini-batches and feeds them to Scikit-Learn's IncrementalPCA class to reduce dimensionality to 154 dimensions:

In [None]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)
X_reduced = inc_pca.transform(X_train)

Note: you must call partial_fit() with each mini-batch, rather than fit() with the whole training set. 

Alternatively, use NumPy's memmap class, which allows manipulating a large array stored in a binary file on disk as if it were entirely in memory:

In [None]:
X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(X_mm)

---

**Kernel PCA**   
In Chapter 5 we discussed the **kernel trick**, a mathematical technique that implicitly maps instances into a very high-dimensional space (called feature space), enabling nonlinear classification and regression with SVMs. A linear decision boundary in high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space. 

The same trick can be applied to PCA, making it possible to perform complex nonlinear projections for dimensionality reduction. This is called **Kernel PCA (kPCA)**. It's often good at preserving clusters after projection, or sometimes even unrolling datasets lying close to twisted manifolds. 

The following code uses Scikit-Learn's KernelPCA class to perform kPCA with an RBF kernel:

In [None]:
from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

**Figure 8-10. Swiss roll reduced to 2D using kPCA with various kernels**   
![Figure8-10.jpg](./08.Chapter-08/Figure8-10.jpg) 

Shows Swiss roll reduced using a linear kernel (equivalent to PCA), an RBF kernel, and a sigmoid kernel.

**Selecting a Kernel and Tuning Hyperparameters**   
As kPCA is unsupervised, there's no obvious performance measure to help select the best kernel and hyperparameters. However, dimensionality reduction is often a preparation step for supervised learning, so you can use grid search to select the kernel and hyperparameters leading to best performance on that task. 

The following code creates a two-step pipeline: first reducing dimensionality to two dimensions using kPCA, then applying Logistic Regression. It uses GridSearchCV to find the best kernel and gamma value for kPCA:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
        ("kpca", KernelPCA(n_components=2)),
        ("log_reg", LogisticRegression())
    ])
param_grid = [{
        "kpca__gamma": np.linspace(0.03, 0.05, 10),
        "kpca__kernel": ["rbf", "sigmoid"]
    }]
grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)

The best kernel and hyperparameters are available through best_params_:

In [None]:
>>> print(grid_search.best_params_)
{'kpca__gamma': 0.043333333333333335, 'kpca__kernel': 'rbf'}

Another entirely unsupervised approach is selecting the kernel and hyperparameters yielding the lowest reconstruction error. Note that reconstruction isn't as easy as with linear PCA. 

**Figure 8-11. Kernel PCA and the reconstruction pre-image error**   
![Figure8-11.jpg](./08.Chapter-08/Figure8-11.jpg) 

Shows the original Swiss roll 3D dataset (top left) and resulting 2D dataset after kPCA with RBF kernel (top right). Thanks to the kernel trick, this transformation is mathematically equivalent to using feature map φ to map the training set to infinite-dimensional feature space (bottom right), then projecting down to 2D using linear PCA. 

If we could invert the linear PCA step for a given instance, the reconstructed point would lie in feature space, not original space. Since feature space is infinite-dimensional, we cannot compute the reconstructed point, and therefore cannot compute true reconstruction error. Fortunately, it's possible to find a point in original space that would map close to the reconstructed point—called the **reconstruction pre-image**. Once you have this pre-image, you can measure its squared distance to the original instance. Then select the kernel and hyperparameters minimizing this reconstruction pre-image error. 

One solution is training a supervised regression model, with projected instances as the training set and original instances as targets. Scikit-Learn will do this automatically if you set fit_inverse_transform=True:

In [None]:
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,
                    fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)

**Note**: By default, fit_inverse_transform=False and KernelPCA has no inverse_transform() method. This method only gets created when you set fit_inverse_transform=True. 

You can then compute the reconstruction pre-image error:

In [None]:
>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(X, X_preimage)
32.786308795766132

Now you can use grid search with cross-validation to find the kernel and hyperparameters that minimize this error. 

---

**LLE**   
**Locally Linear Embedding (LLE)** is another powerful nonlinear dimensionality reduction (NLDR) technique. It's a Manifold Learning technique that doesn't rely on projections. In a nutshell, LLE works by first measuring how each training instance linearly relates to its closest neighbors, then looking for a low-dimensional representation where these local relationships are best preserved. This approach makes it particularly good at unrolling twisted manifolds, especially when there isn't too much noise. 

The following code uses Scikit-Learn's LocallyLinearEmbedding class to unroll the Swiss roll:

In [None]:
from sklearn.manifold import LocallyLinearEmbedding

lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)

**Figure 8-12. Unrolled Swiss roll using LLE**   
![Figure8-12.jpg](./08.Chapter-08/Figure8-12.jpg) 

Shows the resulting 2D dataset. The Swiss roll is completely unrolled, and distances between instances are locally well preserved. However, distances aren't preserved on larger scale: the left part is stretched while the right part is squeezed. Nevertheless, LLE did a pretty good job modeling the manifold. 

Here's how LLE works: for each training instance x(i), the algorithm identifies its k closest neighbors (in the code k = 10), then tries to reconstruct x(i) as a linear function of these neighbors. More specifically, it finds weights w_i,j such that the squared distance between x(i) and Σw_i,j x(j) is as small as possible, assuming w_i,j = 0 if x(j) isn't one of the k closest neighbors of x(i). 

**Equation 8-4. LLE step one: linearly modeling local relationships**   
![Eq8-4.jpg](./08.Chapter-08/Eq8-4.jpg) 

After this step, weight matrix W (containing weights w_i,j) encodes local linear relationships between training instances. The second step maps training instances into d-dimensional space (where d < n) while preserving these local relationships as much as possible. If z(i) is the image of x(i) in this d-dimensional space, we want the squared distance between z(i) and Σw_i,j z(j) to be as small as possible. 

**Equation 8-5. LLE step two: reducing dimensionality while preserving relationships**   
![Eq8-5.jpg](./08.Chapter-08/Eq8-5.jpg) 

Scikit-Learn's LLE implementation has computational complexity: O(m log(m)n log(k)) for finding k nearest neighbors, O(mnk³) for optimizing weights, and O(dm²) for constructing low-dimensional representations. Unfortunately, the m² in the last term makes this algorithm scale poorly to very large datasets. 

---

**Other Dimensionality Reduction Techniques**   
There are many other dimensionality reduction techniques available in Scikit-Learn:

**Random Projections** - Projects data to lower-dimensional space using random linear projection. Such random projection is actually very likely to preserve distances well, as demonstrated mathematically by William B. Johnson and Joram Lindenstrauss in a famous lemma. Quality depends on the number of instances and target dimensionality, but surprisingly not on initial dimensionality. 

**Multidimensional Scaling (MDS)** - Reduces dimensionality while trying to preserve distances between instances. 

**Isomap** - Creates a graph by connecting each instance to its nearest neighbors, then reduces dimensionality while trying to preserve geodesic distances (number of nodes on the shortest path) between instances. 

**t-Distributed Stochastic Neighbor Embedding (t-SNE)** - Reduces dimensionality while trying to keep similar instances close and dissimilar instances apart. It's mostly used for visualization, particularly to visualize clusters of instances in high-dimensional space (e.g., visualizing MNIST images in 2D). 

**Linear Discriminant Analysis (LDA)** - A classification algorithm that during training learns the most discriminative axes between classes. These axes can define a hyperplane to project the data. The benefit is the projection keeps classes as far apart as possible, so LDA is a good technique for reducing dimensionality before running another classification algorithm like an SVM classifier. 

**Figure 8-13. Using various techniques to reduce the Swiss roll to 2D**   
![Figure8-13.jpg](./08.Chapter-08/Figure8-13.jpg) 

Shows results of several dimensionality reduction techniques.