# **Chapter 8 - Dimensionality Reduction**
Many Machine Learning Problems Involve thousands or even hundreds of features to Train the Models which In turn gives birth to **the curse of dimensionality**(Explained Later in the Chapter) and as a solution to this Problem comes Dimensionality Reduction here are two of the advantages of using Dimensionality Reduction :-
- Speed up the Process of Training.
- Helpful in Data Visualization.

# The Curse of Dimensionality
- Sometimes Our Machine Learning Model gets confused with the Noise of the Dataset.
    - Most of the times due to high number of features.
- High Numbers of features in a dataset results in slow speed of Training.

# Main Approaches of Dimensionality Reduction
There are two main Approaches of Dimensionality Reduction in Machine Learning :-

## Projection
In Most Real World Problems the the Dataset is not Uniformly Distrubuted Throughout all Dimensions. Many Features are almost Constant.
So in the Projection We Find out the Plane in which most of the data points lie and then Project the Whole Dataset on that specific Plane.
However somtimes we cannot find a plane on the Dataset Like in the famous Swiss Roll Dataset in Which You have to unroll the dataset to get accurate representation.

## Manifold Learning
Manifold Learning is a Dimensionality Reduction Algorithm which Tries to find Familiar Shapes or Structures inside the Dataset and then Projects the Dataset Unto a Lower Dimensional Plane hence Reducing its Dimension.
- i.e > Unrolling the Swiss Roll Dataset.

# PCA(Principal Component Analysis)
- This is the Most Popular Dimensionality Redution Algorithm.
- First It Identifies the Hyperplane which Lies closest to the Data.
- Then It Projects the Data Onto that Hyperplane.

## Preserving the Variance
- Before You Project the Data onto the Hyperplane You Need to Select the Right Hyperplane.
- The Best Way to Select the Right Hyperplane is to choose the Hyperplane which Preserves the Maximum Variance.
    - Because it will in turn preserve the Data Loss.

## Principal Component
- PCA Identifies the axis that accounts the largest amount of Variance in the training Set.
- The $i^{th}$ axis is also called the $i^{th}$ *principal component*.
- We find the principal component for the dataset with help of a standard matrix factorization called **Singular Value Decomposition(SVD)** here is an implementation of this function with python code :-

In [1]:
from sklearn.datasets import make_moons
X, y = make_moons()

In [2]:
import numpy as np

X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

**V** contains all of the Pricipal Components that we are looking for :-
$$
V = (c_1, c_2, ..., c_n)
$$
- here vectors from $c_1$ to $c_n$ are the Pricipal Components from which main Pricipal Componenet will be choosed on the basis of Preserving Variance.

## Projecting Down to d Dimensions
Now that we have Identified our Pricipal Component we can obtain the reduced data by Prjecting it onto the Pricipal component of our choice.
- To Project the training set onto the hyperplane and obtain a reduced dataset $X_{d-proj}$ we have to compute a matrix multiplication.

$$
X_{d-proj} = XW_d
$$

In [3]:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)

## Using Scikit-Learn
Like all the other Techniques and other things *Scikit-Learn* also Provides an implementation of **PCA** :-

In [4]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

In [5]:
pca.explained_variance_ratio_

array([0.81968748, 0.18031252])

This tells YOu that the VAriance in the first PC was 81% and in the Second PC it was 18% which means that we have Preserved 81% of Our Data.

## Choosing the Right Number of Dimensions
Instead of Choosing the Number of Dimensions by mere Guess we can Set the Percentage of the Data that we want to Preserve from a scale between 0 to 1 in the **n_components** hyperparameter. Lets apply it on the MNIST Dataset.

In [7]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", version=1)
X, y = mnist["data"], mnist["target"]

pca = PCA(n_components=0.95) # Preserve 95% Data of the Dataset
X_reduced = pca.fit_transform(X)
X_reduced

  warn(


array([[ 122.25525533, -316.23384391,  -51.13183087, ...,   34.71703473,
         -14.22575676,   21.38272145],
       [1010.49400346, -289.96362059,  576.1207452 , ...,   23.87884359,
          -6.54283564,  -24.90277545],
       [ -58.99594719,  393.69744499, -161.99818411, ...,   -5.36282742,
          55.00020853,  -96.73397123],
       ...,
       [-271.50701323,  590.07850009,  341.36886918, ...,  -43.7571469 ,
          35.78216024,   49.96612771],
       [-310.22482291, -116.72715081,  635.71999693, ...,  -21.86345345,
          20.40152778,  -42.68277473],
       [1058.86212574,  -83.39253843,  731.34218396, ...,   41.22834049,
         -20.05206663,  -49.92361814]])

In [8]:
X_reduced.shape

(70000, 154)

## PCA for Compression
We can see that after applying PCA to MNIST it Preserves 95% Variance and the Dataset is now Less 20% BTW we can also recover the original dataset (however it it will not be the exact same some of the things can be different)

In [9]:
X_recovered = pca.inverse_transform(X_reduced)
X_recovered.shape

(70000, 784)

$$
X_{recovered} = X_{d-proj} . W_d^T
$$
**--------------------------------------------- Equation 8-3: PCA inverse transformation ---------------------------------------------**