# Dimensionality Recution
Typically, when we train we have instances that have trillions upon billions of features per instance. Not only does it make training really slow, it could also make it harder to find a good solution. This is often refered as the _curse of dimensionality_.

Fortunatly for us, it's often possible to reduce the dimensions!!! For example, the edges of the MNIST set could be trimmed since they hold no useful information. If two neighboring pixels are highly correlated, they can be merged into a single pixel. Keep in mind that reducing features leads to information loss! This may lead to slightly worse performance. That aside, dimensionality reduction can lead to data visualization(AKA _DataViz_). Here we will look at three techniques: PCA, Kernel PCA, and LLE.

## The Curse of Dimensionality
High dimensional stuff be cray cray fam. Look at the book for diz. Keep in mind, the greater the dimensions, the greater the risk of overfitting it! One theoretical solution would be to increase the training set to a rediculous size... however this in practice is not possible.

## Main Approaches for Dimensionality Reduction
Before we look at specific reduction algos, let's look at two main approaches: projection and Manifold Learning

### Projection
In most real-world problems, training instances are _not_ spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated, hence they actually lie within a lower-dimensional _subspace_. Projection is not the best approach however. Take a look at the famous _Swiss roll_ toy dataset. Dropping a plane is bad, you need to unroll it!

### Manifold Learning
Many dimensionality reduction algos work by modeling the _manifold_ on which the training instances lie upon. This is called _Manifold Learning_. This relies onthe _manifold assumtion_, also called _manifold hypothesis_, which holds that most real-world highdimensional data lie closer to a much lower-dimensional manifold.

Again, think about the MNIST set: all handwriten digets have some similarities. They have connected lines, white borders, and are more or less centered. If we had instead randomly generated images, we probably wouldn't be able to reduce the dimensions, given that constraints tend to lead to a dataset being projected onto a lower dimension manifold.

The manifold assumption also has another assumption being that the task at hand will be simpler if expressed in a lower dimension. In short, it all depends on how the data set is!

## PCA
_Principal Component Analysis_(PCA) is the most popular dimensionality reduction algo. It identifies the hyperplane that lies closest to the data, then projects the data onto it.

### Preserving the Variance
You need to choose the right plane and axis to maintain the variance! This is to minizmize the mean squared distance between the original and it's projection onto the axis. This is the idea behind PCA.

### Principal Components
PCA identifies the axis that accounts for the largest amount of variance in the set. It then recursivly finds axis orthoganl to the previous axises. The unit vector that describes the ith axis is called the ith _principal component_. So how do you find the PCs of the set? You do so using a technique called _Singular Value Decomposition_(SVD). That decomposes the trainng set X to the three matricies U E and V^T which contains all of the principal components. Like the following...

In [None]:
# You need to generate a dataset for X first!

X_centered = X - X.mean(axis=0)
U, s, V = np.linalg.svd(X_centered)
c1 = V.T[:, 0]
c2 = V.T[:, 1]

### Projecting Down to d Dimensions
Once all principal components have be identified, you can reduce the dimensionality of the dataset down to _d_ dimensions by projecting it onto the first _d_ principal components. To project the set onto the hyperplane, simply do the dot maxtrix of the set and the matrix W_d_, defined by the first _d_ principal components. The following code does just that!

In [None]:
W2 = V.T[:, :2]
X2D = X_centered.dot(W2)

Of course, there's the Scikit version of doing things...

In [None]:
from sklearn.decompositions import PCA

pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)

### Explained Variance Ratio
Another useful piece of information, the _explained variance ratio_ of each principal component. It indicates the portion of the dataset's variance that lies along the axis of each principal component. Here's a sample...

In [None]:
print(pca.explained_variance_ratio_)

### Choosing the Right Number of Dimensions
Instead of arbitraraly choosing the number of dimensions to reduce down to, it's best to choose a number that adds up to a sufficently large portion of variance (95%)... unless it's for data visualization.

The following finds a _d_ that maintaince a certain amount of variance.

In [None]:
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

Of course, there's a more direct way to get the ratio you desire...

In [None]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

Another way to find the optimal number of dimesions is to plot the darn thing and find the dimention size that you want!

### PCA for Compression
By applying PCA to the dataset, you can achieve resonable compression. For example the MNIST dataset could be compressed by 20%! It's possible to recover the reduced dataset, but there will be some information loss. THe following is an example of PCA to 154 components then recovering the original data.

In [None]:
pca = PCA(n_components=154)
X_mnist_reduced = pca.fit_transform(X_mnist)
X_mnist_recovered = pca.inverse_transform(X_mnist_reduced)

### Incremental PCA
