# Dimensionality Reduction

A lot of the time, your machine learning problems will involve thousands or millions of features (a 256x256 RGB image alone will have 196,608 features!). This will slow training down and also may hinder the algorithm's ability to generalize. This is commonly known as the **curse of dimensionality**.

We do have methods of dimensionality reduction, however, which will speed up training by a lot. Of course, though, lossless compression is not a thing, so some information will be lost. In some cases, though, this could result in a better performance, actually.

"Apart from speeding up training, dimensionality reduction is also extremely useful for data visualization... Reducing the number of dimensions down to two (or three) makes it possible to plot a high-dimensional training set on a graph and often gain some important insights by visually detecting patterns, such as clusters."


The issue with a high-dimensional object is that it behaves very differently than what we can think of. If you take a 10,000 dimensional unit hypercube and take a random point, there is a 99.9999% chance that that point will be located really, really close to the border. You can analogize this by thinking that if you consider enough "dimensions" about someone (e.g., how much sugar they put in their coffee, how many shirts they own, etc.), you could find a lot of examples of people lying in the extremes of those dimensions.

"Here is a more troublesome difference: if you pick two points randomly in a unit square, the distance between these two points will be, on average, roughly 0.52." However, if you pick two random points in a 1,000,000-dimensional hypercube, the average distance between the two points will be around 408.25! "This fact implies that high-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other... In short, the more dimensions the training set has, the greater the risk of overfitting it."

"In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances. Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions. With just 100 features (much less than in the MNIST problem), you would need more training instances than atoms in the observable universe in order for training instances to be within 0.1 of each other on average, assuming they were spread out uniformly across all dimensions."

# Main Approaches for Dimensionality Reduction

"Let's take a look at the two main approaches to reducing dimensionality: projection and Manifold Learning."

### Projection

The concept of projection is actually really simple. "In most real-world problems, training instances are *not* spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances actually lie within (or close to) a much lower-dimensional subspace of the high-dimensional space."

In terms of 3D, this is as simple as finding a 2D plane that closely fits most of the dataset, and then doing a projection onto that plane. Bam, 3D just became 2D!

### Manifold Learning

In the case of the toy "Swiss Roll" dataset, a projection would simply squash the layers of the swiss roll and then the data is no longer linearly separable. "The Swiss roll is an example of a 2D **manifold**. Put simply, a 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. More generally, a d-dimensional manifold is a part of a n-dimensional space (where d < n) that locally resembles a d-dimensional hyperplane. In the case of the Swiss roll, d = 2 and n = 3: it locally resembles a 2D plane, but it is rolled in the third dimension."

"Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie; this is called **Manifold Learning**. It relies on the **manifold assumption**, ... which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifodl. This assumption is very often empirically observed."

"Hopefully, you now have a good sense of what the curse of dimensionality is and how dimensionality reduction algorithms can fight it, especially when the manifold assumption holds. The rest of this chapter will go through some of the most popular algorithms."

### PCA

**Principal Component Analysis (PCA)** is by far the most popular dimensionality reduction algorithm. First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it.

#### Preserving the Variance

"Before you can project the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane." If you have a simple 2D dataset, the hyperplane is going to be a simple line. The line that fits best and preserves the most variance will be the one that is most "parallel" to the data. "Another way to justify this... is that it... minimizes the mean squared distance between the original dataset and its projection onto that axis. This is the rather simple idea behind PCA."

#### Principle Components

"PCA identifies the axis that accounts for the largest amount of variance in the training set... It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance. In [a] 2D example, there is no choice: it is the dotted line. If it were a higher-dimensional data-set, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on-- as many axes as the number of dimensions in the dataset."

"The unit vector that defines the *i*th axis is called the *i*th **principal component** (PC)." To find the prinicipal components of a training set, you can simply use the **Singular Value Decomposition** (SVD) standard matrix factorization technique. In Numpy, simply use the linalg.svd() function.

**Note:** "PCA assumes that the dataset is centered around the origin. As we will see, Scikit-Learn's PCA classes take care of centering the data for you. However, if you implement PCA yourself, ... don't forget to center the data first."

#### Projecting Down to *d* Dimensions

"Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to *d* dimensions by projecting it onto the hyperplane defined by the first *d* principal components. Slecting this hyperplane ensures that the projection will preserve as much variance as possible."

"To project the training set onto the hyperplane, you can simply compute the matrix multiplication of the training set matrix X by the matrix Wd, defined as ... the matrix composed of the first *d* elements of V."

In [None]:
x_centered = x - x.mean(axis=0) # centering data about origin
U, s, Vt = np.linalg.svd(x_centered)
W2 = Vt.T[:,:2] # w2 is the matrix with the first 2 columns of the V matrix.
x2d = x_centered.dot(W2)

#### Using Scikit-Learn

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
x2D = pca.fit_transform(x)

As you can see, Scikit-Learn handles data centering about the origin for us.

"After fitting the PCA transformer to the dataset, you can access the principal components using the `components_` variable."

#### The Explained Variance Ratio

"Another very useful piece of information is the **explained variance ratio** of each principal component, available via the `explained_variance_ratio_` variable. It indicates the proportion of the dataset's variance that lies along the axis of each principal component."

In [None]:
pca.explained_variance_ratio_

#### Choosing the Right Number of Dimensions

"Instead of arbitrarily choosing the number of dimensions to reduce down to, it is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%). Unless, of course, you are reducing dimensionality for data visualization-- in that case you will generally want to reduce the dimensionality down to a 2 or 3."

"The following code computes PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 95% of the training set's variance:

In [1]:
pca = PCA()
pca.fit(x_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum>=0.95) + 1 # + 1 is because indicies start at 0

NameError: name 'PCA' is not defined

"You could then set `n_components=d` and run PCA again. However, there is a much better option. instead of specifying the number of principal components you want to preserve, you can set `n_components` to be a float between 0 and 1, indicating the ratio of variance you wish to preserve:"

In [None]:
pca = PCA(n_components=0.95)
x_reduced = pca.fit_transform(x_train)

"Yet another option is to plot the explained variance as a function of the number of dimensions (simply plot `cumsum`). There will usually be an elbow in the curve, where the explained variance stops growing fast. You can think of this as the intrinsic dimensionality of the dataset."

#### Incremental PCA

"One problem with the preceding implementation of PCA is that it requires the whole training set to fit in memory in order for the SVD algorithm to run. Fortunately, **Incremental PCA** (PCA) algorithms have been developed: you can split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time."

"The following code splits the MNIST dataset into 100 mini-batches (using NumPy's `array_split()` function) and feeds them to Scikit-Learn's `IncrementalPCA` class to reduce the dimensionality of the MNIST dataset down to 154 dimensions... Note that you must call the `partial_fit()` method with each mini-batch rather than the `fit()` method with the whole training set:

In [None]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for x_batch in np.array_split(x_train, n_batches):
    inc_pca.partial_fit(x_batch)

x_reduced = inc_pca.transform (x_train)

"Alternatively, you can use Numpy's `memmap` class, which allows you to manipulate a large array stored in a binary file on disk as if it were entirely in memory; the class loads only the data it needs in memory, when it needs it. Since the `IncrementalPCA` class uses only a small part of the array at any given time, the memory usuage remains under control. This makes it possible to call the usual `fit()` method, as you can see in the following code:"

In [None]:
x_mm = np.memmap(filename, dtype='float32', mode='readonly', shape=(m,n))

batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=154, batch_size=batch_size)
inc_pca.fit(x_mm)

#### Randomized PCA

The Randomized PCA (that is native to Scikit-Learn) is a "stochastic algorithm that quickly finds an approximation of the first d principal components. Its computational complexity is O(m x **d**^2) + O(**d**^3), instead of O(m x **n**^2) + O(**n**^3), so it is dramatically faster than the previous algorithms when d is much smaller than n."

In [None]:
rnd_pca = PCA(n_components=154, svd_solver='randomized')
x_reduced = rnd_pca.fit_transform(x_train)

#### Kernel PCA

"In Chapter 5, we discussed the kernel trick, a mathematical technique that implicitly maps instances into a very high-dimensional space (called the **feature space**)... It turns out that same trick can be applied to PCA, making it possible to perform complex nonlinear projections for dimensionality reduction. This is called **Kernel PCA** (kPCA). It is often good at preserving clusters of instances after projection, or sometimes even unrolling datasets that lie close to a twisted manifold."

"The following code uses Scikit-Learn's `KernelPCA` class to perform kPCA with an RBF kernel:"

In [None]:
from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components=2, kernel='rbf', gamma=0.04)
x_reduced = rbf_pca.fit_transform(x)

#### Selecting a Kernel and Tuning Hyperparameters

"As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values. However, dimensionality reduction is often a preparation step for a supervised learning task (e.g., classification), so you can simply use grid search to select the kernel and hyperprarameters that lead to the best performance on that task. For example, the following code creates a two-step pipeline, first reducing dimensionality to two dimensions using kPCA, then applying Logistic Regression for classification. Then it uses `GridSearchCV` to find the best kernel and gamma value for kPCA in order to get the best classification accuracy at the end of the pipeline:"

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('kpca', KernelPCA(n_components=2)),
    ('log_reg', LogisticRegression())
])

param_grid = [{
    'kpca_gamma': np.linspace(0.03,0.05,10),
    'kpca_kernel': ['rbf', 'sigmoid']
}]

grid_search.best_params_