# PCA
Principal Component Analysis (PCA) is by far the most popular dimensionality reduc‐
tion algorithm. 

1. First it identifies the hyperplane that lies closest to the data
2. It projects the data onto it.

## Preserving the Variance

Before you can project the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane.

<img src='img_7.png'>

 For example, a simple 2D dataset is represented on the left of Figure 8-7, along with three different axes (i.e., one-dimensional hyperplanes). 
 
 On the right is the result of the projection of the dataset onto each of these axes. As you can see, the projection onto the solid line preserves the maximum variance, while the projection onto the dotted line preserves very little variance, and the projection onto the dashed line preserves an intermediate amount of variance.
 
 

It seems reasonable to select the axis that preserves the maximum amount of variance, as it will most likely lose less information than the other projections.

Another way to justify this choice is that it is the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis. This is the rather simple idea behind PCA.



## Principal Components

PCA identifies the axis that accounts for the largest amount of variance in the training set.

In Figure 8-7, it is the solid line. It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance. In this 2D example there is no choice: it is the dotted line. 

If it were a higher-dimensional data‐ set, PCA would also find a third axis, orthogonal to both previous axes, and a fourth,
a fifth, and so on—as many axes as the number of dimensions in the dataset.

The unit vector that defines the $i^{th}$ axis is called the $i^{th}$ principal component (PC).

In Figure 8-7, the 1st PC is c1  and the 2nd PC is c2.

`The direction of the principal components is not stable: if you perturb the training set slightly and run PCA again, some of the new PCs may point in the opposite direction of the original PCs. However, they will generally still lie on the same axes. In some cases, a pair of PCs may even rotate or swap, but the plane they define will generally remain the same.`

So how can you find the principal components of a training set? Luckily, there is a standard matrix factorization technique called Singular Value Decomposition (SVD) that can decompose the training set matrix X into the dot product of three matrices U
· Σ · $V^{T}$ , where V contains all the principal components that we are looking for

<img src='img_8.png'>

In [1]:
import numpy as np
X = np.arange(0, 100)
X = X.reshape(-1, 2)

In [2]:
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

In [3]:
c1

array([0.70710678, 0.70710678])

In [4]:
c2

array([ 0.70710678, -0.70710678])

`PCA assumes that the dataset is centered around the origin. As we will see, Scikit-Learn’s PCA classes take care of centering the data for you. However, if you implement PCA yourself, or if you use other libraries, don’t forget to center
the data first`

## Projecting Down to d Dimensions

Once you have identified all the principal components, you can reduce the dimensionality of the dataset down to d dimensions by projecting it onto the hyperplane defined by the first d principal components. Selecting this hyperplane ensures that the
projection will preserve as much variance as possible.


To project the training set onto the hyperplane, you can simply compute the dot product of the training set matrix X by the matrix $W_d$, defined as the matrix containing the first d principal components (i.e., the matrix composed of the first d columns of V)

<img src='img_9.png'>

In [5]:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)

In [6]:
W2

array([[ 0.70710678,  0.70710678],
       [ 0.70710678, -0.70710678]])

## Using Scikit-Learn


In [7]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X2D_sk = pca.fit_transform(X)


In [8]:
pca.components_.T[:,0]

array([-0.70710678, -0.70710678])

### Explained Variance Ratio

Another very useful piece of information is the explained variance ratio of each principal component, available via the explained_variance_ratio_ variable. It indicates the proportion of the dataset’s variance that lies along the axis of each principal component.

In [9]:
pca.explained_variance_ratio_

array([1.00000000e+00, 1.22347058e-33])

### Choosing the Right Number of Dimensions

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is generally preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance.

Unless, of course, you are reducing dimensionality for data visualization—in that case you will generally want to reduce the
dimensionality down to 2 or 3.




The following code computes PCA without reducing dimensionality, then computes
the minimum number of dimensions required to preserve 95% of the training set’s
variance:

In [11]:
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

In [12]:
print(d.shape)
d

()


1

You could then set n_components=d and run PCA again. However, there is a much
better option: instead of specifying the number of principal components you want to
preserve, you can set n_components to be a float between 0.0 and 1.0, indicating the
ratio of variance you wish to preserve:

In [15]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)
print(X_reduced.shape)

(50, 1)


<img src='img_10.png'>

### PCA for Compression
Obviously after dimensionality reduction, the training set takes up much less space.
For example, try applying PCA to the MNIST dataset while preserving 95% of its var‐
iance. You should find that each instance will have just over 150 features, instead of
the original 784 features. So while most of the variance is preserved, the dataset is
now less than 20% of its original size! This is a reasonable compression ratio, and you
can see how this can speed up a classification algorithm (such as an SVM classifier)
tremendously.

It is also possible to decompress the reduced dataset back to 784 dimensions by
applying the inverse transformation of the PCA projection. Of course this won’t give
you back the original data, since the projection lost a bit of information (within the
5% variance that was dropped), but it will likely be quite close to the original data.
The mean squared distance between the original data and the reconstructed data
(compressed and then decompressed) is called the reconstruction error. 

In [30]:
pca = PCA(n_components = 2)
X_reduced = pca.fit_transform(X)
X_recovered = pca.inverse_transform(X)

<img src='img_11.png'>

### Incremental PCA

One problem with the preceding implementation of PCA is that it requires the whole
training set to fit in memory in order for the SVD algorithm to run. 

Incremental PCA (IPCA) algorithms have been developed: you can split the training set into mini-batches and feed an IPCA algorithm one mini-batch at a time. This is useful for large training sets, and also to apply PCA online





In [32]:
from sklearn.decomposition import IncrementalPCA
from sklearn.datasets import load_diabetes
n_batches = 9
inc_pca = IncrementalPCA(n_components=9)

In [33]:
data = load_diabetes()
data.keys()

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=0.7)

In [35]:
import numpy as np
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)

X_reduced = inc_pca.transform(X_train)

In [37]:
X_train.shape, X_reduced.shape

((309, 10), (309, 9))

Alternatively, you can use NumPy’s memmap class, which allows you to manipulate a large array stored in a binary file on disk as if it were entirely in memory; the class loads only the data it needs in memory, when it needs it.

## Randomized PCA
Scikit-Learn offers yet another option to perform PCA, called Randomized PCA. This is a stochastic algorithm that quickly finds an approximation of the first d principal components.

s. Its computational complexity is O(m × $d^2$) + O($d^3$), instead of O(m × $n^2$)+ O($n^3$), so it is dramatically faster than the previous algorithms when d is much smaller than n.


In [39]:
from sklearn.decomposition import PCA

In [40]:
rnd_pca = PCA(n_components = 152, svd_solver = 'randomized')

## Kernel PCA

It is often good at preserving clusters of instances after projection, or sometimes even unrolling datasets that lie close to a twisted manifold



In [42]:
from sklearn.decomposition import KernelPCA
kenel_pca = KernelPCA(n_components=4, kernel='rbf', gamma=0.04)
X_reduce = kenel_pca

<img src='img_13.png'>

## Selecting a Kernel and Tuning Hyperparameters

As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values.

However, dimensionality reduction is often a preparation step for a supervised learning task
(e.g., classification), so you can simply use grid search to select the kernel and hyper‐
parameters that lead to the best performance on that task.



In [47]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [51]:
clf = Pipeline([('kpca', KernelPCA(n_components=2)),
               ('logistic', LogisticRegression())])
params = [{
    "kpca__gamma": np.linspace(0.03, 0.05, 10),
    "kpca__kernel": ["rbf", "sigmoid"]
}]

grid_search = GridSearchCV(clf, params, cv=3)

In [52]:
grid_search.fit(X_train, y_train)




In [53]:
print(grid_search.best_params_)

{'kpca__gamma': 0.03, 'kpca__kernel': 'rbf'}


Another approach, this time entirely unsupervised, is to select the kernel and hyperparameters that yield the lowest reconstruction error. However, reconstruction is not as easy as with linear PCA.

<img src='img_14.png'>

Here’s why. Figure 8-11 shows the original Swiss roll 3D dataset (top left), and the resulting 2D dataset after kPCA is applied using an RBF kernel (top right). Thanks to the kernel trick, this is mathematically equivalent to mapping the training set to an infinite-dimensional feature space (bottom right) using the feature map φ, then projecting the transformed training set down to 2D using linear PCA.

Notice that if we could invert the linear PCA step for a given instance in the reduced space, the reconstructed point would lie in feature space, not in the original space (e.g., like the one represented by an x in the diagram)

Since the feature space is infinite-dimensional, we cannot compute the reconstructed point, and therefore we cannot compute the true reconstruction error. Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This is called the reconstruction pre-image. Once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error

 Scikit-Learn will do this automatically if you set
fit_inverse_transform=True

In [56]:
rbf_pca = KernelPCA(n_components=2, kernel='rbf', gamma=0.03, fit_inverse_transform=True)

In [57]:
X_reduced = rbf_pca.fit_transform(X_train)
X_preimage = rbf_pca.inverse_transform(X_reduced)


In [58]:
from sklearn.metrics import mean_squared_error
mean_squared_error(X_train, X_preimage)

0.0022399911335444812

Now you can use grid search with cross-validation to find the kernel and hyperpara‐
meters that minimize this pre-image reconstruction error.