# Introduction to PCA

PCA is used for dimensionality reduction. This notebook is based on [In Depth: Principal Component Analysis](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html), with extensions to show what PCA means in practice.

PCA is a relatively advanced technique for feature engineering. It is used when there is a lot of collinearity between the features, but it is difficult to decide which features to drop.

PCA takes a set of $n$ features and generates a scaled and rotated set of $p$ features, where $p < n$.
The features derived by PCA are called the _Principal Components_ of the original features.
For example, size and weight might be heavily correlated (highly collinear) features. We could drop one or the other. PCA derives a new feature which has elements of both `size` _and_ `weight`, and which might be a better predictor than either `size` or `weight` on their own.

We will cover PCA (and how it works) in class. for the moment, it is enough just to see it in action in this notebook.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

We create some functions for later use. `drawVector` allows us to show the principal component axes on a plot. `abline` allows us to draw a line, given its slope and intercept, analogous to R's `abline` function. 

In [None]:
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

In [None]:
# See https://stackoverflow.com/a/43811762
def abline(slope, intercept):
    """Plot a line from slope and intercept"""
    axes = plt.gca()
    x_vals = np.array(axes.get_xlim())
    y_vals = intercept + slope * x_vals
    plt.plot(x_vals, y_vals, '--')

We generate some 2-D random data points. The point cloud has an elongated shape by design. Therefore the first principal component is expected to align with the main axis of the data, and the second principal component would be orthogonal to the first.

In [None]:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
print(X[0:4,:])

#X = np.random.randint(2, size=(100,30))
#print(X[0:4,:])

Now that we have the 2-D point cloud, we derive its first and second principal components and display how much variance is explained by the first and second. As can be seen, most of the variance is associated with the first.

In [None]:
pca = PCA(n_components=2)
pca.fit(X)
explainedVariance = pca.explained_variance_
v1=explainedVariance[0]
v2=explainedVariance[1]
pca.explained_variance_

In other words, the first PC explains {{explainedVariance[0]}} with a variance of {{explainedVariance[1]}} left over for the second (and last) PC.

In [None]:
e = pca.components_
pca.components_

The PC components above are the eigenvectors that define the directions of the first and second PCs. After translating the data so its mean is at the origin, the first PC can be written in the form `(-0.944x_1 + -0.329x_2 = 0` and the second PC is perpendicular to it and can be written in the form `(-0.329x_1 + 0.944x_2 = 0`.

We can now plot the data (in its original position) and overlay `PC_1` which runs along the main axis of the data and `PC_2` which is perpendicular to it. Note that the principal components have been scaled according to the amount of variance they explain.

In [None]:
# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.1)
plt.axis('equal')
plt.xlabel('x')
plt.ylabel('y')
plt.title('input')
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)

In [None]:
# plot principal components
X_pca = pca.transform(X)
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.2)
draw_vector([0, 0], [0, 3])
draw_vector([0, 0], [3, 0])
plt.axis('equal')
plt.xlabel("component 1 pc1")
plt.ylabel('component 2 pc2')
plt.title('Data is rotated to align with the principal components')
plt.xlim=(-5, 5)
plt.ylim=(-5, 5)
plt.show()

If we rerun the PCA operation, but this time decide to keep only the leading principal components that explain at least 95% of the variance, we see that the second principal component can be ignored and we find a lower dimensional representation `X_trans` of the original data (1-D instead of 2-D).

In [None]:
clf = PCA(0.95) # keep 95% of variance
X_trans = clf.fit_transform(X)
print(X.shape)
print(X_trans.shape)

We can compare the reduced dimension `X_trans` to the original `X` - see the following plot. As you can see, the original `X` points have been projected onto the PC_1 line. Because the points lie upon a line, each point maps to a value on the `PC_1` line.

In [None]:
X_new = clf.inverse_transform(X_trans)
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.1)
plt.plot(X_new[:, 0], X_new[:, 1], 'ob', alpha=0.8)
ab1 = clf.components_[0]
abline(slope=ab1[1]/ab1[0], intercept=0)
plt.axis('equal')

We can make the 1-D nature of `X_trans` more obvious by rotating both the original `X` and the transformed `X_trans` so that their `PC_1` is horizontal.

In [None]:
# Derive the rotation parameters (Cosine and Sine) of the rotation angle
C = -ab1[0]
S = ab1[1]
Q = np.vstack([[C, S], [-S, C]])
# Xrot is the points rotated so they are aligned with the first principal component direction
Xrot = np.matmul(X,Q)
X_newrot = np.matmul(X_new,Q)

In [None]:
plt.plot(Xrot[:, 0], Xrot[:, 1], 'ob', alpha=0.1)
plt.plot(X_newrot[:, 0], X_newrot[:, 1], 'ob', alpha=0.8)
plt.axis('equal')
abline(slope=0, intercept=0)
plt.title('Data is projected onto the first principal component direction')
plt.show()