## Dimensionality reduction and PCA

Let's revisit the MNIST digits dataset. 

Notebook objectives: 

* Learn sklearn syntax for PCA 
* Look at an exmaple of using PCA for visualization
* Learn how to use a scree plot to explore how many principal components to keep 

In [None]:
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.decomposition import PCA

from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import svm

from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
# load the digits dataset
digits = datasets.load_digits()
print(digits.data.shape) 

In [None]:
# This is what one digit looks like in numbers
digits.data[166].reshape(8,8)

In [None]:
# Take the shading for each pixel and plot it as color
plt.gray()
plt.matshow(digits.images[166])
plt.show()

In [None]:
# Center the data on 0 
# We should do this (almost) all of the time so that we don't fit to covariates 
# that happen to be on larger scales and have more variance

X_centered = digits.data - digits.data.mean()
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X_centered, y, test_size=0.5,random_state=42)

print(X_train.shape)

## Let's do some PCA!

In [None]:
# Take all of the data and plot it on 2 dimensions
pca = PCA(n_components=2)
pca.fit(X_train)
pcafeatures_train = pca.transform(X_train)

In [None]:
# Create a plot of the PCA results
from itertools import cycle

def plot_PCA_2D(data, target, target_names):
    colors = cycle(['r','g','b','c','m','y','orange','w','aqua','yellow'])
    target_ids = range(len(target_names))
    plt.figure(figsize=(10,10))
    for i, c, label in zip(target_ids, colors, target_names):
        plt.scatter(data[target == i, 0], data[target == i, 1],
                   c=c, label=label, edgecolors='gray')
    plt.legend()

In [None]:
# plot of all the numbers
plot_PCA_2D(pcafeatures_train, target=y_train, target_names=digits.target_names)

## Transforming our input matrix X for use in classification/clustering 

Here we did PCA for visualization. But we can take our new N x k matrix (where k = number principal components) as input to regression, classification, clustering, etc. 

In [None]:
X_transf = pca.transform(X_train)
print("shape of original X_train:", X_train.shape)
print("shape of X_train using 2 principal components:", X_transf.shape, "\n")
print(X_transf)

In [None]:
pca.explained_variance_ratio_

In [None]:
# to understand the importance of each variable in each PC, look at the correlations:

pd.DataFrame(pca.components_, index = ['PC1','PC2'])

# remember, signs don't matter, just direction in space

In [None]:
pca.singular_values_

## Choosing number of components with a scree plot

Choosing two or three components makes sense if we're using PCA for visualization. But what if we're trying to do feature extraction and need to use the components as input for our classifcation/clustering task? Then we might use a scree plot to choose the number of components. 

In [None]:
pca2 = PCA(n_components=15)
pca2.fit(X_train)
pcafeatures_train2 = pca2.transform(X_train)

In [None]:
plt.plot(pca2.explained_variance_ratio_)
plt.xlabel('# components')
plt.ylabel('explained variance');
plt.title('Scree plot for digits dataset');

In [None]:
plt.plot(np.cumsum(pca2.explained_variance_ratio_))
plt.xlabel('# components')
plt.ylabel('cumulative explained variance');
plt.title('Cumulative explained variance by PCA for digits');