# Principal Component Analysis

PCA is a really useful tool for data analysis.  It can be used to lessen the storage/computation requirements for a dataset via dimensionality reduction.  This involves a coordinate transformation from the original feature space to a space defined by the principal components, which are linear recombinations of those features.  It can be quite powerful, reducing a data set from 1000s of features to a handful, which can reproduce ~99% of the variance in the sample.

Another powerful use of PCA is for data mining and explanatory data models.  Using the principal components, we can understand the relationships in the data that describe the largest variations away from the mean behavior.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
from sklearn.decomposition import PCA
import seaborn as sns

%matplotlib notebook

The data we will use is a sample of ~1000 randomly generated ice cream sundaes.  We have different flavors of Amy's Ice Creams, various amounts of toppings, and the corresponding calories per sundae.  

In [None]:
# Import the data.
sundae_df = pd.read_csv('IceCreamSundaes.csv',index_col=0)

In [None]:
sundae_df.head()

### Relationships in the data
Before we get started, let's have a look at the data.  What relationships do you notice?  Take a moment on Slack to share the relationships you notice in the data.

In [None]:
# A scatter plot is a grid of every feature vs. every other feature.  Along the diagonal, where the feature would be
# plotted against itself, a kernal density estimate is plotted, showing the distribution of values for that feature.

scatter_matrix(sundae_df, alpha=0.2, figsize=(10, 10), diagonal='kde')

If you haven't, note the range of values for each feature.  When PCA looks for the main driver of variation in the data, what do you think it will find?  (I.e. what will be responsible for the biggest numerical difference from one sundae to the next?)

In [None]:
#prep data for PCA

del sundae_df['flavor'] #remove text column
X = sundae_df/np.std(sundae_df) #divide by stddev to normalize data

In order to find the largest variations, PCA needs to know how the data behaves on average.  It calculates the means for each feature and subtracts them from the data under the hood, then proceeds to calculate the covariance matrix.

In [None]:
# Applying a PCA is really easy.

pca = PCA() # Initialize the class.
pca.fit(X) # Find the PCs.

# Done.

### Reducing Dimensions

Ok, we've run a PCA.  How do we know how many principal components (PCs) to keep?

In [None]:
# Look at the amount of the variance in the data set represented in each PC.
print(pca.explained_variance_ratio_) 

In [None]:
fig1 = plt.figure(figsize=(6,4))
ax1 = fig1.add_subplot(111)

PCnums = [x+1 for x in range(len(pca.explained_variance_ratio_))]
sns.barplot(PCnums,np.cumsum(pca.explained_variance_ratio_),ax=ax1)

ax1.set_title('Cumulative Sum of Variance Explained by PCs')
ax1.set_xlabel('Number of PCs')

#### How many PCs do you think we should keep and why?
Talk with your group about how many PCs you think are necessary.  Share on Slack when you decide.  

In [None]:
#Enter the number of PCs you want to keep below as numPCs.
numPCs = 

### Transforming Coordinates
How do we take the PCs to reduce the size of our data set?

First, let's take a look at the PCs themselves.

In [None]:
# The pca class has several attributes associated with it, which end in '_'.  
# pca.components_ yields the contribution of each feature to each PC.
# For PC1 (row 0), there is a -0.155670 contribution from ice_cream, etc.  
# We will evaluate what these numbers mean shortly.  

# For now, we can use them to transform our data and reduce our dimensions.

pca_df = pd.DataFrame(pca.components_, columns = sundae_df.columns)
pca_df.head(3)

In [None]:
# Let's recreate one of our sundaes.  Say sundae #5.
X.iloc[5,:]

In [None]:
# We need to know how much of each PC to add to the mean to recreate Sundae #5.  These are called the weights.
weights = pca.transform(X)
weights[5]

In [None]:
# If you know you're going to use PCA to transform your data, you can 
# run the PCA and the transformation all in one step:
#weights = pca.fit_transform(X)

In [None]:
mean = pd.DataFrame([pca.mean_],columns = sundae_df.columns)
meanPCs = pd.concat([mean,pca_df[0:numPCs]])
meanPCs = meanPCs.reset_index(drop=True)
meanPCs

In [None]:
# Let's visualize what sundae #5 looks like.
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)

# plot sundae #5
X.iloc[5,:].plot(ax=ax,kind='bar',alpha=0.4)

# Uncomment the next line and run this cell to additionally plot mean values for dataset
#meanPCs.iloc[0].plot(ax=ax,kind='bar',alpha=0.2,color='g')

ax.set_ylabel('Values')
ax.set_title('Sundae #5')
ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=25)

Describe sundae #5 on Slack.  Would you eat it?

In [None]:
# Let's visualize what sundae #5 looks like.
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)

# plot sundae #5
X.iloc[5,:].plot(ax=ax,kind='bar',alpha=0.4)

# Now we need to add weight*PC for each PC you want to include.  Try adding one PC at a time.
# First, we'll start with just the mean + PC1.
sundae5 = meanPCs.iloc[0] + weights[5][0]*meanPCs.iloc[1]

sundae5.plot(ax=ax,kind='bar',alpha=0.2,color='orange')

ax.set_ylabel('Values')
ax.set_title('Sundae #5')
ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=25)

Go back and revise the previous plot.  Continue to add weights x PCs to your sundae #5 reconstruction and see how the plot evolves.  

#### Does the reconstruction get closer to the data with additional PCs?

#### Remember where we came from:
You should notice that you come very close to recreating sundae #5 from our X data set that we inputted to the PCA.  Recall that we divided the original data, sundae_df, by the standard deviation.  So to truly recover the original data, you would need to multiply your reconstructed sundae #5 by the standard deviation.

### Interpreting PCA

Coming back to PCA as a tool for data mining and explanatory models, let's investigate our PCs in more detail.

In [None]:
# let's look again at our DF of the sample means + PCs
meanPCs

In [None]:
#Plot just PC1
figure = plt.figure(figsize=(12,6))
ax = figure.add_subplot(111)

#plot PC1
meanPCs.iloc[1].plot(ax=ax,kind='bar',alpha=0.4,color='g')

#add horizontal dashed line at 0
xzero = range(-1,10)
zero = [0 for xi in xzero]
ax.plot(xzero,zero,'m--')

ax.set_title('Principal Component 1')
ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=25)

By evaluating how different features contribute to the PCs, we can say something about important relationships in our data.  Any features that have the same sign (positive or negative) are positively correlated.  Features with opposite signs are anticorrelated.  The value of each feature's contribution tells you how strong those relationships are.  PC1 will include the correlations that describe the most variation among your data.  

For ice cream sundaes in this data set, for example, one of the most important relationships is a strong correlation between the amount of hot fudge and marshmallow fluff.

What are the most important relationships that differentiate one sundae from another?  (I.e. ALL the relationships represented in PC1.)  Talk with your group then post your answer on Slack.

In [None]:
fig = plt.figure(figsize=(12,18))
for i in range(numPCs):
    ax = fig.add_subplot(numPCs,1,i+1)
    meanPCs.iloc[i+1].T.plot(sharex=True,ax=ax,kind='bar',alpha=0.75)
    zero = [0 for x in pca_df.columns]
    xzero = range(len(zero))
    ax.plot(xzero,zero,'k--')
    ax.set_title('PC'+str(i+1))

ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=25)

### Predictive modeling

If we were to use this data set for predictive modeling, we would likely want to use the ingredient features to predict the calories.  Given that, can you glean any further information from the PC plots?  Would you revise the number of components you want to keep?  What else might you consider?