# Pricipal Component Analysis
Here we are going to find the principle components in the data and visualize the results.

Principal Component Analysis is a popular method of visualizing variance in gene expression data and has many uses in statistics and machine learning. If you want to know more about PCA, this [excerpt from the Python Data Science Handbook by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html) has a nice explanation.

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import decomposition

## 1. Load Data and Experimental Groups

In [None]:
present_transcripts_df = pd.read_csv('../../data/expression_by_probe.csv', index_col=0)
experimental_groups = json.load(open('../../data/experimental_groups.json'))

## 2. Find the Components Using `sklearn`
We want a high combined explained variance ratio. If there we can't get a high explained variance ratio within the first 2-3 components, we should choose a different method

In [None]:
pca = decomposition.PCA(n_components=3)
components = pca.fit_transform(present_transcripts_df.T.values)
components_df = pd.DataFrame(components, index=present_transcripts_df.T.index)

In [None]:
print(pca.explained_variance_ratio_)
pca.explained_variance_ratio_.sum()

In [None]:
components_df

## 3. Visualize Principle Components Using 3D Scatterplot
With 3 components, we can visualize them in three dimensions or two with a simple scatterplot showing different components as the axes.

In [None]:
# Creates 3D Scatterplot in matplotlib with 3 groups for the purpose of visualizing the 
# 3 component PCA across our 3 conditions. Plots scatterplot and does not return anything.
fig = plt.figure(figsize=(10,10), dpi=100)
ax = fig.gca(projection='3d')

for i, j in zip(experimental_groups, 'rgyb'):
    xs = components_df.loc[experimental_groups[i], 0].values
    ys = components_df.loc[experimental_groups[i], 1].values
    zs = components_df.loc[experimental_groups[i], 2]
    ax.scatter(xs, ys, zs, c=j, s=400)

ax.legend(labels=experimental_groups.keys())
plt.show()
fig.savefig('../../results/3d_pca.png')

## 4. Alternatively, Visualize Principle Components Using Multiple 2D Scatterplots

In [None]:
# Make a bunch of 2D scatterplots instead of one 3D scatterplot
for combo in [(0,1), (0,2), (1,2)]:
    
    fig=plt.figure(figsize=(5,5), dpi=100)
    ax=fig.gca()
    
    #Go through each experimental group and plot in a different color
    for i, j in zip(experimental_groups, 'rygb'):
        
        xs=components_df.loc[experimental_groups[i], combo[0]].values
        ys=components_df.loc[experimental_groups[i], combo[1]].values
    
        ax.scatter(xs, ys, c=j, s=400)
        fig.savefig(f'../../results/2d_pca_{combo}.png')