# **PCA Notebook Description**

Description: This notebook gives an example of running a principle component analysis on the smoothed and normalized data to get visualizations in a lower dimension space. 

## **Package Imports**

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import altair as alt

## **Read in the data**

**This is the data that was subset to -30 to 40 seconds w.r.t sample detect time, normalized (scaled between 0 and 1), and smoothed with a convolution (bartlett window of length 50)**

In [None]:
ecd_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/ecd_smooth.csv')
syn_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/syn_smooth.csv')
cont_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/cont_smooth.csv')
un_ts = pd.read_csv('../Data/PreprocessedData/TimeSeries/un_smooth.csv')

Save the ids for each category in case we need them later. 

In [None]:
ecd_ids = ecd_ts['TestId']
un_ids = un_ts['TestId']
cont_ids = cont_ts['TestId']
syn_ids = syn_ts['TestId']

Make a new data frame with all of the time series.

In [None]:
all_ts = pd.concat([un_ts, ecd_ts, cont_ts, syn_ts]).drop('TestId', axis = 1)

Create labels for the ECD errors vs. unssuccessful. This will allow colour coding in the plot.

In [None]:
cmap = {"pc":"red", "un":"blue"}
pc_lab = pd.Series(['pc'])
un_lab = pd.Series(['un'])
x = un_lab.repeat(len(un_ts))
y = pc_lab.repeat(len(ecd_ts) + len(cont_ts) + len(syn_ts))
labs = pd.concat([x, y])
all_ts.reset_index(drop = True, inplace = True)

## **Run the PCA**

In [None]:
# Keep enough components to account for 95% of the variation in the data. 
pca = PCA(n_components=0.95)

# Run the PCA on standardized data to avoid any particualy time point with a lot of variation having too much sway.
principalComponents = pca.fit_transform(StandardScaler().fit_transform(all_ts))

# Make a data frame with the resulting components. 
principalDf = pd.DataFrame(data = principalComponents, columns = ['Component '+ str(i+1) for i in range(pca.n_components_)])

print(pca.explained_variance_ratio_)

Here we can see that two components are sufficient to explain 95% of the variance in our time series. This is convenient as we can easily visualize things in a two-dimensional space. 

In [None]:
# Add the predefined data labels into the PCA data frame. 
principalDf['label'] = labs.reset_index(drop=True)
principalDf

## **Visualize the PCA**|

In [None]:
finalDf = principalDf
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['un', 'pc']
colors = ['b', 'r']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['label'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'Component 1']
               , finalDf.loc[indicesToKeep, 'Component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

Here we can see that most of the ECDs load around 0 for the second principle components.
This is interesting as it provides some indication that there might be a way to pull them out. Next, lets look at the eigenvectors to see the coeffecients for different time points in the different components. This will help us see which time points explain most of the variation in each component.

In [None]:
# The matrix of variable loadings (i.e., matrix whose columns contain the eigenvectors)
# The eigenvectors provide the coefficients for the linear combination. This will tell use which time points are 
# most influential. 
rotation = pd.DataFrame(pca.components_, columns = all_ts.columns).T
rotation.columns = [f"PC_{i}" for i in range(1, len(pca.components_) + 1)]
rotation

# Take a look at the first eigenvector (coefficients for the first principle component). 
rotation.sort_values(by=['PC_1'], key = abs, ascending = False).head(30)[['PC_1']].head(10)

Interestingly, the predictors with the largest coefficients for principal component 1 are all between 11-15 seconds, around what is considered to be the post window. Now let's take a look at the second eigenvector. 

In [None]:
rotation.sort_values(by=['PC_2'], key = abs, ascending = False).head(30)[['PC_2']].head(10)

The predictors with the largest coefficients for principal component 2 are all between 35-40 seconds, around what is considered to be the sample window. Now let's visualize these two eigenvectors to get a clearer idea of what's going on and what these eigenvectors might correspond to. 

In [None]:
plt.plot(np.arange(-30,40,0.2),rotation['PC_1'], label = 'eigenvector 1')
plt.plot(np.arange(-30,40,0.2),rotation['PC_2'], label = 'eigenvector 2')
plt.xlabel('time w.r.t sample detect time (secs)')
plt.ylabel('coefficient')
plt.legend()

We can see that coefficients increase after sample detect for principle component 2 but not principle component 1. This means that readings that score low on principle component 2 (ECDs) are less explained by what happens after sample detect time. This is very interesting given that we expect ECD errors to remain around 0 (unchanged relative to calibration) after sample detection. 