# Lecture 2
## Introduction to Sklearn
### PCA in sklearn

<ol>
<li> Used data: Simulated data (Data simulation done in notebook)
<li> Notebook Goal: Learn how to apply PCA in sklearn. 
<li> Extra Exercise: No.
</ol>

![SegmentLocal](../Pictures/PCAgif.gif "segment")


In [1]:
#Necessary Imports

import pandas as pd
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.offline as ofl
from plotly.offline import iplot
ofl.init_notebook_mode()

We start by sampling bivariate normal data with strong correlation between the two dimensions. We will then apply PCA on this data, and consider the performance of the dimension reduction in terms of the explained variance.

In [2]:
import numpy as np
means = [0.5, 14]
cov_matrix = [[0.22, 0.88], [0.88, 4]]
X = np.random.multivariate_normal(means, cov_matrix, 9000)

fig = go.Figure()
fig.add_traces(go.Scatter(x=X[:,0], y=X[:,1],mode='markers'))
fig.update_layout(title='Randomly Generated Bivariate Normal Data', title_x = 0.5)
fig.update_layout(width=750, height=500, autosize=False)
iplot(fig)


Recall that it is customary to standardize your data before applying PCA. Luckily, we have already seen how to do this!
We get the following code. Notice once again how short and clean the code is!

In [3]:
StS = StandardScaler()
X = StS.fit_transform(X)
pca = PCA()
pca.fit(X)
pca.components_

array([[-0.70710678, -0.70710678],
       [ 0.70710678, -0.70710678]])

Hence, the two components are the vectors

$$
\vec{v_1} = \begin{pmatrix} -0.707 \\ -0.707 \end{pmatrix}, \quad  \vec{v_2} = \begin{pmatrix} 0.707 \\ -0.707 \end{pmatrix}.
$$

We can obtain the coordinates of the observations in terms of this new basis using the transform method.

In [4]:
pca.transform(X)

array([[ 1.02304317,  0.03675687],
       [-1.36247929, -0.153627  ],
       [-1.43181346,  0.14550569],
       ...,
       [ 1.71162641, -0.02380029],
       [-1.56784116, -0.04495779],
       [-0.63310781,  0.00573982]])

The projections are then given by

In [5]:
pca_projection = [[x,x]*pca.components_[0] for x in pca.transform(X)[:,0]]
pca_projection2 = [[x,x]*pca.components_[1] for x in pca.transform(X)[:,1]]
pca_projection=np.array(pca_projection)
pca_projection2=np.array(pca_projection2)

fig = go.Figure()
fig.add_traces(go.Scatter(x=X[:,0], y=X[:,1],mode='markers', name='Original data'))
fig.add_traces(go.Scatter(x=pca_projection[:,0], y=pca_projection[:,1],mode='markers',name='First PC'))
fig.add_traces(go.Scatter(x=pca_projection2[:,0], y=pca_projection2[:,1],mode='markers',name='Second PC'))

fig.update_layout(title='Randomly Generated Bivariate Normal Data', title_x = 0.5)

fig.update_layout(width=500, height=500, autosize=False)
iplot(fig)

The ratio of the explained variance can be obtained using the following code:

In [6]:
pca.explained_variance_ratio_

array([0.97017056, 0.02982944])

We see that around 97% of the variance of the data can be explained using only one component.