# Dimensionality Reduction 

- General Overview
- Examples with data:
    - Iris Dataset
    - S&P 500

# Setup

In [None]:
import pandas as pd
pd.options.display.max_rows = 10

import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (17, 5)
plt.rcParams.update({'font.size': 14})

import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

#improve resolution
#comment this line if erroring on your machine/screen
%config InlineBackend.figure_format ='retina'

Code here in intro example adapted from Jake VanderPlas' [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html)

### General Concepts

In [None]:
# generate data and look at relationhip
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');

We can tell that there is a linear relationship between the two variables.

#### Carry Out PCA

In [None]:
# Use PCA from sklearn
pca = PCA(n_components=2)
pca.fit(X)

The fit learns some quantities from the data, most importantly the "components" and "explained variance":

In [None]:
# PCs
print(pca.components_)

In [None]:
# Variance Explained
print(pca.explained_variance_)

#### Visualizing PCA

To see what these numbers mean, let's visualize them:
- as vectors over the input data
    - using the "components" to define the direction of the vector
    - and the "explained variance" to define the squared-length of the vector:

In [None]:
# generate draw_vector fucntion
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops = dict(arrowstyle='->',
                      linewidth=2,
                      shrinkA=0, 
                      shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.4)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');

- Vectors represent the **principal axes** of the data
- length of the vector = an indication of how "important" that axis is in describing the distribution of the data—more precisely:
    - a measure of the variance of the data when projected onto that axis
    - The projection of each data point onto the principal axes are the "principal components" of the data.

# PCA: Iris Dataset

- Dataset
- PCA
- Visualize

This example comes from plotly, but has been reproduced in many places, as the iris dataset is quite famous: https://plot.ly/ipython-notebooks/principal-component-analysis/

### The Dataset

In [None]:
# read dataset in
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                 header=None)

# change column names and drop empty line
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end
df['species'] = df['class'].replace(to_replace = "Iris-", value = "", regex=True)

print(df.shape)

df

In [None]:
# boxplot across variables
df.boxplot(by='species');

- A) setosa and versicolor
- B) setosa and virginica
- C) versicolor and viginica
- D) all quite distinct
- E) all equally similar


In [None]:
# plot two variables
sns.scatterplot(x='sepal_len', y='sepal_wid', 
                s=100, data=df);

In [None]:
# color by species
sns.scatterplot(x='sepal_len', y='sepal_wid', hue='species', 
                s=100, data=df)
plt.legend(loc='upper left');


- A) setosa and versicolor
- B) setosa and virginica
- C) versicolor and viginica
- D) all quite distinct
- E) all equally similar


![iris](img/iris.png)

### PCA: Iris

- define predictors
- standardize data

In [None]:
# split data table into predictors and outcome (class labels - species)
iris_predictors = df.iloc[:,0:4].values
iris_species = df.iloc[:,4].values

#### Calculating PCs

There are many ways to calculate PCs. You can calculate the covariance matrix and run eigendecomposition on that. You can use `SVD` (singular variable decomposition) from `numpy`. But, the simplest way to calculate PCs is from `sklearn`, using `PCA` (as seen above).

In [None]:
# calculate PCs
pca = PCA(n_components=4)
iris_pca_fit = pca.fit(iris_predictors)
iris_PCs = pca.fit_transform(iris_predictors)

In [None]:
# See PCs
print(iris_pca_fit.components_)

# Selecting PCs

The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.

In order to do so, the common approach is to rank the eigenvalues from highest to lowest in order choose the top 𝑘 eigenvectors.

In [None]:
print(iris_pca_fit.explained_variance_)

In [None]:
print(iris_pca_fit.explained_variance_ratio_)

In [None]:
var_exp = pd.DataFrame(iris_pca_fit.explained_variance_ratio_,
                       ['PC1', 'PC2','PC3','PC4'])
var_exp.plot.bar(rot=0)
plt.ylabel('variance explained');

PC1 explains 92% of the variance.
PC2 contains important information (5% of the variance).

PCs 3 and 4 can safely be dropped without losing to much information. 

Together, the first two  PCs contain 97% of the information.



#### Visualization

In [None]:
# get dataframe with PCs and label
pca_out = pd.DataFrame(iris_PCs, 
                       columns=['PC1','PC2','PC3','PC4'])
pca_out['species'] = df['species']
pca_out

In [None]:
# plot PC1 and PC2
# color by species 
sns.scatterplot(x='PC1', y='PC2', hue='species', 
                s=100, data=pca_out);


- A) setosa and versicolor
- B) setosa and virginica
- C) versicolor and viginica
- D) all quite distinct
- E) all equally similar


### Iris: Summary

While we're only looking at 4 variables here and PCA can handle many more quantitative information, we quickly get a  understanding of the overall structure of the data across all four quantitative variables using PCA within the iris dataset.

## S&P 500 Example 

In [None]:
# read data in
sp = pd.read_csv('https://raw.githubusercontent.com/shanellis/datasets/master/sp500_data.csv')
sp.shape

In [None]:
sp.head()

#### Specify subset

In [None]:
funds = ['AAPL', 'AXP', 'COP', 'COST', 'CSCO', 'CVX', 'HD', 
        'INTC', 'JPM', 'MSFT', 'SLB', 'TGT', 'USB', 
        'WFC', 'WMT', 'XOM']

sp = sp.loc[:,funds].transpose()
sp.head()

#### Calculate PCs

In [None]:
# calculate PCs
pca = PCA(n_components=5)
sp_pca_fit = pca.fit(sp)
sp_PCs = pca.fit_transform(sp)

#### Screeplot & Variance Explained

In [None]:
# screeplot for first 5 PCs
var_exp = pd.DataFrame(sp_pca_fit.explained_variance_ratio_,
                       ['PC1', 'PC2','PC3','PC4', 'PC5'])
var_exp.plot.bar(rot=0)
plt.ylabel('variance explained');

#### Visualization

In [None]:
# get dataframe with PCs and label
sp_out = pd.DataFrame(sp_PCs,
                      columns=['PC1','PC2','PC3','PC4','PC5'],
                      index=sp.index)

sns.scatterplot(x = 'PC1', y ='PC2', 
                s = 300, data = sp_out);

In [None]:
sp_out[sp_out['PC1'] > 10]

- COP : ConocoPhillips (crude oil & natural gas)
- CVX : Chevron
- SLB : Schlumberger (technology for oil drilling)
- XOM : Exxon Mobile

In [None]:
# get dataframe with PCs and label
sp_out = pd.DataFrame(sp_PCs,
                      columns=['PC1','PC2','PC3','PC4','PC5'],
                      index=sp.index)

pd.plotting.scatter_matrix(sp_out);

### PCA Summary

- helpful to understand _many_ quantitative variables at a time
- PCA helpful for understanding 'global' structure within dataset
    - Identify outliers and groups (means that PCs will be driven by outliers)
- Useful in EDA, modeling, & prediction
- PCA from `sklearn` most straightforward approach to computation