<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# PCA: Extra Practice

_Author: Joseph Nelson (DC), Matt Brems (DC) _

---

In this lab, we will practice the PCA Process:

- Load data.
- Center it (Standardize, center at 0).
- Compute covariance matrix of the standardized original data.
- Compute eigenvalues/eigenvectors.
- Decide how much explained variance you want in your final model. Select the number of principal components that are needed to explain said amount of variance.  
- Keep the needed eigenvalues to explain said variance.
- Go back and multiply original data by the eigenvalues of the selected principal components.

PCA works best to find the importance of relationship between various features.  
Having a dataset of entirely uncorrelated features will not show much benefit from a PCA.

# Congressional Voting Data

You're working for a political watchdog that wants to track and analyze the voting behavior of various politicians. Specifically, we want to understand how the political affiliation of a member of the House of Representatives affects their voting record. You're given a dataset with a affiliations as well as voting records for a variety of key bills.

Your task is to perform PCA to determine the principal components of this dataset so that your data science team can perform a clustering analysis to learn how political affiliation is related to voting.

[Congressional Voting Dataset](./datasets/votes.csv)

Bill Index|Bill (vote options)
----------|----
V1.  |handicapped-infants: 2 (y,n)
V2.  |water-project-cost-sharing: 2 (y,n)
V3.  |adoption-of-the-budget-resolution: 2 (y,n)
V4.  |physician-fee-freeze: 2 (y,n)
V5.  |el-salvador-aid: 2 (y,n)
V6.  |religious-groups-in-schools: 2 (y,n)
V7.  |anti-satellite-test-ban: 2 (y,n)
V8.  |aid-to-nicaraguan-contras: 2 (y,n)
V9.  |mx-missile: 2 (y,n)
V10. |immigration: 2 (y,n)
V11. |synfuels-corporation-cutback: 2 (y,n)
V12. |education-spending: 2 (y,n)
V13. |superfund-right-to-sue: 2 (y,n)
V14. |crime: 2 (y,n)
V15. |duty-free-exports: 2 (y,n)
V16. |export-administration-act-south-africa: 2 (y,n)

### 1. Load Packages

In [18]:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
import numpy as np
import math
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 

votes_file = '/Users/Indraja/Documents/Dsi/8.2.3_pca-extra-practice-lab/datasets/votes.csv'

### 2. Preprocess Data

After you've downloaded the data from the repository, go ahead and load it with Pandas and handle any preprocessing that is may need.

- Convert all columns to numeric values
- Decide what to do with NaN values
- Standardize numeric values

In [19]:
votes=pd.read_csv('/Users/Indraja/Documents/Dsi/8.2.3_pca-extra-practice-lab/datasets/votes.csv')
votes.head()

Unnamed: 0.1,Unnamed: 0,Class,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
0,1,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,2,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,3,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,4,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,5,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [20]:
votes.Class.unique()

array(['republican', 'democrat'], dtype=object)

In [21]:
votes.shape
             

(435, 18)

In [25]:
votes.isnull().sum()


Unnamed: 0      0
Class           0
V1             12
V2             48
V3             11
V4             11
V5             15
V6             11
V7             14
V8             15
V9             22
V10             7
V11            21
V12            31
V13            25
V14            17
V15            28
V16           104
dtype: int64

In [37]:
votes['V1']=votes.V1.map(lambda x : 0 if x=='n' else 1)
votes['V2']=votes.V2.map(lambda x : 0 if x=='n' else 1)
votes['V3']=votes.V3.map(lambda x : 0 if x=='n' else 1)
votes['V4']=votes.V4.map(lambda x : 0 if x=='n' else 1)
votes['V5']=votes.V5.map(lambda x : 0 if x=='n' else 1)
votes['V6']=votes.V6.map(lambda x : 0 if x=='n' else 1)
votes['V7']=votes.V7.map(lambda x : 0 if x=='n' else 1)
votes['V8']=votes.V8.map(lambda x : 0 if x=='n' else 1)
votes['V9']=votes.V9.map(lambda x : 0 if x=='n' else 1)
votes['V10']=votes.V10.map(lambda x : 0 if x=='n' else 1)
votes['V11']=votes.V11.map(lambda x : 0 if x=='n' else 1)
votes['V12']=votes.V12.map(lambda x : 0 if x=='n' else 1)
votes['V13']=votes.V13.map(lambda x : 0 if x=='n' else 1)
votes['V14']=votes.V14.map(lambda x : 0 if x=='n' else 1)
votes['V15']=votes.V15.map(lambda x : 0 if x=='n' else 1)
votes['V16']=votes.V16.map(lambda x : 0 if x=='n' else 1)

In [38]:
votes.head()

Unnamed: 0.1,Unnamed: 0,Class,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
0,1,republican,1,1,0,1,1,1,0,0,0,1,1,1,1,1,0,1
1,2,republican,1,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1
2,3,democrat,1,1,1,1,1,1,0,0,0,0,1,0,1,1,0,0
3,4,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,0,0,1
4,5,democrat,1,1,1,0,1,1,0,0,0,0,1,1,1,1,1,1


In [42]:
votes=votes.dropna(axis=1)

In [43]:
votes.shape

(435, 18)

In [45]:
#standardise data
event_names=['V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12','V13','V14','V15','V16']
target_name=votes['Class']
x=votes[event_names]
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
xn = ss.fit_transform(x)

### 3. Compute eigenpairs

- Compute the covariance matrix
- Compute the eigenvectors and eigenvalues using `np.linalg`
- Sort by descending eigenvalue to find the principal components

In [46]:
votes.corr()

Unnamed: 0.1,Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16
Unnamed: 0,1.0,,-0.125572,-0.010157,0.04357,0.053397,0.141029,0.004862,0.058753,-0.034355,0.102588,0.044976,-0.011379,0.024822,0.019172,-0.044269,-0.004661
V1,,,,,,,,,,,,,,,,,
V2,-0.125572,,1.0,-0.042424,0.093263,0.131543,0.15449,-0.181424,-0.071236,-0.166197,-0.125711,0.165643,0.001472,0.22549,-0.009815,-0.07279,-0.031326
V3,-0.010157,,-0.042424,1.0,-0.684978,-0.619645,-0.402272,0.55772,0.670319,0.6129,0.025066,0.223751,-0.60013,-0.481583,-0.567444,0.456818,0.439262
V4,0.04357,,0.093263,-0.684978,1.0,0.71435,0.464186,-0.567661,-0.642464,-0.64102,0.04292,-0.236604,0.648536,0.594492,0.632191,-0.495958,-0.387671
V5,0.053397,,0.131543,-0.619645,0.71435,1.0,0.62083,-0.662616,-0.77792,-0.76042,0.015007,-0.115274,0.605228,0.617444,0.676383,-0.520545,-0.363937
V6,0.141029,,0.15449,-0.402272,0.464186,0.62083,1.0,-0.5043,-0.511857,-0.550247,0.086061,0.037034,0.47936,0.539304,0.569157,-0.410064,-0.243625
V7,0.004862,,-0.181424,0.55772,-0.567661,-0.662616,-0.5043,1.0,0.706362,0.651582,0.040101,0.052903,-0.490428,-0.543047,-0.488305,0.481363,0.4407
V8,0.058753,,-0.071236,0.670319,-0.642464,-0.77792,-0.511857,0.706362,1.0,0.727637,0.030406,0.152891,-0.556321,-0.546248,-0.580329,0.531149,0.449767
V9,-0.034355,,-0.166197,0.6129,-0.64102,-0.76042,-0.550247,0.651582,0.727637,1.0,0.042414,0.056362,-0.547814,-0.509639,-0.542627,0.47691,0.377178


In [None]:
pca = PCA(n_components=5)
pca.fit(subjective.values)

### 4. Understand the principal components

#### 4.A. Calculate the explained variance. 

> Explained variance is the eigenvalue divided by the sum of all eigenvalues.
  **These should sum to 1!**

#### 4.B. Calculate the explained variance and the cumulative explained variance (see `np.cumsum`)

#### 4.C. Suppose we require 90% explained variance. How many eigenvectors should we keep? 

- Hint: Use the cumulative sum

### 5. Now, repeat the process with sklearn.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html