1. With the `food-texture.csv`, build a PCA model with 5 components, then detrmine the variables that should be kept, with a 95% tolerance.
2. Using `room-temperature.csv`, generate a PCA model with 4 components, and provide the covariance matrix and the explained variance. Based on the explained variance, what data columns should we keep if our tolerance is 90%? Refit the model with the components we need.
3. With the `silicon-wafer-thickness.csv`, build a PCA model using all the components of the data, and determine the number of components for a data set with 90% tolerance of variance and 95% tolerance. Refit the model for 90% and 95%.

In [13]:
from IPython.core.display import display
import numpy as np
from numpy import linalg as LA
import math
import pandas as pd
from sklearn.decomposition import PCA as sklearnPCA
from sklearn.preprocessing import StandardScaler
import csv

## Problem 1 :

In [14]:
food = pd.read_csv("food-texture.csv")

display(food.tail())

Unnamed: 0.1,Unnamed: 0,Oil,Density,Crispy,Fracture,Hardness
45,B907,16.6,2865,11,25,120
46,B911,16.4,2995,12,20,165
47,B923,15.1,2925,10,29,118
48,B971,21.1,2700,13,16,116
49,B998,16.3,2845,10,26,75


In [24]:
data_set = food.ix[:,1:6].values
#display(data_set)
data_set_std = StandardScaler().fit_transform(data_set)
sklearn_pca = sklearnPCA(n_components=5)
sklearn_pca.fit(data_set_std)

# covariance matrix
covariance_matrix = sklearn_pca.get_covariance()
#print("Covariance matrix:\n%s\n" %covariance_matrix)
# eigenvectors
eigenvectors = sklearn_pca.components_.T
#print("Eigenvectors:\n%s\n" %eigenvectors)
# eigenvalues (already sorted highest to lowest)
eigenvalues_sum = sum(sklearn_pca.explained_variance_)
explained_variance = sklearn_pca.explained_variance_/eigenvalues_sum
print("Explained Variance:\n%s\n" %explained_variance)
print(sum(explained_variance[0:4]))

# explained variance shows we really only need the eigenvectors associated with the first four eigenvalues
sklearn_pca = sklearnPCA(n_components=4)
sklearn_pca.fit(data_set_std)
# apply the feature vector to the data set
pca_data_set = pd.DataFrame(sklearn_pca.transform(data_set_std))
print("Dimensionally-reduced data set (tail):")
display(pca_data_set.tail())

Explained Variance:
[ 0.60624263  0.25914115  0.06200987  0.04838402  0.02422233]

0.975777670877
Dimensionally-reduced data set (tail):


Unnamed: 0,0,1,2,3
45,-0.787704,-0.278429,0.087182,0.393916
46,-0.357685,1.636842,0.132277,-0.364841
47,-2.141993,-0.080405,-0.06128,0.739172
48,2.584544,-1.325324,0.778196,-0.615027
49,-1.417243,-1.594231,-0.553364,0.322129


In order to explin 95% of the variance we need to include the first four factors.

## Problem 2 :

In [25]:
temp = pd.read_csv("room-temperature.csv")

display(temp.tail())

Unnamed: 0,Date,FrontLeft,FrontRight,BackLeft,BackRight
139,4/14/2010 9:00,295.8,294.6,294.8,295.7
140,4/14/2010 9:30,294.8,295.5,294.7,295.6
141,4/14/2010 10:00,295.9,295.8,295.5,295.2
142,4/14/2010 10:30,295.1,296.2,296.0,296.1
143,4/14/2010 11:00,296.2,297.2,296.6,296.0


In [28]:
data_set = temp.ix[:,1:5].values
#display(data_set)
data_set_std = StandardScaler().fit_transform(data_set)
sklearn_pca = sklearnPCA(n_components=4)
sklearn_pca.fit(data_set_std)

# covariance matrix
covariance_matrix = sklearn_pca.get_covariance()
#print("Covariance matrix:\n%s\n" %covariance_matrix)
# eigenvectors
eigenvectors = sklearn_pca.components_.T
#print("Eigenvectors:\n%s\n" %eigenvectors)
# eigenvalues (already sorted highest to lowest)
eigenvalues_sum = sum(sklearn_pca.explained_variance_)
explained_variance = sklearn_pca.explained_variance_/eigenvalues_sum
print("Explained Variance:\n%s\n" %explained_variance)
print(sum(explained_variance[0:2]))

# explained variance shows we really only need the eigenvectors associated with the first two eigenvalues
sklearn_pca = sklearnPCA(n_components=2)
sklearn_pca.fit(data_set_std)
# apply the feature vector to the data set
pca_data_set = pd.DataFrame(sklearn_pca.transform(data_set_std))
print("Dimensionally-reduced data set (tail):")
display(pca_data_set.tail())

Explained Variance:
[ 0.7660801   0.16896714  0.0366861   0.02826666]

0.935047236322
Dimensionally-reduced data set (tail):


Unnamed: 0,0,1
139,-0.522176,-0.301549
140,-0.42437,-0.287943
141,-0.967853,-0.56429
142,-1.282329,0.002331
143,-2.062368,-0.453206


We need to include the first two components with a tolerance of 90%

# Problem 3 :

In [29]:
wafer = pd.read_csv("silicon-wafer-thickness.csv")

display(wafer.tail())

Unnamed: 0,G1,G2,G3,G4,G5,G6,G7,G8,G9
179,0.535,0.524,0.649,0.475,0.486,0.657,0.941,0.527,0.494
180,0.041,0.056,0.194,0.234,-0.003,-0.31,0.267,-0.449,-0.432
181,0.507,0.563,0.539,0.634,0.471,0.578,0.686,0.763,0.576
182,-0.033,-0.025,0.118,0.148,-0.076,-0.403,-0.345,0.084,-0.473
183,0.47,0.52,0.526,0.553,0.65,0.92,0.466,0.608,0.55


In [35]:
data_set = wafer.ix[:,0:10].values
#display(data_set)
data_set_std = StandardScaler().fit_transform(data_set)
sklearn_pca = sklearnPCA(n_components=9)
sklearn_pca.fit(data_set_std)

# covariance matrix
covariance_matrix = sklearn_pca.get_covariance()
#print("Covariance matrix:\n%s\n" %covariance_matrix)
# eigenvectors
eigenvectors = sklearn_pca.components_.T
#print("Eigenvectors:\n%s\n" %eigenvectors)
# eigenvalues (already sorted highest to lowest)
eigenvalues_sum = sum(sklearn_pca.explained_variance_)
explained_variance = sklearn_pca.explained_variance_/eigenvalues_sum
print("Explained Variance:\n%s\n" %explained_variance)
print(sum(explained_variance[0:1]))

# explained variance shows we really only need the eigenvectors associated with the first eigenvalue for 90% 
sklearn_pca = sklearnPCA(n_components=1)
sklearn_pca.fit(data_set_std)
# apply the feature vector to the data set
pca_data_set = pd.DataFrame(sklearn_pca.transform(data_set_std))
print("Dimensionally-reduced data set (tail):")
display(pca_data_set.tail())

Explained Variance:
[ 0.90837038  0.02822623  0.02240953  0.01960468  0.00768546  0.00624863
  0.0033764   0.0029256   0.00115308]

0.908370382982
Dimensionally-reduced data set (tail):


Unnamed: 0,0
179,1.250929
180,-1.063471
181,1.257413
182,-1.347982
183,1.222109


Only need one component for 90% certainty.

In [36]:
data_set = wafer.ix[:,0:10].values
#display(data_set)
data_set_std = StandardScaler().fit_transform(data_set)
sklearn_pca = sklearnPCA(n_components=9)
sklearn_pca.fit(data_set_std)

# covariance matrix
covariance_matrix = sklearn_pca.get_covariance()
#print("Covariance matrix:\n%s\n" %covariance_matrix)
# eigenvectors
eigenvectors = sklearn_pca.components_.T
#print("Eigenvectors:\n%s\n" %eigenvectors)
# eigenvalues (already sorted highest to lowest)
eigenvalues_sum = sum(sklearn_pca.explained_variance_)
explained_variance = sklearn_pca.explained_variance_/eigenvalues_sum
print("Explained Variance:\n%s\n" %explained_variance)
print(sum(explained_variance[0:3]))

# explained variance shows we really only need the eigenvectors associated with the first three eigenvalues for 95% 
sklearn_pca = sklearnPCA(n_components=3)
sklearn_pca.fit(data_set_std)
# apply the feature vector to the data set
pca_data_set = pd.DataFrame(sklearn_pca.transform(data_set_std))
print("Dimensionally-reduced data set (tail):")
display(pca_data_set.tail())

Explained Variance:
[ 0.90837038  0.02822623  0.02240953  0.01960468  0.00768546  0.00624863
  0.0033764   0.0029256   0.00115308]

0.95900614233
Dimensionally-reduced data set (tail):


Unnamed: 0,0,1,2
179,1.250929,-0.385745,0.059087
180,-1.063471,0.230204,0.302837
181,1.257413,-0.298749,-0.152288
182,-1.347982,0.292728,-0.153568
183,1.222109,-0.453504,0.183556


Needed three components for 95%.