# Principal Component Analysis

PCA is used mainly for the purpose of dimensionality reduction. As the number of predictors increases, the effect of multicollinearity increases. PCA can be used to transform correlated variables into uncorrelated variables. It transforms the feature vectors into components along the basis vector that has the maximum possible variance. 

Objective here is to reduce the dimension of the winery data set in sklearn using PCA.

In [1]:
from sklearn.datasets import load_wine
import pandas as pd
import numpy as np

In [2]:
wine_data =pd.DataFrame(data =load_wine(as_frame=True)['data'])

In [3]:
wine_data.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


The data set contains 13 features and 177 onservations which is divided into 3 classes

Step 1: Normalize the dataset

In [4]:
mu = np.mean(wine_data,axis=0)
sig = np.std(wine_data,axis=0)

In [5]:
norm_wine_data = (wine_data-mu)/sig

In [6]:
norm_wine_data.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,1.518613,-0.56225,0.232053,-1.169593,1.913905,0.808997,1.034819,-0.659563,1.224884,0.251717,0.362177,1.84792,1.013009
1,0.24629,-0.499413,-0.827996,-2.490847,0.018145,0.568648,0.733629,-0.820719,-0.544721,-0.293321,0.406051,1.113449,0.965242
2,0.196879,0.021231,1.109334,-0.268738,0.088358,0.808997,1.215533,-0.498407,2.135968,0.26902,0.318304,0.788587,1.395148
3,1.69155,-0.346811,0.487926,-0.809251,0.930918,2.491446,1.466525,-0.981875,1.032155,1.186068,-0.427544,1.184071,2.334574
4,0.2957,0.227694,1.840403,0.451946,1.281985,0.808997,0.663351,0.226796,0.401404,-0.319276,0.362177,0.449601,-0.037874


Step 2: Compute the covariance matrix of the features

In [7]:
S = pd.DataFrame(data = np.cov(norm_wine_data,rowvar=False),columns=norm_wine_data.columns,index=norm_wine_data.columns)
S

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
alcohol,1.00565,0.09493,0.21274,-0.311988,0.272328,0.290734,0.238153,-0.15681,0.13747,0.549451,-0.072153,0.072752,0.647357
malic_acid,0.09493,1.00565,0.164972,0.29013,-0.054883,-0.337061,-0.413329,0.294632,-0.221993,0.250392,-0.564467,-0.370794,-0.193095
ash,0.21274,0.164972,1.00565,0.445872,0.288206,0.129708,0.115727,0.187283,0.009706,0.26035,-0.075089,0.003933,0.22489
alcalinity_of_ash,-0.311988,0.29013,0.445872,1.00565,-0.083804,-0.322928,-0.353355,0.363966,-0.198442,0.018838,-0.275503,-0.278332,-0.443086
magnesium,0.272328,-0.054883,0.288206,-0.083804,1.00565,0.215613,0.19689,-0.257742,0.237776,0.20108,0.055711,0.066377,0.395573
total_phenols,0.290734,-0.337061,0.129708,-0.322928,0.215613,1.00565,0.869448,-0.452477,0.615873,-0.055448,0.436132,0.703904,0.500929
flavanoids,0.238153,-0.413329,0.115727,-0.353355,0.19689,0.869448,1.00565,-0.540939,0.656379,-0.173353,0.546549,0.791641,0.496985
nonflavanoid_phenols,-0.15681,0.294632,0.187283,0.363966,-0.257742,-0.452477,-0.540939,1.00565,-0.367912,0.139843,-0.264123,-0.506113,-0.313144
proanthocyanins,0.13747,-0.221993,0.009706,-0.198442,0.237776,0.615873,0.656379,-0.367912,1.00565,-0.025393,0.297214,0.522,0.332283
color_intensity,0.549451,0.250392,0.26035,0.018838,0.20108,-0.055448,-0.173353,0.139843,-0.025393,1.00565,-0.524761,-0.431238,0.317886


Since the data is normalized, the variance of each feature has been reduced to 1.

Step 3: Compute the eigen vectors and eigen values of the covariance matrix

In [8]:
E = np.linalg.eig(S)

In [9]:
E[0]

array([4.73243698, 2.51108093, 1.45424187, 0.92416587, 0.85804868,
       0.64528221, 0.55414147, 0.10396199, 0.35046627, 0.16972374,
       0.29051203, 0.22706428, 0.25232001])

In [10]:
B = pd.DataFrame(data=E[1],index =norm_wine_data.columns,columns=['PC '+str(i) for i in range(1,14)] )
B

Unnamed: 0,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,PC 11,PC 12,PC 13
alcohol,-0.144329,0.483652,-0.207383,0.017856,-0.265664,0.213539,0.056396,-0.01497,0.396139,-0.266286,-0.508619,-0.225917,0.211605
malic_acid,0.245188,0.224931,0.089013,-0.53689,0.035214,0.536814,-0.420524,-0.025964,0.065827,0.121696,0.075283,0.076486,-0.30908
ash,0.002051,0.316069,0.626224,0.214176,-0.143025,0.154475,0.149171,0.141218,-0.17026,-0.049622,0.307694,-0.498691,-0.027125
alcalinity_of_ash,0.23932,-0.010591,0.61208,-0.060859,0.066103,-0.100825,0.286969,-0.091683,0.42797,-0.055743,-0.200449,0.479314,0.052799
magnesium,-0.141992,0.299634,0.130757,0.351797,0.727049,0.038144,-0.322883,-0.056774,-0.156361,0.06222,-0.271403,0.071289,0.06787
total_phenols,-0.394661,0.06504,0.146179,-0.198068,-0.149318,-0.084122,0.027925,0.463908,-0.405934,-0.303882,-0.286035,0.304341,-0.320131
flavanoids,-0.422934,-0.00336,0.150682,-0.152295,-0.109026,-0.01892,0.060685,-0.832257,-0.187245,-0.042899,-0.049578,-0.025694,-0.163151
nonflavanoid_phenols,0.298533,0.028779,0.170368,0.203301,-0.500703,-0.258594,-0.595447,-0.11404,-0.233285,0.042352,-0.195501,0.116896,0.215535
proanthocyanins,-0.313429,0.039302,0.149454,-0.399057,0.13686,-0.533795,-0.372139,0.116917,0.368227,-0.095553,0.209145,-0.237363,0.134184
color_intensity,0.088617,0.529996,-0.137306,-0.065926,-0.076437,-0.418644,0.227712,0.011993,-0.033797,0.604222,-0.056218,0.031839,-0.290775


By checking the eigenvalues, we can see that they are not in descending order. This will cause problems in interpretation. Rearranging the eigen vectors in the order of increasing eigen values

In [11]:
order_E = list(reversed(np.argsort(E[0])))

In [14]:
eig_val = sorted(E[0],reverse=True)
eig_vec = B.copy()
for i in range(len(order_E)):
    eig_vec.iloc[:,i] = B.iloc[:,order_E[i]]

eig_vec

Unnamed: 0,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,PC 11,PC 12,PC 13
alcohol,-0.144329,0.483652,-0.207383,0.017856,-0.265664,0.213539,0.056396,0.396139,-0.508619,0.211605,-0.225917,-0.266286,-0.01497
malic_acid,0.245188,0.224931,0.089013,-0.53689,0.035214,0.536814,-0.420524,0.065827,0.075283,-0.30908,0.076486,0.121696,-0.025964
ash,0.002051,0.316069,0.626224,0.214176,-0.143025,0.154475,0.149171,-0.17026,0.307694,-0.027125,-0.498691,-0.049622,0.141218
alcalinity_of_ash,0.23932,-0.010591,0.61208,-0.060859,0.066103,-0.100825,0.286969,0.42797,-0.200449,0.052799,0.479314,-0.055743,-0.091683
magnesium,-0.141992,0.299634,0.130757,0.351797,0.727049,0.038144,-0.322883,-0.156361,-0.271403,0.06787,0.071289,0.06222,-0.056774
total_phenols,-0.394661,0.06504,0.146179,-0.198068,-0.149318,-0.084122,0.027925,-0.405934,-0.286035,-0.320131,0.304341,-0.303882,0.463908
flavanoids,-0.422934,-0.00336,0.150682,-0.152295,-0.109026,-0.01892,0.060685,-0.187245,-0.049578,-0.163151,-0.025694,-0.042899,-0.832257
nonflavanoid_phenols,0.298533,0.028779,0.170368,0.203301,-0.500703,-0.258594,-0.595447,-0.233285,-0.195501,0.215535,0.116896,0.042352,-0.11404
proanthocyanins,-0.313429,0.039302,0.149454,-0.399057,0.13686,-0.533795,-0.372139,0.368227,0.209145,0.134184,-0.237363,-0.095553,0.116917
color_intensity,0.088617,0.529996,-0.137306,-0.065926,-0.076437,-0.418644,0.227712,-0.033797,-0.056218,-0.290775,0.031839,0.604222,0.011993


The above table gives the factor loadings of each feature. Each principal component can be represented as a linear combination of the other features. The weights are the factor loadings. These are the eigenvectors of the covariance matrix in the order of eigen values. The eigen vectors by definition are orthonormal to each other.

In [15]:
np.dot(eig_vec['PC 1'],eig_vec['PC 1'])

1.0

In [16]:
np.round(np.dot(eig_vec['PC 1'],eig_vec['PC 5']),5)

0.0

The eigen vectors can be used to transform the observations into coordinates in the eigen basis. The variance of the components along the basis vectors is given by the eigen values. The covariance between components is zero.

In [17]:
z =np.matmul(eig_vec.transpose().values,norm_wine_data.transpose().values)

In [18]:
z1 = pd.DataFrame(data=z.transpose(),index=norm_wine_data.index,columns=eig_vec.columns)

In [19]:
z1.head()

Unnamed: 0,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,PC 11,PC 12,PC 13
0,-3.316751,1.443463,-0.165739,0.215631,0.693043,0.22388,-0.596427,-0.065139,-0.641443,1.020956,-0.451563,0.54081,0.066239
1,-2.209465,-0.333393,-2.026457,0.291358,-0.257655,0.92712,-0.053776,-1.024416,0.308847,0.159701,-0.142657,0.388238,-0.003637
2,-2.51674,1.031151,0.982819,-0.724902,-0.251033,-0.549276,-0.424205,0.344216,1.177834,0.113361,-0.286673,0.000584,-0.021717
3,-3.757066,2.756372,-0.176192,-0.567983,-0.311842,-0.114431,0.383337,-0.643593,-0.052544,0.239413,0.759584,-0.24202,0.369484
4,-1.008908,0.869831,2.026688,0.409766,0.298458,0.40652,-0.444074,-0.4167,-0.326819,-0.078366,-0.525945,-0.216664,0.079364


This is the transformed dataframe of all observations into principal components. Checking the covariance matrix.

In [20]:
pd.DataFrame(np.round(np.cov(z1,rowvar=False),2))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,4.73,0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0
1,0.0,2.51,0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,0.0
2,-0.0,0.0,1.45,-0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,0.0
3,-0.0,0.0,-0.0,0.92,0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0
4,0.0,0.0,0.0,0.0,0.86,-0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0
5,-0.0,-0.0,0.0,-0.0,-0.0,0.65,0.0,-0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,-0.0,-0.0,0.0,-0.0,0.0,0.55,-0.0,0.0,-0.0,0.0,0.0,-0.0
7,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.35,-0.0,-0.0,-0.0,0.0,0.0
8,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,0.29,-0.0,0.0,0.0,0.0
9,-0.0,0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0,0.25,-0.0,0.0,0.0


The components are uncorrelated and have variance as per their eigen values. Now these components can be used as features to predict outcomes.

The components are in the order of the variance they explain in the original feature matrix. So we can remove the components which may not explain much variance since the information loss will be less but the number of dimensions can be reduced.

In [21]:
p1 = pd.Series(eig_val/sum(eig_val),index= z1.columns)
pd.DataFrame(data=(np.round(p1,2),np.round(np.cumsum(p1),2)),index=['Variance Explained','Cumulative'])

Unnamed: 0,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,PC 11,PC 12,PC 13
Variance Explained,0.36,0.19,0.11,0.07,0.07,0.05,0.04,0.03,0.02,0.02,0.02,0.01,0.01
Cumulative,0.36,0.55,0.67,0.74,0.8,0.85,0.89,0.92,0.94,0.96,0.98,0.99,1.0


From the above table, we can see that the first 7 components explain 90% of the variation in the data set. But the number of components to choose depends on the context. Another way to choose the number of components is to choose the number that minimizes the prediction error either in a regression or classification setting. Choosing a smaller number of predictors may increase the prediction accuracy in out of sample data because of bias-variance trade off.