## Prinicipal Components

Algebraically, PC's are particular linear combinations of p-random variables. Geometrically, these linear combinations represent the selection of a new coordinate system obtained by rotating the original system.

- PC's solely depent on covariance matrix. 
- No multivariate normal assumption required. However, normality has some nice property in terms of constant density ellepsoid.

Steps:

1. Calculate the covariance matrix $\Sigma$
2. Eigen decomposition of $\Sigma$ to get eigen values and eigen vectors
3. $Y=AX$, Y is Scores and A is the Loadings
4. PC's are these uncorrelated linear combinations $Y_1$, $Y_2$, ... , $Y_p$
5. $Y_1$ is First PC that maximizes $Var(a_1'X)$ s.t. $a_1'a_1=1$
6. $Y_2$ is the Second PC that maximizes $Var(a_2'X)$ s.t. $a_2'a_2=1$ and $Cov(a_1'X, a_2'X)$, and so on

If X is a data matrix with eigen($\Sigma$) as the ($\lambda_1, e_1$), ($\lambda_2, e_2$), ..., ($\lambda_p, e_p$) eigenvalue, eigenVector pairs, then the ith PC is - $Y_i = e_{i1}X_1 + e_{i2}X_2 + ... + e_{ip}X_p$ 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm
import sklearn.metrics as metrics

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore")

In [27]:
df = pd.read_csv("datasets/lr_data.csv")
df = df.drop(columns="Unnamed: 0")
df.head()

Unnamed: 0,is_booked,tmv,demand_supply_ratio,price,category_grouped,month
0,0,14569,1.902318,30.0,1,5
1,1,4201,14.622831,39.0,0,5
2,1,5724,8.659708,24.0,1,4
3,1,39102,13.57039,121.0,2,12
4,0,9666,1.297453,42.0,1,8


In [28]:
# 1. Get Covariance matrix
cov_mat = np.cov(df[['tmv', 'demand_supply_ratio', 'price', 'category_grouped', 'month']], rowvar=False)
pd.DataFrame(cov_mat)

Unnamed: 0,0,1,2,3,4
0,87247040.0,-494.698385,162660.955028,2604.270513,3129.826554
1,-494.6984,15.613508,-4.173764,-0.900113,-1.55765
2,162661.0,-4.173764,492.550758,10.185813,9.327519
3,2604.271,-0.900113,10.185813,0.968767,0.476619
4,3129.827,-1.55765,9.327519,0.476619,12.828027


In [29]:
# 2. Get eigen decomposition
eigval, eigvec = np.linalg.eig(cov_mat)
print(eigval)
pd.DataFrame(eigvec) # Each column is a principal component direction

[8.72473410e+07 1.89572385e+02 6.94428867e-01 1.62114050e+01
 1.20287938e+01]


Unnamed: 0,0,1,2,3,4
0,0.999998,-0.001865,-2.1e-05,2.9e-05,-1e-05
1,-6e-06,-0.018997,-0.051566,-0.921015,0.385632
2,0.001864,0.999216,0.027001,-0.026718,-0.010978
3,3e-05,0.028359,-0.998153,0.053555,-0.004167
4,3.6e-05,0.019958,0.017368,0.384903,0.922578


In [38]:
# 3. Get PC scores
pc_loadings = np.matmul(np.transpose(eigvec), np.transpose(df[['tmv', 'demand_supply_ratio', 'price', 'category_grouped', 'month']].as_matrix()))
df_loadings = pd.DataFrame(np.transpose(pc_loadings)).head()
df_loadings.columns = ['pc1', 'pc2', 'pc3', 'pc4', 'pc5']
df_loadings.head()

Unnamed: 0,pc1,pc2,pc3,pc4,pc5
0,14568.97045,-0.658583,29.79131,-29.423056,6.079428
1,4200.96454,-0.384526,23.441762,-38.408733,5.732779
2,5723.973372,-0.813036,19.901994,-23.502356,4.87021
3,39101.904034,-3.933448,89.542453,-119.167438,15.61579
4,9665.979816,-0.081187,20.336979,-41.576901,8.867616


**Properties of PCA**

In [43]:
# print(np.var(df_loadings['pc1']))
# print(np.cov(df_loadings['pc1']))

print(sum(np.diag(cov_mat)))
print(sum(eigval))

87247559.49757595
87247559.49757594


**Although, PCA as such does not require normalization.. It looks like because the scales are so different, I should normalize and scale.**