## README
To run the file, upload the data.csv file. In my case, I have uploaded it to sample_data folder in session storage. You can upload it elsewhere but remember to change the path mentioned in code afterwards.

# Principal Component Analysis
Principal Component Analysis (PCA) is an algorithm that reduces the dimensionality of a data set to a lower-dimensional linear subspace by linear projection
in such a way that the reconstruction error made by the linear projection is as
low as possible.

In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [4]:
data = pd.read_csv('data.csv')
data.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,3,6,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,10,3,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,10,7,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,3,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,3,1,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [5]:
# X should be transposed to convert rows to features and columns to data points
# 12 x 517 size matrix where there are 12 features and 517 data points
X = np.transpose(data.iloc[:, :-1].values)

label = data.iloc[:, -1].values

# X = StandardScaler().fit_transform(X) #standardize the data
# X.shape

In [6]:
mean_vector = np.mean(X, axis=1)
mean_matrix = np.tile(mean_vector, (X.shape[1],1))
X = X - mean_matrix.T

In [7]:
Cov = (np.dot(X, X.T))/(X.shape[1])
# Cov = np.cov(X)

n = Cov.shape[0] 

q = 5
# Cov.shape

# Reconstruction

In [8]:
# Corresponding Eigenvalues and Eigenvectors for Cov.
eigenvalue, eigenvector = np.linalg.eig(Cov)

eigval_eigvec = list(zip(*sorted(zip(eigenvalue,eigenvector), reverse=True)))

principal_components = np.asarray(list(eigval_eigvec[1][:])).T

P = principal_components[:q] # q x 12 matrix with q leading principal components in each row.

Y = np.dot(P, X) 
# Y.shape

In [9]:
variance_loss = sum(eigval_eigvec[0][q:]) 

total_variance = sum(eigval_eigvec[0][:]) 

print("PCA Loss : ", (variance_loss/total_variance)*100, "%") 

PCA Loss :  0.037985574963925295 %


In [10]:
from sklearn.linear_model import LinearRegression
import datetime

start_time = datetime.datetime.now()
reg_original_data = LinearRegression().fit(X.T,label) 

print("The R2 coefficient is : ", reg_original_data.score(X.T, label)) 

end_time = datetime.datetime.now()
print("Total elapsed time : ", end_time - start_time)

The R2 coefficient is :  0.02535067134925728
Total elapsed time :  0:00:00.007069


In [11]:
start_time = datetime.datetime.now()
reg_pca_data = LinearRegression().fit(Y.T, label) 

print("The R2 coefficient is : ", reg_pca_data.score(Y.T, label))  

end_time = datetime.datetime.now()
print("Total elapsed time : ", end_time - start_time)

The R2 coefficient is :  0.012739904147630488
Total elapsed time :  0:00:00.002679


# Observation
It can be seen that even though the reconstructed data points has a lower R2 coefficient, it saves some time
during the modelling of the linear regression. The time saved might be small for this example, but it makes a lot
of difference when it comes to higher dimensional dataset and more complex modelling techniques. Linear Regression
was just used for simplicity sake to show the working principles of PCA. That is, even though the dimension
from the original data points has been reduced drastically, the reconstructed data points can still provide
linear regression enough information to model the relationships between the dependent and independent variables.

In [12]:
from sklearn.decomposition import PCA
pca = PCA(n_components=q)
pca.fit(X.T)
Y_2 = np.transpose(pca.transform(X.T))
Y_2.shape

(5, 517)

In [13]:
diff = sum(sum(pca.components_) - sum(P)) 
print(diff)

0.5970902555675164


## Differences
The difference between the principal components are not that much. Sklearn's PCA uses normalizations on the data, hence provides a slightly different result than my implementation.