## README
To run the file, upload the data.csv file. In my case, I have uploaded it to sample_data folder in session storage. You can upload it elsewhere but remember to change the path mentioned in code afterwards.

# Principal Component Analysis
Principal Component Analysis (PCA) is an algorithm that reduces the dimensionality of a data set to a lower-dimensional linear subspace by linear projection
in such a way that the reconstruction error made by the linear projection is as
low as possible.

In [18]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [None]:
data = pd.read_csv('/content/sample_data/data.csv')
data.head()

In [20]:
# X should be transposed to convert rows to features and columns to data points
# 12 x 517 size matrix where there are 12 features and 517 data points
X = np.transpose(data.iloc[:, :-1].values)

# Get the last column of the matrix
label = data.iloc[:, -1].values

# X = StandardScaler().fit_transform(X) #standardize the data
# X.shape

In [21]:
mean_vector = np.mean(X, axis=1)
mean_matrix = np.tile(mean_vector, (X.shape[1],1))
X = X - mean_matrix.T

In [22]:
# Covariance matrix of X
Cov = (np.dot(X, X.T))/(X.shape[1])
# Cov = np.cov(X)

# Get the dimension of the covariance matrix
n = Cov.shape[0] 

q = 5 #dimension of the lower-dimensional subspace
# Cov.shape

# Reconstruction

In [23]:
# Corresponding Eigenvalues and Eigenvectors for Cov.
eigenvalue, eigenvector = np.linalg.eig(Cov)

# Order in decreasing order
eigval_eigvec = list(zip(*sorted(zip(eigenvalue,eigenvector), reverse=True)))

# Get eigenvectors in rows instead of columns
principal_components = np.asarray(list(eigval_eigvec[1][:])).T

P = principal_components[:q] # q x 12 matrix with q leading principal components in each row.

# Transform X to get reconstructed data points
Y = np.dot(P, X) 
# Y.shape

In [24]:
# Total Loss
variance_loss = sum(eigval_eigvec[0][q:]) 

# Total variance
total_variance = sum(eigval_eigvec[0][:]) 

# Percentage loss
print("PCA Loss : ", (variance_loss/total_variance)*100, "%") 

PCA Loss :  0.037985574963925774 %


In [None]:
# To compare the linear regression prediction on the original data points 
# and the reconstructed data points.
from sklearn.linear_model import LinearRegression
import datetime

# Fit the original data with the label
start_time = datetime.datetime.now()
reg_original_data = LinearRegression().fit(X.T,label) 

# Calculate the R2 coefficient on the training data itself
print("The R2 coefficient is : ", reg_original_data.score(X.T, label)) 

# Calculate total elapsed time
end_time = datetime.datetime.now()
print("Total elapsed time : ", end_time - start_time)

In [None]:
# Fit the reconstructed data points with the label
start_time = datetime.datetime.now()
reg_pca_data = LinearRegression().fit(Y.T, label) 

# Calculate the R2 coefficient with the training data itself
print("The R2 coefficient is : ", reg_pca_data.score(Y.T, label))  

# Calculate total elapsed time
end_time = datetime.datetime.now()
print("Total elapsed time : ", end_time - start_time)

# Observation
It can be seen that even though the reconstructed data points has a lower R2 coefficient, it saves some time
during the modelling of the linear regression. The time saved might be small for this example, but it makes a lot
of difference when it comes to higher dimensional dataset and more complex modelling techniques. Linear Regression
was just used for simplicity sake to show the working principles of PCA. That is, even though the dimension
from the original data points has been reduced drastically, the reconstructed data points can still provide
linear regression enough information to model the relationships between the dependent and independent variables.

In [None]:
from sklearn.decomposition import PCA #using PCA from the sklearn 
pca = PCA(n_components=q)
pca.fit(X.T)
Y_2 = np.transpose(pca.transform(X.T))
Y_2.shape

In [None]:
# Difference between the principal components (my implementation) and Sklearn's implementation
diff = sum(sum(pca.components_) - sum(P)) 
print(diff)

## Differences
The difference between the principal components are not that much. Sklearn's PCA uses normalizations on the data, hence provides a slightly different result than my implementation.