In [2]:
!sudo apt-get install unzip
!unzip archive.zip -d .

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
unzip is already the newest version (6.0-26ubuntu3.2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Archive:  archive.zip
  inflating: ./Student_Performance.csv  


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the Dataset
The dataset has 6 (5 + 1 constant) variables and the target parameter is Performance Index

In [4]:
df = pd.read_csv("Student_Performance.csv")
df['Extracurricular Activities'] = df['Extracurricular Activities'].apply(lambda x: 1 if x=='Yes' else 0)

In [5]:
df['constant'] = 1
df['train'] = 1

## Principal Component Analysis

Essentially we find the direction  in which sample variance is highest

In [6]:
df_pcr = df

In [7]:
X_train_c = np.array(df_pcr.drop(["Performance Index"],axis=1)) - np.array(df_pcr.drop(["Performance Index"],axis=1).mean(axis=0))
y_train_c = np.array(df_pcr["Performance Index"]) - np.array(df_pcr["Performance Index"].mean(axis=0))

### Eigen Decomposition
We decompose the covariance matrix into its eigen vectors

$$ X^T X = V \Sigma V^T$$

The vectors of v are the prinicipal components in the data

In [8]:
lamb,v = np.linalg.eig(X_train_c.T@X_train_c)

We sort the principal components in terms of the magnitude of its eigenvectors

In [9]:
idx = lamb.argsort()[::-1]
lamb= lamb[idx]
v = v[:,idx]

We then calculate the explained variance by cumulatively adding each component in descending order

In [10]:
evar = np.cumsum(lamb) / lamb.sum()

Here we notice that only the first 3 components add most of the explained variance

In [11]:
evar

array([0.94338659, 0.96920666, 0.99019654, 0.99921658, 1.        ,
       1.        , 1.        ])

We choose only the first 3 components and project X in that space. We then estimate the parameters with these 3 components. As the data is centered no need of constant term

In [12]:
comps = v[:,:3]

In [13]:
X_pcr = X_train_c@comps
Y_pcr = y_train_c

In [14]:
X_train_pcr,Y_train_pcr = X_pcr[:8000], Y_pcr[:8000]
X_test_pcr,Y_test_pcr = X_pcr[8000:], Y_pcr[8000:]

In [15]:
m = np.linalg.inv(X_train_pcr.T@X_train_pcr)@X_train_pcr.T@Y_train_pcr

In [16]:
y_pred = X_test_pcr@m.T

In [17]:
ESS = np.square(y_pred-Y_test_pcr.mean()).sum()
RSS = np.square(y_pred-Y_test_pcr).sum()
R2 = ESS / (ESS + RSS)
adjR2 = ESS*(2000-3) / ((RSS+ESS)*(2000-1))

We find that only with the 3 principal components we have very close Adjusted R2 value

In [18]:
adjR2

0.9852917965590035

## Partial Least Squares
The main drawback with PCA is that, the derived input direction doesn't take into account the variation of dependent variables. So it could happen that in some of the prinicipal component directions variation of Y is very small. Hence we use Partial Least Squares methods.

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the Dataset
The dataset has 6 (5 + 1 constant) variables and the target parameter is Performance Index

In [22]:
df = pd.read_csv("Student_Performance.csv")
df['Extracurricular Activities'] = df['Extracurricular Activities'].apply(lambda x: 1 if x=='Yes' else 0)

In [25]:
X,Y  = np.array(df.drop("Performance Index",axis=1)), np.array(df['Performance Index'])

In [30]:
X_red,Y_red = X[:2],Y[:2]

In [31]:
np.linalg.eig(X_red.T @ X_red)

EigResult(eigenvalues=array([ 1.66828564e+04,  2.42038025e-14,  1.01435867e+01, -1.12531735e-15,
       -8.05066484e-17]), eigenvectors=array([[-0.06151446, -0.90094683,  0.42954718, -0.03071734, -0.01246139],
       [-0.99525548,  0.0277053 , -0.0844182 , -0.03794534,  0.01408377],
       [-0.00597283,  0.09565673,  0.19977832, -0.01792556,  0.96720087],
       [-0.07346011,  0.40031004,  0.82910382,  0.33890602, -0.24596464],
       [-0.01582516, -0.13464354, -0.28467222,  0.9393817 , -0.06058242]]))