PCA overview: https://www.youtube.com/watch?v=5vgP05YpKdE

Mathematics behind PCA: https://www.youtube.com/watch?v=FD4DeN81ODY


##Principal Component Analysis
PCA is a statistical technique used for reducing the dimensionality of data while preserving its key features. It's particularly useful when dealing with datasets with a large number of variables, as it simplifies the complexity by transforming the original variables into a new set of uncorrelated variables called principal components.

##Applications:
 1. Data Visualization: visualize high-dimensional data in a lower-dimensional space (typically 2D or 3D)

2. Feature Extraction: extract the most important features while minimizing information in high-dimensional datasets.

3. Noise Reduction: By focusing on the principal components with the highest variance, PCA can filter out noise present in the data.

4. Data Compression: reduces the memory footprint by representing the data with a smaller number of principal components while maintaining most of the variability in data compression techniques.

Some interesting thoughts and discussions on PCA to understand the concept in more depth
https://www.kaggle.com/discussions/general/21449


- Requires standardization/scaling
- Requires encoding of categorical values

## Import Libraries and Dataset

In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [72]:
data = pd.read_csv('/content/PCA_train.csv')

In [73]:
data.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [74]:
data.shape

(429, 4993)

## EDA & Preprocessing

In [75]:
data.isnull().sum()

ID           0
target       0
48df886f9    0
0deb4b6a8    0
34b15f335    0
            ..
71b203550    1
137efaa80    1
fb36b89d9    1
7e293fbaf    1
9fc776466    1
Length: 4993, dtype: int64

In [76]:
data.fillna(data.mode().iloc[0], inplace=True)

In [77]:
data.isnull().sum()

ID           0
target       0
48df886f9    0
0deb4b6a8    0
34b15f335    0
            ..
71b203550    0
137efaa80    0
fb36b89d9    0
7e293fbaf    0
9fc776466    0
Length: 4993, dtype: int64

In [78]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429 entries, 0 to 428
Columns: 4993 entries, ID to 9fc776466
dtypes: float64(3482), int64(1510), object(1)
memory usage: 16.3+ MB


In [79]:
data.drop(['ID'], axis=1, inplace=True)

In [80]:
scaler = StandardScaler()
scaler.fit_transform(data)

array([[ 3.88974891,  0.        , -0.04833682, ..., -0.08327945,
        -0.07736   , -0.15155035],
       [-0.70803918,  0.        , -0.04833682, ..., -0.08327945,
        -0.07736   , -0.15155035],
       [ 0.44755462,  0.        , -0.04833682, ..., -0.08327945,
        -0.07736   , -0.15155035],
       ...,
       [ 2.90626483,  0.        , -0.04833682, ..., -0.08327945,
        -0.07736   , -0.15155035],
       [-0.6696833 ,  0.        , -0.04833682, ..., -0.08327945,
        -0.07736   , -0.15155035],
       [-0.63316028,  0.        , -0.04833682, ..., -0.08327945,
        -0.07736   , -0.15155035]])

## Train & Test split

In [81]:
y = data['target']
X = data.drop(['target'], axis=1)

In [82]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

## Model Training and testing without PCA

In [83]:
model = LinearRegression()
model.fit(X_train, y_train)

In [84]:
y_pred = model.predict(X_test)

In [89]:
rms = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [90]:
print("Root Mean Square Error: ", rms)
print("R2 Score: ", r2)

Root Mean Square Error:  1.6404051775991124e+16
R2 Score:  -232.24116937158334


## PCA

In [103]:
pca = PCA(n_components=0.99, random_state=0).fit(X)

In [109]:
var = np.cumsum(np.round(a=pca.explained_variance_ratio_,decimals=3) * 100 ) #cumulative_explained_variance

In [105]:
import plotly.graph_objs as go

In [110]:
fig= go.Figure()
fig.add_trace(trace =go.Scatter(x=list(range(500)),
                              y=var,
                              name="Cumulative Explained Variance ",
                              mode='lines+markers',
                              line=dict(color='royalblue', width=2),
                              marker=dict(color='darkorange', size=5)
                               ))
#layout with cosmetics
fig.update_layout(height=500,
                width=1000,
                title_text ='PCA Analysis',
                title=0.5,
                xaxis_title='Number of components',
                yaxis_title='Explained Variance %'
                 )
fig.update_traces(showlegend=True)
fig.show()

In [113]:
pca=PCA(n_components=150, random_state=0)
X_pca=pca.fit_transform(X)
print(f'Before PCA \t :{X.shape}')
print(f'After PCA \t :{X_pca.shape}')

Before PCA 	 :(429, 4991)
After PCA 	 :(429, 150)


In [114]:
X_pca_train,X_pca_test, y_pca_train,y_pca_test=train_test_split(X_pca, y,test_size=0.30)

In [115]:
model.fit(X_pca_train, y_pca_train)
y_pca_pred = model.predict(X_pca_test)

In [116]:
rms = mean_squared_error(y_pca_test, y_pca_pred)
r2 = r2_score(y_pca_test, y_pca_pred)

In [117]:
print("Root Mean Square Error: ", rms)
print("R2 Score: ", r2)

Root Mean Square Error:  1.598101754845351e+16
R2 Score:  -266.88044167453506
