# Dimensionality Reduction using PCA.

# PCA (Principal Component Analysis)

**Definition:** Unsupervised dimensionality reduction that transforms correlated features into **orthogonal principal components (PCs)** maximizing **variance**.

---

**Purpose:** Reduce dimensionality, remove correlation, denoise data, aid visualization.

---

**Key Facts:**  
- PCs are **linear combinations** of original features  
- PCs are **orthogonal** (uncorrelated)  
- PC1 has **maximum variance**, subsequent PCs capture remaining variance  

---

**Mathematics**
$$
X \in \mathbb{R}^{n \times p} \quad \text{(mean-centered data matrix)}
$$

$$
\Sigma = \frac{1}{n} X^T X \quad \text{(covariance matrix)}
$$

$$
\Sigma v_i = \lambda_i v_i \quad 
v_i = \text{i-th principal component}, \quad 
\lambda_i = \text{variance explained by PC}_i
$$

$$
\text{Explained Variance (PC}_i) = \frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j}
$$

$$
X = U \Sigma V^T \quad \Rightarrow \quad V = \text{matrix of principal components (SVD)}
$$

---



## Setup

In [1]:
# Import Libraries
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn import metrics

In [2]:
# Import Data
iris_df = sns.load_dataset('iris')
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
display(
iris_df.describe(),
iris_df.shape,
iris_df.isnull().sum()
)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


(150, 5)

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

No nulls have been found

---

## Calculating accuracy with PCA

### Against Raw data

In [4]:
# Setting features
x = iris_df.drop(columns=['species'])
y = iris_df['species']

In [5]:
# Train-Test Split
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.2, random_state=67)

In [6]:
# Standardize data
s = StandardScaler()

# If we standardize the data together, the train and test data become dependent and thus model results are skewed.
x_train_s = s.fit_transform(x_train)
x_test_s = s.transform(x_test)

In [7]:
# Train the classifier
clf = RandomForestClassifier(n_estimators=100, random_state=67)
# fit the model
clf.fit(x_train_s, y_train)
# generate predictions
y_pred = clf.predict(x_test_s)

print("Accuracy on Raw Data: ", metrics.accuracy_score(y_test, y_pred))

Accuracy on Raw Data:  0.9666666666666667


### Against PCA 

In [12]:
# Applying PCA
pca = PCA(n_components = 2)
x_train_pca = pca.fit_transform(x_train_s)
x_test_pca = pca.transform(x_test_s)

# Train classifier against PCA data
clf_pca = RandomForestClassifier(n_estimators=100, random_state=67)
clf_pca.fit(x_train_pca, y_train)
y_pred_pca2 = clf_pca.predict(x_test_pca)

print("Accuracy on PCA Data: ", metrics.accuracy_score(y_test, y_pred_pca2))

Accuracy on PCA Data:  0.8333333333333334


Since 2 is not giving a good result, then lets try again with 3 components


In [None]:
# Applying PCA
pca = PCA(n_components = 3)
x_train_pca = pca.fit_transform(x_train_s)
x_test_pca = pca.transform(x_test_s)

# Train classifier against PCA data
clf_pca = RandomForestClassifier(n_estimators=1000, random_state=67)
clf_pca.fit(x_train_pca, y_train)
y_pred_pca3 = clf_pca.predict(x_test_pca)

print("Accuracy on PCA Data: ", metrics.accuracy_score(y_test, y_pred_pca3))

Accuracy on PCA Data:  0.9


In [14]:
# Applying PCA
pca = PCA(n_components = 4)
x_train_pca = pca.fit_transform(x_train_s)
x_test_pca = pca.transform(x_test_s)

# Train classifier against PCA data
clf_pca = RandomForestClassifier(n_estimators=100, random_state=67)
clf_pca.fit(x_train_pca, y_train)
y_pred_pca4 = clf_pca.predict(x_test_pca)

print("Accuracy on PCA Data: ", metrics.accuracy_score(y_test, y_pred_pca4))

Accuracy on PCA Data:  0.9
