ntroduction

Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce a dataset with many features into fewer features while retaining most of the information (variance).

Why PCA?

Reduce dimensions â†’ faster computation

Remove redundancy (correlated features)

Visualize high-dimensional data

Improve model performance in ML

In [1]:
# PCA Notebook Example
# Author: Ananda Rimal
# Date: 2026-02-11

# -------------------------------
# Step 0: Import Libraries
# -------------------------------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# -------------------------------
# Step 1: Create Example Dataset
# -------------------------------
data = {
    'Price': [50, 65, 80, 48, 95],
    'Rooms': [2, 3, 4, 2, 5],
    'Bathroom': [30, 35, 40, 28, 50]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Price,Rooms,Bathroom
0,50,2,30
1,65,3,35
2,80,4,40
3,48,2,28
4,95,5,50


In [2]:
# -------------------------------
# Step 2: Standardize the Data
# -------------------------------
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
print("\nStandardized Data:\n", X_scaled)


Standardized Data:
 [[-0.98227501 -1.02899151 -0.83658321]
 [-0.14510881 -0.17149859 -0.20280805]
 [ 0.69205739  0.68599434  0.43096711]
 [-1.09389717 -1.02899151 -1.09009327]
 [ 1.52922359  1.54348727  1.69851742]]


In [3]:
cov_matrix = np.cov(X_scaled.T)
print("\nCovariance Matrix:\n", cov_matrix)


Covariance Matrix:
 [[1.25       1.24908352 1.23482521]
 [1.24908352 1.25       1.23364901]
 [1.23482521 1.23364901 1.25      ]]


In [4]:
# Step 4: Compute Eigenvalues and Eigenvectors
# -------------------------------
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("\nEigenvalues:\n", eigenvalues)
print("\nEigenvectors:\n", eigenvectors)


Eigenvalues:
 [3.72838510e+00 8.93068586e-04 2.07218276e-02]

Eigenvectors:
 [[-0.57821069 -0.72058696 -0.38265758]
 [-0.57802899  0.6927938  -0.43118352]
 [-0.57580803  0.02812774  0.81710094]]


In [6]:
# Sort eigenvalues in descending order
eig_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]
eig_pairs.sort(key=lambda x: x[0], reverse=True)
eig_pairs

[(np.float64(3.728385103829724),
  array([-0.57821069, -0.57802899, -0.57580803])),
 (np.float64(0.02072182758466134),
  array([-0.38265758, -0.43118352,  0.81710094])),
 (np.float64(0.0008930685856138446),
  array([-0.72058696,  0.6927938 ,  0.02812774]))]

In [7]:
top_k = 2  # reduce to 2 principal components
W = np.column_stack([eig_pairs[i][1] for i in range(top_k)])
print("\nProjection Matrix W (Top 2 Eigenvectors):\n", W)


Projection Matrix W (Top 2 Eigenvectors):
 [[-0.57821069 -0.38265758]
 [-0.57802899 -0.43118352]
 [-0.57580803  0.81710094]]


In [8]:
# Step 6: Project Data onto Principal Components
# -------------------------------
Z = X_scaled.dot(W)
df_pca = pd.DataFrame(Z, columns=['PC1', 'PC2'])
print("\nProjected Data (2 Principal Components):\n", df_pca)


Projected Data (2 Principal Components):
         PC1       PC2
0  1.644460  0.135986
1  0.299813 -0.036240
2 -1.044834 -0.208467
3  1.854974 -0.028444
4 -2.754414  0.137165
