# Principal Component Analysis (PCA) — My Ultimate Notebook

Hi — welcome to my **PCA Notebook**.  
I created this to learn PCA deeply myself, and to help anyone who wants a **complete, hands-on PCA reference**.  

This notebook will take you from **zero to hero** on PCA: we’ll cover intuition, math, code, visualizations, and real-world applications — step by step.  

High-dimensional datasets come with unique challenges, often called the **curse of dimensionality**:  
- Distances between points become less meaningful.  
- Data becomes sparse, making patterns hard to detect.  
- Models are prone to overfitting when features outnumber observations.  

PCA helps tackle these issues by projecting data to a lower-dimensional space while preserving most of the variance. In other words, PCA reduces dimensionality **intelligently**, making data easier to visualize, store, and analyze, and often improving machine learning performance.

---

## Table of Contents
# 📑 Table of Contents

1. [Introduction & Why PCA](#intro)  
2. [Prerequisites & Intuition](#prereq)  
3. [Mathematical Derivation](#math)  
   - Variance & Covariance  
   - Eigenvalues & Eigenvectors  
   - Optimization View of PCA  
4. [PCA from Scratch (Numpy Implementation)](#scratch)  
5. [PCA with Scikit-Learn](#sklearn)  
6. [Applications](#apps)  
   - Visualization  
   - Compression  
   - De-noising  
   - Preprocessing for ML  
7. [Limitations & Pitfalls](#pitfalls)  
8. [Advanced Notes](#advanced)  
   - Kernel PCA  
   - Sparse PCA  
   - Incremental PCA  
9. [Hands-on Exercises](#exercises)  
10. [Summary & Final Thoughts](#summary)


---

<a id="intro"></a>
# 1️. Introduction & Why PCA

Principal Component Analysis (PCA) is one of the most widely used techniques in data analysis and machine learning. At its core, PCA is a **dimensionality reduction method**: it helps us take data with many features (sometimes hundreds or thousands) and represent it with fewer variables, while keeping as much information as possible.

I like to think of PCA as a way to **compress the essence of data**.  
When I have a dataset with lots of features (columns), many of them might be redundant or noisy. PCA helps to distill that information into a smaller set of new features (principal components) that capture what truly matters.

---

## The Motivation

Why do we need dimensionality reduction in the first place?

- **High-dimensional data is tricky**  
  As the number of features grows, data becomes sparse and distances between points lose meaning. This is known as the **curse of dimensionality**. It makes visualization harder, models prone to overfitting, and computations expensive.

- **Redundancy in features**  
  Many real-world datasets have correlated features (e.g., height and arm span, or pixels in nearby locations of an image). This means the dataset has fewer “true” degrees of freedom than it appears.

- **Noise in data**  
  Not all variance in data is useful. Some directions of variation are mostly noise, and removing them can make downstream models more robust.

PCA provides a principled way to address all of these issues by finding new axes (principal components) that:  
1. Capture the directions of **maximum variance** in the data  
2. Are **orthogonal (independent)** of each other  
3. Allow us to **rank** components by importance (explained variance)

---

## What PCA Does in nutshell

PCA finds new axes (principal components) such that if we project our data onto the first few, we keep most of the **spread/variance** of the dataset, but with fewer dimensions.

Formally, PCA solves the optimization problem:

$$
\max_{\mathbf{u}} \; \mathrm{Var}(\mathbf{u}^\top X)
\quad \text{subject to } \mathbf{u}^\top \mathbf{u} = 1
$$

This leads to an eigenvalue problem of the covariance matrix, which we’ll derive in detail later.

---

## Why PCA Matters in Practice

- **Visualization**  
  Project high-dimensional data into 2D or 3D to explore clusters, separability, and patterns.  

- **Compression**  
  Represent images or datasets with fewer numbers, saving storage and memory.  

- **De-noising**  
  By discarding low-variance directions, PCA can remove noise from signals and images.  

- **Preprocessing for ML**  
  PCA decorrelates features and reduces collinearity, making models like linear regression or logistic regression more stable.  

- **Speed**  
  With fewer features, many ML algorithms run significantly faster without much accuracy loss.  

---

## Real-World Examples We’ll Cover

- **Synthetic 2D datasets** — to build intuition with visual plots  
- **Iris / Wine datasets** — small, interpretable datasets for demos  
- **Digits / MNIST images** — to see compression, reconstruction, and denoising  
- **Practical ML pipelines** — where PCA helps vs. where it can hurt interpretability  

---

In short: **PCA is about finding the “true axes of variation” in data.**  
It is both a mathematical tool and a practical workhorse for visualization, compression, noise reduction, and preprocessing.

<a id="prereq"></a>
# 2️. Prerequisites & Intuition

Before diving into PCA, it’s important to understand some **basic concepts from statistics and linear algebra**. PCA builds directly on these ideas, so a clear understanding will make the derivation and intuition much easier. Let’s go **step by step**, with simple examples.

---

## 1. Variance: How much a feature spreads

Variance measures how much the values of a single feature vary around their mean. It gives us a sense of the “spread” of the data along one axis.

For a feature $X \in \mathbb{R}^{n \times 1}$ (a single column of $n$ observations):

$$
\text{Var}(X) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

- $\bar{x}$ = mean of $X$  
- $x_i$ = individual data point

**Example:**  
Suppose $X = [2, 4, 6, 8]$. The mean is $\bar{x} = 5$.  

$$
\text{Var}(X) = \frac{(2-5)^2 + (4-5)^2 + (6-5)^2 + (8-5)^2}{3} = \frac{9 + 1 + 1 + 9}{3} = 6.67
$$

Intuition: variance tells us **how “wide” the data is along this feature**. PCA looks for directions where variance is largest.

---

## 2. Covariance: How two features vary together

Covariance measures whether **two features move together or in opposite directions**.

For features $X, Y \in \mathbb{R}^n$:

$$
\text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$

- Positive covariance → both increase together  
- Negative covariance → one increases while the other decreases  
- Zero → features are uncorrelated

**Example:**  
Let $X=[1,2,3]$, $Y=[2,4,6]$. Then $\bar{X}=2$, $\bar{Y}=4$.

$$
\text{Cov}(X,Y) = \frac{(1-2)(2-4) + (2-2)(4-4) + (3-2)(6-4)}{2} = 2
$$

- Covariance matrix generalizes this to multiple features:

$$
\Sigma =
\begin{bmatrix}
\text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots \\
\text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots \\
\vdots & \vdots & \ddots
\end{bmatrix}
$$

PCA uses this matrix to find directions of maximum variance considering **all features together**.

---

## 3. Eigenvectors & Eigenvalues: Directions and importance

Eigenvectors represent **directions in space**, and eigenvalues tell us **how much variance exists along each direction**.  

For a square matrix $A$:

$$
A \mathbf{v} = \lambda \mathbf{v}
$$

- $\mathbf{v}$ = eigenvector (direction)  
- $\lambda$ = eigenvalue (length/scaling along that direction)

**Example:**  

$$
A = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix},
\mathbf{v}_1 = \begin{bmatrix}1\\0\end{bmatrix},
\mathbf{v}_2 = \begin{bmatrix}0\\1\end{bmatrix}
$$

Then:

$$
A \mathbf{v}_1 = 2 \mathbf{v}_1, \quad A \mathbf{v}_2 = 3 \mathbf{v}_2
$$

Intuition: **PCA finds eigenvectors of the covariance matrix**. The eigenvector with the largest eigenvalue is the **first principal component** — the direction with maximum variance.

---

## 4. Orthogonality: Directions at right angles

Two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^p$ are **orthogonal** if their dot product is zero:

$$
\mathbf{u}^\top \mathbf{v} = 0
$$

- Orthogonal vectors are independent in direction.  
- In PCA, all principal components are **orthogonal**, so each new component captures variance **not captured by previous components**.

**Example:**  
$\mathbf{u} = [1,0]$, $\mathbf{v} = [0,1]$ → $\mathbf{u}^\top \mathbf{v} = 0$

---

## 5. Dot Product & Projection: Measuring alignment

The dot product measures how much one vector aligns with another:

$$
\mathbf{u}^\top \mathbf{x} = \|\mathbf{u}\|\|\mathbf{x}\|\cos\theta
$$

- $\theta$ = angle between $\mathbf{u}$ and $\mathbf{x}$  
- Projection of $\mathbf{x}$ onto $\mathbf{u}$:

$$
\text{Proj}_{\mathbf{u}}(\mathbf{x}) = (\mathbf{u}^\top \mathbf{x}) \mathbf{u}
$$

PCA projects data onto eigenvectors to get **principal component scores**.

---

## 6. Matrix Multiplication & Transpose: Combining features

- Covariance: $\Sigma = \frac{1}{n-1} X^\top X$ (for centered $X$)  
- Transpose flips rows ↔ columns  
- Matrix multiplication allows **linear combinations of features**, which is how PCA rotates the original axes

---

## Putting it all together: PCA intuition

1. Compute **covariance matrix** → captures spread and correlations  
2. Solve **eigenvalue problem** → find directions (eigenvectors) with largest variance (eigenvalues)  
3. **Project data onto top eigenvectors** → lower-dimensional representation retaining most information  

Intuition: PCA finds the **true axes of variation** and compresses the essence of the data, while discarding noise and redundant dimensions.


<a id="math"></a>
# 3️. Mathematical Derivation of PCA — Step by Step (With Examples)

Now that we understand variance, covariance, eigenvectors, and projection, let's derive PCA **step by step** using a small example. This way, I can really *see why PCA works*, not just run code.

---

## Example Dataset

Suppose I have 2 features and 3 data points:

$$
X =
\begin{bmatrix}
2 & 0 \\
0 & 1 \\
3 & 2
\end{bmatrix}
$$

- 3 rows = 3 samples  
- 2 columns = 2 features  
I want to reduce this 2D data to 1D (the most informative direction).

---

## Step 1: Center the Data

PCA requires **centered data** (mean of each column = 0):

1. Compute column means:

$$
\mu =
\begin{bmatrix}
\bar{x}_1 \\
\bar{x}_2
\end{bmatrix} =
\begin{bmatrix}
\frac{2+0+3}{3} \\
\frac{0+1+2}{3}
\end{bmatrix} =
\begin{bmatrix}
1.667 \\
1
\end{bmatrix}
$$

2. Subtract mean from each column:

$$
X_{\text{centered}} = X - \mathbf{1}\mu^\top =
\begin{bmatrix}
2-1.667 & 0-1 \\
0-1.667 & 1-1 \\
3-1.667 & 2-1
\end{bmatrix} =
\begin{bmatrix}
0.333 & -1 \\
-1.667 & 0 \\
1.333 & 1
\end{bmatrix}
$$

✅ Now the data is centered. PCA will focus on **variance around the mean**, not absolute values.

---

## Step 2: Projection onto a Direction

Suppose I want a direction $\mathbf{u} = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}$ (unit vector) to project the data onto.  

- Projection of a point $\mathbf{x}_i$:

$$
z_i = \mathbf{u}^\top \mathbf{x}_i = u_1 x_{i1} + u_2 x_{i2}
$$

- For all points:

$$
\mathbf{z} = X_{\text{centered}} \mathbf{u}
$$

- Intuition: $z_i$ is the **coordinate along the direction $\mathbf{u}$**.

---

## Step 3: Maximize Variance Along the Direction

We want the **direction with maximum variance**. Variance of $\mathbf{z}$:

$$
\text{Var}(\mathbf{z}) = \mathbf{u}^\top \Sigma \mathbf{u}, \quad \text{where } \Sigma = \frac{1}{n} X_{\text{centered}}^\top X_{\text{centered}}
$$

- Compute covariance matrix for our example:

$$
\Sigma = \frac{1}{3}
\begin{bmatrix}
0.333 & -1 \\
-1.667 & 0 \\
1.333 & 1
\end{bmatrix}^\top
\begin{bmatrix}
0.333 & -1 \\
-1.667 & 0 \\
1.333 & 1
\end{bmatrix}
=
\begin{bmatrix}
1.555 & 0.667 \\
0.667 & 0.667
\end{bmatrix}
$$

- Goal: maximize $\mathbf{u}^\top \Sigma \mathbf{u}$  
- Constraint: $\mathbf{u}^\top \mathbf{u} = 1$

---

## Step 4: Solve Eigenvalue Problem

Introduce Lagrange multiplier $\lambda$:

$$
\mathcal{L}(\mathbf{u}, \lambda) = \mathbf{u}^\top \Sigma \mathbf{u} - \lambda (\mathbf{u}^\top \mathbf{u} - 1)
$$

Take derivative → eigenvalue equation:

$$
\Sigma \mathbf{u} = \lambda \mathbf{u}
$$

- Solve for eigenvalues and eigenvectors:

$$
\lambda_1 \approx 1.973, \quad \lambda_2 \approx 0.249
$$

$$
\mathbf{u}_1 \approx
\begin{bmatrix}0.881 \\ 0.472\end{bmatrix}, \quad
\mathbf{u}_2 \approx
\begin{bmatrix}-0.472 \\ 0.881\end{bmatrix}
$$

✅ Interpretation:  
- $\mathbf{u}_1$ = direction of **maximum variance** (first principal component)  
- $\mathbf{u}_2$ = orthogonal direction (second component, smaller variance)

---

## Step 5: Project Data onto Top Component

We pick **top 1 eigenvector** $\mathbf{u}_1$ and project:

$$
Z = X_{\text{centered}} \mathbf{u}_1 =
\begin{bmatrix}
0.333 & -1 \\
-1.667 & 0 \\
1.333 & 1
\end{bmatrix}
\begin{bmatrix}0.881 \\ 0.472\end{bmatrix}
\approx
\begin{bmatrix}-0.127 \\ -1.469 \\ 1.596 \end{bmatrix}
$$

- $Z$ = 1D representation of the 2D data  
- ✅ Most of the **original variance is preserved** along this new axis

---

## Step 6: Approximate Reconstruction

If I want to reconstruct approximate original data:

$$
\hat{X} = Z \mathbf{u}_1^\top =
\begin{bmatrix}-0.127 \\ -1.469 \\ 1.596 \end{bmatrix}
\begin{bmatrix}0.881 & 0.472\end{bmatrix}
\approx
\begin{bmatrix}-0.112 & -0.060 \\
-1.295 & -0.694 \\
1.405 & 0.754
\end{bmatrix}
$$

- Reconstruction = **projection back to original space** using top component  
- Not exact (we dropped the second component), but captures most variance

---

## Step 7: PCA via SVD

Alternatively, do **SVD on centered $X$**:

$$
X_{\text{centered}} = U S V^\top
$$

- $V$ = eigenvectors of $\Sigma$ → principal components  
- $U S$ = projected data (scores)  
- $\lambda_i = S_i^2 / n$ → variance along each component  
- ✅ Numerically stable, especially when **features > samples**

---

## Key Takeaways From My POV

1. PCA = **find directions of maximum variance**  
2. Eigenvectors = principal axes, eigenvalues = variance along them  
3. Top components capture **most information**  
4. Projection reduces dimensionality; reconstruction approximates original data  
5. SVD = robust way to compute PCA

💡 Intuition: PCA is like **rotating the axes** to align with the directions where the data “stretches” the most, letting us **compress the essence of data** while discarding less informative directions.


<a id="sklearn"></a>
# 5️. PCA with Scikit-Learn

Now that we understand PCA conceptually, let’s **use scikit-learn** to apply PCA on a dataset with **many features**.  

> We’ll use the **Wine dataset** (13 features) to show how PCA can reduce dimensions while keeping most information.