Here’s a **complete and in-depth note** on **PCA as Maximizing Variance**, covering all the topics from the lecture:

---

# 📌 Principal Component Analysis (PCA) as Maximizing Variance

**Machine Learning Foundations — Lecture 5, Week 6**

---

## 🔍 Recap: Two Perspectives on PCA

Principal Component Analysis (PCA) is a fundamental technique in dimensionality reduction. There are **two equivalent perspectives**:

1. **Minimizing Reconstruction Error**:

   * Choose a lower-dimensional subspace that preserves the structure of the original data.
   * Minimize the squared distance between original data and its projection onto the subspace.

2. **Maximizing Projected Variance**:

   * Find a subspace such that **variance of data after projection is maximized**.
   * Variance is a measure of how spread out the data is — more spread (variance) implies more information is retained.

---

## 🧠 Intuition Behind Maximizing Variance

Let’s assume we have a dataset:

$$
\mathcal{D} = \{x_1, x_2, \dots, x_n\}, \quad x_i \in \mathbb{R}^d
$$

We want to **project** this data onto a 1D line (or m-dimensional subspace), such that the **variance of the projected data is maximized**.

### 📐 Projection Onto a Line

Let $u \in \mathbb{R}^d$ be a **unit vector** (i.e., $u^T u = 1$) representing the direction onto which we project data.

The projection of a data point $x_i$ onto $u$ is:

$$
\text{Projection of } x_i \text{ on } u = x_i^T u \cdot u
$$

But for variance, we are interested in scalar projections:

$$
\text{Projected scalar value} = x_i^T u
$$

---

## 📊 Step-by-Step Derivation: Maximizing Variance

---

### 1️⃣ Define the Mean

Let $\bar{x}$ be the **mean** of the data:

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

We center the data by subtracting the mean (assumed centered unless stated otherwise).

---

### 2️⃣ Projected Variance

We define the **projected variance** as:

$$
\frac{1}{n} \sum_{i=1}^{n} \left( x_i^T u - \bar{x}^T u \right)^2 = \frac{1}{n} \sum_{i=1}^{n} \left( (x_i - \bar{x})^T u \right)^2
$$

This can be rewritten using matrix notation:

$$
= \frac{1}{n} \sum_{i=1}^{n} u^T (x_i - \bar{x})(x_i - \bar{x})^T u = u^T C u
$$

Where $C$ is the **covariance matrix**:

$$
C = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(x_i - \bar{x})^T
$$

---

### 3️⃣ Objective: Maximize Projected Variance

The goal is to **maximize the projected variance**:

$$
\max_{u} u^T C u \quad \text{subject to} \quad u^T u = 1
$$

This is a **constrained optimization** problem.

---

### 4️⃣ Solution Using Lagrangian

We form the **Lagrangian**:

$$
\mathcal{L}(u, \lambda) = u^T C u - \lambda (u^T u - 1)
$$

Taking the gradient and setting it to zero:

$$
\nabla_u \mathcal{L} = 2Cu - 2\lambda u = 0 \Rightarrow Cu = \lambda u
$$

Thus, **u is an eigenvector of C**, and the **projected variance** is:

$$
u^T C u = \lambda
$$

Hence, to **maximize** variance, choose **the eigenvector corresponding to the largest eigenvalue**.

---

### 5️⃣ Alternative: Calculus Without Lagrangian

To avoid the Lagrangian, consider maximizing the **Rayleigh quotient**:

$$
R(u) = \frac{u^T C u}{u^T u}
$$

We differentiate this ratio (using quotient rule) and again reach the conclusion:

$$
Cu = \lambda u \quad \text{where} \quad \lambda = \frac{u^T C u}{u^T u}
$$

---

## 📏 Generalization to m-Dimensional Subspace

If we want to reduce to **m dimensions**, not just one, we want to find **m orthonormal vectors $u_1, u_2, ..., u_m$** that:

* Maximize the **total projected variance**:

$$
\sum_{j=1}^m u_j^T C u_j
$$

* Subject to:

$$
u_i^T u_j = \delta_{ij} \quad (\text{orthonormality})
$$

The solution: **top m eigenvectors** of the covariance matrix $C$, corresponding to the **top m eigenvalues**.

---

## 📌 Terminology

* **Principal Directions** $\rightarrow$ Eigenvectors $u_1, u_2, ..., u_m$ of $C$
* **Principal Components** $\rightarrow$ Projected values $x_i^T u_j$
* The **projected data** lies in an m-dimensional subspace of $\mathbb{R}^d$

---

## ✅ Example: 2D Dataset

Given data points:

$$
x_1 = [-1, -1],\quad x_2 = [0, 0],\quad x_3 = [1, 1]
$$

### Step 1: Compute the mean

$$
\bar{x} = \frac{1}{3}(x_1 + x_2 + x_3) = [0, 0]
$$

### Step 2: Compute Covariance Matrix

$$
C = \frac{1}{3} \sum_{i=1}^{3} x_i x_i^T =
\frac{1}{3} \left( \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} + \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} \right)
= \begin{bmatrix} \frac{2}{3} & \frac{2}{3} \\ \frac{2}{3} & \frac{2}{3} \end{bmatrix}
$$

### Step 3: Eigenvalues and Eigenvectors

* Eigenvalues: $\lambda_1 = \frac{4}{3}, \lambda_2 = 0$
* Corresponding eigenvectors:
  $u_1 = \frac{1}{\sqrt{2}}[1, 1]^T$,
  $u_2 = \frac{1}{\sqrt{2}}[1, -1]^T$

### Step 4: Projection onto $u_1$

Compute projected values (principal components):

$$
x_1^T u_1 = [-1, -1] \cdot \frac{1}{\sqrt{2}}[1, 1] = -\sqrt{2} \\
x_2^T u_1 = [0, 0] \cdot \frac{1}{\sqrt{2}}[1, 1] = 0 \\
x_3^T u_1 = [1, 1] \cdot \frac{1}{\sqrt{2}}[1, 1] = \sqrt{2}
$$

### Step 5: Compute Projected Variance

$$
\text{Projected Variance} = \frac{1}{3} \left[ (-\sqrt{2})^2 + 0^2 + (\sqrt{2})^2 \right] = \frac{1}{3}(2 + 0 + 2) = \frac{4}{3}
$$

This is equal to the **largest eigenvalue of C**, which validates our result.

---

## 🧾 Summary: PCA as Maximizing Variance

| Concept                  | Summary                                                                                 |
| ------------------------ | --------------------------------------------------------------------------------------- |
| **Goal**                 | Find the directions (principal directions) in which the data has maximum variance       |
| **How**                  | Maximize $u^T C u$ s.t. $u^T u = 1$                                                     |
| **Solution**             | Take eigenvectors of the covariance matrix $C$ corresponding to the largest eigenvalues |
| **Principal Directions** | Eigenvectors $u_1, u_2, ..., u_m$                                                       |
| **Principal Components** | Projections $x_i^T u_j$                                                                 |
| **Projected Variance**   | Equal to eigenvalues $\lambda_1, ..., \lambda_m$                                        |
| **Algorithm Output**     | Subspace of dimension m, variance-maximizing projection                                 |

---

## 🧠 Learning Outcomes

* ✅ Understood PCA as a method to maximize projected variance.
* ✅ Learned that the solution involves computing the **eigenvectors of the covariance matrix**.
* ✅ Understood that PCA outputs the **principal components** by projecting data onto **principal directions**.
* ✅ Verified the result through a **worked-out example**.
* ✅ Understood that this method is equivalent to **minimizing reconstruction error**, hence the same algorithm serves **two objectives**.