# 🧩 PCA (Original, Intuitive Version)

So PCA is basically trying to find many **directions** (let’s call them \( v_1, v_2, \dots \)) such that
each \( v \) projects the original data \( X \) onto a **line**,
and that line (projection) captures **as much variance** as possible.

---

### 🧠 Step-by-step intuition

1️⃣ **Assume there exists some direction \( v \)**
   such that projecting \( X \) onto it gives the **maximum variance** possible.
   This projection is just:
   $$
   y = Xv
   $$
   where \( X \in \mathbb{R}^{n \times d} \) (n samples, d features),
   and \( v \in \mathbb{R}^d \) is the direction vector we’re testing.

---

2️⃣ **Compute the variance along that direction**

   The variance of those projected values is:
   $$
   \text{Var}(Xv) = v^\top \Sigma v
   $$
   where
   $$
   \Sigma = \frac{1}{n-1} X_c^\top X_c
   $$
   is the covariance matrix of the centered data \( X_c \).

   Intuitively, \( v^\top \Sigma v \) tells us **how spread out the data looks** when we view it from direction \( v \).

---

3️⃣ **Add a constraint so we’re only comparing directions, not magnitudes**

   We only care about **direction**, not scale,
   so we force the vector to have **unit length**:
   $$
   \|v\| = 1
   $$

   Then PCA becomes an optimization problem:
   $$
   \max_{\|v\| = 1} v^\top \Sigma v
   $$

---

4️⃣ **Solve that optimization → eigen decomposition**

   This problem’s solution is given by the **eigenvectors** and **eigenvalues** of \( \Sigma \):

   $$
   \Sigma v = \lambda v
   $$

   - \( v \): direction (principal component)
   - \( \lambda \): how much variance data has along that direction

   The direction with the **largest eigenvalue** gives the **largest variance** when projecting data onto it.

---

5️⃣ **Get multiple principal directions**

   After the first one (\( v_1 \)),
   we find more directions (\( v_2, v_3, \dots \)) that:
   - are **orthogonal** to the earlier ones, and
   - capture the next highest variances.

   So overall:
   - \( v_1 \): 1st principal component (max variance)
   - \( v_2 \): 2nd principal component (next max variance)
   - \( \dots \)

---

6️⃣ **Projection and reconstruction**

   Once we have the top \( k \) eigenvectors \( V_k = [v_1, v_2, \dots, v_k] \),
   we can project the data into this lower-dimensional space:
   $$
   X_{\text{proj}} = X_c V_k
   $$
   Each column of \( X_{\text{proj}} \) is how the data looks along one principal axis.

---

### 🔍 TL;DR

So in short:

- We’re finding directions \( v \) that make the projection \( Xv \) have **maximum variance**.
- That variance is measured by \( v^\top \Sigma v \).
- With \( \|v\| = 1 \), maximizing it leads to the eigenvalue problem \( \Sigma v = \lambda v \).
- The top eigenvalues correspond to directions with the **largest data spread**.
- Those directions (eigenvectors) form the **principal components** of the data.


If X isn’t centered, the covariance formula includes both spread and mean position.
That means large values of the mean will dominate — PCA would think there’s a “big variance” just because the data is far from zero, even if it’s tightly clustered.

# 🎲 Probabilistic PCA

### 🧠 Idea

Probabilistic PCA (PPCA) views the data as being generated from a **Gaussian model** instead of just geometric projection.

Each data point \( x_i \in \mathbb{R}^d \) is modeled as:

$$
x_i \sim \mathcal{N}(\mu,\; C)
\quad \text{where} \quad
C = W W^\top + \sigma^2 I
$$

- \( W \): principal component loading matrix
- \( \sigma^2 I \): isotropic Gaussian noise
- \( \mu \): mean of the data

So data is assumed to come from a **low-dimensional subspace** (spanned by \( W \))
plus small Gaussian noise in all directions.

---

### 📊 Log-likelihood form

For each sample \( x_i \):

$$
\log p(x_i) = -\frac{1}{2}
\Big[
(x_i-\mu)^\top C^{-1} (x_i-\mu)
+ \log |C|
+ d \log (2\pi)
\Big]
$$

This measures **how well** the sample fits into the PCA subspace.

---

### ⚙️ In scikit-learn

`PCA.score(X)` computes this **average log-likelihood** (up to a constant).
You can use it to compare models with different numbers of components in **cross-validation** —
higher score → the PCA subspace explains data variance better. (we need a train test split here)

---

### 💡 Intuition

- Regular PCA: *“Which directions explain variance best?”*
- Probabilistic PCA: *“Given those directions, how likely is this data point under the Gaussian model of that subspace?”*
- Each **row** (data point) is treated as one multivariate Gaussian sample —
  features are **jointly** modeled via the covariance \( C \),
  not multiplied per column.


# 🔗 Connection between PCA and SVD

### 🧩 1️⃣ Recall what PCA computes
PCA works on the **covariance matrix** of centered data \( X_c \in \mathbb{R}^{n \times d} \):

$$
\Sigma = \frac{1}{n-1} X_c^\top X_c
$$

and finds its eigenvectors and eigenvalues by solving:

$$
\Sigma v = \lambda v
$$

---

### ⚙️ 2️⃣ Now, decompose \( X_c \) using SVD

We can factorize the same centered data as:

$$
X_c = U S V^\top
$$

where
- \( U \): left singular vectors (\( n \times n \))
- \( S \): diagonal matrix of singular values \( s_1, s_2, \dots \)
- \( V \): right singular vectors (\( d \times d \))

---

### 🔗 3️⃣ Substitute into the covariance

$$
\Sigma = \frac{1}{n-1} X_c^\top X_c
       = \frac{1}{n-1} (V S^\top U^\top)(U S V^\top)
       = \frac{1}{n-1} V S^2 V^\top
$$

From this, we see:
- **Eigenvectors of \( \Sigma \)** = columns of \( V \)
- **Eigenvalues of \( \Sigma \)** = \( s_i^2 / (n-1) \)

---

### 🧠 4️⃣ Interpretation

So PCA can be done directly via SVD:
- The **right singular vectors** \( V \) give the **principal directions** (PCA axes).
- The **singular values** \( s_i \) encode how much variance each component explains.

This is why modern implementations of PCA (e.g., in scikit-learn) actually use **SVD** under the hood —
it’s faster, numerically more stable, and avoids explicitly forming \( X_c^\top X_c \).

---

### ⚖️ 5️⃣ Summary

| Concept | PCA term | SVD term |
|:--|:--|:--|
| Directions (principal components) | Eigenvectors of \( \Sigma \) | Right singular vectors \( V \) |
| Variance explained | Eigenvalues \( \lambda_i \) | \( s_i^2 / (n-1) \) |
| Projected data | \( X_c V_k \) | \( U_k S_k \) |

In short:
> **PCA = SVD**, just seen through the lens of variance instead of matrix geometry.


# ⚙️ Incremental PCA (IPCA)

### 🧩 1️⃣ Motivation

Regular PCA requires building the full covariance matrix:

$$
\Sigma = \frac{1}{n-1} X^\top X
$$

which needs *all data in memory*.
**Incremental PCA (IPCA)** instead updates PCA **batch by batch**, reusing the existing subspace instead of recomputing from scratch.

---

### 🧠 2️⃣ Key insight

PCA learns a set of **orthonormal directions** \(V_{\text{old}}\) that capture variance.
Because they are orthonormal:

$$
V_{\text{old}}^{-1} = V_{\text{old}}^\top
$$

That means projecting into the PCA space and reconstructing back are *inverse operations*:

- **Projection (compress)**: \(Y = X V_{\text{old}}\)
- **Reconstruction (expand)**: \(\hat{X} = Y V_{\text{old}}^\top = Y V_{\text{old}}^{-1}\)

---

### ⚙️ 3️⃣ Step-by-step update

1️⃣ **Center the new batch**

$$
\tilde{X}_{\text{new}} = X_{\text{new}} - \mu_{\text{old}}
$$

Update mean:

$$
\mu_{\text{new}} = \frac{n_{\text{old}}\mu_{\text{old}} + m\,\bar{X}_{\text{new}}}{n_{\text{old}} + m}
$$

---

2️⃣ **Project into old PCA space**

Compute how much of the new data is explained by the existing PCA subspace:

$$
Y = \tilde{X}_{\text{new}} V_{\text{old}}
$$

---

3️⃣ **Reconstruct explained part & find residuals**

Reconstruction (what old PCA can explain):

$$
\hat{X}_{\text{new}} = Y V_{\text{old}}^\top = Y V_{\text{old}}^{-1}
$$

Residual (what’s new):

$$
R = \tilde{X}_{\text{new}} - \hat{X}_{\text{new}}
$$

So \(R\) captures **new variance directions** that weren’t in the previous PCA basis.

---

4️⃣ **Combine old information and new variance**

We now build a compact matrix \(M\) that merges:
- The **old PCA subspace**, scaled by how strong each direction was (\(S_{\text{old}}\))
- The **new residual variance** \(R\)

$$
M =
\begin{bmatrix}
S_{\text{old}} V_{\text{old}}^\top \\
R
\end{bmatrix}
$$

💡 Here’s the key intuition:

> We are **combining residuals and the old subspace (with scale)**
> into a *new metric space*,
> and we’ll find **new eigenvectors** (principal directions) within that merged space.

---

5️⃣ **Find new principal directions**

Run a **small SVD** on \(M\):

$$
M = U' S' V'^\top
$$

and keep top \(k\) components:

$$
V_{\text{new}} = V_{\text{old}} V'_k, \quad S_{\text{new}} = S'_k
$$

This step re-orthogonalizes the space — it finds the dominant eigenvectors of the *combined variance structure* (old + new).

---

### 🧩 4️⃣ Intuition summary

- \(V_{\text{old}}\): old orthonormal PCA basis (so \(V^\top = V^{-1}\))
- \(S_{\text{old}}\): how strong each old direction was
- \(R\): new directions unexplained by old PCA

We merge \([S_{\text{old}}V_{\text{old}}^\top; R]\),
then find **new eigenvectors** in that merged space —
those represent the updated global directions of maximum variance.

---

### ⚖️ 5️⃣ Big picture

| Concept | Regular PCA | Incremental PCA |
|:--|:--|:--|
| Data | All at once | Mini-batches |
| Computation | Full eigen/SVD | Small local SVD updates |
| Basis | Fixed | Evolving |
| Main operation | \(X^\top X\) eigen | Combine \([S_{\text{old}}V_{\text{old}}^\top; R]\), re-SVD |

**In short:**
Incremental PCA continuously merges *old knowledge (scaled basis)* with *new information (residual variance)*,
then finds **new eigenvectors in that combined space**, keeping the PCA representation up to date.


# ⚡ Randomized PCA (RPCA)

### 🧩 1️⃣ Motivation
Full PCA or SVD on large, high-dimensional data (\(X \in \mathbb{R}^{n \times d}\)) is expensive.
**Randomized PCA** provides a *fast approximation* by using **random projections** to find the same subspace with much less computation.

---

### 🧠 2️⃣ Core idea
We generate a random matrix:

$$
\Omega \in \mathbb{R}^{d \times (k+p)}, \quad \Omega_{ij} \sim \mathcal{N}(0,1)
$$

and project the data:

$$
Y = X \Omega
$$

Each column of \(Y\) is a **random linear combination** of the features in \(X\).
These random combinations act like “random directions” in the feature space.

---

### 💡 3️⃣ Why it works
- High-variance directions in \(X\) dominate any random projection.
- With high probability, the subspace spanned by \(Y\) captures nearly the same variance as the top \(k\) PCA components.
- We then orthogonalize \(Y\):
  $$
  Q = \text{orth}(Y)
  $$
  and perform a small SVD on \(B = Q^\top X\).

This gives the approximate decomposition:
$$
X \approx Q (\tilde{U} S V^\top)
$$

---

### ⚙️ 4️⃣ Intuition
Randomized PCA ≈
> “Look at the data from a few **random directions**,
> orthogonalize what you see,
> and perform SVD only in that smaller space.”

It’s **really random** — the weights in \(\Omega\) are sampled from a random distribution —
but randomness in high dimensions almost always overlaps with the true top variance directions.

---

### ⚖️ 5️⃣ Summary

| Step | What happens | Why |
|:--|:--|:--|
| \( \Omega \) | Random weights | Choose random directions |
| \( Y = X\Omega \) | Project data | Compress while preserving structure |
| \( Q = \text{orth}(Y) \) | Orthonormal basis | Approximate main subspace |
| SVD on \(Q^\top X\) | Compute PCA in small space | Fast and accurate |

So **Randomized PCA** finds nearly the same principal directions as full PCA —
but by using *random projections* to drastically cut computation time.


# 🌿 Sparse PCA (SparsePCA & MiniBatchSparsePCA)

### 🧩 1️⃣ Motivation
Regular PCA components are **dense** — every feature contributes to every component.
That captures variance well but makes interpretation hard.

**Sparse PCA** adds a sparsity constraint so that each component uses only a **small subset of features**,
making them easier to interpret.

---

### 🧠 2️⃣ Core idea
Sparse PCA modifies the PCA objective by adding an ℓ₁ penalty (like Lasso):

$$
\max_{V} \text{Tr}(V^\top \Sigma V) - \alpha \|V\|_1
$$

- The first term keeps variance large (same as PCA).
- The ℓ₁ term forces many small coefficients in \(V\) to **zero** → sparse components.

So each component depends only on a few important features.

---

### ⚙️ 3️⃣ Implementation
It can also be seen as minimizing the reconstruction error with sparsity:

$$
\min_{U,V} \|X - UV^\top\|_F^2 + \alpha \|V\|_1
$$

- \(V\): sparse loading vectors (principal directions)
- \(U\): projections of samples onto those directions

Two variants:
- **SparsePCA** → full batch (coordinate descent)
- **MiniBatchSparsePCA** → faster version using random mini-batches

---

### 💡 4️⃣ Summary

| Method | Feature usage | Interpretation | Use case |
|:--|:--|:--|:--|
| PCA | All features | Hard | Compression, visualization |
| Sparse PCA | Few features | Easy | Feature selection, interpretability |
| MiniBatchSparsePCA | Few features | Easy + Fast | Large datasets |

> Sparse PCA = PCA + Lasso
> Adds sparsity for interpretability, trading off a bit of variance for clarity.


# 🌈 Kernel PCA — Training & Prediction (Algebra Only)

## 1) Prediction (projection in terms of kernels)

Goal: express the projection of a point onto a principal direction **using only kernels**.

**Start (feature space projection):**
$$
z_i \;=\; v^\top \phi(x_i)
$$

**Represent the principal axis as a combo of training samples:**
$$
v \;=\; \sum_{j=1}^n \alpha_j \,\phi(x_j)
$$

**Substitute and linearity:**
$$
z_i \;=\; \Big(\sum_{j=1}^n \alpha_j \phi(x_j)\Big)^\top \phi(x_i)
\;=\; \sum_{j=1}^n \alpha_j \,\langle \phi(x_j), \phi(x_i)\rangle
$$

**Replace by kernel values:**
$$
\boxed{\,z_i \;=\; \sum_{j=1}^n \alpha_j\, K(x_j, x_i)\,}
$$


## 2) Training (derivation to \(K\alpha = n\lambda \alpha\))

We want eigenpairs of the **feature-space covariance**:
$$
C_\phi \;=\; \frac{1}{n}\sum_{i=1}^n \phi(x_i)\phi(x_i)^\top,
\qquad C_\phi v \;=\; \lambda v.
$$

**(a) Expand \(C_\phi v\) explicitly:**
$$
\Big(\tfrac{1}{n}\sum_{i=1}^n \phi(x_i)\phi(x_i)^\top\Big) v
\;=\; \tfrac{1}{n}\sum_{i=1}^n \phi(x_i)\big(\phi(x_i)^\top v\big)
\;=\; \lambda v.
$$

**(b) Express \(v\) in the span of training samples:**
$$
v \;=\; \sum_{j=1}^n \alpha_j\, \phi(x_j).
$$

**(c) Plug into (a); use inner products:**
$$
\tfrac{1}{n}\sum_{i=1}^n \phi(x_i)\Big(\phi(x_i)^\top \sum_{j=1}^n \alpha_j \phi(x_j)\Big)
\;=\; \lambda \sum_{j=1}^n \alpha_j \phi(x_j).
$$

**(d) Pull the sum out and define the kernel:**
$$
\tfrac{1}{n}\sum_{i=1}^n \sum_{j=1}^n \alpha_j\, \phi(x_i)\,\langle \phi(x_i),\phi(x_j)\rangle
\;=\; \lambda \sum_{j=1}^n \alpha_j \phi(x_j).
$$

Let \(K_{ij} = \langle \phi(x_i),\phi(x_j)\rangle\) and note both sides are linear combos of \(\{\phi(x_\ell)\}\).
Equate coefficients (collect terms along the basis \(\phi(x_\ell)\)):

For the coefficient of \(\phi(x_\ell)\) on the LHS:
$$
\tfrac{1}{n} \sum_{j=1}^n \alpha_j\, K_{\ell j}.
$$

Set LHS = RHS coefficients for each \(\ell\):
$$
\tfrac{1}{n} \sum_{j=1}^n K_{\ell j} \alpha_j \;=\; \lambda \alpha_\ell
\quad \Longleftrightarrow \quad
\frac{1}{n}(K\alpha)_\ell \;=\; \lambda \alpha_\ell.
$$

**Vector form:**
$$
\boxed{\,K \alpha \;=\; n \lambda \alpha\,}
$$

(After solving, normalize eigenvectors appropriately; in practice one uses the **centered** kernel \(K_c\) before this step.)
