x# 2.5.4 Dictionary Learning — Short Notes

## Idea
Represent each sample using **few** learned basis vectors (atoms). For data matrix \(X \in \mathbb{R}^{m \times n}\), learn dictionary \(D \in \mathbb{R}^{m \times k}\) and sparse codes \(A \in \mathbb{R}^{k \times n}\) so that:
$$
x_i \approx D\alpha_i, \quad \text{with } \alpha_i \text{ sparse.}
$$

## Objective
Minimize reconstruction error + sparsity:
$$
\min_{D, A}\ \tfrac{1}{2}\|X - DA\|_F^2 + \lambda \|A\|_1
\quad \text{s.t. } \|d_j\|_2 = 1 \ \forall j.
$$

- \( \|A\|_1 = \sum_{i,j}|A_{ij}| \) promotes sparsity.
- Unit-norm atoms avoid arbitrary scaling of \(D\) vs \(A\).

## Training (Alternating Minimization)

### 1) Sparse Coding (fix \(D\), solve \(A\))
For each column \(x_i\):
$$
\alpha_i \leftarrow \arg\min_\alpha \ \tfrac{1}{2}\|x_i - D\alpha\|_2^2 + \lambda \|\alpha\|_1.
$$
Solvers: coordinate descent (Lasso), ISTA/FISTA (proximal gradient), OMP (greedy, \(\ell_0\)-style).

**What this step does:** finds a **sparse** set of active atoms that reconstruct \(x_i\) well.

### 2) Dictionary Update (fix \(A\), solve \(D\))
Solve:
$$
D \leftarrow \arg\min_D \ \|X - DA\|_F^2 \ \ \text{s.t. }\|d_j\|_2=1.
$$

Atom-wise update using residuals. Let \(a_j^\top\) be row \(j\) of \(A\), and
$$
R_j = X - \sum_{\ell\neq j} d_\ell a_\ell^\top.
$$
Then
$$
d_j \leftarrow \frac{R_j a_j}{\|R_j a_j\|_2}, \quad
\text{(normalize to unit norm)}.
$$

**What this step does:** refines each atom to best explain the **current** set of samples that use it.

### 3) Iterate
Alternate 1) and 2) until objective stabilizes.

## Geometry & Comparison (1-liner)
- PCA: orthogonal bases, **dense** coefficients.
- Dictionary Learning: non-orthogonal bases, **sparse** coefficients → parts-based, selective reconstruction.

## Practical Tips
- Choose \(k\) (atoms) and \(\lambda\) (sparsity) via validation.
- Mini-batch variants scale to large \(n\).
- Initialize \(D\) with normalized random columns or k-means centroids.



# 2.5.5 Factor Analysis — Understanding Mean and Covariance

## Model
Each sample \( x_i \in \mathbb{R}^d \) is generated as:
$$
x_i = W z_i + \mu + \epsilon_i
$$

- \( W \in \mathbb{R}^{d \times k} \): factor loading matrix
- \( z_i \sim \mathcal{N}(0, I_k) \): latent factors (shared causes)
- \( \epsilon_i \sim \mathcal{N}(0, \Psi) \): feature-specific noise (diagonal covariance)
- \( \mu \in \mathbb{R}^d \): mean vector

The marginal distribution is:
$$
p(x_i) = \mathcal{N}(x_i \mid \mu,\, W W^\top + \Psi)
$$

---

## 1️⃣ Mean vs Covariance

| Term | Controls | Meaning |
|------|-----------|----------|
| **\(\mu\)** | Location | Center of the Gaussian cloud (one mean per feature) |
| **\(W W^\top\)** | Shared structure | Correlation between features via common latent factors |
| **\(\Psi\)** | Independent noise | Feature-specific variance (no correlation) |

So the **mean vector** sets the baseline (average value for each feature),
while \( W W^\top + \Psi \) defines the **shape** and **orientation** of variations *around that mean*.

---

## 2️⃣ Estimating the Mean

\(\mu\) is not learned through \(W\); it’s computed directly as the data’s empirical mean:
$$
\mu = \frac{1}{n} \sum_i x_i
$$
It tells us where each feature "starts" — e.g., one variable might average 100 while another averages 5.

---

## 3️⃣ Covariance Structure

Taking the covariance of the model:
$$
\text{Cov}(x_i) = \text{Cov}(W z_i + \epsilon_i)
= W\,\text{Cov}(z_i)\,W^\top + \text{Cov}(\epsilon_i)
= W W^\top + \Psi
$$

- \(W W^\top\): correlations due to shared latent factors
- \(\Psi\): diagonal matrix of independent noise

This decomposition says:
> "Total variability = shared structure + feature-specific noise."

---

## 4️⃣ Geometric Intuition

A multivariate Gaussian is fully described by:
$$
x \sim \mathcal{N}(\mu,\, \Sigma)
$$
- **\(\mu\)** shifts the center of the ellipse (mean of the cloud)
- **\(\Sigma = W W^\top + \Psi\)** determines the shape and tilt (covariance structure)

So in 2D:
- If \(\mu = (5, 10)\), the ellipse is centered at (5, 10).
- \(W W^\top\) controls how features move together (tilt of the ellipse).
- \(\Psi\) controls individual spread per axis (roundness).

---

## ✅ Summary

- Mean and covariance are separate: \(\mu\) sets the *location*, \(W, \Psi\) set the *shape*.
- Two features can start large (high means) yet still move together (high covariance).
- The covariance term \(W W^\top\) models *how one feature’s change influences another* through shared latent factors.


# ICA
https://danieljyc.github.io/2014/06/13/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A015-3--%E7%8B%AC%E7%AB%8B%E6%88%90%E5%88%86%E5%88%86%E6%9E%90ica%EF%BC%88independent-component-analysis%EF%BC%89/
# 2.5.6 Independent Component Analysis (ICA)

## Idea
ICA aims to recover **independent source signals** from observed mixtures.
Given whitened data \( P \in \mathbb{R}^{n \times k} \) (from PCA),
ICA finds a transformation \( R \) such that:
$$
S = P R^\top
$$
and the new components \( S = [s_1, s_2, \dots, s_k] \) are **statistically independent**, i.e.
$$
p(s_1, s_2, ..., s_k) = \prod_i p(s_i)
$$

---

## PCA vs ICA

- **PCA** ensures components are **uncorrelated** (Cov = I).
  This means:
  $$
  E[p_i p_j] = 0 \quad \text{for } i \ne j
  $$
  → their *dot product is 0*.

- But **uncorrelated ≠ independent**.
  Orthogonality (dot product = 0) only removes **linear** relationships —
  not nonlinear dependencies.

### Example
Let:
$$
x \sim \text{Uniform}(-1,1), \quad y = x^2
$$
Then:
$$
E[xy] = 0 \Rightarrow \text{Cov}(x,y) = 0
$$
They’re **uncorrelated**, but clearly **not independent** — knowing \(x\) gives \(y\).

> **So:**
> Dot product \(=0\) ⟹ uncorrelated
> but does **not** imply
> \( p(\text{vector}_1, \text{vector}_2) = p(\text{vector}_1)p(\text{vector}_2) \)

---

## What ICA Does

After PCA whitening (Cov = I), the data cloud is a **sphere** in feature space —
no correlation, equal variance in all directions.

ICA now searches for a **rotation** \( R \) (orthogonal matrix) such that
the new axes (independent components) are **statistically independent**:
$$
S = P R^\top
$$

Since rotation doesn’t change variance, ICA can freely “spin” this sphere until
each axis (component) becomes maximally **non-Gaussian** —
a proxy for statistical independence.

---

## Training Objective

ICA finds \(R\) (or equivalently \(W = R^\top\)) that maximizes **non-Gaussianity**
using contrast functions like kurtosis or negentropy.

Typical FastICA update:
$$
w_i \leftarrow E[P\,g(w_i^\top P)] - E[g'(w_i^\top P)]w_i
$$
then normalize and orthogonalize \(w_i\).

---

## Summary

| Step | What happens | Purpose |
|------|---------------|----------|
| PCA | Decorrelates and whitens data (Cov = I) | Remove 2nd-order correlation |
| ICA | Rotates whitened data | Remove higher-order dependence |
| Goal | \(p(s_1, s_2, ...)=\prod_i p(s_i)\) | Achieve independence |

**PCA:** makes vectors orthogonal (uncorrelated)
**ICA:** finds true independent sources

> ICA solves what PCA can’t —
> even if the dot product between vectors is 0,
> ICA ensures their **joint probability truly factorizes.**


# 2.5.7 Non-negative Matrix Factorization (NMF)

## Concept
NMF factorizes a **non-negative** matrix \( X \) into two smaller non-negative matrices:
$$
X \approx W H
$$
where:
- \( W \): latent basis (parts)
- \( H \): activations (weights)
- all entries ≥ 0

This gives a **parts-based**, additive representation — features combine only by addition, not subtraction.

---

## Why Non-negative
- Prevents cancellation of effects (no “negative part” to offset a positive one).
- Encourages **sparse**, interpretable components (e.g., eyes + nose + mouth = face).

---

## Training Objective
Minimize reconstruction error:
$$
\min_{W,H \ge 0} \|X - WH\|_F^2
$$
using **multiplicative updates** that maintain non-negativity:
$$
H \leftarrow H \odot \frac{W^\top X}{W^\top W H}, \quad
W \leftarrow W \odot \frac{X H^\top}{W H H^\top}
$$

---

## Intuition
- PCA: combines features through positive and negative weights (global patterns)
- **NMF:** only positive combinations (local, additive “parts”)

| PCA | NMF |
|------|------|
| Components can cancel each other | Components add up |
| Orthogonal basis | Non-orthogonal basis |
| Focus: variance | Focus: additive interpretability |

---

# 2.5.8 Latent Dirichlet Allocation (LDA)

## Concept
LDA is a **probabilistic topic model** for text.
It assumes:
- Each **document** is a **mixture of topics**.
- Each **topic** is a **distribution over words**.

Example:
> “Cats play with yarn” → 70% *pets*, 30% *toys*.

---

## Generative Process
For each topic \(k\):
- Draw word distribution: \( \phi_k \sim \text{Dirichlet}(\beta) \)

For each document \(d\):
1. Draw topic mixture: \( \theta_d \sim \text{Dirichlet}(\alpha) \)
2. For each word:
   - Choose topic \( z_{dn} \sim \text{Multinomial}(\theta_d) \)
   - Choose word \( w_{dn} \sim \text{Multinomial}(\phi_{z_{dn}}) \)

---

## Learning
Given only observed words, LDA infers:
- \( \phi_k \): topic-word distributions
- \( \theta_d \): topic proportions per document
- \( z_{dn} \): topic assignment for each word

Typically via:
- **Variational Bayes** (optimization)
- **Gibbs Sampling** (sampling-based posterior estimation)

---

## Intuition
- **NMF**: additive, deterministic parts of data.
- **LDA**: probabilistic, interpretable topics in text.

| NMF | LDA |
|-----|-----|
| Linear factorization \(X \approx WH\) | Probabilistic model \(p(w,z,\theta,\phi)\) |
| Non-negative constraints | Dirichlet priors |
| Deterministic optimization | Bayesian inference |
| Often used on term-frequency matrix | Used on bag-of-words corpus |

---

✅ **Key takeaway:**
- **NMF** → additive parts in numeric data.
- **LDA** → probabilistic topics in text data.
Both uncover latent structure — NMF through linear decomposition, LDA through probabilistic generative modeling.
