# Linear Representation and Latent Variable Methods — A Unified Mathematical View

---

## 1. Principal Component Analysis (PCA)

### Mathematical Objective

Given a centered data matrix  
$$
X \in \mathbb{R}^{n \times d},
$$

PCA solves:

$$
\max_{W^\top W = I} \ \mathrm{Var}(XW)
$$

which is equivalent to:

$$
\max \ \mathbb{E}\left[\|W^\top x\|^2\right].
$$

### Core Mechanism

Covariance matrix:

$$
\Sigma = \mathbb{E}[x x^\top].
$$

PCA performs eigen-decomposition of $\Sigma$ and projects data onto eigenvectors corresponding to the largest eigenvalues.

### What PCA Optimizes

Uses **second-order statistics only**.

Maximizes variance.

Minimizes reconstruction error:

$$
\min \ \|X - X W W^\top\|_F^2.
$$

### Statistical Interpretation

Optimal linear estimator under Gaussian assumptions.

Equivalent to the Karhunen–Loève Transform.

Equivalent to Singular Value Decomposition (SVD).

### Strengths

Orthogonal, ordered components.

Fast, stable, deterministic.

Excellent for compression and denoising.

### Limitations

Blind to non-Gaussian structure.

Components are uncorrelated, not independent.

No class awareness.

---

## 2. Independent Component Analysis (ICA)

### Generative Model

$$
x = A s
$$

where:

$$
s
$$
are statistically independent latent sources,

$$
A
$$
is an unknown mixing matrix.

Goal:

$$
s = W x.
$$

### Mathematical Objective

Maximize non-Gaussianity (since independence implies non-Gaussianity).

Common criteria include kurtosis and negentropy.

Negentropy formulation:

$$
J(s) = H(s_{\text{gauss}}) - H(s).
$$

### Core Insight

Gaussian variables cannot be separated using independence.

This explains why PCA cannot solve ICA problems, although ICA commonly uses PCA whitening as a preprocessing step.

### What ICA Optimizes

Higher-order statistics.

Statistical independence (stronger than uncorrelatedness).

### Statistical Interpretation

Blind source separation.

Maximum likelihood estimation under non-Gaussian priors.

### Strengths

Recovers meaningful physical sources.

Works where PCA fails (EEG, audio, finance).

### Limitations

No intrinsic ordering of components.

Sensitive to noise.

Requires non-Gaussianity.

Identifiable only up to scale and permutation.

---

## 3. Linear Discriminant Analysis (LDA)

### Supervised Setting

Data with labels:

$$
y \in \{1, \ldots, C\}.
$$

### Objective Function

Maximize class separability:

$$
\max_W \ \frac{W^\top S_B W}{W^\top S_W W}
$$

where:

$$
S_B
$$
is the between-class scatter matrix,

$$
S_W
$$
is the within-class scatter matrix.

### Core Mechanism

Solve the generalized eigenvalue problem:

$$
S_B w = \lambda S_W w.
$$

### Dimensionality Constraint

$$
\text{dim} \le C - 1.
$$

### Statistical Interpretation

Optimal Bayes classifier when class-conditional distributions are Gaussian with equal covariance.

Equivalent to Fisher’s discriminant.

### Strengths

Explicitly class-aware.

Excellent for classification pipelines.

Interpretable projection directions.

### Limitations

Requires labeled data.

Assumes homoscedastic Gaussian classes.

Poor performance in high-dimensional, low-sample regimes.

---

## 4. Canonical Correlation Analysis (CCA)

### Objective

Given two views:

$$
X \quad \text{and} \quad Y,
$$

CCA solves:

$$
\max_{w_x, w_y} \ \mathrm{corr}(w_x^\top X, w_y^\top Y).
$$

### What It Optimizes

Shared variance across datasets.

Does not maximize variance within each dataset independently.

### Use Cases

Multimodal learning.

Audio–video alignment.

Text–image correspondence.

---

## 5. Factor Analysis (FA)

### Generative Model

$$
x = \Lambda z + \epsilon
$$

where:

$$
z \sim \mathcal{N}(0, I),
$$

$$
\epsilon \sim \mathcal{N}(0, \Psi),
$$

and $\Psi$ is a diagonal noise covariance matrix.

### Key Difference from PCA

PCA assumes isotropic noise.

FA models dimension-specific noise explicitly.

### Statistical Interpretation

Probabilistic latent variable model.

Direct ancestor of Variational Autoencoders.

---

## Unified Comparison Table

| Method | Supervised | Objective | Statistics Used | Orthogonal | Probabilistic | Typical Use |
|------|------------|-----------|-----------------|------------|---------------|-------------|
| PCA | No | Maximize variance | $$2^\text{nd}$$ order | Yes | Yes (Gaussian) | Compression, denoising |
| ICA | No | Independence | Higher-order | No | Yes | Source separation |
| LDA | Yes | Class separation | $$2^\text{nd}$$ order + labels | No | Yes | Classification |
| CCA | No | Cross-correlation | $$2^\text{nd}$$ order | No | Yes | Multiview learning |
| FA | No | Likelihood maximization | $$2^\text{nd}$$ order + noise | No | Yes | Latent modeling |

---

## Signal Processing Perspective (Critical)

| SSP Concept | PCA | ICA |
|------------|-----|-----|
| Noise removal | Yes | Limited |
| Source separation | No | Yes |
| Whitening | Core operation | Preprocessing step |
| Identifiability | Yes | Partial |
| Physical meaning | Weak | Strong |

---

## Geometric Interpretation

PCA rotates coordinate axes to align with the principal axes of a data ellipsoid.

ICA rotates axes to make projections statistically independent.

LDA rotates space to maximize separation between class centroids.

CCA rotates two spaces to align directions of maximal shared correlation.

---

## Modern Extensions

| Classical Method | Modern Descendant |
|-----------------|------------------|
| PCA | Kernel PCA, Autoencoders |
| ICA | Nonlinear ICA, Contrastive Learning |
| LDA | Metric learning, Fisher networks |
| FA | Variational Autoencoders, Diffusion latents |

---

## Final Mental Model

PCA sees shape.

ICA hears voices.

LDA sees labels.

CCA sees agreement.

FA sees hidden causes.


# Dimensionality Reduction and Manifold Visualization — PCA, t-SNE, and UMAP

---

## 1. Principal Component Analysis (PCA)

### Core Nature

Linear, global, deterministic.

### Mathematical Objective

Given centered data matrix $X$:

$$
\max_{W^\top W = I} \ \mathrm{Var}(XW)
$$

Equivalently:

$$
\min \ \|X - X W W^\top\|_F^2.
$$

### What PCA Preserves

Global variance.

Euclidean geometry.

Large-scale structure.

### What PCA Ignores

Local neighborhoods.

Nonlinear manifolds.

Cluster separability.

### Statistical Interpretation

Optimal linear estimator under Gaussian assumptions.

Eigen-decomposition of the covariance matrix.

Equivalent to Singular Value Decomposition (SVD).

### Strengths

Interpretable axes.

Stable and fast.

Scales well.

Reusable embedding (out-of-sample extension is trivial).

### Weaknesses

Fails on curved manifolds.

Poor for visualization of complex clusters.

---

## 2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

### Core Nature

Nonlinear, local, probabilistic.

### Fundamental Idea

Preserve pairwise neighborhood similarity rather than global geometry.

High-dimensional similarity:

$$
p_{ij} \propto \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma_i^2}\right).
$$

Low-dimensional similarity using Student-$t$ distribution:

$$
q_{ij} \propto \frac{1}{1 + \|y_i - y_j\|^2}.
$$

### Objective Function

Minimize Kullback–Leibler divergence:

$$
\mathrm{KL}(P \| Q) = \sum_{i,j} p_{ij} \log \frac{p_{ij}}{q_{ij}}.
$$

### What t-SNE Preserves

Local neighborhoods.

Cluster tightness.

### What t-SNE Destroys

Global distances.

Relative cluster positions.

Density information.

### Critical Properties

Non-parametric.

Non-invertible.

Stochastic.

No meaningful axes.

### Common Misinterpretation (Very Important)

Distances between clusters in t-SNE visualizations are meaningless.

---

## 3. UMAP (Uniform Manifold Approximation and Projection)

### Core Nature

Nonlinear, topological, manifold-based.

### Mathematical Foundation

UMAP is grounded in:

Riemannian geometry.

Algebraic topology.

Fuzzy simplicial sets.

### Key Assumptions

Data lies on a low-dimensional manifold.

The manifold is locally connected.

Local distances are meaningful.

### High-Level Objective

Construct a fuzzy graph in high-dimensional space.

Construct a fuzzy graph in low-dimensional space.

Minimize the cross-entropy between them:

$$
\min \ \mathrm{CE}(\text{Graph}_{\text{high}}, \text{Graph}_{\text{low}}).
$$

### What UMAP Preserves

Local neighborhoods.

Some global structure.

Manifold continuity.

### Compared to t-SNE

More stable.

Better global layout.

Faster.

Supports out-of-sample transform.

---

## Core Differences (Intuition First)

| Aspect | PCA | t-SNE | UMAP |
|------|-----|-------|------|
| Type | Linear | Nonlinear | Nonlinear |
| Geometry | Euclidean | Probabilistic | Topological |
| Focus | Global | Local | Local + global |
| Axes meaningful | Yes | No | No |
| Clusters | Weak | Strong | Strong |
| Global distances | Preserved | Destroyed | Partially preserved |
| Deterministic | Yes | No | Mostly |
| Scales to large data | Yes | Poor | Good |

---

## Mathematical Comparison (Deeper)

| Property | PCA | t-SNE | UMAP |
|--------|-----|-------|------|
| Objective | Variance maximization | KL divergence | Cross-entropy |
| Distance model | Euclidean | Probabilistic | Graph-based |
| Linear map | Yes | No | No |
| Probabilistic | Gaussian | Yes | Fuzzy |
| Manifold-aware | No | Implicit | Explicit |
| Out-of-sample | Trivial | No | Yes |

---

## Geometry Perspective

PCA fits a flat plane through the data.

t-SNE preserves which points are close to each other.

UMAP preserves how local neighborhoods are connected across the manifold.

---

## When to Use Which (Correctly)

### Use PCA When

You need compression.

You care about variance.

You want interpretable dimensions.

You plan to feed data into another model.

### Use t-SNE When

You want exploratory visualization.

You care only about local clusters.

The dataset is not too large.

You accept instability.

### Use UMAP When

You want visualization with structural fidelity.

You want speed and scalability.

You want repeatability.

You may need to embed new data later.

---

## A Crucial Best Practice

Never run t-SNE or UMAP on raw high-dimensional data.

Correct pipeline:

$$
\text{Raw} \ \rightarrow \ \text{PCA (30–100D)} \ \rightarrow \ \text{t-SNE / UMAP}.
$$

### Why This Matters

Noise removal.

Better distance estimates.

Faster optimization.

Improved manifold learning.

---

## Signal Processing and Statistical Insight

| SSP View | PCA | t-SNE | UMAP |
|--------|-----|-------|------|
| Noise suppression | Strong | Weak | Moderate |
| Density modeling | No | Yes | Yes |
| Stochastic model | Gaussian | Explicit | Implicit |
| Identifiability | High | Low | Medium |

---

## One-Sentence Mental Models

PCA: “Show me the directions of maximum energy.”

t-SNE: “Put neighbors together, no matter the cost.”

UMAP: “Preserve the manifold’s shape and connectivity.”

---

## Final Warning (Important)

t-SNE and UMAP are visualization tools, not feature extractors.

Do not interpret distances, densities, or axes causally.


# Comprehensive Comparison Table of Dimensionality Reduction and Representation Methods

| Aspect | PCA | ICA | LDA | CCA | Factor Analysis | t-SNE | UMAP |
|------|-----|-----|-----|-----|----------------|-------|------|
| **Full Name** | Principal Component Analysis | Independent Component Analysis | Linear Discriminant Analysis | Canonical Correlation Analysis | Factor Analysis | t-Distributed Stochastic Neighbor Embedding | Uniform Manifold Approximation and Projection |
| **Learning Type** | Unsupervised | Unsupervised | Supervised | Unsupervised (paired data) | Unsupervised | Unsupervised | Unsupervised |
| **Primary Goal** | Maximize variance | Maximize independence | Maximize class separability | Maximize cross-correlation | Explain covariance with latent factors | Preserve local neighborhoods | Preserve manifold topology |
| **Data Assumption** | Linear structure | Linear mixing of sources | Gaussian classes, equal covariance | Paired views | Latent variables + noise | Manifold, local similarity | Manifold, local connectivity |
| **Uses Labels** | No | No | Yes | No | No | No | No |
| **Linear / Nonlinear** | Linear | Linear | Linear | Linear | Linear | Nonlinear | Nonlinear |
| **Core Mathematics** | Eigen-decomposition / SVD | Higher-order statistics | Generalized eigenproblem | Correlation optimization | Probabilistic latent model | KL divergence minimization | Graph cross-entropy minimization |
| **Statistics Used** | Second-order (covariance) | Higher-order | Second-order + labels | Second-order | Second-order + noise model | Probability distributions | Fuzzy topology |
| **Objective Function** | Variance / reconstruction error | Non-Gaussianity | Between / within class ratio | Correlation maximization | Likelihood maximization | KL(P‖Q) | Cross-entropy |
| **Preserves Global Structure** | Yes | Partially | Yes (class-wise) | Yes | Yes | No | Partially |
| **Preserves Local Structure** | Weak | Moderate | Moderate | Weak | Weak | Strong | Strong |
| **Orthogonal Components** | Yes | No | No | No | No | No | No |
| **Component Ordering** | Yes | No | Yes | Yes | No | No | No |
| **Interpretability of Axes** | High | Medium | High | Medium | Medium | None | None |
| **Probabilistic Model** | Yes (Gaussian) | Yes | Yes | Yes | Yes | Yes | Implicit |
| **Noise Modeling** | Implicit | Weak | Weak | Weak | Explicit | No | No |
| **Invertible Mapping** | Yes (linear) | Yes (up to scale/permutation) | Yes | Yes | Yes | No | No |
| **Out-of-Sample Extension** | Trivial | Trivial | Trivial | Trivial | Trivial | No | Yes |
| **Stability / Determinism** | High | Medium | High | High | High | Low | Medium–High |
| **Scalability** | Excellent | Good | Good | Good | Moderate | Poor | Good |
| **Main Use Case** | Compression, denoising | Source separation | Classification | Multiview learning | Latent modeling | Visualization | Visualization |
| **Common Pitfall** | Misses nonlinear structure | Sensitive to noise | Fails if assumptions break | Requires paired data | Over-assumes Gaussianity | Misinterpreting distances | Misinterpreting global distances |
| **SSP Interpretation** | Energy compaction | Blind source separation | Optimal linear classifier | Cross-signal alignment | Latent signal recovery | Neighborhood preservation | Manifold reconstruction |
| **Relation to Deep Learning** | Linear autoencoder | Nonlinear ICA | Metric learning | Multimodal models | VAEs | Visualization only | Visualization / embeddings |

---

## One-Line Conceptual Summary

PCA → Global variance  

ICA → Independent sources  

LDA → Class separation  

CCA → Shared structure across views  

Factor Analysis → Latent causes + noise  

t-SNE → Local neighborhood visualization  

UMAP → Manifold topology visualization
