# 📜 Loss Functions in Dimensionality Reduction (AI/ML/DL)

---

## 🔹 1. Classical ML / Statistical Losses

* **PCA (Principal Component Analysis)**  
  * **Loss:** Reconstruction error (squared Euclidean).  
  * $$L = \|X - X W W^T\|^2$$  
  * Minimizes variance lost in projection.

* **MDS (Multidimensional Scaling, 1950s)**  
  * **Loss:** Stress function = difference in pairwise distances.  
  * Preserves global geometry.

* **Isomap (2000)**  
  * **Loss:** Preserves geodesic (shortest-path) distances.  
  * Captures nonlinear manifolds.

* **t-SNE (2008)**  
  * **Loss:** KL divergence between high- and low-dim neighbor probabilities.  
  * Preserves local neighborhoods.  

  $$L = D_{KL}(P \parallel Q) = \sum_{i \ne j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$  

* **UMAP (2018)**  
  * **Loss:** Cross-entropy between neighbor graphs.  
  * Balances local & global structure.

---

## 🔹 2. Autoencoder-Based Losses (DL)

* **Basic Autoencoder** (1986 → deep revival 2006)  
  * **Loss:**  
    $$L = \|x - \hat{x}\|^2$$  

* **Denoising Autoencoder (Vincent, 2008)**  
  * Input corrupted, output clean.  
  * **Loss:** MSE on reconstructions.  

* **Sparse Autoencoder**  
  * **Loss:** Reconstruction MSE + L1 penalty on activations.  
  * $$L = \|x - \hat{x}\|^2 + \lambda \sum |h|$$  

* **Contractive Autoencoder (Rifai, 2011)**  
  * **Loss:**  
    $$L = \|x - \hat{x}\|^2 + \lambda \|\nabla_x h(x)\|_F^2$$  
  * Encourages robustness to perturbations.

---

## 🔹 3. Generative & Probabilistic Losses

* **Variational Autoencoder (VAE, 2013)**  
  * **Loss:**  

    $$
    \mathcal{L} = \mathbb{E}_{q(z|x)}[\|x - \hat{x}\|^2]
    + \beta \, D_{KL}(q(z|x) \parallel p(z))
    $$  

* **β-VAE (2017)**  
  * Same as VAE but $$\beta > 1$$ → disentanglement.

* **VQ-VAE (2017)**  
  * **Loss:** Reconstruction + Vector Quantization + Commitment.  
  * Discrete latent embeddings.

---

## 🔹 4. Contrastive & Representation Learning

* **SimCLR (2020)**  
  * **Loss:** NT-Xent (normalized temperature-scaled cross-entropy).  

    $$
    L = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \ne i} \exp(\text{sim}(z_i, z_k)/\tau)}
    $$  

* **BYOL (2020), SwAV (2020)**  
  * Losses align augmented views without negatives.  
  * Dimensionality reduced to embeddings.

* **Deep Embedding Clustering (DEC, 2016)**  
  * **Loss:** KL divergence between soft cluster assignments and target distribution.

---

## 🔹 5. Domain-Specific

* **Triplet Loss (FaceNet, 2015):**  
  $$L = \max(0, d(a,p) - d(a,n) + \alpha)$$  
  Forces similar samples close, dissimilar far.

* **Manifold Regularization:**  
  Penalizes deviations from neighborhood graph.

* **Deep t-SNE / Parametric t-SNE:**  
  Neural networks + KL divergence objective.

---

## ✅ Quick Comparative Table

| Method / Family       | Loss Type                           | Preserves / Encourages           |
| --------------------- | ----------------------------------- | -------------------------------- |
| PCA                   | MSE (linear recon.)                 | Variance (global)                |
| MDS / Isomap          | Distance stress / geodesics         | Geometry, manifolds              |
| t-SNE                 | KL divergence (neighbors)           | Local clusters                   |
| UMAP                  | Cross-entropy (graphs)              | Local + global balance           |
| Autoencoders          | MSE (+ sparse/contractive terms)    | Compact latent recon.            |
| VAE / β-VAE / VQ-VAE  | Recon + KL (+ quantization)         | Probabilistic latent structure   |
| SimCLR, BYOL, SwAV    | Contrastive/SSL losses              | Representation alignment         |
| Triplet / Metric      | Distance-based (margin, triplets)   | Semantic similarity in embedding |

---


# 📊 Comparative Table: Loss Functions in Dimensionality Reduction (AI/ML/DL)

| Method / Loss            | Formula (simplified)                                                                 | Intuition                                   | Pros                                     | Cons                          | When to Use                                |
|---------------------------|---------------------------------------------------------------------------------------|---------------------------------------------|------------------------------------------|-------------------------------|--------------------------------------------|
| **PCA (Reconstruction)** | $$L = \|X - X W W^T\|^2$$                                                             | Preserve variance by minimizing reconstruction error | Simple, convex, closed-form              | Linear only, ignores local structure | Large-scale linear DR, preprocessing       |
| **MDS (Stress Loss)**    | $$L = \sum_{i,j} (d_{ij}^{HD} - d_{ij}^{LD})^2$$                                      | Preserve pairwise distances                 | Captures global geometry                  | Sensitive to noise, costly for large $$n$$ | Visualization with distance preservation   |
| **Isomap**               | Geodesic distance preservation                                                        | Preserve manifold structure                 | Good for nonlinear manifolds              | Heavy compute, noise-sensitive   | Data lying on curved manifolds             |
| **t-SNE (KL Divergence)**| $$L = D_{KL}(P^{HD} \parallel Q^{LD}) = \sum_{i \ne j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$ | Match high vs low-dim neighborhood distributions | Great for local clustering                | Poor global structure, non-parametric | Visualizing clusters in high-dim data      |
| **UMAP (Cross-Entropy)** | $$L = -\sum \Big[p_{ij}\log q_{ij} + (1-p_{ij})\log(1-q_{ij})\Big]$$                  | Graph-based neighbor preservation           | Scales better than t-SNE, preserves both local & global | Sensitive to hyperparams        | Scalable nonlinear embedding               |
| **Autoencoder (MSE)**    | $$L = \|X - g(f(X))\|^2$$                                                             | Encode–decode to minimize reconstruction    | Learns nonlinear embeddings               | Needs large data, may overfit    | General nonlinear DR, DL pipelines         |
| **Denoising AE**          | $$L = \|X - g(f(\tilde{X}))\|^2$$                                                    | Reconstruct clean from noisy inputs         | Robust feature learning                   | Needs noise design               | Robust representation learning             |
| **Sparse AE**            | $$L = \|X - \hat{X}\|^2 + \lambda \|h\|_1$$                                          | Enforce sparse hidden features              | Feature selection in latent space         | Extra hyperparam tuning          | Compressed sensing, embeddings             |
| **Contractive AE**       | $$L = \|X - \hat{X}\|^2 + \lambda \|\nabla_x f(x)\|_F^2$$                             | Invariance to small input perturbations     | Smooth, robust representations            | Heavy (Jacobian cost)            | Robust DR in vision, speech                |
| **VAE (ELBO)**           | $$\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - KL(q(z|x)\parallel p(z))$$         | Probabilistic latent space                  | Generative, interpretable latent vars     | May blur details                 | Generative modeling + DR                   |
| **β-VAE**                | Same as VAE but KL scaled by $$\beta > 1$$                                            | Disentangled latent factors                 | Interpretable, structured representations | May underfit                     | Disentangled representation learning       |
| **VQ-VAE**               | Recon. + quantization + commitment losses                                             | Discrete latent embeddings                  | Good for discrete data (speech, text)     | Codebook collapse risk            | Discrete embeddings, compression           |
| **DEC (Clustering)**     | $$L = KL(P \parallel Q)$$ between soft assignments                                    | Align latent clusters with targets          | Clustering + DR jointly                   | Sensitive to initialization       | Unsupervised clustering + embedding        |
| **SimCLR (NT-Xent)**     | $$L = -\log \frac{\exp(\text{sim}(z_i,z_j)/\tau)}{\sum_{k \ne i}\exp(\text{sim}(z_i,z_k)/\tau)}$$ | Pull positives close, push negatives apart  | Strong embeddings via contrastive SSL     | Needs large batch sizes           | Vision/NLP self-supervised embeddings      |
| **Triplet Loss**         | $$L = \max(0, d(a,p) - d(a,n) + m)$$                                                  | Anchor–positive close, negative far         | Good for metric learning                  | Needs careful triplet mining      | Face verification, retrieval tasks         |
| **BYOL / SwAV**          | Variants of contrastive / clustering losses w/o negatives                             | Self-supervised latent structuring          | No negatives needed                       | Trickier stability                | SSL rep. learning, multimodal embeddings   |

---

✅ **Insights**

- **Classical ML losses** (PCA, MDS, t-SNE, UMAP): preserve **variance, distances, or neighborhoods**.  
- **Deep Learning losses** (Autoencoders, VAEs, VQ-VAEs): reconstruction + probabilistic/discrete latent modeling.  
- **Modern SSL/contrastive losses** (SimCLR, BYOL, SwAV, Triplet): learn embeddings through **instance discrimination**.  

👉 Choose depending on data + goal:  
- **Linear DR** → PCA.  
- **Nonlinear manifold visualization** → t-SNE, UMAP.  
- **Generative latent space** → VAE, β-VAE.  
- **Discrete codebook embeddings** → VQ-VAE.  
- **Joint clustering** → DEC.  
- **Strong SSL features** → Contrastive losses.  
