# 🌀 Understanding UMAP (Uniform Manifold Approximation and Projection)

UMAP is a **nonlinear dimensionality reduction** technique that builds upon concepts from **manifold learning** and **fuzzy topology**.
It’s similar in goal to *t-SNE* but is faster, preserves more global structure, and can be used as a general embedding method.

---

## 🧭 1️⃣ Overview

UMAP assumes that:

> High-dimensional data lie on a manifold, and we want to find a low-dimensional projection that preserves the manifold’s *local topology*.

So instead of matching probability distributions like t-SNE,
UMAP builds a **fuzzy topological graph** in high-D and optimizes a low-D representation that preserves it.

---

## 🧩 2️⃣ Step-by-Step Mechanics

### (a) Build a weighted graph in high-D

For each data point \(i\):

1. Find its **k nearest neighbors** (based on some distance metric).
2. Define connection strength:

   $$
   w_{ij} = \exp\!\left(-\frac{\max(0,\, d(i,j) - \rho_i)}{\sigma_i}\right)
   $$

   where:
   - \(d(i,j)\): distance between \(i\) and \(j\)
   - \(\rho_i\): distance to \(i\)’s nearest neighbor (local connectivity floor)
   - \(\sigma_i\): smooths distances so all points have similar entropy

3. Symmetrize with a fuzzy union:

   $$
   w^{(sym)}_{ij} = w_{ij} + w_{ji} - w_{ij} w_{ji}
   $$

This gives a **fuzzy simplicial set** (a graph encoding probabilistic connections).

---

### (b) Define the low-D graph

In embedding space, define a smooth connection curve:

$$
w'_{ij} = \frac{1}{1 + a\,{d'(i,j)}^{2b}}
$$

where:
- \(d'(i,j)\): distance between embedded points \(y_i\) and \(y_j\)
- \(a,b\) ≈ (1.929, 0.7915): constants controlling curve shape

---

### (c) Optimize embeddings

UMAP minimizes a **cross-entropy** loss between the high-D and low-D fuzzy graphs:

$$
L = \sum_{i \ne j} \Big[-w_{ij}\log(w'_{ij}) - (1 - w_{ij})\log(1 - w'_{ij})\Big]
$$

- The first term pulls close (neighbor) points together.
- The second term pushes unrelated (non-neighbor) points apart.

Optimization is performed via **stochastic gradient descent (SGD)**.

---

## 🧠 3️⃣ Intuition Behind the Components

| Symbol | Meaning | Description |
|---------|----------|-------------|
| \(d(i,j)\) | High-D distance | Distance between points in original data space |
| \(w_{ij}\) | High-D connection weight | Fuzzy measure of neighbor strength |
| \(d'(i,j)\) | Low-D distance | Distance between embedded points |
| \(w'_{ij}\) | Low-D connection weight | Learned similarity curve |
| Loss \(L\) | Cross-entropy | Encourages \(w'_{ij} \approx w_{ij}\) |

---

## ⚙️ 4️⃣ Handling Neighbors vs Non-neighbors

- The **k-NN graph** defines which pairs \((i,j)\) are neighbors.
- For neighbors → compute attractive term:
  $$-w_{ij}\log(w'_{ij})$$
- For non-neighbors → approximate the repulsive term via **negative sampling**:
  $$-\log(1 - w'_{ij^-})$$

This avoids computing all \(O(N^2)\) pairs — only a few random negatives are sampled for each positive edge.

---

## ⚡ 5️⃣ Why UMAP Is Fast

1. **Approximate k-NN graph** via **NN-Descent** (O(N log N)):
   - Iteratively refines neighbor lists using neighbor-of-neighbor propagation.
2. **Sparse graph representation**:
   - Only store k edges per node (not all pairs).
3. **SGD optimization** with sampled edges:
   - No full normalization constant like in t-SNE.

Result: linear or near-linear scaling with dataset size.

---

## 🧮 6️⃣ The Distances \(d(i,j)\) and \(d'(i,j)\)

| Symbol | Space | Typical Metric | Used For |
|---------|--------|----------------|-----------|
| \(d(i,j)\) | High-dimensional space | Euclidean / Cosine / Hamming | Building the fuzzy k-NN graph |
| \(d'(i,j)\) | Low-dimensional embedding | Euclidean | Measuring similarity between embedded points |

### High-D weights:
$$
w_{ij} = \exp\!\left(-\frac{\max(0,\,d(i,j)-\rho_i)}{\sigma_i}\right)
$$

### Low-D weights:
$$
w'_{ij} = \frac{1}{1 + a\,{d'(i,j)}^{2b}}
$$

UMAP optimizes embeddings so that these match: \(w'_{ij} \approx w_{ij}\).

---

## 🧩 7️⃣ Relation to t-SNE

| Aspect | **t-SNE** | **UMAP** |
|--------|------------|-----------|
| Core idea | Match pairwise probabilities | Preserve local fuzzy topology |
| High-D similarity | Gaussian | Exponential with adaptive scaling |
| Low-D similarity | Student-t | Smooth curve \(1/(1+a d^{2b})\) |
| Loss | KL divergence | Cross-entropy |
| Optimization | Gradient descent | SGD with sampled negatives |
| Structure preserved | Local only | Local + moderate global |
| Complexity | O(N²) | O(N log N) |

---

## 🧠 8️⃣ Geometric Intuition

Think of each point as having a *local connectivity radius* \( \rho_i \):
- Inside that radius → fully connected (\(w_{ij} = 1\))
- Outside → exponentially decaying connection strength

In low-D space, points are moved so that their connection curve \(w'_{ij}\) matches \(w_{ij}\).
This yields an embedding that preserves local clusters and overall manifold shape.

---

## ✅ 9️⃣ Summary

- \(d(i,j)\): high-dimensional distance
- \(d'(i,j)\): embedding-space distance
- \(w_{ij}\): fuzzy high-D connectivity
- \(w'_{ij}\): learned connectivity in low-D
- Optimization: SGD to minimize cross-entropy
- Non-neighbors: approximated via random negative sampling
- Efficiency: NN-Descent + sparse graph → O(N log N) scaling

> **t-SNE** matches pairwise probabilities.
> **UMAP** matches fuzzy graph connectivity.

Result: faster, more scalable, and often better at preserving global structure.

---

*References:*
- McInnes, Healy, & Melville (2018). *UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.*
- Official implementation: [https://umap-learn.readthedocs.io](https://umap-learn.readthedocs.io)


# 🔍 Why UMAP Preserves Global Structure (vs t-SNE)

## 1️⃣ Pairwise penalties

### t-SNE loss
$$
D_{KL}(P\|Q)=\sum_{ij} p_{ij}\log\frac{p_{ij}}{q_{ij}}
$$
- **Missed neighbor**: \(p\) large, \(q\) small → big term
  e.g. \(p=0.1, q=0.001 \Rightarrow 0.46\)
- **False neighbor**: \(p\) tiny, \(q\) large → small term
  e.g. \(p=10^{-4}, q=0.2 \Rightarrow 7.6\times10^{-4}\)

➡ KL heavily penalizes *missed* neighbors but barely cares about *false* ones → strong local clustering, weak global layout.

---

### UMAP loss
$$
L_{ij}=-[w_{ij}\log w'_{ij}+(1-w_{ij})\log(1-w'_{ij})]
$$
- **Missed neighbor**: \(w\!\approx\!1,\ w'\!\to\!0 \Rightarrow\) large
- **False neighbor**: \(w\!\approx\!0,\ w'\!\to\!1 \Rightarrow\) large

Both directions penalized → symmetric pressure.

---

## 2️⃣ Gradients (qualitative)

| Case | t-SNE | UMAP |
|------|--------|-------|
| Missed neighbor | Strong pull (big \(p-q\)) | Strong pull |
| False neighbor | Weak push (small \(p\)) | Strong push (via sampled negatives) |

KL’s asymmetry ⇒ points repel weakly across clusters, so clusters drift apart.
UMAP’s balanced cross-entropy ⇒ consistent pull-push, preserving manifold continuity.

---

## ✅ Summary

| Property | **t-SNE** | **UMAP** |
|-----------|------------|-----------|
| Divergence | Asymmetric KL | Symmetric cross-entropy |
| Penalizes | Missed neighbors only (strongly) | Both missed & false |
| Effect | Tight, separated blobs | Continuous global manifold |

> **t-SNE:** “protect local neighbors.”
> **UMAP:** “preserve neighborhoods *and* their global connections.”


# 📘 Parameters and Intuitions: t-SNE vs UMAP

---

## 🎯 Overview

Both **t-SNE** and **UMAP** aim to reduce dimensionality while preserving local structure.
They differ mainly in **how they model similarity** and **what their key parameters control**.

---

## 🧩 t-SNE Parameters

### **1. Perplexity**

**Definition:**
Perplexity controls the *effective number of neighbors* each point considers when defining its local probability distribution.

Formally, for each point \(x_i\):

$$
p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}
$$

where \( \sigma_i \) is chosen such that:

$$
2^{H(P_i)} = \text{perplexity}
$$

and

$$
H(P_i) = -\sum_j p_{j|i} \log_2 p_{j|i}
$$

So **perplexity ≈ effective number of neighbors**.

---

**Intuition:**

- Small perplexity → very local view → captures fine clusters, may ignore global layout
- Large perplexity → wider neighborhood → captures more global structure but may blur clusters

Typical range: **5 – 50**

---

### **2. Learning Rate, Early Exaggeration (brief)**

- **Learning rate** affects how fast embeddings move — too small slows convergence, too large causes collapse.
- **Early exaggeration** temporarily increases attractive forces between similar points early on, helping clusters separate before refinement.

---

## 🧮 UMAP Parameters

UMAP builds a **K-NN graph** in high-dim space and optimizes a low-dim embedding that preserves those relationships.

### **1. n_neighbors**

Controls the *size of the local neighborhood* considered when building the graph.

- Small → focus on local structure
- Large → capture more global relationships

This is conceptually similar to t-SNE’s perplexity, but deterministic (KNN-based rather than entropy-based).

---

### **2. min_dist**

Controls *how tightly points are allowed to pack together* in the low-dim embedding.

During optimization, the low-dim connection strength between points \(i,j\) is:

$$
w'_{ij} = \frac{1}{1 + a\|y_i - y_j\|^{2b}}
$$

where \(a,b\) are derived from `min_dist`.

- Smaller `min_dist` → points can get very close → compact clusters
- Larger `min_dist` → points repel earlier → smoother, more spread-out structure

| min_dist | Effect |
|-----------|---------|
| 0.0 | Very tight clusters |
| 0.1 (default) | Balanced local/global |
| 0.5 | Broader structure |
| >0.8 | Global, blurry clusters |

---

### **3. Interplay between n_neighbors and min_dist**

| n_neighbors | min_dist | Result |
|--------------|-----------|---------|
| Small (5–15) | Small (0.0–0.1) | Very local, dense clusters |
| Large (30–100) | Large (0.3–0.8) | Global continuity, smooth manifold |

---

## 🧠 Conceptual Analogy

| Concept | t-SNE | UMAP |
|----------|--------|-------|
| Locality parameter | *Perplexity* → "How many neighbors each point considers" | *n_neighbors* → "How many neighbors are linked in the KNN graph" |
| Cluster compactness | Controlled indirectly | *min_dist* → "How close points can sit together" |
| Mechanism | Probabilistic similarity matching | Graph + topological optimization |
| Effect | Emphasizes local clusters | Balances local + global structure |

---

## 🧭 Summary Table

| Aspect | t-SNE | UMAP |
|--------|--------|-------|
| Core parameters | `perplexity`, `learning_rate`, `early_exaggeration` | `n_neighbors`, `min_dist` |
| Defines locality by | Entropy-based Gaussian width | KNN graph |
| Controls cluster size via | Indirectly (learning rate, exaggeration) | Directly (min_dist) |
| Local vs Global balance | Tuned by perplexity | Tuned by n_neighbors |
| Typical use | Visualization | Visualization or general reduction |
| Speed / scale | Slower | Much faster |

---

## 💡 Quick Takeaway

- **t-SNE:** adjusts *how many neighbors* each point “feels” → via **perplexity**.
- **UMAP:** fixes *how many neighbors* via **n_neighbors**, then adjusts *how tightly they’re packed* via **min_dist**.

In short:

> *t-SNE learns how wide to look; UMAP learns how close to hug.*
