# t-SNE — Notebook-style Notes

> **Goal:** compact but complete reference you can paste into a single Jupyter Markdown cell. Includes: quick stats refresher (z/t/p), t-distribution PDF, t-SNE formulation, why KL, why the *t* tail, optimization/gradient, hyperparams, tricks, pitfalls, and interview notes.

---

## 0) Quick refresher — z-test, t-test, p-value

- **Z-test (known σ or large n):**

  $$
  z=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}},\quad z\sim \mathcal N(0,1)\text{ under }H_0
  $$

- **t-test (unknown σ, small/medium n):**

  $$
  t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}},\quad t\sim t_\nu,\ \nu=n-1
  $$

- **p-value:** probability of observing a statistic at least as extreme as the observed one **if** $H_0$ were true.

### Student’s t-distribution PDF

$$
f(t)=\frac{\Gamma\!\left(\frac{\nu+1}{2}\right)}
{\sqrt{\nu\pi}\,\Gamma\!\left(\frac{\nu}{2}\right)}
\left(1+\frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}}
$$

- Fatter tails than normal; $t_\nu \to \mathcal N(0,1)$ as $\nu\to\infty$.

---

## 1) What is t-SNE?

**t-Distributed Stochastic Neighbor Embedding** — a **nonlinear** dimensionality-reduction method primarily for **visualization** (2D/3D). It preserves **local neighborhood structure** of high-dimensional data.

- “SNE”: preserve neighbor **probabilities** rather than raw distances.
- “t-Distributed”: use a **Student’s t** distribution in low-D to mitigate the **crowding problem**.

---

## 2) High-D similarities $P=\{p_{ij}\}$

For data $x_1,\dots,x_n\in\mathbb R^D$:

1. Conditional similarities with a **Gaussian kernel**:

   $$
   p_{j|i}=\frac{\exp\!\left(-\frac{\lVert x_i-x_j\rVert^2}{2\sigma_i^2}\right)}
   {\sum_{k\neq i}\exp\!\left(-\frac{\lVert x_i-x_k\rVert^2}{2\sigma_i^2}\right)}
   $$

   - $\sigma_i$ chosen via **perplexity** to set local neighborhood size.

2. Symmetrize:

   $$
   p_{ij}=\frac{p_{j|i}+p_{i|j}}{2n},\qquad p_{ii}=0,\ \sum_{i\ne j}p_{ij}=1
   $$

> Intuition: $p_{ij}$ encodes “who is near whom” in high-D, focusing on **local** neighborhoods.

---

## 3) Low-D similarities $Q=\{q_{ij}\}$

For embedding points $y_i\in\mathbb R^{d}$ (usually $d=2$):

$$
q_{ij}=\frac{(1+\lVert y_i-y_j\rVert^2)^{-1}}
{\sum_{k\neq \ell}(1+\lVert y_k-y_\ell\rVert^2)^{-1}},\qquad q_{ii}=0
$$

- This is a **Student’s t** kernel with 1 d.o.f. (Cauchy-like): **heavy tails**.
- Both $P$ and $Q$ use **Euclidean distances**; the difference is the **decay** (Gaussian vs t-tail).

---

## 4) Objective (what is optimized)

Minimize **KL divergence** from $P$ to $Q$:

$$
C(Y)=KL(P\parallel Q)=\sum_{i\ne j}p_{ij}\log\frac{p_{ij}}{q_{ij}}
$$

- **Why KL (and not MSE on distances)?** It compares **distributions of similarities**, preserving **relative** neighbor structure.
- **Asymmetry matters:** Large $p_{ij}$ (true neighbors) penalize heavily if $q_{ij}$ is too small ⇒ **prioritizes local structure**.

> **Smaller KL is better.** $KL=0$ only if $P\equiv Q$.

---

## 5) The gradient (how points move)

The variables are **only** the low-D coordinates $Y=\{y_i\}$. The gradient:

$$
\frac{\partial C}{\partial y_i}
=4\sum_{j}(p_{ij}-q_{ij})\,(y_i-y_j)\,(1+\lVert y_i-y_j\rVert^2)^{-1}
$$

- If $p_{ij}>q_{ij}$ → **attraction** (too far → pull together).
- If $p_{ij}<q_{ij}$ → **repulsion** (too close → push apart).
- $(1+r^2)^{-1}$ gives **long-range but decaying** influence (from t-tail).

**Update (batch GD with momentum):**

$$
y_i^{(t+1)}=y_i^{(t)}-\eta\frac{\partial C}{\partial y_i}+\alpha(t)\left[y_i^{(t)}-y_i^{(t-1)}\right]
$$

> All points update **together each iteration** (full batch). No per-point SGD.

---

## 6) Why the **t** tail solves crowding (better visualization)

- In high-D, many points are moderately far yet not neighbors → impossible to fit all separations in 2D.
- A **Gaussian** low-D kernel makes far pairs $q_{ij}\approx0$ ⇒ no repulsion ⇒ everything crowds at center.
- The **t-distribution** decays slowly: $q_{ij}\propto(1+r^2)^{-1}$ ⇒ even far pairs keep mild repulsion ⇒ clusters spread and remain separable.
- With **KL’s asymmetry** (local preservation), t-SNE yields compact, well-separated clusters.

---

## 7) Practicalities: hyperparameters & tricks

- **perplexity:** 5–50; controls neighborhood size (larger → broader).
- **learning_rate:** small → clumping; large → scatter. Typical 200–1000 (~ n/12 rule).
- **n_iter:** 500–1000+.
- **metric:** use cosine for embeddings (or L2-normalize then Euclidean ≈ cosine).
- **init:** `'pca'` (stable) vs `'random'` (stochastic).
- **early_exaggeration:** multiply $p_{ij}$ × 4–12 for first ~250 iters → stronger early attraction.
- **Speed-ups:** Barnes–Hut ($O(n\log n)$) or FIt-SNE/FFT for large n.

---

## 8) What t-SNE is **not**

- **No mapping** $f(x)=Wx$ → it learns only coordinates $Y$ → **visualization only**.
- **Cannot project new points** (unless parametric t-SNE or approximation).
- **Global distances/angles meaningless**; only **local structure** matters.

---

## 9) Sanity checks & troubleshooting

- Run multiple seeds → clusters should broadly match.
- **Perplexity sweep:** small → fine clusters; large → smoother.
- **Learning rate:** collapse → increase slightly; oscillation → decrease.
- **Normalize embeddings**; use cosine metric.
- **Don’t over-interpret:** cluster size/gaps ≠ quantitative distance.

---

## 10) Interview-ready contrasts & sound bites

- **PCA vs t-SNE:** PCA = linear/global; t-SNE = nonlinear/local.
- **Why KL(P‖Q):** asymmetric → protects close neighbors.
- **Why t-distribution:** heavy tails → long-range repulsion → less crowding.
- **No transform W:** pairwise objective ⇒ non-parametric.
- **UMAP vs t-SNE:** UMAP faster, supports `.transform()`, preserves some global geometry.

---

## 11) Force-based intuition

Each pair exerts forces:

$$
\frac{\partial C}{\partial y_i}
=4\sum_j \underbrace{(p_{ij}-q_{ij})}_{\text{attract vs repel}}
\underbrace{(y_i-y_j)}_{\text{direction}}
\underbrace{(1+\lVert y_i-y_j\rVert^2)^{-1}}_{\text{decay (t-tail)}}
$$

- Attraction ∝ $p_{ij}$ (true neighbors).
- Repulsion ∝ $q_{ij}$ (low-D similarity).
- Equilibrium when forces balance → low KL.

---

## 12) Wrap-up TL;DR

- **What:** map high-D data → 2D/3D by matching neighbor probabilities $P$ and $Q$.
- **Loss:** $KL(P\parallel Q)$ (minimized).
- **Why it works:** KL asymmetry + t-tails → preserve local clusters, avoid crowding.
- **Use:** visualization & exploration only.
- **Mind:** hyperparams (perplexity, LR) and interpretability limits.
