# A Unified Mathematical Map of Generative Models

This document rewrites and **normalizes** your content into a clean, consistent mathematical narrative.  
Nothing new is added conceptually; the goal is **clarity, correctness, and structural alignment** across all families of generative models.

---

## 0. Shared Probabilistic Foundation

### Data distribution and model
$$
x \sim p_{\text{data}}(x), \qquad p_\theta(x) \;\text{(model)}
$$

### Maximum Likelihood Estimation (MLE)
$$
\theta^\* = \arg\max_\theta \; \mathbb{E}_{x \sim p_{\text{data}}}
\big[ \log p_\theta(x) \big]
$$

### Cross-entropy and KL relations
$$
\mathbb{E}_{p_{\text{data}}}\big[-\log p_\theta(x)\big]
= H(p_{\text{data}}, p_\theta)
$$

$$
\mathrm{KL}(p_{\text{data}} \,\|\, p_\theta)
= \mathbb{E}_{p_{\text{data}}}
\left[
\log \frac{p_{\text{data}}(x)}{p_\theta(x)}
\right]
$$

Minimizing cross-entropy is equivalent to minimizing forward KL (up to a constant).

### Bayes rule
$$
p_\theta(z \mid x)
=
\frac{p_\theta(x \mid z)\,p(z)}{p_\theta(x)},
\qquad
p_\theta(x)=\int p_\theta(x \mid z)p(z)\,dz
$$

---

## 1. Autoregressive Generative Models

### Chain rule factorization
$$
p_\theta(x)
=
\prod_{t=1}^{T} p_\theta(x_t \mid x_{<t})
$$

### Log-likelihood
$$
\log p_\theta(x)
=
\sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t})
$$

### Discrete softmax parameterization
$$
p_\theta(x_t=v \mid x_{<t})
=
\frac{\exp(\ell_{\theta,v}(x_{<t}))}
{\sum_{v'} \exp(\ell_{\theta,v'}(x_{<t}))}
$$

Examples: NADE, PixelRNN, PixelCNN, Transformers as language models.

---

## 2. Latent-Variable Models (General)

### Marginal likelihood
$$
p_\theta(x)
=
\int p_\theta(x \mid z)p(z)\,dz
$$

### Variational inference  
Introduce $q_\phi(z \mid x)$:
$$
\log p_\theta(x)
=
\underbrace{
\mathbb{E}_{q_\phi(z \mid x)}
[\log p_\theta(x \mid z)]
-
\mathrm{KL}(q_\phi(z \mid x)\,\|\,p(z))
}_{\mathcal{L}_{\text{ELBO}}(x)}
+
\mathrm{KL}(q_\phi(z \mid x)\,\|\,p_\theta(z \mid x))
$$

Hence:
$$
\log p_\theta(x) \ge \mathcal{L}_{\text{ELBO}}(x)
$$

---

## 3. Variational Autoencoders (VAE)

### Standard ELBO
$$
\max_{\theta,\phi}
\;
\mathbb{E}_{x \sim p_{\text{data}}}
\left[
\mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)]
-
\mathrm{KL}(q_\phi(z \mid x)\,\|\,p(z))
\right]
$$

### Typical choices
$$
p(z)=\mathcal{N}(0,I)
$$

$$
q_\phi(z \mid x)=
\mathcal{N}(\mu_\phi(x), \mathrm{diag}(\sigma_\phi^2(x)))
$$

### Reparameterization trick
$$
z=\mu_\phi(x)+\sigma_\phi(x)\odot\epsilon,
\qquad
\epsilon\sim\mathcal{N}(0,I)
$$

### $\beta$-VAE
$$
\mathcal{L}_{\beta\text{-VAE}}
=
\mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)]
-
\beta\,\mathrm{KL}(q_\phi(z \mid x)\,\|\,p(z))
$$

### IWAE
$$
\log p_\theta(x)
\ge
\mathbb{E}_{z_{1:K}\sim q_\phi}
\left[
\log\frac{1}{K}
\sum_{k=1}^K
\frac{p_\theta(x,z_k)}{q_\phi(z_k\mid x)}
\right]
$$

---

## 4. Flow-Based Generative Models

### Invertible transform
$$
x=f_\theta(z),
\qquad
z=f_\theta^{-1}(x)
$$

### Change of variables
$$
p_\theta(x)
=
p(z)\left|
\det\frac{\partial z}{\partial x}
\right|
$$

$$
\log p_\theta(x)
=
\log p(z)
+
\log\left|
\det\frac{\partial f_\theta^{-1}(x)}{\partial x}
\right|
$$

### Composition of flows
$$
z_0=x,\quad
z_k=f_k(z_{k-1}),\quad
z_K=z
$$

$$
\log p(x)
=
\log p(z_K)
+
\sum_{k=1}^K
\log
\left|
\det\frac{\partial f_k}{\partial z_{k-1}}
\right|
$$

---

## 5. Energy-Based Models (EBMs)

### Energy parameterization
$$
p_\theta(x)=
\frac{\exp(-E_\theta(x))}{Z(\theta)},
\qquad
Z(\theta)=\int \exp(-E_\theta(x))\,dx
$$

### Log-likelihood
$$
\log p_\theta(x)
=
- E_\theta(x) - \log Z(\theta)
$$

### Score
$$
\nabla_x \log p_\theta(x)
=
- \nabla_x E_\theta(x)
$$

### Langevin sampling
$$
x_{k+1}
=
x_k
+
\frac{\eta}{2}\nabla_x \log p_\theta(x_k)
+
\sqrt{\eta}\,\epsilon_k,
\qquad
\epsilon_k\sim\mathcal{N}(0,I)
$$

---

## 6. RBMs, Boltzmann Machines, Deep Belief Nets

### RBM energy
$$
E_\theta(v,h)
=
- b^\top v
- c^\top h
- v^\top W h
$$

### Joint and marginal
$$
p_\theta(v,h)=\frac{e^{-E_\theta(v,h)}}{Z(\theta)},
\qquad
p_\theta(v)=\sum_h p_\theta(v,h)
$$

### Conditional independence
$$
p(h_j=1\mid v)
=
\sigma\!\left(c_j+\sum_i W_{ij}v_i\right)
$$

$$
p(v_i=1\mid h)
=
\sigma\!\left(b_i+\sum_j W_{ij}h_j\right)
$$

### Contrastive divergence gradient
$$
\nabla_\theta \log p_\theta(v)
=
\mathbb{E}_{p(h\mid v)}[\nabla_\theta(-E_\theta(v,h))]
-
\mathbb{E}_{p(v,h)}[\nabla_\theta(-E_\theta(v,h))]
$$

---

## 7. GANs as Divergence Minimization

### Minimax objective
$$
\min_G \max_D
\;
\mathbb{E}_{x\sim p_{\text{data}}}[\log D(x)]
+
\mathbb{E}_{z\sim p(z)}[\log(1-D(G(z)))]
$$

### Optimal discriminator
$$
D^\*(x)=
\frac{p_{\text{data}}(x)}
{p_{\text{data}}(x)+p_g(x)}
$$

At $D^\*$, the generator minimizes Jensen–Shannon divergence.

### WGAN (Wasserstein-1)
$$
W_1(p,q)
=
\sup_{\|f\|_L\le 1}
\mathbb{E}_{p}[f(x)]-\mathbb{E}_{q}[f(x)]
$$

---

## 8. Diffusion Models (DDPM)

### Forward noising
$$
q(x_t\mid x_{t-1})
=
\mathcal{N}(\sqrt{1-\beta_t}\,x_{t-1},\beta_t I)
$$

Define:
$$
\alpha_t=1-\beta_t,
\qquad
\bar{\alpha}_t=\prod_{s=1}^t \alpha_s
$$

$$
q(x_t\mid x_0)
=
\mathcal{N}(\sqrt{\bar{\alpha}_t}x_0,(1-\bar{\alpha}_t)I)
$$

### Reverse model
$$
p_\theta(x_{t-1}\mid x_t)
=
\mathcal{N}(\mu_\theta(x_t,t),\Sigma_\theta(x_t,t))
$$

### Noise prediction
$$
\epsilon_\theta(x_t,t)\approx\epsilon
$$

### Training loss
$$
\min_\theta
\;
\mathbb{E}_{t,x_0,\epsilon}
\big[\|\epsilon-\epsilon_\theta(x_t,t)\|_2^2\big]
$$

---

## 9. Score Matching

### Score definition
$$
s_\theta(x)=\nabla_x \log p_\theta(x)
$$

### Hyvärinen score matching
$$
J(\theta)
=
\mathbb{E}_{p_{\text{data}}}
\left[
\frac{1}{2}\|s_\theta(x)\|^2
+
\nabla\cdot s_\theta(x)
\right]
$$

### Denoising score matching (Gaussian noise)
$$
\min_\theta
\;
\mathbb{E}
\big[
\|s_\theta(\tilde{x})-\nabla_{\tilde{x}}\log q_\sigma(\tilde{x}\mid x)\|^2
\big]
$$

$$
\nabla_{\tilde{x}}\log q_\sigma(\tilde{x}\mid x)
=
\frac{x-\tilde{x}}{\sigma^2}
$$

---

## 10. Diffusion as SDE

### Forward SDE
$$
dx=f(x,t)\,dt+g(t)\,dW_t
$$

### Reverse-time SDE
$$
dx=
\big[f(x,t)-g(t)^2\nabla_x\log p_t(x)\big]dt
+
g(t)\,d\bar{W}_t
$$

### Probability flow ODE
$$
\frac{dx}{dt}
=
f(x,t)-\frac{1}{2}g(t)^2\nabla_x\log p_t(x)
$$

---

## 11. Discrete Latent Code Models (VQ-VAE)

### Quantization
$$
z_e=E_\phi(x),
\qquad
z_q=e_k,\quad
k=\arg\min_j\|z_e-e_j\|
$$

### VQ-VAE loss
$$
\mathcal{L}
=
\|x-D_\theta(z_q)\|^2
+
\| \mathrm{sg}[z_e]-e\|^2
+
\beta\|z_e-\mathrm{sg}[e]\|^2
$$

### Prior over codes
$$
p(k)=\prod_t p(k_t\mid k_{<t})
$$

---

## 12. Mixture Models

### Mixture density
$$
p(x)=\sum_{k=1}^K \pi_k p(x\mid k),
\qquad
\sum_k \pi_k=1
$$

### Gaussian mixture
$$
p(x)=\sum_{k=1}^K \pi_k \mathcal{N}(x\mid\mu_k,\Sigma_k)
$$

### EM responsibilities
$$
\gamma_{nk}
=
\frac{\pi_k \mathcal{N}(x_n\mid\mu_k,\Sigma_k)}
{\sum_j \pi_j \mathcal{N}(x_n\mid\mu_j,\Sigma_j)}
$$

---

## 13. Divergences Used in Generative Training

### Forward KL (mode-covering)
$$
\min_\theta \mathrm{KL}(p_{\text{data}}\|p_\theta)
$$

### Reverse KL (mode-seeking)
$$
\min_\theta \mathrm{KL}(p_\theta\|p_{\text{data}})
$$

### f-divergence
$$
D_f(p\|q)
=
\int q(x)\,f\!\left(\frac{p(x)}{q(x)}\right)\,dx
$$

### Jensen–Shannon
$$
\mathrm{JS}(p\|q)
=
\frac{1}{2}\mathrm{KL}\!\left(p\Big\|\frac{p+q}{2}\right)
+
\frac{1}{2}\mathrm{KL}\!\left(q\Big\|\frac{p+q}{2}\right)
$$

### Wasserstein-1
$$
W_1(p,q)
=
\inf_{\gamma\in\Pi(p,q)}
\mathbb{E}_{(x,y)\sim\gamma}\big[\|x-y\|\big]
$$

---

## Final Meta-View

Every generative model above is a **different operationalization** of:

$$
\text{Simple distribution}
\;\xrightarrow{\text{learned transformation}}\;
\text{Data distribution}
$$

What changes is **how probability mass is moved**:  
likelihoods, energies, flows, games, diffusion, or transport.  
This is one theory — expressed through many mathematical lenses.
