# Generative AI Models with Equations & Detailed Explanations

Generative AI aims to model the data distribution \(p(x)\) and generate new samples that resemble the true data.  
Below are the major categories with their mathematical foundations and detailed component breakdowns.

---

## 1. Generative Adversarial Networks (GANs)

**Equation (Minimax Game):**

$$
\min_G \max_D V(D,G) = \mathbb{E}_{x \sim p_\text{data}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log (1 - D(G(z)))]
$$

### Components:
- - Generator (G):  
$$
G(z; \theta_g)
$$  
maps random noise  
$$
z \sim p_z(z)
$$  
into data space.

- Discriminator (D):  
$$
D(x; \theta_d)
$$  
outputs probability that input is real.

- Objective:  
  -  
  $$
  D \ \text{maximizes correct classification (real vs fake)}
  $$  
  -  
  $$
  G \ \text{minimizes log-probability of being detected as fake}
  $$

Interpretation: A two-player game where \(G\) tries to fool \(D\), and \(D\) tries to detect fakes.

---

## 2. Variational Autoencoders (VAEs)

**Evidence Lower Bound (ELBO):**

$$
\log p_\theta(x) \geq \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))
$$

### Components:
- Encoder: \( q_\phi(z|x) \), approximates posterior distribution of latent \(z\).
- Decoder: \( p_\theta(x|z) \), reconstructs input data from latent samples.
- Prior: \( p(z) \), often standard Gaussian \(\mathcal{N}(0,I)\).
- KL Divergence Term: Regularizes latent distribution to be close to prior.
- Reconstruction Term: Ensures decoded samples resemble the input.

Interpretation: VAEs balance reconstruction accuracy and latent regularization.

---

## 3. Autoregressive Models

**Factorization of Probability:**

$$
p(x) = \prod_{t=1}^T p(x_t \mid x_{<t})
$$

### Components:
- Conditioning: Each token depends only on previous tokens \(x_{<t}\).
- Neural Architectures: RNNs, LSTMs, Transformers.
- Training: Teacher-forcing with maximum likelihood.

Examples: GPT, PixelRNN, WaveNet.  
Interpretation: Generate one element at a time in a sequence.

---

## 4. Normalizing Flows

**Change of Variables Formula:**

$$
p_X(x) = p_Z(f(x)) \left| \det \frac{\partial f(x)}{\partial x} \right|
$$

### Components:
- Invertible Mapping: \( f: X \to Z \) ensures reversibility.
- Prior: Simple distribution (Gaussian).
- Jacobian Determinant: Adjusts probability density when mapping between spaces.

Interpretation: Learn exact likelihoods with flexible, invertible functions.

---

## 5. Diffusion Models

**Forward (noising) process:**

$$
q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)
$$

**Reverse (denoising) process:**

$$
p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(x_t,t))
$$

### Components:
- Forward Process: Gradually corrupt data with Gaussian noise.
- Reverse Process: Learn denoising distribution parameterized by neural net (e.g., U-Net).
- Scheduler: Controls step sizes \(\beta_t\).
- Sampling: Start from noise, iteratively denoise.

Interpretation: Model learns to reverse noise injection and generate realistic samples.  
Examples: DDPM, Stable Diffusion, Imagen.

---

## 6. Energy-Based Models (EBMs)

**Boltzmann Distribution:**

$$
p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z_\theta}, \quad Z_\theta = \int e^{-E_\theta(x)} dx
$$

### Components:
- Energy Function: \(E_\theta(x)\) assigns low energy to real samples.
- Partition Function: \(Z_\theta\) ensures normalization (often intractable).
- Learning: Contrastive divergence, sampling via MCMC.

Interpretation: Learn an energy landscape where real samples lie in valleys.

---

## 7. Score-Based / Flow-Matching Models

**Stochastic Differential Equation (SDE):**

$$
\frac{dx}{dt} = \nabla_x \log p_\theta(x) + \sqrt{2} \, dW_t
$$

### Components:
- Score Function: \(\nabla_x \log p_\theta(x)\), gradient of log-density.
- Sampling Dynamics: Langevin dynamics simulate motion in probability landscape.
- Flow Matching: Aligning forward/backward ODE flows.

Interpretation: Model generates data by following gradient fields of probability density.

---

## 8. Neural Autoregressive Density Estimators (NADE)

**Density Factorization:**

$$
p(x) = \prod_{i=1}^D p(x_i \mid x_{<i}; \theta)
$$

### Components:
- Neural Parametrization: Conditionals parameterized by NN.
- Ordering: Sequence of features matters (fixed or random).
- Training: Exact likelihood maximization.

Interpretation: A fully tractable probabilistic model for high-dimensional data.

---

## 9. Transformer-based Generative Models

**Self-Attention Mechanism:**

$$
\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V
$$

### Components:
- Query (Q): Representation of the current token.
- Key (K): Encodes context of past tokens.
- Value (V): Carries the actual information.
- Softmax Scaling: Normalizes attention scores.

Interpretation: Each token attends to others for context, enabling long-range dependencies.  
Examples: GPT-3/4/5, BERT (masked variant).

---

## 10. Hybrid and Specialized Models

- VAE-GAN: Combines reconstruction likelihood with adversarial loss.  

$$
\mathcal{L} = \mathcal{L}_\text{VAE} + \lambda \mathcal{L}_\text{GAN}
$$

- Diffusion + Transformers: Stable Diffusion uses U-Net with cross-attention.
- Energy-Guided Diffusion: Injects energy-based priors into diffusion sampling.

---

# Summary

- GANs: Adversarial game → fake vs real.
- VAEs: Latent variable model → ELBO maximization.
- Autoregressive: Sequential factorization.
- Normalizing Flows: Exact likelihood via invertible transforms.
- Diffusion: Reverse noise → denoise to generate.
- EBMs: Energy landscape.
- Score Models: Gradient-based sampling.
- NADE: Exact conditional factorization.
- Transformers: Attention-powered autoregression.
- Hybrids: Combine strengths of multiple models.


# Generative AI Models: Probabilistic Essence

---

## 1. GANs

**Essence**: The discriminator outputs a probability between 0 and 1 (real vs fake).

**Why the logarithm?**  
Logarithms turn multiplicative probabilities into additive terms, making optimization stable.  

- For real data: maximize  
$$
\log D(x)
$$  
which forces \(D(x)\) close to 1.  

- For fake data: maximize  
$$
\log (1 - D(G(z)))
$$  
which forces \(D(G(z))\) close to 0.  

**Heart of idea**: The generator improves by making the discriminator’s prediction uncertain (pushing its probability close to 0.5). This back-and-forth ensures generated samples approximate the true distribution.

---

## 2. VAEs

**Essence**: Probability of data is often intractable, so we lower-bound it with ELBO.

**Why the KL divergence?**  
It measures how different the encoder’s distribution \(q_\phi(z|x)\) is from a prior (usually Gaussian). By constraining it, the latent space becomes smooth and generalizable.

**Why the log-likelihood term?**  
$$
\log p_\theta(x|z)
$$  
is like a “chance” of reconstructing the input from latent codes. Higher likelihood → better reconstructions.

**Heart of idea**: Balance between two pressures:  
1. Data reconstruction accuracy.  
2. Keeping latent variables in a structured, normalized space.

---

## 3. Autoregressive Models

**Essence**: Probability of a sequence is factorized step by step.

Each token must lie in [0,1] probability space, and the full sentence probability is a product of these terms.

**Factorization**:  
$$
p(x) = \prod_{t=1}^T p(x_t \mid x_{<t})
$$

**Why logs matter?**  
Maximizing  
$$
\log p(x)
$$  
avoids multiplying many small numbers (underflow) and instead adds them.

**Heart of idea**: Predict the next token given history → chance of the sequence is just the product of these conditional chances.

---

## 4. Normalizing Flows

**Essence**: Exact probability modeling using invertible transformations.

**Change of Variables**:  
$$
p_X(x) = p_Z(f(x)) \left| \det \frac{\partial f(x)}{\partial x} \right|
$$

**Why determinant of Jacobian?**  
When you stretch or compress space, the density changes accordingly. The determinant tells how much “volume” is scaled.

**Heart of idea**: Start with a simple distribution (like Gaussian), apply reversible transformations, and compute the exact likelihood of data.

---

## 5. Diffusion Models

**Essence**: Probability is modeled as a noisy process.

- Forward process gradually destroys structure (adding Gaussian noise).  
- Reverse process learns how likely it is to denoise step by step.  

**Forward process**:  
$$
q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)
$$

**Reverse process**:  
$$
p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(x_t,t))
$$

**Why Gaussian assumption?**  
Normal distributions are mathematically convenient and stable under addition.

**Heart of idea**: The chance of clean data is recovered by chaining probabilities of many denoising steps backward in time.

---

## 6. EBMs

**Essence**: Assigns an “energy” score to every configuration. Low energy = high probability.

**Boltzmann Distribution**:  
$$
p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z_\theta}, \quad
Z_\theta = \int e^{-E_\theta(x)} dx
$$

**Why exponentials and partition function?**  
The exponential ensures probabilities are always positive and between 0–1 once normalized by \(Z\).

**Heart of idea**: The world is modeled as an energy surface; real data sits in valleys, and fake data in high-energy peaks.

---

## 7. Score-Based / Flow-Matching Models

**Essence**: Instead of modeling probability directly, model its gradient (the score function).

**Stochastic Differential Equation**:  
$$
\frac{dx}{dt} = \nabla_x \log p_\theta(x) + \sqrt{2}\, dW_t
$$

**Why gradient of log probability?**  
It points in the direction of higher probability density — essentially, “where data is more likely.”

**Heart of idea**: Generate by simulating dynamics that climb toward regions of higher data likelihood.

---

## 8. NADE

**Essence**: Like autoregressive models but for high-dimensional data (e.g., vectors).

**Density Factorization**:  
$$
p(x) = \prod_{i=1}^D p(x_i \mid x_{<i}; \theta)
$$

Every conditional probability is modeled explicitly so that the product remains a valid probability in [0,1].

**Heart of idea**: Fully tractable density estimation with neural nets.

---

## 9. Transformers

**Essence**: Attention mechanism computes a probability distribution over keys given a query.

**Self-Attention**:  
$$
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$

**Why softmax?**  
Converts raw similarity scores into probabilities (0–1, summing to 1).

**Heart of idea**: Each token “chooses” how much chance it gives to others when forming its new representation.

---

## 10. Hybrids

**Essence**: Mix and match the probability tricks above.

**Example (VAE-GAN)**:  
$$
\mathcal{L} = \mathcal{L}_\text{VAE} + \lambda \mathcal{L}_\text{GAN}
$$

**Heart of idea**: Multiple probability constraints together yield sharper, more realistic models.

---

# Core Intuition Across All Models

- Probabilities must lie in [0,1] — hence the use of sigmoid, softmax, and normalization.  
- Logarithms stabilize products of probabilities (turning them into sums).  
- KL divergence / energy functions constrain distributions so they don’t collapse or spread arbitrarily.  
- Noise vs. structure — most models balance the randomness of generation with the order of real data.  
- Gradients guide learning — whether via adversarial games, variational bounds, or score matching, the key is optimizing probability surfaces.


# Understanding GAN Training as a Game

The GAN objective is formulated as:

\[
\min_G \max_D V(D,G) = \mathbb{E}_{x \sim p_{data}} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log (1 - D(G(z)))]
\]

- **Discriminator (D):** judges whether an input is real (from data) or fake (from generator).  
- **Generator (G):** takes random noise \(z\) and tries to produce samples \(G(z)\) that fool the discriminator.

---

## Step-by-Step Flow

1. Draw real samples \(x \sim p_{data}(x)\).  
   - Feed into \(D\).  
   - D’s goal: output probability close to 1 (real).  

2. Draw random noise \(z \sim p_z(z)\).  
   - Feed into \(G\).  
   - Generator outputs a fake sample \(G(z)\).  

3. Feed \(G(z)\) into \(D\).  
   - D’s goal: output probability close to 0 (fake).  
   - G’s goal: push D’s output closer to 1 (make fake look real).  

**Training alternates:**  
- Fix \(G\), update \(D\) to better separate real/fake.  
- Fix \(D\), update \(G\) to fool \(D\).  

---

## Intuition

- If **D is perfect**: it outputs 1 for real, 0 for fake. Generator gets crushed.  
- If **G is perfect**: its fakes are indistinguishable from real, so D outputs ~0.5 for everything. D is confused.  

The training aims for a **balance point**: the generator learns the true data distribution.

---

## Truth Table of Cases

| Input type   | D’s output = 1 (calls it Real) | D’s output = 0 (calls it Fake) | Who wins? |
|--------------|--------------------------------|--------------------------------|-----------|
| Real data \(x\) |  Correct (good for D, \(\log D(x)\) high) |  Mistake (bad for D, \(\log D(x) \to -\infty\)) | D wins if 1, loses if 0 |
| Fake data \(G(z)\) |  Mistake (bad for D, good for G, \(\log(1-D(G(z))) \to -\infty\)) |  Correct (good for D, bad for G, \(\log(1-D(G(z)))\) high) | D wins if 0, G wins if 1 |

---

## Summary

- **Discriminator’s job:** maximize correctness — say “1” for real, “0” for fake.  
- **Generator’s job:** flip the second row of the table — try to force D to say “1” for its fakes.  

**Training is a push–pull game:** each improvement by D pushes G to improve, and vice versa.


#  Generative AI Models & Their Origins

| Generative Model | Core Origin / Inspiration | Theoretical Foundation |
|------------------|--------------------------|-------------------------|
| **GANs (Generative Adversarial Networks)** | Game theory (two-player minimax) + probability | Adversarial training = minimax optimization (Goodfellow et al., 2014) |
| **VAEs (Variational Autoencoders)** | Bayesian inference + variational methods | Variational inference, KL divergence, ELBO maximization |
| **Autoregressive Models (RNN, PixelRNN, GPT)** | Probability theory + Markov property extensions | Chain rule factorization of joint distributions |
| **Normalizing Flows (RealNVP, Glow)** | Probability & calculus | Change-of-variables theorem + Jacobian determinants |
| **Diffusion Models (DDPM, Stable Diffusion)** | Markov chains + statistical physics (diffusion process) | Forward noising as Gaussian Markov chain, reverse denoising learned |
| **EBMs (Energy-Based Models, Boltzmann Machines)** | Statistical mechanics (Boltzmann distribution) | Probability from energy landscapes (low energy → high probability) |
| **Score-Based Models / Flow Matching** | Stochastic calculus + differential equations | Langevin dynamics, SDE/ODE probability flows |
| **NADE / MADE (Neural Autoregressive Density Estimators)** | Autoregressive probability modeling | Explicit factorization of densities with neural parametrization |
| **Transformers (GPT, BERT, etc.)** | Sequence modeling + attention mechanisms | Probability distributions over tokens via softmax attention |
| **Hybrids (VAE-GAN, Energy-Guided Diffusion)** | Combinations of above | Mix variational inference, adversarial games, and energy principles |

---

##  Key Insights

- **Markov roots:** Autoregressive & Diffusion models → rely on step-by-step probabilistic transitions.  
- **Physics roots:** EBMs & Diffusion → inspired by thermodynamics (energy, heat, diffusion).  
- **Game theory roots:** GANs → adversarial min-max optimization.  
- **Bayesian roots:** VAEs → approximate inference with variational bounds.  
- **Neural architectures roots:** Transformers, RNNs → sequence probability modeling.  

---

#  Generative, Unsupervised, Probabilistic?

### 1. Are they all **Generative**?
Yes — every model in the list (GAN, VAE, Diffusion, etc.) is a generative model, because its purpose is to model  

$$ p(x) $$  

(the data distribution) and produce new samples that look like the real data.  

---

### 2. Are they all **Unsupervised**?
Mostly yes, because they don’t require labeled data — they learn from the raw data distribution itself.  

 But:  
- Some can be used in **semi-supervised** or **conditional** modes (e.g., Conditional GANs, Conditional Diffusion, Conditional VAE).  
- Autoregressive models (like GPT) are trained with **teacher forcing** on sequences, which looks like supervised learning, but conceptually it’s still density estimation (so many call them unsupervised).  

---

### 3. Are they all **Probabilistic**?
Yes, at the core they are **probability-based**, but in different ways:  

- **Explicit probabilistic models:**  
  VAEs, Flows, Autoregressive, NADE (directly compute $$ p(x) $$).  

- **Implicit probabilistic models:**  
  GANs, EBMs, Diffusion (don’t always give a closed-form $$ p(x) $$, but still define a probability model).  

---

##  Final Answer
 These models are **generative, probabilistic, and largely unsupervised** approaches to modeling data distributions.  

With nuance:  
- GANs and EBMs = **implicit** probabilistic.  
- Autoregressive, VAEs, Flows, NADE = **explicit** probabilistic.  
- Most are **unsupervised**, but many have **conditional / semi-supervised** variants.  


#  Teacher-Forcing vs Residual Connections

| Aspect | Teacher-Forcing | Residual (Skip) Connections |
|--------|-----------------|-----------------------------|
| **Type** | Training strategy | Network architecture design |
| **Purpose** | Stabilize sequence learning by guiding the model with ground-truth tokens | Stabilize deep networks by easing gradient flow |
| **Where used** | RNNs, LSTMs, GRUs, autoregressive Transformers | Deep networks (CNNs, ResNets, Transformers, etc.) |
| **Mechanism** | Feed the true previous token instead of the model’s own prediction | Add a direct shortcut: output = f(x) + x |
| **Problem addressed** | Exposure bias (model drifting on wrong predictions) | Vanishing/exploding gradients in deep layers |
| **Stage** | Training-time only | Architectural (always present, during training & inference) |
| **Analogy** | Teacher guiding step by step with correct answers | Shortcut path that lets information bypass obstacles |


#  Generative AI Models & Their Theoretical Foundations (Rooted)

| Generative Model | Deeper Theoretical Roots | Theoretical Foundation |
|------------------|--------------------------|-------------------------|
| **GANs (Generative Adversarial Networks)** | **Game theory** (Von Neumann, 1928: Minimax theorem) + **statistical decision theory** + probability distributions | Adversarial training = minimax optimization (Goodfellow et al., 2014) |
| **VAEs (Variational Autoencoders)** | **Bayesian inference** (Bayes, 1763) + **variational methods** in physics/statistics (mean-field theory, 1950s) + EM algorithm (Dempster, 1977) | Variational inference, KL divergence, ELBO maximization |
| **Autoregressive Models (RNN, PixelRNN, GPT)** | **Markov chains** (A. Markov, 1906) + **probability chain rule** + sequential data models (Shannon’s information theory, 1948) | Chain rule factorization of joint distributions |
| **Normalizing Flows (RealNVP, Glow)** | **Change-of-variables theorem** (Jacobian determinants in multivariate calculus, 19th c.) + **invertible transformations** in probability | Exact likelihood modeling via invertible mappings |
| **Diffusion Models (DDPM, Stable Diffusion)** | **Markov processes** (Kolmogorov, 1930s) + **statistical physics diffusion equations** (Fick’s law, 1855; Einstein’s Brownian motion, 1905) | Forward noising as Gaussian Markov chain, reverse denoising learned |
| **EBMs (Energy-Based Models, Boltzmann Machines)** | **Statistical mechanics** (Boltzmann distribution, 1872) + Gibbs distributions + Hopfield networks (1982) | Probability from energy landscapes (low energy → high probability) |
| **Score-Based Models / Flow Matching** | **Stochastic calculus** (Ito, 1940s) + **Langevin dynamics** (1908) + Fokker-Planck / SDE theory | Langevin dynamics, SDE/ODE probability flows |
| **NADE / MADE (Neural Autoregressive Density Estimators)** | **Autoregressive probability models** (Markov, Kolmogorov) + neural nets as parametrizers | Explicit factorization of densities with neural parametrization |
| **Transformers (GPT, BERT, etc.)** | **Sequence-to-sequence modeling** (Shannon, 1948 → encoder-decoder RNNs, 1997) + **attention mechanisms** (Bahdanau, 2014) | Probability distributions over tokens via softmax attention |
| **Hybrids (VAE-GAN, Energy-Guided Diffusion)** | Cross-pollination of the above theories: Bayesian inference + game theory + statistical mechanics + stochastic processes | Mix variational inference, adversarial games, and energy principles |


##  Observations

- **Physics roots:** Diffusion, EBMs, Score-based models all trace back to **statistical mechanics** and **stochastic processes** (Einstein, Boltzmann, Langevin).  
- **Math roots:** GANs, VAEs, Flows, NADE build on **probability theory, calculus, and optimization**.  
- **Information theory roots:** Autoregressive models and Transformers trace to **Shannon** and **Markov** for sequence modeling.  
- **Hybrid roots:** Combine these traditions into modern architectures.  


#  Generative AI Models & Their Theoretical Foundations (Rooted)

---

### 1. GANs (Generative Adversarial Networks)
- **Root Theory & Scientist:** Game theory by **John von Neumann (1928, minimax theorem)**; later extended in statistical decision theory.  
- **How it works:** Two networks (Generator vs Discriminator) play a minimax game: the generator produces samples, the discriminator distinguishes real vs fake.  
- **Applications:** Image synthesis (faces, art), deepfake generation, data augmentation, super-resolution.

---

### 2. VAEs (Variational Autoencoders)
- **Root Theory & Scientist:** **Thomas Bayes (1763)** — Bayesian inference; **variational methods** in physics/statistics (1950s, mean-field theory); EM algorithm by **Dempster et al., 1977**.  
- **How it works:** Encode data into a latent distribution, regularized by KL divergence, then decode to reconstruct; balances accuracy and smooth latent space.  
- **Applications:** Anomaly detection, latent space representation, molecule generation, semi-supervised learning.

---

### 3. Autoregressive Models (RNN, PixelRNN, GPT)
- **Root Theory & Scientist:** **Andrey Markov (1906)** — Markov chains; **Claude Shannon (1948)** — information theory; chain rule of probability.  
- **How it works:** Factorize probability of a sequence step-by-step: each token depends on previous ones. Neural nets (RNNs, Transformers) parameterize conditionals.  
- **Applications:** Text generation (GPT), speech synthesis (WaveNet), image modeling (PixelRNN).

---

### 4. Normalizing Flows (RealNVP, Glow)
- **Root Theory & Scientist:** **19th-century calculus & multivariate analysis** — change-of-variables theorem and Jacobians.  
- **How it works:** Apply invertible transformations to a simple base distribution (Gaussian) while tracking the Jacobian determinant to compute exact likelihood.  
- **Applications:** Density estimation, generative image modeling, likelihood-based anomaly detection.

---

### 5. Diffusion Models (DDPM, Stable Diffusion)
- **Root Theory & Scientist:** **Adolf Fick (1855)** — diffusion equations; **Albert Einstein (1905)** — Brownian motion; **Andrey Kolmogorov (1930s)** — Markov processes.  
- **How it works:** Add Gaussian noise step by step (forward process), then learn to reverse the process (denoising).  
- **Applications:** Text-to-image (Stable Diffusion, Imagen, DALL·E), audio generation, protein design.

---

### 6. EBMs (Energy-Based Models, Boltzmann Machines)
- **Root Theory & Scientist:** **Ludwig Boltzmann (1872)** — statistical mechanics; Gibbs distributions; neural energy models by **Hopfield (1982)**.  
- **How it works:** Assign energy values to states; low energy = high probability. Use Boltzmann distribution to normalize.  
- **Applications:** Feature learning, collaborative filtering (recommendation), early generative models (RBMs for Deep Belief Nets).

---

### 7. Score-Based Models / Flow Matching
- **Root Theory & Scientist:** **Paul Langevin (1908)** — Langevin dynamics; **Kiyoshi Ito (1940s)** — stochastic calculus; Fokker–Planck / SDE theory.  
- **How it works:** Learn the score function (gradient of log-density); sample using stochastic differential equations or deterministic ODE flows.  
- **Applications:** High-quality image/audio generation, diffusion alternatives, controllable sampling.

---

### 8. NADE / MADE (Neural Autoregressive Density Estimators)
- **Root Theory & Scientist:** **Markov/Kolmogorov** — autoregressive probability models; extended with neural nets for parameterization.  
- **How it works:** Factorize density dimension by dimension, each conditioned on earlier variables. Neural networks efficiently compute these conditionals.  
- **Applications:** Exact likelihood modeling, density estimation in high-dimensional data.

---

### 9. Transformers (GPT, BERT, etc.)
- **Root Theory & Scientist:** **Claude Shannon (1948)** — sequence modeling; Encoder-decoder RNNs (1997); **Bahdanau et al. (2014)** — attention mechanism.  
- **How it works:** Use self-attention to compute contextualized probabilities over tokens; autoregressive or masked training for generative tasks.  
- **Applications:** Large language models (GPT, LLaMA), translation, summarization, multimodal generative AI.

---

### 10. Hybrids (VAE-GAN, Energy-Guided Diffusion)
- **Root Theory & Scientist:** Combine Bayesian inference (Bayes), game theory (Von Neumann), statistical mechanics (Boltzmann), and stochastic processes (Einstein, Langevin).  
- **How it works:** Fuse techniques — e.g., VAE-GAN merges reconstruction likelihood with adversarial loss; energy-guided diffusion injects energy priors into denoising.  
- **Applications:** Text-to-image with high fidelity (Stable Diffusion), sharper reconstructions, domain-specific generative models.


#  Why Use Logs in Probability & Machine Learning

Logs play a central role in probability, statistics, and machine learning.  
Here’s **why they are so powerful**:

---

## 1️ Logs turn products into sums
- Probabilities of sequences are **tiny numbers** multiplied together.  
- Example:  
  $$P(x_1, x_2, ..., x_n) = \prod_{i=1}^n P(x_i)$$  
- Multiplying many small numbers becomes unwieldy.  
- Taking logs:  
  $$\log P(x_1, x_2, ..., x_n) = \sum_{i=1}^n \log P(x_i)$$  
-  Much easier to compute and reason about.

---

## 2️ Logs avoid underflow
- Computers struggle with **very small numbers**: multiplying them can push results to **0 (underflow)**.  
- Logs keep values in a stable range.  
- Example:  
  $$10^{-1000} \to \text{underflow} \quad \text{but} \quad \log(10^{-1000}) = -1000 \quad \text{safe!}$$

---

## 3️ Logs smooth growth
- Probabilities can span **huge ranges**.  
- Logs compress these ranges into **manageable scales**.  
- Example:  
  $$\log(1{,}000{,}000) = 6 \quad \text{vs raw value } 1{,}000{,}000$$  
-  Makes comparison and optimization more stable.

---

## 4️ Logs make optimization easier
- In maximum likelihood estimation (MLE):  
  - Goal: maximize  
    $$P(\text{data}|\theta) = \prod_{i=1}^n P(x_i|\theta)$$  
  - Equivalent (but easier): maximize  
    $$\log P(\text{data}|\theta) = \sum_{i=1}^n \log P(x_i|\theta)$$  
- Same optimum, but **log-likelihood** gives simpler math, better gradients, and easier optimization.

---

#  In short
**Logs make probabilities easier to compute, compare, and optimize.**  
They are the **mathematical lens** that turns fragile probability products into stable, scalable, and learnable objectives.


# GAN Objective Function (Minimax Game)

## Equation
$$
\min_{G} \max_{D} V(D, G) \;=\;
\mathbb{E}_{x \sim p_{\text{data}}(x)} \big[ \log D(x) \big]
+ \mathbb{E}_{z \sim p_z(z)} \big[ \log (1 - D(G(z))) \big]
$$

---

## Explanation of Symbols

- $$\min_G \max_D$$  
  A **two-player minimax game**:  
  - The **Discriminator (D)** tries to maximize the objective (classify real vs fake correctly).  
  - The **Generator (G)** tries to minimize the objective (fool D).  

- $$V(D, G)$$  
  The **value function**, i.e., the score of the adversarial game.  

- $$\mathbb{E}_{x \sim p_{\text{data}}(x)}[\cdot]$$  
  Expectation over **real data samples** from the true data distribution $$p_{\text{data}}(x)$$.  

- $$\log D(x)$$  
  The log-probability that D assigns to a real sample being real.  
  - If $$D(x) \to 1$$ → high confidence for real.  

- $$\mathbb{E}_{z \sim p_z(z)}[\cdot]$$  
  Expectation over **noise vectors** $$z$$ sampled from a prior distribution $$p_z(z)$$ (e.g., Gaussian or uniform).  

- $$G(z)$$  
  The generator’s output — a fake sample created from noise $$z$$.  

- $$D(G(z))$$  
  The discriminator’s probability that the generated sample $$G(z)$$ is real.  

- $$\log (1 - D(G(z)))$$  
  The log-probability that D correctly identifies the fake as fake.  
  - If $$D(G(z)) \to 0$$ → D wins.  
  - If $$D(G(z)) \to 1$$ → G wins.  

---

## Intuition
- **Discriminator (D):** Maximizes the objective by making $$D(x) \to 1$$ for real data and $$D(G(z)) \to 0$$ for fake data.  
- **Generator (G):** Minimizes the objective by pushing $$D(G(z)) \to 1$$, making its fakes look real.  
- Training is a dynamic **push–pull game**.  
- The equilibrium occurs when the generator produces samples indistinguishable from real data, i.e., when $$p_g(x) \approx p_{\text{data}}(x)$$.


# Variational Autoencoder (VAE) — Evidence Lower Bound (ELBO)

## Equation
$$
\log p_{\theta}(x) \;\geq\;
\mathbb{E}_{z \sim q_{\phi}(z|x)} \big[ \log p_{\theta}(x|z) \big]
- \text{KL}\!\left(q_{\phi}(z|x) \,\|\, p(z)\right)
$$

---

## Explanation of Symbols

- $$\log p_{\theta}(x)$$  
  The **log-likelihood** of the observed data $$x$$ under the generative model with parameters $$\theta$$.  
  - Goal: maximize this, but it’s usually intractable.  

- $$\geq$$  
  Indicates we are computing a **lower bound** (the ELBO) on the true log-likelihood.  

- $$\mathbb{E}_{z \sim q_{\phi}(z|x)}[\cdot]$$  
  Expectation with respect to the **approximate posterior distribution** $$q_{\phi}(z|x)$$, which the encoder learns.  

- $$q_{\phi}(z|x)$$  
  The **encoder (inference model)** with parameters $$\phi$$.  
  - Approximates the true but intractable posterior $$p(z|x)$$.  

- $$\log p_{\theta}(x|z)$$  
  The **reconstruction likelihood**: probability of reconstructing $$x$$ given latent variable $$z$$, under decoder parameters $$\theta$$.  

- $$p(z)$$  
  The **prior distribution** over latent variables (commonly standard Gaussian $$\mathcal{N}(0, I)$$).  

- $$\text{KL}\!\left(q_{\phi}(z|x) \,\|\, p(z)\right)$$  
  Kullback–Leibler divergence: measures how far the encoder’s posterior $$q_{\phi}(z|x)$$ is from the prior $$p(z)$$.  
  - Acts as a **regularizer** for the latent space.  

---

## Intuition
- The ELBO has **two competing terms**:  
  1. **Reconstruction term**: $$\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]$$ → encourages accurate reconstructions.  
  2. **Regularization term**: $$-\text{KL}(q_{\phi}(z|x)\,\|\,p(z))$$ → keeps latent space well-structured, close to the prior.  

- Together, this balances **data fidelity** and **latent smoothness**, making VAEs powerful generative models.


# Factorization of Probability (Chain Rule)

## Equation
$$
p(x) \;=\; \prod_{t=1}^{T} \; p\big(x_t \mid x_{<t}\big)
$$

---

## Explanation of Symbols

- $$x = (x_1, x_2, \dots, x_T)$$  
  The full sequence (e.g., tokens in a sentence, frames, samples).

- $$p(x)$$  
  The **joint probability** of the entire sequence.

- $$\prod_{t=1}^{T} (\cdot)$$  
  Product operator: multiply the terms for all time steps \(t=1,\dots,T\).

- $$x_{<t} = (x_1, \dots, x_{t-1})$$  
  All elements **before** time \(t\) (the “history”).

- $$p(x_t \mid x_{<t})$$  
  The **conditional probability** of the current element \(x_t\) given all previous elements.

---

## Small Expansion (example \(T=3\))
$$
p(x_1, x_2, x_3)
= p(x_1)\; p(x_2 \mid x_1)\; p(x_3 \mid x_1, x_2).
$$

---

## Notes
- This is the **chain rule of probability**.  
- It underpins **autoregressive modeling**: learn each conditional $$(p(x_t \mid x_{<t}))$$ and multiply to get $$(p(x))$$.


# Normalizing Flows — Change of Variables

## Equation
$$
p_X(x) \;=\; p_Z(f(x)) \; \Bigg| \det \frac{\partial f(x)}{\partial x} \Bigg|
$$

---

## Explanation of Symbols

- $$x \in \mathbb{R}^n$$  
  A data point in the original (data) space.

- $$f: \mathbb{R}^n \to \mathbb{R}^n$$  
  An **invertible mapping function** that transforms data $$x$$ into latent variable $$z$$.  
  - $$z = f(x)$$.

- $$p_X(x)$$  
  Probability density of the data point $$x$$ in the **data space**.

- $$p_Z(z)$$  
  Probability density of the latent variable $$z$$ in the **latent space**.  
  - Usually a simple distribution (e.g., standard Gaussian).

- $$\frac{\partial f(x)}{\partial x}$$  
  The **Jacobian matrix** of function $$f$$ at point $$x$$.  
  - It contains all partial derivatives of $$f$$ with respect to $$x$$.  

- $$\det \frac{\partial f(x)}{\partial x}$$  
  The **determinant of the Jacobian**, which measures how much volume is stretched or compressed by the mapping $$f$$ at point $$x$$.  

- $$\Big|\det \frac{\partial f(x)}{\partial x}\Big|$$  
  Absolute value ensures probabilities stay non-negative.  
  - Large determinant = space expanded.  
  - Small determinant = space compressed.  

---

## Intuition
- We start with a **simple prior distribution** in latent space $$p_Z(z)$$ (like Gaussian).  
- We transform it into a **complex data distribution** $$p_X(x)$$ using invertible mappings $$f$$.  
- The Jacobian determinant adjusts probability density to account for this stretching/compression of space.  


# Diffusion Models — Forward and Reverse Processes

## Forward (noising) process
$$
q(x_t \mid x_{t-1}) \;=\; \mathcal{N}\!\Big(x_t \; ; \; \sqrt{1 - \beta_t}\, x_{t-1}, \; \beta_t I \Big)
$$

### Explanation
- $$q(x_t \mid x_{t-1})$$ → Probability of noisy sample $$x_t$$ given previous step $$x_{t-1}$$.  
- $$\mathcal{N}(\mu, \Sigma)$$ → Gaussian distribution with mean $$\mu$$ and covariance $$\Sigma$$.  
- $$\sqrt{1 - \beta_t}\, x_{t-1}$$ → Scaled version of the previous sample (shrinks signal).  
- $$\beta_t I$$ → Variance term; adds Gaussian noise proportional to step-size $$\beta_t$$.  
- Interpretation: Gradually adds Gaussian noise step by step until data becomes pure noise.  

---

## Reverse (denoising) process
$$
p_{\theta}(x_{t-1} \mid x_t) \;=\; \mathcal{N}\!\Big(x_{t-1} \; ; \; \mu_{\theta}(x_t, t), \; \Sigma_{\theta}(x_t, t) \Big)
$$

### Explanation
- $$p_{\theta}(x_{t-1} \mid x_t)$$ → Model’s learned distribution for denoising.  
- $$\mu_{\theta}(x_t, t)$$ → Neural network–predicted mean (denoised signal).  
- $$\Sigma_{\theta}(x_t, t)$$ → Predicted covariance (uncertainty in denoising).  
- Interpretation: Learn to reverse the noising process by predicting cleaner samples step by step.  

---

## Intuition
- **Forward process:** Data → Noise (fixed Gaussian corruption).  
- **Reverse process:** Noise → Data (learned by neural net).  
- Final sampling: Start with pure Gaussian noise and iteratively denoise using the reverse process to generate realistic data (e.g., images).


# Energy-Based Models (EBMs) — Boltzmann Distribution

## Equation
$$
p_{\theta}(x) \;=\; \frac{e^{-E_{\theta}(x)}}{Z_{\theta}},
\qquad
Z_{\theta} \;=\; \int e^{-E_{\theta}(x)} \, dx
$$

---

## Explanation of Symbols

- $$p_{\theta}(x)$$  
  Probability of data sample $$x$$ under the model, parameterized by $$\theta$$.

- $$E_{\theta}(x)$$  
  The **energy function**.  
  - Assigns a scalar "energy" to each configuration $$x$$.  
  - Lower energy = higher probability (valleys in energy landscape).  
  - Higher energy = lower probability (peaks).  

- $$e^{-E_{\theta}(x)}$$  
  Exponential weighting.  
  - Ensures probabilities are positive.  
  - Strongly favors low-energy (more likely) states.  

- $$Z_{\theta}$$ (partition function)  
  $$Z_{\theta} = \int e^{-E_{\theta}(x)} dx$$  
  - Normalizing constant to ensure $$p_{\theta}(x)$$ is a valid probability distribution (sums/integrates to 1).  
  - Often intractable in high dimensions.  

---

## Intuition
- The model defines an **energy landscape** over all possible data samples.  
- Realistic data lies in **low-energy valleys** (high probability).  
- Unrealistic data lies in **high-energy regions** (low probability).  
- Training = shaping the energy function so that real data is assigned lower energy than fake/noisy data.


# Score-Based / Flow-Matching Models

## Stochastic Differential Equation (SDE)

$$
\frac{dx}{dt} \;=\; \nabla_x \log p_{\theta}(x) \;+\; \sqrt{2}\, dW_t
$$

---

## Explanation of Symbols

- $$\frac{dx}{dt}$$  
  The **change in data sample** $$x$$ with respect to continuous time $$t$$.  
  - Describes how $$x$$ evolves as a stochastic process.  

- $$\nabla_x \log p_{\theta}(x)$$  
  The **score function**: gradient of the log-probability with respect to $$x$$.  
  - Points in the direction where data is **more likely**.  
  - Guides the process toward regions of higher probability density.  

- $$p_{\theta}(x)$$  
  Probability distribution of the data under parameters $$\theta$$.  

- $$\sqrt{2}\, dW_t$$  
  Noise term:  
  - $$W_t$$ = Wiener process (standard Brownian motion).  
  - $$dW_t$$ = infinitesimal random Gaussian step.  
  - $$\sqrt{2}$$ = scaling factor that controls noise intensity.  

---

## Intuition
- This SDE says:  
  - Move **towards higher probability regions** (via the score function).  
  - Add a bit of **random Gaussian noise** at each step.  

- The balance of **gradient flow** and **stochastic noise** allows sampling from complex distributions.  
- In generative modeling:  
  - Start from noise.  
  - Follow the reverse-time SDE using a neural network that learns the score function.  
  - Result: realistic samples (images, text, etc.).  


# Neural Autoregressive Density Estimators (NADE)

## Density Factorization
$$
p(x) \;=\; \prod_{i=1}^{D} \; p(x_i \mid x_{<i}; \theta)
$$

---

## Explanation of Symbols

- $$x = (x_1, x_2, \dots, x_D)$$  
  The full data vector with $$D$$ dimensions (e.g., pixels, features, tokens).

- $$p(x)$$  
  The **joint probability distribution** of the entire data vector.

- $$\prod_{i=1}^{D} (\cdot)$$  
  Product operator — multiply the conditional probabilities of all components.

- $$p(x_i \mid x_{<i}; \theta)$$  
  The conditional probability of the current component $$x_i$$ given all the **previous components** $$x_{<i} = (x_1, \dots, x_{i-1})$$.  
  - Parameterized by neural network weights $$\theta$$.  

- $$\theta$$  
  The parameters of the neural network (weights and biases) that learn to model the conditionals.  

---

## Intuition
- The **chain rule of probability** lets us break down the joint distribution into conditionals.  
- NADE uses a neural network to model each conditional distribution.  
- This gives an **exact, tractable density estimator** (unlike GANs or EBMs, which are implicit).  

---

## Example (D = 3)
$$
p(x_1, x_2, x_3) = p(x_1)\; p(x_2 \mid x_1)\; p(x_3 \mid x_1, x_2).
$$


# Transformer-Based Generative Models

## Self-Attention Mechanism
$$
\text{Attention}(Q, K, V) \;=\;
\text{softmax}\!\left( \frac{QK^{\top}}{\sqrt{d_k}} \right) V
$$

---

## Explanation of Symbols

- $$Q$$ (**Query matrix**)  
  Represents the current token (or position) we are focusing on.  
  - Shape: (sequence length, attention dimension).

- $$K$$ (**Key matrix**)  
  Represents the context of all tokens (encodings of inputs).  
  - Used to decide how much each token should attend to others.

- $$V$$ (**Value matrix**)  
  Contains the actual information content associated with tokens.  
  - The weighted sum of these values forms the new representation.

- $$QK^{\top}$$  
  Dot product between queries and keys.  
  - Measures **similarity** or relevance between tokens.

- $$\sqrt{d_k}$$  
  Scaling factor, where $$d_k$$ is the dimension of the key vectors.  
  - Prevents the dot product from growing too large, which stabilizes softmax.

- $$\text{softmax}(\cdot)$$  
  Converts similarity scores into a **probability distribution** over tokens (values between 0 and 1, summing to 1).

- Final multiplication with $$V$$  
  Produces a weighted combination of values, where weights are the attention scores.

---

## Intuition
- Each token asks a **query** about all other tokens’ **keys**.  
- Softmax decides how much attention each token should pay to others.  
- The output is a context-aware representation: every token becomes enriched with information from relevant tokens in the sequence.  
- This mechanism enables Transformers to capture **long-range dependencies** in data, unlike RNNs which struggle with distant context.


# Hybrid Generative Models

## Essence
Hybrid models **combine techniques** from multiple generative frameworks to leverage their strengths and overcome weaknesses.  

- Example: **VAE-GAN** combines the **variational inference** power of VAEs with the **adversarial training** of GANs.  

---

## Example Equation (VAE-GAN)
$$
\mathcal{L} \;=\; \mathcal{L}_{\text{VAE}} \;+\; \lambda \, \mathcal{L}_{\text{GAN}}
$$

---

## Explanation of Symbols

- $$\mathcal{L}$$  
  The **total loss function** for the hybrid model.  

- $$\mathcal{L}_{\text{VAE}}$$  
  The **VAE loss** (Evidence Lower Bound, ELBO):  
  $$
  \mathcal{L}_{\text{VAE}} \;=\;
  \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]
  - \text{KL}\!\left(q_\phi(z|x) \,\|\, p(z)\right)
  $$
  - First term = reconstruction accuracy.  
  - Second term = KL divergence regularizer on latent space.  

- $$\mathcal{L}_{\text{GAN}}$$  
  The **GAN loss**:  
  $$
  \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D(x)]
  + \mathbb{E}_{z \sim p_z(z)}[\log (1 - D(G(z)))]
  $$
  - Discriminator tries to separate real/fake.  
  - Generator tries to fool discriminator.  

- $$\lambda$$  
  A **weighting coefficient** that balances how much influence the GAN loss has relative to the VAE loss.  

---

## Intuition
- **VAE alone** → generates blurry but consistent samples (good coverage, smooth latent space).  
- **GAN alone** → generates sharp but sometimes unstable samples (mode collapse).  
- **VAE-GAN hybrid** → combines them:  
  - VAE ensures a structured latent space and coverage.  
  - GAN sharpens results via adversarial feedback.  


# Generative AI Models and Their Applications

| Generative Model | Core Idea | Typical Applications |
|------------------|-----------|-----------------------|
| **GANs (Generative Adversarial Networks)** | Adversarial game between Generator and Discriminator | Image generation (e.g., deepfakes), super-resolution, style transfer, data augmentation |
| **VAEs (Variational Autoencoders)** | Latent-variable model with ELBO training (reconstruction + KL regularization) | Image reconstruction, anomaly detection, representation learning, molecule design |
| **Autoregressive Models (RNN, LSTM, GPT, PixelRNN, WaveNet)** | Factorize joint distribution using chain rule; predict one token at a time | Language modeling (GPT), speech synthesis (WaveNet), image generation (PixelRNN) |
| **Normalizing Flows (RealNVP, Glow)** | Invertible transformations with Jacobian determinant for exact likelihoods | Density estimation, audio synthesis, molecular modeling, uncertainty estimation |
| **Diffusion Models (DDPM, Stable Diffusion, Imagen)** | Forward Gaussian noising (Markov chain) + learned reverse denoising | High-quality image generation (Stable Diffusion, Imagen), text-to-image synthesis, audio generation |
| **Energy-Based Models (EBMs, Boltzmann Machines)** | Probability from energy landscapes (low energy = high probability) | Representation learning, generative modeling, reinforcement learning, scientific modeling |
| **Score-Based / Flow-Matching Models** | Stochastic differential equations; simulate with score function (∇ log p(x)) | Image generation (score-based diffusion), physics-informed generative models |
| **NADE / MADE (Neural Autoregressive Density Estimators)** | Exact tractable density estimation via autoregressive factorization | Density estimation, structured data modeling, probabilistic reasoning |
| **Transformer-Based Generative Models (GPT, BERT, etc.)** | Self-attention + softmax to model token dependencies | Large language models (GPT), translation, summarization, code generation |
| **Hybrids (VAE-GAN, Diffusion+Transformers, Energy-Guided Diffusion)** | Combine multiple generative frameworks to balance strengths | VAE-GAN: sharper images; Stable Diffusion: text-to-image with cross-attention; Energy-guided: physics + diffusion |

---


Here’s a compact, up-to-date reading list of the most-cited/representative academic papers for each generative model family we discussed. I included both seminal work and influential follow-ups so you can go deep fast.

# Generative Model Families → Key Papers

| Model family | Seminal paper(s) | Influential follow-ups / recent guides |
|--------------|------------------|---------------------------------------|
| **GANs** | Goodfellow et al., “Generative Adversarial Networks” (2014). [arXiv][1] | WGAN (Arjovsky et al., 2017) and WGAN-GP (Gulrajani et al., 2017) for stability; StyleGAN (Karras et al., 2019) for controllable synthesis. [arXiv][2] |
| **VAEs** | Kingma & Welling, “Auto-Encoding Variational Bayes” (2013). [arXiv][3] | Doersch, “Tutorial on VAEs” (2016); β-VAE (Higgins et al., 2017) for disentanglement. [arXiv][4] |
| **Autoregressive (PixelRNN/WaveNet/GPT)** | PixelRNN (van den Oord et al., 2016) and WaveNet (2016). [arXiv][5] | Transformer “Attention Is All You Need” (2017); GPT-3 (2020) as a large-scale autoregressive LM. [openreview.net][6] |
| **Normalizing flows** | RealNVP (Dinh et al., 2017). [semanticscholar.org][7] | Glow (Kingma & Dhariwal, 2018); FFJORD (Grathwohl et al., 2019) for continuous-time flows. [ar5iv][8] |
| **Diffusion models (DDPM, latent diffusion)** | Sohl-Dickstein et al., “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” (2015). [arXiv][9] | DDPM (Ho et al., 2020); Improved DDPM & Guided Diffusion (Nichol & Dhariwal, 2021); Latent Diffusion / Stable Diffusion (Rombach et al., 2022); SDXL (2023). [arXiv][10] |
| **EBMs (Boltzmann/energy)** | LeCun et al., “A Tutorial on Energy-Based Learning” (2006). [Stanford University][11] | Hinton’s RBM/DBN line (2006–2010) for practical training of energy models. [cs.toronto.edu][12] |
| **Score-based generative models / SDEs** | Score Matching (Hyvärinen, 2005) as the estimator; NCSN (Song & Ermon, 2019). [jmlr.org][13] | Score-based SDE framework (Song et al., 2021); Flow/Rectified-Flow perspectives (Lipman et al., 2022; Liu et al., 2022); practical guide to Flow Matching (2024). [arXiv][14] |
| **NADE / MADE (neural autoregressive density estimators)** | NADE (Larochelle & Murray, 2011). [Proceedings of Machine Learning Research][15] | MADE (Germain et al., 2015) and later NADE surveys/extensions. [arXiv][16] |
| **Transformers (as generative sequence models)** | Transformer (Vaswani et al., 2017). [openreview.net][6] | GPT-3 (Brown et al., 2020) highlighting scaling for generative modeling. [semanticscholar.org][7] |
| **Hybrids (e.g., VAE-GAN, guided diffusion)** | VAE-GAN (Larsen et al., 2016). [arXiv][17] | Guided diffusion: classifier guidance (Dhariwal & Nichol, 2021) and classifier-free guidance (Ho & Salimans, 2022). [proceedings.nips.cc][18] |

---

*Notes*:

- I favored authoritative sources (original papers, PMLR/NeurIPS/ICLR/CVPR proceedings, arXiv) and added modern guides where they materially improve understanding (e.g., Flow Matching guide).
- If you want this tailored to a specific application (vision, audio, code, molecules), say the word and I’ll filter the list to that domain with the most relevant, recent papers.

[1]: https://arxiv.org/abs/1406.2661 "[1406.2661] Generative Adversarial Networks"  
[2]: https://arxiv.org/abs/1701.07875 "Wasserstein GAN"  
[3]: https://arxiv.org/abs/1312.6114 "Auto-Encoding Variational Bayes"  
[4]: https://arxiv.org/abs/1606.05908 "Tutorial on Variational Autoencoders"  
[5]: https://arxiv.org/abs/1601.06759 "Pixel Recurrent Neural Networks"  
[6]: https://openreview.net/forum?id=HkpbnH9lx "Density estimation using Real NVP"  
[7]: https://www.semanticscholar.org/paper/Density-estimation-using-Real-NVP-Dinh-Sohl-Dickstein/09879f7956dddc2a9328f5c1472feeb8402bcbcf/figure/0 "Figure 1 from Density estimation using Real NVP"  
[8]: https://ar5iv.labs.arxiv.org/html/1807.03039 "Glow: Generative Flow with Invertible 1×1 Convolutions - ar5iv"  
[9]: https://arxiv.org/abs/1503.03585 "Deep Unsupervised Learning using Nonequilibrium Thermodynamics"  
[10]: https://arxiv.org/abs/2006.11239 "Denoising Diffusion Probabilistic Models"  
[11]: https://web.stanford.edu/class/cs379c/archive/2012/suggested_reading_list/documents/LeCunetal06.pdf "[PDF] A Tutorial on Energy-Based Learning - Stanford University"  
[12]: https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf "A fast learning algorithm for deep belief nets - Computer Science"  
[13]: https://jmlr.org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf "Estimation of Non-Normalized Statistical Models by Score ..."  
[14]: https://arxiv.org/abs/2011.13456 "Score-Based Generative Modeling through Stochastic ..."  
[15]: https://proceedings.mlr.press/v15/larochelle11a/larochelle11a.pdf "The Neural Autoregressive Distribution Estimator"  
[16]: https://arxiv.org/abs/1502.03509 "MADE: Masked Autoencoder for Distribution Estimation"  
[17]: https://arxiv.org/abs/1512.09300 "Autoencoding beyond pixels using a learned similarity metric"  
[18]: https://proceedings.nips.cc/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf "Diffusion Models Beat GANs on Image Synthesis"  
