# Generative AI Models with Equations & Detailed Explanations

Generative AI aims to model the data distribution \(p(x)\) and generate new samples that resemble the true data.  
Below are the major categories with their mathematical foundations and detailed component breakdowns.

---

## 1. Generative Adversarial Networks (GANs)

**Equation (Minimax Game):**

$$
\min_G \max_D V(D,G) = \mathbb{E}_{x \sim p_\text{data}(x)} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log (1 - D(G(z)))]
$$

### Components:
- - Generator (G):  
$$
G(z; \theta_g)
$$  
maps random noise  
$$
z \sim p_z(z)
$$  
into data space.

- Discriminator (D):  
$$
D(x; \theta_d)
$$  
outputs probability that input is real.

- Objective:  
  -  
  $$
  D \ \text{maximizes correct classification (real vs fake)}
  $$  
  -  
  $$
  G \ \text{minimizes log-probability of being detected as fake}
  $$

Interpretation: A two-player game where \(G\) tries to fool \(D\), and \(D\) tries to detect fakes.

---

## 2. Variational Autoencoders (VAEs)

**Evidence Lower Bound (ELBO):**

$$
\log p_\theta(x) \geq \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))
$$

### Components:
- Encoder: \( q_\phi(z|x) \), approximates posterior distribution of latent \(z\).
- Decoder: \( p_\theta(x|z) \), reconstructs input data from latent samples.
- Prior: \( p(z) \), often standard Gaussian \(\mathcal{N}(0,I)\).
- KL Divergence Term: Regularizes latent distribution to be close to prior.
- Reconstruction Term: Ensures decoded samples resemble the input.

Interpretation: VAEs balance reconstruction accuracy and latent regularization.

---

## 3. Autoregressive Models

**Factorization of Probability:**

$$
p(x) = \prod_{t=1}^T p(x_t \mid x_{<t})
$$

### Components:
- Conditioning: Each token depends only on previous tokens \(x_{<t}\).
- Neural Architectures: RNNs, LSTMs, Transformers.
- Training: Teacher-forcing with maximum likelihood.

Examples: GPT, PixelRNN, WaveNet.  
Interpretation: Generate one element at a time in a sequence.

---

## 4. Normalizing Flows

**Change of Variables Formula:**

$$
p_X(x) = p_Z(f(x)) \left| \det \frac{\partial f(x)}{\partial x} \right|
$$

### Components:
- Invertible Mapping: \( f: X \to Z \) ensures reversibility.
- Prior: Simple distribution (Gaussian).
- Jacobian Determinant: Adjusts probability density when mapping between spaces.

Interpretation: Learn exact likelihoods with flexible, invertible functions.

---

## 5. Diffusion Models

**Forward (noising) process:**

$$
q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)
$$

**Reverse (denoising) process:**

$$
p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(x_t,t))
$$

### Components:
- Forward Process: Gradually corrupt data with Gaussian noise.
- Reverse Process: Learn denoising distribution parameterized by neural net (e.g., U-Net).
- Scheduler: Controls step sizes \(\beta_t\).
- Sampling: Start from noise, iteratively denoise.

Interpretation: Model learns to reverse noise injection and generate realistic samples.  
Examples: DDPM, Stable Diffusion, Imagen.

---

## 6. Energy-Based Models (EBMs)

**Boltzmann Distribution:**

$$
p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z_\theta}, \quad Z_\theta = \int e^{-E_\theta(x)} dx
$$

### Components:
- Energy Function: \(E_\theta(x)\) assigns low energy to real samples.
- Partition Function: \(Z_\theta\) ensures normalization (often intractable).
- Learning: Contrastive divergence, sampling via MCMC.

Interpretation: Learn an energy landscape where real samples lie in valleys.

---

## 7. Score-Based / Flow-Matching Models

**Stochastic Differential Equation (SDE):**

$$
\frac{dx}{dt} = \nabla_x \log p_\theta(x) + \sqrt{2} \, dW_t
$$

### Components:
- Score Function: \(\nabla_x \log p_\theta(x)\), gradient of log-density.
- Sampling Dynamics: Langevin dynamics simulate motion in probability landscape.
- Flow Matching: Aligning forward/backward ODE flows.

Interpretation: Model generates data by following gradient fields of probability density.

---

## 8. Neural Autoregressive Density Estimators (NADE)

**Density Factorization:**

$$
p(x) = \prod_{i=1}^D p(x_i \mid x_{<i}; \theta)
$$

### Components:
- Neural Parametrization: Conditionals parameterized by NN.
- Ordering: Sequence of features matters (fixed or random).
- Training: Exact likelihood maximization.

Interpretation: A fully tractable probabilistic model for high-dimensional data.

---

## 9. Transformer-based Generative Models

**Self-Attention Mechanism:**

$$
\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V
$$

### Components:
- Query (Q): Representation of the current token.
- Key (K): Encodes context of past tokens.
- Value (V): Carries the actual information.
- Softmax Scaling: Normalizes attention scores.

Interpretation: Each token attends to others for context, enabling long-range dependencies.  
Examples: GPT-3/4/5, BERT (masked variant).

---

## 10. Hybrid and Specialized Models

- VAE-GAN: Combines reconstruction likelihood with adversarial loss.  

$$
\mathcal{L} = \mathcal{L}_\text{VAE} + \lambda \mathcal{L}_\text{GAN}
$$

- Diffusion + Transformers: Stable Diffusion uses U-Net with cross-attention.
- Energy-Guided Diffusion: Injects energy-based priors into diffusion sampling.

---

# Summary

- GANs: Adversarial game → fake vs real.
- VAEs: Latent variable model → ELBO maximization.
- Autoregressive: Sequential factorization.
- Normalizing Flows: Exact likelihood via invertible transforms.
- Diffusion: Reverse noise → denoise to generate.
- EBMs: Energy landscape.
- Score Models: Gradient-based sampling.
- NADE: Exact conditional factorization.
- Transformers: Attention-powered autoregression.
- Hybrids: Combine strengths of multiple models.


# Generative AI Models: Probabilistic Essence

---

## 1. GANs

**Essence**: The discriminator outputs a probability between 0 and 1 (real vs fake).

**Why the logarithm?**  
Logarithms turn multiplicative probabilities into additive terms, making optimization stable.  

- For real data: maximize  
$$
\log D(x)
$$  
which forces \(D(x)\) close to 1.  

- For fake data: maximize  
$$
\log (1 - D(G(z)))
$$  
which forces \(D(G(z))\) close to 0.  

**Heart of idea**: The generator improves by making the discriminator’s prediction uncertain (pushing its probability close to 0.5). This back-and-forth ensures generated samples approximate the true distribution.

---

## 2. VAEs

**Essence**: Probability of data is often intractable, so we lower-bound it with ELBO.

**Why the KL divergence?**  
It measures how different the encoder’s distribution \(q_\phi(z|x)\) is from a prior (usually Gaussian). By constraining it, the latent space becomes smooth and generalizable.

**Why the log-likelihood term?**  
$$
\log p_\theta(x|z)
$$  
is like a “chance” of reconstructing the input from latent codes. Higher likelihood → better reconstructions.

**Heart of idea**: Balance between two pressures:  
1. Data reconstruction accuracy.  
2. Keeping latent variables in a structured, normalized space.

---

## 3. Autoregressive Models

**Essence**: Probability of a sequence is factorized step by step.

Each token must lie in [0,1] probability space, and the full sentence probability is a product of these terms.

**Factorization**:  
$$
p(x) = \prod_{t=1}^T p(x_t \mid x_{<t})
$$

**Why logs matter?**  
Maximizing  
$$
\log p(x)
$$  
avoids multiplying many small numbers (underflow) and instead adds them.

**Heart of idea**: Predict the next token given history → chance of the sequence is just the product of these conditional chances.

---

## 4. Normalizing Flows

**Essence**: Exact probability modeling using invertible transformations.

**Change of Variables**:  
$$
p_X(x) = p_Z(f(x)) \left| \det \frac{\partial f(x)}{\partial x} \right|
$$

**Why determinant of Jacobian?**  
When you stretch or compress space, the density changes accordingly. The determinant tells how much “volume” is scaled.

**Heart of idea**: Start with a simple distribution (like Gaussian), apply reversible transformations, and compute the exact likelihood of data.

---

## 5. Diffusion Models

**Essence**: Probability is modeled as a noisy process.

- Forward process gradually destroys structure (adding Gaussian noise).  
- Reverse process learns how likely it is to denoise step by step.  

**Forward process**:  
$$
q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)
$$

**Reverse process**:  
$$
p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(x_t,t))
$$

**Why Gaussian assumption?**  
Normal distributions are mathematically convenient and stable under addition.

**Heart of idea**: The chance of clean data is recovered by chaining probabilities of many denoising steps backward in time.

---

## 6. EBMs

**Essence**: Assigns an “energy” score to every configuration. Low energy = high probability.

**Boltzmann Distribution**:  
$$
p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z_\theta}, \quad
Z_\theta = \int e^{-E_\theta(x)} dx
$$

**Why exponentials and partition function?**  
The exponential ensures probabilities are always positive and between 0–1 once normalized by \(Z\).

**Heart of idea**: The world is modeled as an energy surface; real data sits in valleys, and fake data in high-energy peaks.

---

## 7. Score-Based / Flow-Matching Models

**Essence**: Instead of modeling probability directly, model its gradient (the score function).

**Stochastic Differential Equation**:  
$$
\frac{dx}{dt} = \nabla_x \log p_\theta(x) + \sqrt{2}\, dW_t
$$

**Why gradient of log probability?**  
It points in the direction of higher probability density — essentially, “where data is more likely.”

**Heart of idea**: Generate by simulating dynamics that climb toward regions of higher data likelihood.

---

## 8. NADE

**Essence**: Like autoregressive models but for high-dimensional data (e.g., vectors).

**Density Factorization**:  
$$
p(x) = \prod_{i=1}^D p(x_i \mid x_{<i}; \theta)
$$

Every conditional probability is modeled explicitly so that the product remains a valid probability in [0,1].

**Heart of idea**: Fully tractable density estimation with neural nets.

---

## 9. Transformers

**Essence**: Attention mechanism computes a probability distribution over keys given a query.

**Self-Attention**:  
$$
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$

**Why softmax?**  
Converts raw similarity scores into probabilities (0–1, summing to 1).

**Heart of idea**: Each token “chooses” how much chance it gives to others when forming its new representation.

---

## 10. Hybrids

**Essence**: Mix and match the probability tricks above.

**Example (VAE-GAN)**:  
$$
\mathcal{L} = \mathcal{L}_\text{VAE} + \lambda \mathcal{L}_\text{GAN}
$$

**Heart of idea**: Multiple probability constraints together yield sharper, more realistic models.

---

# Core Intuition Across All Models

- Probabilities must lie in [0,1] — hence the use of sigmoid, softmax, and normalization.  
- Logarithms stabilize products of probabilities (turning them into sums).  
- KL divergence / energy functions constrain distributions so they don’t collapse or spread arbitrarily.  
- Noise vs. structure — most models balance the randomness of generation with the order of real data.  
- Gradients guide learning — whether via adversarial games, variational bounds, or score matching, the key is optimizing probability surfaces.


# Understanding GAN Training as a Game

The GAN objective is formulated as:

\[
\min_G \max_D V(D,G) = \mathbb{E}_{x \sim p_{data}} [\log D(x)] + \mathbb{E}_{z \sim p_z(z)} [\log (1 - D(G(z)))]
\]

- **Discriminator (D):** judges whether an input is real (from data) or fake (from generator).  
- **Generator (G):** takes random noise \(z\) and tries to produce samples \(G(z)\) that fool the discriminator.

---

## Step-by-Step Flow

1. Draw real samples \(x \sim p_{data}(x)\).  
   - Feed into \(D\).  
   - D’s goal: output probability close to 1 (real).  

2. Draw random noise \(z \sim p_z(z)\).  
   - Feed into \(G\).  
   - Generator outputs a fake sample \(G(z)\).  

3. Feed \(G(z)\) into \(D\).  
   - D’s goal: output probability close to 0 (fake).  
   - G’s goal: push D’s output closer to 1 (make fake look real).  

**Training alternates:**  
- Fix \(G\), update \(D\) to better separate real/fake.  
- Fix \(D\), update \(G\) to fool \(D\).  

---

## Intuition

- If **D is perfect**: it outputs 1 for real, 0 for fake. Generator gets crushed.  
- If **G is perfect**: its fakes are indistinguishable from real, so D outputs ~0.5 for everything. D is confused.  

The training aims for a **balance point**: the generator learns the true data distribution.

---

## Truth Table of Cases

| Input type   | D’s output = 1 (calls it Real) | D’s output = 0 (calls it Fake) | Who wins? |
|--------------|--------------------------------|--------------------------------|-----------|
| Real data \(x\) | ✅ Correct (good for D, \(\log D(x)\) high) | ❌ Mistake (bad for D, \(\log D(x) \to -\infty\)) | D wins if 1, loses if 0 |
| Fake data \(G(z)\) | ❌ Mistake (bad for D, good for G, \(\log(1-D(G(z))) \to -\infty\)) | ✅ Correct (good for D, bad for G, \(\log(1-D(G(z)))\) high) | D wins if 0, G wins if 1 |

---

## Summary

- **Discriminator’s job:** maximize correctness — say “1” for real, “0” for fake.  
- **Generator’s job:** flip the second row of the table — try to force D to say “1” for its fakes.  

**Training is a push–pull game:** each improvement by D pushes G to improve, and vice versa.
