# The Complete Problem–Solution Journey  
## Score-Based and Diffusion Generative Modeling

This section presents a complete, logically ordered account of the **problems that historically blocked likelihood-based generative modeling**, and the **precise theoretical solutions** that culminated in modern score-based diffusion models. Each step resolves a concrete failure mode and motivates the next conceptual advance.

---

## 1. Problem: Likelihood-Based Models Require Intractable Normalization

### The problem

Classical likelihood-based generative models define densities as

$$
p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z_\theta}
$$

where the normalizing constant

$$
Z_\theta = \int e^{-E_\theta(x)} dx
$$

is typically **intractable** in high dimensions.

This forces practitioners to choose between:

- **Restrictive architectures** (normalizing flows, autoregressive models), or  
- **Approximate objectives** (VAEs, contrastive divergence),

both of which significantly limit modeling flexibility.

### The solution

Model the **score function** instead of the density:

$$
s(x) = \nabla_x \log p(x)
$$

Key insight:

- The score **does not depend on the normalization constant**
- Any free-form neural network can parameterize it

### Outcome

Tractable learning without computing likelihoods or partition functions.

---

## 2. Problem: Ground-Truth Score Is Unknown

### The problem

The natural training objective is the Fisher divergence:

$$
\mathbb{E}_{p(x)} \left\| s_\theta(x) - \nabla_x \log p(x) \right\|^2
$$

However, the true score  
$$
\nabla_x \log p(x)
$$  
is unknown, making direct optimization impossible.

### The solution

**Score Matching** (Hyvärinen, 2005):

- Reformulates the Fisher divergence
- Eliminates dependence on the unknown true score
- Produces an objective computable from data alone

### Outcome

Score models can be trained with standard SGD, without adversarial objectives.

---

## 3. Problem: Naive Score Matching Fails in High Dimensions

### The problem

Score matching weights errors by the data density \( p(x) \):

- Low-density regions contribute almost nothing to the loss
- But sampling *starts* in low-density regions

### Consequence

Langevin dynamics diverges immediately.  
This was the first major empirical failure encountered in practice.

---

## 4. Problem: Inaccurate Scores in Low-Density Regions Break Sampling

### The solution

Perturb the data with noise:

$$
x \sim p(x) \quad \Rightarrow \quad x + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I)
$$

Effects:

- Fills low-density regions
- Makes score estimation well-posed everywhere
- Stabilizes both training and sampling

### Outcome

Score estimation becomes globally accurate.

---

## 5. Problem: No Single Noise Level Is Optimal

### The problem

- Large noise: good coverage, poor fidelity  
- Small noise: accurate detail, poor coverage  

A single noise scale is fundamentally insufficient.

### The solution

**Noise Conditional Score Networks (NCSN)**

Train models of the form:

$$
s_\theta(x, \sigma_i) \approx \nabla_x \log p_{\sigma_i}(x)
$$

Using:

- Multiple noise scales
- Geometric noise schedules
- Joint training over all \( \sigma \)

Sampling via **Annealed Langevin Dynamics** gradually reduces noise.

### Outcome

High-quality samples competitive with GANs.

---

## 6. Problem: Discrete Noise Schedules Are Heuristic

### The problem

- Finite noise levels lack theoretical elegance
- No exact likelihood computation
- Difficult to analyze or generalize

### The solution

Move to **continuous-time noise** using stochastic differential equations (SDEs).

Forward diffusion:

$$
dx = f(x,t)\,dt + g(t)\,dW_t
$$

Interpretation:

- Infinitely many noise scales
- Smooth diffusion from data to noise

Common choices:

- Variance Exploding (VE)
- Variance Preserving (VP)
- sub-VP SDEs

### Outcome

Continuous theory with improved sample quality.

---

## 7. Problem: How Can an SDE Be Reversed?

### The problem

The forward SDE destroys information.  
A principled reverse process is required.

### The solution

**Reverse-time SDE** (Anderson, 1982):

$$
dx = \left[f(x,t) - g(t)^2 \nabla_x \log p_t(x)\right]dt + g(t)\,d\bar{W}_t
$$

Crucial observation:

- The reverse drift depends on the **score**
- Exactly what the model learns

### Outcome

Sampling reduces to solving a reverse-time SDE.

---

## 8. Problem: Reverse SDE Requires Time-Dependent Scores

### The solution

Train a **time-dependent score model**:

$$
s_\theta(x,t) \approx \nabla_x \log p_t(x)
$$

Training via:

- Continuous-time score matching
- Weighted Fisher divergence

Special case:

- Likelihood weighting corresponds to KL minimization

### Outcome

State-of-the-art likelihoods without maximum likelihood estimation.

---

## 9. Problem: Numerical SDE Solvers Are Slow and Noisy

### The solution

**Predictor–Corrector Sampling**

- Predictor: Euler–Maruyama or higher-order solvers
- Corrector: Langevin MCMC using the learned score

Key principle:

- Only the marginal distributions must be correct
- Exact trajectories are unnecessary

### Outcome

Faster, higher-quality sampling than GANs.

---

## 10. Problem: No Exact Likelihood with Stochastic Sampling

### The solution

**Probability Flow ODE**

Every diffusion SDE admits a deterministic ODE with identical marginals:

$$
dx = \left[f(x,t) - \frac{1}{2} g(t)^2 s_\theta(x,t)\right]dt
$$

Benefits:

- Exact likelihood computation
- Invertible generative model
- Neural ODE framework

### Outcome

Exact density estimation and continuous normalizing flows.

---

## 11. Problem: Inverse Problems Usually Require Retraining

### The solution

Bayesian inference in **score space**:

$$
\nabla_x \log p(x \mid y)
=
\nabla_x \log p(x)
+
\nabla_x \log p(y \mid x)
$$

Implications:

- Reuse unconditional score
- Modify the drift term
- Sample from the posterior directly

Applications:

- MRI
- CT
- Inpainting
- Colorization

### Outcome

Inverse problem solving with **zero retraining**.

---

## 12. Problem: Score-Based Models Appeared Separate from Diffusion Models

### The solution

Unification with diffusion probabilistic models.

Key identity:

- DDPM ELBO is equivalent to **weighted score matching**
- Diffusion models discretize the same underlying SDE

Conceptual analogy:

- Wave mechanics vs matrix mechanics

### Outcome

A single unified diffusion framework.

---

## 13. Remaining Open Challenges

| Challenge | Partial Solution | Status |
|---------|-----------------|--------|
| Sampling speed | Probability flow ODE, distillation | Open |
| Discrete data | Latent diffusion, autoencoders | Open |
| Memory and compute | Adaptive solvers | Improving |

---

## Final Synthesis

Every major obstacle was resolved by a single unifying principle:

**Learn the geometry of probability via its score, then use physics (SDEs) to move through it.**

Diffusion models are therefore not merely generative models, but a **geometric, dynamical theory of probability itself**.
