# Intellectual Origins and Theoretical Foundations of Diffusion Models  
## (for Image and Video Generation)

Diffusion models did not arise abruptly. Rather, they are the culmination of a long intellectual trajectory that spans statistical physics, stochastic calculus, information theory, numerical analysis, and modern generative modeling. This section presents a coherent historical and theoretical synthesis of the key ideas that ultimately converged into contemporary diffusion and score-based models.

---

## 1. Noise and Brownian Motion: From Physics to Forward Diffusion

The foundational idea of progressively injecting noise originates in the physical theory of Brownian motion.

At the microscopic level, Brownian motion describes the seemingly random movement of particles suspended in a fluid.  
Albert Einstein provided the first rigorous statistical explanation of this phenomenon, demonstrating how macroscopic diffusion emerges from microscopic randomness.

Subsequently, the Fokker–Planck equation formalized how probability densities evolve over time under random motion. This equation describes the temporal dynamics of distributions rather than individual particle trajectories.

Paul Langevin introduced a crucial conceptual decomposition of motion:

$$
\text{Particle dynamics} = \text{deterministic drift} + \text{stochastic noise}
$$

This formulation introduced what is now recognized as a **forward diffusion process**, in which structured states are gradually corrupted by noise.

Later, Kiyosi Itô developed the mathematical framework of **stochastic differential equations (SDEs)**, enabling rigorous treatment of random processes whose paths are almost surely non-differentiable.

Finally, Brian Anderson formalized **reverse-time stochastic processes**, proving that under specific conditions diffusion processes admit a mathematically well-defined time reversal.

### Impact on diffusion models

This lineage establishes the core structure of diffusion-based generation:

$$
\text{Forward diffusion: data} \rightarrow \text{noise}
$$

$$
\text{Reverse diffusion: noise} \rightarrow \text{data}
$$

This bidirectional stochastic process constitutes the mathematical backbone of modern diffusion models.

---

## 2. Energy-Based Models: Learning Without Explicit Normalization

A second major intellectual stream arises from energy-based models (EBMs).

Geoffrey Hinton introduced the idea of modeling data via an **energy function** rather than an explicitly normalized probability density. In this formulation, probability is defined implicitly as

$$
p(x) \propto e^{-E(x)}
$$

The principal obstacle of EBMs is the **partition function**, a normalization constant that is typically intractable in high-dimensional spaces.

A decisive step forward was taken by Yang Song, who shifted the modeling focus from the energy itself to its derivatives. By learning gradients of the log-density, the partition function disappears entirely from the learning objective.

### Impact on diffusion models

Score-based diffusion models inherit the expressive power of EBMs while eliminating their most severe computational limitation: intractable normalization.

---

## 3. Log-Probability and the Score Function: Fisher’s Insight

A foundational contribution from statistics was made by Sir Ronald Fisher, who introduced:

- The log-likelihood  
- Its gradient, known as the **score function**

$$
s(x) = \nabla_x \log p(x)
$$

This concept initiated a profound shift in perspective. Rather than learning the probability density directly, one can instead learn its **local geometry**, encoded in the score.

A key insight is that the score function is invariant to normalization constants. Consequently, it avoids the most intractable component of unnormalized probability models.

### Impact on diffusion models

Learning the score transforms density estimation into **vector field estimation**, a formulation that is far more tractable in high-dimensional spaces such as images and videos.

---

## 4. Stein’s Method: Distribution Characterization via Score-Based Operators  
(New — historically necessary insertion)

Before score matching could be formulated as a practical learning principle, a deeper probabilistic insight was required: **probability distributions can be uniquely characterized by differential operators involving their score functions**.

This insight originates with Charles Stein, who introduced **Stein’s method** in 1972 while studying rates of convergence in the Central Limit Theorem.

---

### Core Idea of Stein’s Method

Rather than comparing probability densities directly, Stein proposed comparing distributions through **operator identities** of the form

$$
\mathbb{E}_{p}\left[ T_p f(X) \right] = 0
$$

for all suitable test functions \( f \).

For continuous distributions, the **Stein operator** \( T_p \) explicitly involves the score function:

$$
T_p f(x)
=
\nabla_x \log p(x)\, f(x)
+
\nabla_x f(x)
$$

A distribution \( p \) is **uniquely identified** by the fact that this identity holds **if and only if**

$$
X \sim p .
$$

---

### What Stein Did *Not* Do (Important Clarification)

Stein did **not** invent the score function.

The score function was already well established through Fisher’s likelihood theory.

Stein’s contribution was its **conceptual repurposing**:

- From parameter estimation  
- To distribution comparison, approximation, and convergence  

---

### Why Stein’s Method Matters for Diffusion Models

Stein’s method establishes a crucial theoretical fact:

**Knowing the score function is sufficient to characterize a distribution, without ever computing its density.**

This result provides the probabilistic justification for:

- Comparing distributions via scores  
- Defining discrepancies based on score fields  
- Training models that never evaluate likelihoods  

---

### Impact on Diffusion Models

Stein’s method provides the theoretical bridge between:

- Fisher’s score  
- Hyvärinen’s score matching  
- Yang Song’s score-based generative modeling  

Without Stein’s operator-based perspective, **score matching would lack its deeper probabilistic foundation**, and score-based diffusion models would appear as a heuristic rather than a principled generative framework.

---
## 5. Fisher Divergence and Score Matching: Learning Without Ground Truth

While Fisher introduced criteria for evaluating statistical estimators, directly comparing a model’s score to the true data score is impossible, as the latter is unknown.

Aapo Hyvärinen resolved this difficulty by introducing **score matching**. This method reformulates the Fisher divergence in a way that:

- Eliminates dependence on the unknown true score  
- Enables direct optimization using only data samples  

Yang Song adopted score matching as the core training principle for diffusion models.

### Impact on diffusion models

Score matching yields training procedures that are:

- Non-adversarial  
- Numerically stable  
- Scalable to extremely high-dimensional data  

This sharply contrasts with adversarial approaches that rely on unstable minimax objectives.

---

## 6. Numerical Methods: Bridging SDEs and ODEs

To make stochastic processes computationally practical, numerical methods play a critical role.

The Euler–Maruyama method extended classical Euler integration to stochastic differential equations, enabling numerical simulation of diffusion processes.

Later theoretical developments revealed that diffusion SDEs correspond to deterministic **probability flow ordinary differential equations (ODEs)**. These ODEs:

- Share the same marginal distributions as the SDEs  
- Enable exact likelihood computation  
- Define invertible generative flows  

### Impact on diffusion models

This establishes a unified framework that encompasses both:

- Stochastic sampling via SDEs  
- Deterministic generation via ODEs  

---

## 7. Langevin Dynamics: MCMC as a Generative Mechanism

Langevin dynamics bridges gradients of log-probability with Markov Chain Monte Carlo (MCMC) sampling.

It provides a principled method for sampling from a distribution using only its score. Within diffusion models, Langevin dynamics manifests as:

- Annealed Langevin dynamics  
- Predictor–Corrector samplers  
- Reverse-time SDE solvers  

### Impact on diffusion models

Sampling becomes:

- Geometry-driven  
- Physically interpretable  
- Independent of explicit density evaluation  

---

## Unified Perspective

Modern diffusion models emerge from the convergence of five major scientific traditions:

- **Statistical physics**: Brownian motion, diffusion, and reversibility  
- **Stochastic calculus**: Itô SDEs and reverse-time dynamics  
- **Energy-based modeling**: Unnormalized probability representations  
- **Information geometry**: Score functions, Fisher divergence, and score matching  
- **Numerical analysis and MCMC**: Practical and scalable sampling algorithms  

---

## Final Synthesis

Diffusion models do not learn probability distributions directly.  
Instead, they learn **vector fields** that describe how probability mass flows through space and time.

This conceptual shift:

- Eliminates normalization constraints  
- Avoids adversarial training  
- Enables exact likelihood computation  
- Solves inverse problems without retraining  
- Unifies EBMs, diffusion models, and continuous normalizing flows  

In this sense, diffusion models represent a **geometric and dynamical theory of generative modeling**, rather than a purely statistical one.
