Diffusion models have become the dominant paradigm for image generation and are extending to video, audio, 3D, molecules, and beyond. This post provides a technical grounding: what diffusion models actually learn, how sampling works, and how to think about the design choices that distinguish different variants.



## The Forward Process: Adding Noise

A diffusion model starts by defining a **forward process** that gradually destroys data by adding noise. Given a data point $x_0$, we produce a sequence of increasingly noisy versions $x_1, x_2, \ldots, x_T$ by iteratively adding Gaussian noise:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

where $\beta_t$ is a "noise schedule" that controls how much noise is added at each step.

A key mathematical property: you can skip directly from $x_0$ to $x_t$ without simulating all intermediate steps:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$$

where $\bar{\alpha}_t = \prod_{s=1}^{t} (1 - \beta_s)$.

As $t \to T$, the data is completely destroyed—$x_T$ is approximately standard Gaussian noise. Training involves learning to reverse this process.



## Score Matching: Learning ∇log p(x)

Here's the core insight. If you knew the **score function** $\nabla_x \log p(x)$—the gradient of the log probability density—you could sample from $p(x)$ using Langevin dynamics:

$$x_{k+1} = x_k + \frac{\epsilon}{2} \nabla_x \log p(x_k) + \sqrt{\epsilon} z, \quad z \sim \mathcal{N}(0, I)$$

Start from noise, follow the score uphill (toward higher probability), add a bit of noise for exploration, and eventually you sample from $p(x)$.

The problem: we don't know $p(x)$, so we can't compute its gradient. But we can learn to approximate it.

**Score matching** trains a neural network $s_\theta(x)$ to predict $\nabla_x \log p(x)$ using the objective:

$$\mathbb{E}_{x \sim p} \left[ \| s_\theta(x) - \nabla_x \log p(x) \|^2 \right]$$

This looks circular (how do you compute the target?), but there's a trick: **denoising score matching** shows that learning to denoise is equivalent to learning the score.



## Denoising Score Matching: Learn to Denoise

Given clean data $x_0$, add noise to get $x_t$. The score of the noisy distribution points toward the clean data:

$$\nabla_{x_t} \log q(x_t | x_0) = -\frac{x_t - \sqrt{\bar{\alpha}_t} x_0}{1 - \bar{\alpha}_t}$$

This is proportional to the direction from noisy $x_t$ to clean $x_0$. So if you train a network to predict this direction, you're implicitly learning the score.

In practice, most implementations train a network $\epsilon_\theta(x_t, t)$ to predict the noise $\epsilon$ that was added:

$$\mathcal{L} = \mathbb{E}_{x_0, \epsilon, t} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]$$

where $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ and $\epsilon \sim \mathcal{N}(0, I)$.

Predicting noise and predicting score are equivalent up to a scaling factor. The noise prediction framing is more numerically stable in practice.



## Noise Schedules: Linear, Cosine, Learned

The noise schedule $\beta_t$ (equivalently, $\bar{\alpha}_t$) determines how quickly information is destroyed during the forward process and how sampling progresses during generation.

**Linear schedule**: The original DDPM paper used $\beta_t$ increasing linearly from $\beta_1$ to $\beta_T$. Simple but destroys high-frequency detail too quickly relative to low-frequency structure.

**Cosine schedule**: Designed so that $\bar{\alpha}_t$ follows a cosine curve, keeping more information at intermediate timesteps. This produces better samples, especially for images with fine detail.

**Learned schedules**: Some work optimizes the schedule jointly with the model. The optimal schedule depends on the data distribution—images, audio, and molecules may want different curves.

**Continuous-time formulations**: Instead of discrete steps $t = 1, \ldots, T$, some formulations use continuous time $t \in [0, 1]$. This enables more flexible sampling strategies and cleaner theoretical analysis.

The schedule is a hyperparameter that meaningfully affects sample quality. Getting it wrong can make sampling slow (too many steps needed) or produce artifacts (information destroyed at wrong rates).



## Sampling: DDPM, DDIM, and Beyond

Once trained, how do you actually generate samples?

**DDPM sampling**: The original approach reverses the forward process step by step. At each step, use the predicted noise to estimate the denoised image, then resample with slightly less noise. This requires ~1000 steps and is slow.

$$x_{t-1} = \frac{1}{\sqrt{1-\beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z$$

**DDIM (Denoising Diffusion Implicit Models)**: Reinterprets the generative process as solving an ODE rather than an SDE. This allows:
- Deterministic sampling (same noise → same output)
- Fewer steps (50-100 instead of 1000)
- Interpolation in latent space

**ODE solvers**: Since the continuous-time limit of diffusion is an ODE, you can use standard numerical ODE solvers (Euler, Heun, RK45) with adaptive step sizes. Better solvers need fewer function evaluations.

**Distillation**: Train a student model to predict the final output in fewer steps than the teacher. Progressive distillation can reduce steps from 1000 → 500 → 250 → ... → 4.

The practical upshot: modern diffusion models generate high-quality images in 20-50 steps, not the 1000 originally required.



## Classifier-Free Guidance

How do you control what a diffusion model generates? One answer: **guidance**.

**Classifier guidance** uses an external classifier $p(y|x_t)$ to steer generation toward class $y$. During sampling, modify the score:

$$\tilde{\nabla} \log p(x_t | y) = \nabla \log p(x_t) + \gamma \nabla \log p(y | x_t)$$

Push toward both high probability and high classifier score. The weight $\gamma$ controls how strongly to follow the guidance.

**Classifier-free guidance** eliminates the separate classifier. Instead, train the diffusion model with dropout on the conditioning (sometimes train unconditional, sometimes conditional). At sampling time, interpolate:

$$\tilde{\epsilon} = \epsilon_\theta(x_t, t, \emptyset) + \gamma \left( \epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset) \right)$$

where $c$ is the conditioning (text prompt, class label) and $\emptyset$ is the unconditional case.

With $\gamma > 1$, you over-emphasize the conditional direction—making samples more aligned with the prompt at the cost of diversity. This is why "guidance scale" appears in DALL-E, Stable Diffusion, and Midjourney settings.



## Latent Diffusion: Stable Diffusion Architecture

Running diffusion directly on high-resolution images is expensive. A 512×512 RGB image has ~800K dimensions. That's a lot of score function evaluations.

**Latent diffusion** compresses images first:

1. Train a variational autoencoder (VAE) with encoder $\mathcal{E}$ and decoder $\mathcal{D}$
2. Encode images to latent space: $z = \mathcal{E}(x)$, typically 64×64×4 (~16K dims)
3. Train the diffusion model in latent space
4. At generation time, sample $z$, then decode $x = \mathcal{D}(z)$

The VAE handles pixel-level detail; the diffusion model handles semantic structure. This is 16-64× more efficient than pixel-space diffusion.

Stable Diffusion is a latent diffusion model with:
- A pretrained VAE for image compression
- A U-Net architecture for the diffusion model
- CLIP text encoder for conditioning
- Cross-attention to inject text embeddings into the U-Net



## Applications Beyond Images

The diffusion framework is surprisingly general. Anything you can add Gaussian noise to, you can learn to denoise.

**Video generation**: Add temporal dimensions. Challenges include maintaining consistency across frames and scaling to longer durations. Sora, Runway, and others are pushing this frontier.

**Audio**: Diffusion over spectrograms or waveforms. Used for music generation, speech synthesis, and sound effects.

**3D**: Diffusion over point clouds, neural radiance fields, or mesh representations. Enables text-to-3D generation.

**Molecules**: Diffusion over 3D atomic coordinates. AlphaFold 3 uses diffusion for structure prediction. Drug discovery applies diffusion to generate novel molecules with desired properties.

**Robotics**: Diffusion for trajectory planning. Sample diverse trajectories, then select or refine. Useful when you want multimodal predictions (multiple valid ways to accomplish a task).

**Text**: More challenging because text is discrete, but recent work maps discrete tokens to continuous embeddings, runs diffusion there, and maps back.



## Conceptual Summary

The diffusion framework reduces generative modeling to a simple idea:

1. Define a process that destroys data by adding noise
2. Learn to reverse that process by predicting the noise
3. Sample by starting from noise and iteratively denoising

The magic is that predicting noise is a well-defined regression problem (unlike trying to model $p(x)$ directly), and the learned denoiser implicitly captures gradient information about the data distribution.

All the variants—different noise schedules, sampling methods, conditioning strategies, latent spaces—are refinements of this core loop. Once you understand the basic mechanism, the rest is engineering.





```{=html}
<div style="text-align:center;">
  <img src="image.png" alt="Figure" width="65%"/>
  <p><em>Figure 1. The diffusion process: adding noise (forward) and learning to denoise (reverse)</em></p>
</div>
```

