# Forward Process

## What is the forward process?

The basic idea of diffusion models is given some image, we want to progressively destroy the image over a series of timesteps. At each step, we add some Gaussian noise to the image. By the end of this sequence of steps, we are left with a completely random image, indistinguishable from Gaussian noise.

![](images/diffusion-forward-process.png)

This is called the **forward process**. 

For each step, we apply this transition function to move from $x_{t-1}$ to $x_t$:

$$
\Huge
q(x_t | x_{t-1}) := \mathcal{N}(x_{t}; \sqrt{1 - \beta} x_{t-1}, \beta \mathbf{I})
$$

Our goal during training is to learn to reverse this process:

1. Start with a random image ($x_T$).
2. For each step, produce the image from the previous step ($x_{t-1}$).
3. Repeat until we have the original image ($x_0$).

We'll go into more detail on the reverse process in the next lesson.

### PyTorch implementation

In [1]:
import torch

def q(x, beta):
    # This implementation is equivalent to the formula above, see the reparmetrization trick below for more details
    return torch.sqrt(1 - beta) * x + torch.randn_like(x, device=x.device) * torch.sqrt(beta)

#### Example: Forward process on an image

In [2]:
import ipywidgets as widgets
from IPython.display import display
import numpy as np
import torch
from PIL import Image

device = 'cuda' if torch.cuda.is_available() else 'cpu'

img = Image.open('data/mandrill.png').convert('RGB')
x = np.array(img).astype(np.float32) / 255
x = torch.tensor(x, device=device)
x = 2 * x - 1  # Normalize to [-1, 1]

frames = [{'img': img, 't': 0}]
steps = 20
beta = torch.tensor(0.25, device=device)

for t in range(1, steps + 1):
    x = q(x, beta)
    img = x.cpu().numpy()
    img = (img + 1) / 2 # Denormalize to [0, 1]
    img = img.clip(0, 1)
    img = (img * 255).astype(np.uint8)
    img = Image.fromarray(img)
    frames.append({'img': img, 't': t})

# Set up an interactive display widget
def show_frame(frame_index):
    frame = frames[frame_index]
    img = frame['img']
    display(img)

# Slider widget
slider = widgets.IntSlider(value=0, min=0, max=len(frames)-1, step=1, description="T")
play_button = widgets.Play(value=0, min=0, max=len(frames)-1, step=1, interval=200)

# Link play button and slider
widgets.jslink((play_button, 'value'), (slider, 'value'))

# Combine play button and slider in the interactive display
interactive_display = widgets.interactive(show_frame, frame_index=slider)
display(widgets.HBox([play_button, interactive_display.children[0]]))  # Attach slider from interactive
display(interactive_display.children[1])  # Display the frame itself

HBox(children=(Play(value=0, interval=200, max=20), IntSlider(value=0, description='T', max=20)))

Output()

## Why is this process called diffusion?

Diffusion is a physical process where particles spread out from areas of high concentration to areas of low concentration.

An example of this might be adding a drop of ink to a glass of water. The ink will spread out over time until it is evenly distributed throughout the water.


<iframe width="800" height="450" src="https://www.youtube.com/embed/prSgVi8WyjQ" title="Ink Drop/Drip in water 60fps_08 - Free HD Stock Footage" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Note that entropy is increasing in this process. The ink is spreading out, and the system is becoming more disordered.

From an information theory perspective, if we consider the entropy of real images, it is much lower than the entropy of random noise. The forward process in diffusion models increases the entropy of the image, mimicking the process of diffusion in the physical world.

## What are diffusion processes?

Here are several different ways to think about diffusion processes:

### 1. Stochastic Process

> In probability theory and statistics, diffusion processes are a class of continuous-time Markov process with almost surely continuous sample paths. Diffusion process is stochastic in nature and hence is used to model many real-life stochastic systems.

["Diffusion process", Wikipedia](https://en.wikipedia.org/wiki/Diffusion_process)

Let's break this down:

- **Continuous-time** means the process is happening continuously, not in discrete steps. Note that in the actual implementation of diffusion models, we observe the process at discrete timesteps. But the underlying process is continuous, and we can think of the discrete observations as samples from the continuous process.

- **Markov process** means the future state of the system depends only on the current state, not on how we got there. In diffusion models, the image at the next timestep depends only on the image at the current timestep:
$$
\Huge
P(X_{t+1} | X_t, X_{t-1}, ..., X_0) = P(X_t+1 | X_t)
$$

- **Stochastic** means the process involves randomness. In the case of diffusion models, we sample random noise from a Gaussian distribution and add it to the image at each timestep.

### 2. Statistics context

> In the context of statistics, meaning of Diffusion is quite similar, i.e. the process of transforming a complex distribution $p_{\text{complex}}$ on $\mathbb{R}^d$ to a simple (predefined) distribution $p_{\text{prior}}$ on the same domain. 
> 
> [Under certain conditions,] repeated application of a transition kernel $q(\mathbf{x} | \mathbf{x}')$ on the samples of *any* distribution would lead to samples from $p_{\text{prior}}$.

["An introduction to Diffusion Probabilistic Models", Ayan Das](https://ayandas.me/blogs/2021-12-04-diffusion-prob-models.html)

In our diffusion process, a set of real images can be seen as a highly dimensional complex distribution. By the end of the diffusion process, we have a set of images that are samples a simple Gaussian distribution.

The transition kernel refers to the function $q(x_t | x_{t-1})$ that we apply at each step of the diffusion process. 

#### Example: diffusion with a 1D mixture of Gaussians


In [3]:
from IPython.display import HTML
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
import torch
from scipy.stats import norm

# Define parameters for our mixture of Gaussians
weights = torch.tensor([0.5, 0.3, 0.2], dtype=torch.float32)
means = torch.tensor([0, 5, -3], dtype=torch.float32)
stds = torch.tensor([1, 0.5, 1.5], dtype=torch.float32)

# Construct Gaussian mixture model with torch distributions
# mix = torch.distributions.Categorical(probs=torch.tensor(weights))
mix = torch.distributions.Categorical(probs=weights)
components = torch.distributions.Normal(loc=means, scale=stds)
gmm = torch.distributions.MixtureSameFamily(mix, components)

# Sample from the mixture of Gaussians
num_samples = 10000
samples = gmm.sample((num_samples,))


# Plot histogram of samples
fig, ax = plt.subplots(figsize=(10, 6))
ax.set_title("Example Mixture of Gaussians (t = 0)")
ax.set_xlabel("x")
ax.set_ylabel("Density")
ax.set_xlim(-10, 10)
ax.set_ylim(0, 0.5)
ax.grid(True)
bins = 50

hist_values, bin_edges = np.histogram(samples.numpy(), bins=bins, density=True)
samples_t = [(hist_values, bin_edges)]
bars = ax.bar(bin_edges[:-1], hist_values, width=bin_edges[1] - bin_edges[0], color='blue', alpha=0.7)

# Calculate the x values for the ideal normal distribution line
x = np.linspace(min(bin_edges), max(bin_edges), 500)

# Calculate the weighted sum of normal distributions for the ideal Gaussian line
normal_density = norm.pdf(x)
ax.plot(x, normal_density, 'k-', lw=2, label="Normal Distribution")

for rect in bars.patches:
    rect.set_animated(True)

beta = torch.tensor(0.5)
num_steps = 40
for i in range(num_steps):
    samples = q(samples, beta)
    hist_values, bin_edges = np.histogram(samples.numpy(), bins=bins, density=True)
    samples_t.append((hist_values, bin_edges))

def init():
    return bars

def update(i):
    hist_values, bin_edges = samples_t[i]
    for bar, h in zip(bars, hist_values):
        bar.set_height(h)
    ax.set_title(f"Example Mixture of Gaussians (t = {i})")
    return bars

# Create the animation
ani = FuncAnimation(fig, update, frames=len(samples_t), init_func=init, blit=True, interval=100)
plt.close(fig)  # Close the figure to prevent static display
HTML(ani.to_jshtml())

### 3. Stochastic Differential Equations (SDEs)

> In stochastic processes, a **diffusion process** satisfies a stochastic differential equation of the form:
> $$
\Huge
dX_t = a(X_t, t)\,dt + b(X_t, t)\,dW_t,
$$
> where $a(X_t, t)$ represents the deterministic drift term, and $b(X_t, t)\,dW_t$ is the stochastic term, with $dW_t$ being an increment of a Wiener process with mean zero and variance $dt$.

["Stochastic Differential Equations and Diffusion Processes," N. Ikeda and S. Watanabe](https://api.pageplace.de/preview/DT0400.9781483296159_A23889268/preview-9781483296159_A23889268.pdf)

This generally states that diffusion processes have equations of the form:
$$
\Huge
\text{change in } X = \text{deterministic term} + \text{stochastic term}
$$

Going back to our original definition for the transition function:
$$
\Huge
q(x_t | x_{t-1}) := \mathcal{N}(x_{t}; \sqrt{1 - \beta^2} x_{t-1}, \beta^2 \mathbf{I})
$$

We can apply the [reparameterization trick](https://dilithjay.com/blog/the-reparameterization-trick-clearly-explained):
$$
\Huge
z \sim \mathcal{N}(\mu, \sigma^2) \implies x = \mu + \sigma \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)
$$ 

Then we can rewrite this as:
$$
\Huge
x_t = \sqrt{1 - \beta} x_{t-1} + \sqrt{\beta} \mathcal{N}(0, \mathbf{I})
$$

Our change in $x$ is then:
$$
\Huge
\begin{aligned}
    x_t - x_{t-1} &= \sqrt{1 - \beta} x_{t-1} + \sqrt{\beta} \mathcal{N}(0, \mathbf{I}) - x_{t-1} \\
    &= (\sqrt{1 - \beta} - 1) x_{t-1} + \sqrt{\beta} \mathcal{N}(0, \mathbf{I}) \\
\end{aligned}
$$

Using this form, we can see that our function is a combination of:
$$
\Huge
\begin{aligned}
    \sqrt{1 - \beta} - 1) x_{t-1} \qquad &\text{(Deterministic term)} \\
    \sqrt{\beta} \mathcal{N}(0, \mathbf{I} \qquad &\text{(Stochastic term)}
\end{aligned}
$$
