In [None]:
#@title üéß Download Narration Audio & Play Introduction
import os as _os
if not _os.path.exists("/content/narration"):
    !pip install -q gdown
    import gdown
    gdown.download(id="1K3QJmxvgc0_ZwXawSeV_oQll-6WeibaO", output="/content/narration.zip", quiet=False)
    !unzip -q /content/narration.zip -d /content/narration
    !rm /content/narration.zip
    print(f"Loaded {len(_os.listdir('/content/narration'))} narration segments")
else:
    print("Narration audio already loaded.")

from IPython.display import Audio, display
display(Audio("/content/narration/01_00_intro.mp3"))


In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

In [None]:
#@title üéß Listen: Why Matter Intuition
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_02_why_matter_intuition.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


# Forward Diffusion Process ‚Äî Vizuara

## 1. Why Does This Matter?

Generative models have transformed AI, but how do we train a model to create images from pure noise? The key insight of diffusion models is that if we can learn how noise is systematically added to images, we can learn to reverse the process.

In this notebook, we will build the **forward diffusion process** from scratch. This is the foundation of all diffusion models ‚Äî the process that takes clean images and gradually transforms them into pure Gaussian noise.

By the end of this notebook, you will:
- Understand why Gaussian noise is the right choice for corruption
- Implement the step-by-step forward process
- Derive and implement the closed-form "skip to any timestep" formula
- Visualize how images are destroyed at different noise levels
- Build the noise schedule that controls the rate of corruption

Let us begin!

## 2. Building Intuition

Let us start with a simple analogy. Imagine you have a cup of black coffee. You add a single drop of milk. The coffee looks almost the same. Add another drop ‚Äî still barely noticeable. But if you keep adding drops, eventually the coffee becomes uniformly light brown. The original "structure" of pure black coffee has been completely destroyed.

This is exactly what happens in forward diffusion. We add tiny amounts of random noise at each step. After enough steps, the original image is completely unrecognizable ‚Äî replaced by pure random noise.

Let us visualize this with actual images. First, let us set up our environment.

In [None]:
#@title üéß Narration: Setup Packages Data
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_03_setup_packages_data.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Install required packages
!pip install torch torchvision matplotlib numpy -q

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

# Load MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Normalize to [-1, 1]
])

dataset = torchvision.datasets.MNIST(
    root='./data', train=True, download=True, transform=transform
)

# Grab a single image
sample_image = dataset[0][0]  # Shape: [1, 28, 28]
print(f"Image shape: {sample_image.shape}")
print(f"Pixel range: [{sample_image.min():.2f}, {sample_image.max():.2f}]")

In [None]:
#@title üéß What to Look For: Visualize Clean Image
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_04_visualize_clean_image.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


Let us look at our clean image before we start adding noise.

In [None]:
plt.figure(figsize=(3, 3))
plt.imshow(sample_image.squeeze(), cmap='gray')
plt.title('Original Image (x‚ÇÄ)')
plt.axis('off')
plt.show()

In [None]:
#@title üéß Listen: Math Step By Step Equation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_05_math_step_by_step_equation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 3. The Mathematics

Now let us understand the mathematics behind the forward process.

At each time step $t$, we take the image from step $t-1$ and add noise according to:

$$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t} \cdot \mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

This means: sample $x_t$ from a Gaussian with:
- **Mean:** $\sqrt{1-\beta_t} \cdot x_{t-1}$ (scaled-down version of previous image)
- **Variance:** $\beta_t$ (amount of noise)

Let us plug in numbers. If pixel value $x_{t-1} = 0.8$ and $\beta_t = 0.01$:
- Mean = $\sqrt{0.99} \times 0.8 = 0.995 \times 0.8 = 0.796$
- Std = $\sqrt{0.01} = 0.1$
- So $x_t \sim \mathcal{N}(0.796, 0.01)$

The mean shrinks slightly, and noise is added. After many steps, the signal disappears.

**The key trick:** We define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$.

Then we get the closed-form formula:

$$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \cdot \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \cdot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$$

This is a massive shortcut ‚Äî we can jump directly from $x_0$ to any $x_t$ without iterating through all intermediate steps!

Let us verify with numbers. If $\bar{\alpha}_t = 0.5$ and $x_0 = 0.8$, $\epsilon = 0.3$:
- $x_t = \sqrt{0.5} \times 0.8 + \sqrt{0.5} \times 0.3 = 0.566 + 0.212 = 0.778$
- About 70.7% original signal + 70.7% noise. This makes sense for a mid-level noise amount.

## 4. Let's Build It ‚Äî Component by Component

### Step 1: Define the Noise Schedule

The noise schedule determines $\beta_t$ at each time step. We use a **linear schedule** from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$.

In [None]:
#@title üéß Code Walkthrough: Noise Schedule Implementation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_07_noise_schedule_implementation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
T = 1000  # Total diffusion steps

# Linear noise schedule
beta_start = 1e-4
beta_end = 0.02
betas = torch.linspace(beta_start, beta_end, T)

# Compute alphas and cumulative alpha products
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)

print(f"beta range: [{betas[0]:.6f}, {betas[-1]:.4f}]")
print(f"alpha_bar at t=0:   {alpha_bars[0]:.6f}")
print(f"alpha_bar at t=250: {alpha_bars[250]:.6f}")
print(f"alpha_bar at t=500: {alpha_bars[500]:.6f}")
print(f"alpha_bar at t=750: {alpha_bars[750]:.6f}")
print(f"alpha_bar at t=999: {alpha_bars[999]:.6f}")

In [None]:
#@title üéß What to Look For: Noise Schedule Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_08_noise_schedule_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


Let us visualize how $\bar{\alpha}_t$ decreases over time.

In [None]:
plt.figure(figsize=(10, 4))
plt.plot(alpha_bars.numpy())
plt.xlabel('Time step t')
plt.ylabel('·æ±‚Çú (cumulative alpha)')
plt.title('How Much Original Signal Remains at Each Time Step')
plt.grid(True, alpha=0.3)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.5, label='50% signal')
plt.legend()
plt.show()

In [None]:
#@title üéß Listen: Step By Step Intro Func
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_09_step_by_step_intro_func.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


**Checkpoint:** You should see $\bar{\alpha}_t$ starting near 1.0 (almost no noise) and decreasing towards 0 (pure noise). The curve shows that most of the signal is destroyed in the first half of the schedule.

### Step 2: Implement Step-by-Step Forward Process

Let us first implement the naive iterative approach, step by step.

In [None]:
def forward_step(x_prev, t):
    """Apply one step of forward diffusion: q(x_t | x_{t-1})"""
    beta_t = betas[t]
    mean = torch.sqrt(1 - beta_t) * x_prev
    noise = torch.randn_like(x_prev)
    x_t = mean + torch.sqrt(beta_t) * noise
    return x_t

In [None]:
#@title üéß What to Look For: Step By Step Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_10_step_by_step_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Apply forward diffusion step by step
x = sample_image.clone()
noisy_images_stepwise = [x.clone()]

for t in range(T):
    x = forward_step(x, t)
    if t in [0, 99, 249, 499, 749, 999]:
        noisy_images_stepwise.append(x.clone())

# Visualize
fig, axes = plt.subplots(1, 7, figsize=(21, 3))
titles = ['t=0', 't=1', 't=100', 't=250', 't=500', 't=750', 't=1000']
for ax, img, title in zip(axes, noisy_images_stepwise, titles):
    ax.imshow(img.squeeze().clamp(-1, 1).numpy(), cmap='gray')
    ax.set_title(title)
    ax.axis('off')
plt.suptitle('Step-by-Step Forward Diffusion', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
#@title üéß Listen: Closed Form Intro Func
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_11_closed_form_intro_func.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### Step 3: Implement the Closed-Form Forward Process

Now let us implement the efficient closed-form version.

In [None]:
def forward_diffusion_closed_form(x_0, t, noise=None):
    """
    Jump directly from x_0 to x_t using the closed-form formula.

    x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
    """
    if noise is None:
        noise = torch.randn_like(x_0)

    sqrt_alpha_bar = torch.sqrt(alpha_bars[t])
    sqrt_one_minus_alpha_bar = torch.sqrt(1 - alpha_bars[t])

    x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * noise
    return x_t, noise

In [None]:
#@title üéß What to Look For: Closed Form Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_12_closed_form_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Jump to different time steps directly
timesteps = [0, 50, 100, 250, 500, 750, 999]

fig, axes = plt.subplots(1, len(timesteps), figsize=(3 * len(timesteps), 3))
for ax, t in zip(axes, timesteps):
    x_t, _ = forward_diffusion_closed_form(sample_image, t)
    ax.imshow(x_t.squeeze().clamp(-1, 1).numpy(), cmap='gray')
    ax.set_title(f't={t}\n·æ±={alpha_bars[t]:.3f}')
    ax.axis('off')
plt.suptitle('Closed-Form Forward Diffusion (Direct Jump)', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
#@title üéß Before You Start: Todo1 Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_13_todo1_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


**Checkpoint:** Both methods should produce visually similar results ‚Äî the image gradually dissolving into noise. But the closed-form version computes each noisy image independently in one operation.

## 5. Your Turn

### TODO 1: Experiment with Different Noise Schedules

Replace the linear schedule with a **cosine schedule** (proposed by Nichol & Dhariwal, 2021). The cosine schedule provides a more gradual noise increase.

In [None]:
#@title üéß Before You Start: Todo1 Implementation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_14_todo1_implementation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def cosine_beta_schedule(T, s=0.008):
    """
    Cosine noise schedule from Nichol & Dhariwal (2021).

    TODO: Implement the cosine schedule.

    The formula is:
    alpha_bar_t = cos((t/T + s) / (1 + s) * pi/2)^2

    Then compute betas from alpha_bars:
    beta_t = 1 - alpha_bar_t / alpha_bar_{t-1}
    Clip beta_t to be at most 0.999

    Returns:
        betas: tensor of shape (T,)
    """
    # HINT: Use torch.cos and torch.clamp
    # Step 1: Create timestep array from 0 to T
    # Step 2: Compute f(t) = cos((t/T + s)/(1+s) * pi/2)^2
    # Step 3: Compute alpha_bars = f(t) / f(0)
    # Step 4: Compute betas = 1 - alpha_bars[t] / alpha_bars[t-1]
    # Step 5: Clip betas to [0, 0.999]

    pass  # YOUR CODE HERE

In [None]:
#@title üéß Before You Start: Todo1 Verification
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_15_todo1_verification.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Verification cell for TODO 1
cosine_betas = cosine_beta_schedule(T)
if cosine_betas is not None:
    cosine_alphas = 1.0 - cosine_betas
    cosine_alpha_bars = torch.cumprod(cosine_alphas, dim=0)

    plt.figure(figsize=(10, 4))
    plt.plot(alpha_bars.numpy(), label='Linear Schedule')
    plt.plot(cosine_alpha_bars.numpy(), label='Cosine Schedule')
    plt.xlabel('Time step t')
    plt.ylabel('·æ±‚Çú')
    plt.title('Linear vs Cosine Noise Schedules')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    print("The cosine schedule should decrease more slowly at first, then faster at the end.")
else:
    print("TODO 1 not yet implemented.")

In [None]:
#@title üéß Before You Start: Todo2 Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_16_todo2_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### TODO 2: Verify the Closed-Form Formula

Prove empirically that the closed-form and step-by-step methods produce the same distribution.

In [None]:
#@title üéß Before You Start: Todo2 Implementation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_17_todo2_implementation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
def verify_closed_form(x_0, t, num_samples=10000):
    """
    TODO: Sample x_t using both methods many times and compare the distributions.

    1. Use forward_step() iteratively t times to get samples of x_t
    2. Use forward_diffusion_closed_form() to get samples of x_t
    3. Compare the mean and std of both sample sets
    4. Plot histograms of pixel values from both methods

    If the closed-form is correct, both distributions should match.
    """
    # HINT: Pick a single pixel location, e.g., [0, 14, 14]
    # Sample it many times with both methods and compare

    pass  # YOUR CODE HERE

In [None]:
#@title üéß Code Walkthrough: Putting It Together Class
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_18_putting_it_together_class.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 6. Putting It All Together

Let us combine everything into a complete forward diffusion module.

In [None]:
class ForwardDiffusion:
    """Complete forward diffusion process for DDPM."""

    def __init__(self, T=1000, beta_start=1e-4, beta_end=0.02, schedule='linear'):
        self.T = T

        if schedule == 'linear':
            self.betas = torch.linspace(beta_start, beta_end, T)
        else:
            raise ValueError(f"Unknown schedule: {schedule}")

        self.alphas = 1.0 - self.betas
        self.alpha_bars = torch.cumprod(self.alphas, dim=0)

        # Precompute useful quantities
        self.sqrt_alpha_bars = torch.sqrt(self.alpha_bars)
        self.sqrt_one_minus_alpha_bars = torch.sqrt(1.0 - self.alpha_bars)

    def add_noise(self, x_0, t, noise=None):
        """Add noise to x_0 at timestep t."""
        if noise is None:
            noise = torch.randn_like(x_0)

        # Handle batch of different timesteps
        if isinstance(t, int):
            sqrt_ab = self.sqrt_alpha_bars[t]
            sqrt_1_ab = self.sqrt_one_minus_alpha_bars[t]
        else:
            sqrt_ab = self.sqrt_alpha_bars[t].view(-1, 1, 1, 1)
            sqrt_1_ab = self.sqrt_one_minus_alpha_bars[t].view(-1, 1, 1, 1)

        x_t = sqrt_ab * x_0 + sqrt_1_ab * noise
        return x_t, noise

    def visualize(self, x_0, num_steps=8):
        """Visualize the forward process at evenly spaced timesteps."""
        timesteps = torch.linspace(0, self.T - 1, num_steps).long()

        fig, axes = plt.subplots(1, num_steps, figsize=(3 * num_steps, 3))
        for ax, t in zip(axes, timesteps):
            x_t, _ = self.add_noise(x_0, t.item())
            ax.imshow(x_t.squeeze().clamp(-1, 1).numpy(), cmap='gray')
            ax.set_title(f't={t.item()}\n·æ±={self.alpha_bars[t]:.3f}')
            ax.axis('off')
        plt.suptitle('Forward Diffusion Process', fontsize=14)
        plt.tight_layout()
        plt.show()


# Create the forward process and visualize
forward = ForwardDiffusion(T=1000)
forward.visualize(sample_image)

In [None]:
#@title üéß What to Look For: Batch Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_19_batch_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 7. Training and Results

Let us test our forward process on a batch of different MNIST digits.

In [None]:
# Load a batch of images
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=True)
batch_images, batch_labels = next(iter(dataloader))

# Show the original batch
fig, axes = plt.subplots(2, 8, figsize=(24, 6))
for i in range(8):
    # Original
    axes[0][i].imshow(batch_images[i].squeeze().numpy(), cmap='gray')
    axes[0][i].set_title(f'Original (digit {batch_labels[i].item()})')
    axes[0][i].axis('off')

    # At t=500
    x_noisy, _ = forward.add_noise(batch_images[i:i+1], 500)
    axes[1][i].imshow(x_noisy.squeeze().clamp(-1, 1).numpy(), cmap='gray')
    axes[1][i].set_title(f't=500')
    axes[1][i].axis('off')

plt.suptitle('Forward Diffusion on MNIST Batch (t=0 vs t=500)', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
#@title üéß Narration: Statistical Verification
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_20_statistical_verification.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Statistical verification: at t=T, x_T should be approximately standard Gaussian
x_T, _ = forward.add_noise(batch_images, T - 1)
print(f"x_T statistics:")
print(f"  Mean:  {x_T.mean().item():.4f}  (expected: ~0.0)")
print(f"  Std:   {x_T.std().item():.4f}  (expected: ~1.0)")
print(f"  Min:   {x_T.min().item():.4f}")
print(f"  Max:   {x_T.max().item():.4f}")

In [None]:
#@title üéß What to Look For: Final Output Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_21_final_output_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


**Checkpoint:** At $t = T$, the mean should be close to 0 and the std close to 1.0, confirming that our forward process correctly produces approximately standard Gaussian noise.

## 8. Final Output

In [None]:
# Create a beautiful visualization showing one image being destroyed over time
fig, axes = plt.subplots(2, 10, figsize=(30, 6))
timesteps = [0, 50, 100, 200, 300, 400, 500, 700, 850, 999]

for row, img_idx in enumerate([0, 3]):
    img = dataset[img_idx][0]
    for col, t in enumerate(timesteps):
        x_t, _ = forward.add_noise(img, t)
        axes[row][col].imshow(x_t.squeeze().clamp(-1, 1).numpy(), cmap='gray')
        axes[row][col].set_title(f't={t}', fontsize=10)
        axes[row][col].axis('off')

plt.suptitle('Forward Diffusion Process ‚Äî From Clean Image to Pure Noise', fontsize=16)
plt.tight_layout()
plt.show()
print("\nForward diffusion complete! We can now jump to any noise level in a single step.")
print("Next: Learn how to REVERSE this process to generate new images.")

In [None]:
#@title üéß Wrap-Up: Closing
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_22_closing.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 9. Reflection and Next Steps

### Key Takeaways
1. The forward process adds Gaussian noise gradually over T steps
2. The closed-form formula $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ lets us skip to any timestep directly
3. The noise schedule controls how quickly the signal is destroyed
4. At $t = T$, the noisy image is approximately standard Gaussian noise

### Reflection Questions
- Why do we scale the mean down by $\sqrt{1-\beta_t}$ instead of just adding noise? What would happen if we did not scale?
- What is the effect of choosing a larger $\beta_T$ value? How would it change the quality of generated images?
- Why is the closed-form formula so important for training efficiency?

### What is Next
In the next notebook, we will tackle the reverse question: given a noisy image, how do we learn to remove the noise? This brings us to the DDPM loss function and the elegant insight that training a diffusion model reduces to predicting noise.