In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

In [None]:
#@title üéß Download Narration Audio & Play Introduction
import os as _os
if not _os.path.exists("/content/narration"):
    !pip install -q gdown
    import gdown
    gdown.download(id="1VimSva8YFMP_2-RyAjHL8KfiPhTmlA-o", output="/content/narration.zip", quiet=False)
    !unzip -q /content/narration.zip -d /content/narration
    !rm /content/narration.zip
    print(f"Loaded {len(_os.listdir('/content/narration'))} narration segments")
else:
    print("Narration audio already loaded.")

from IPython.display import Audio, display
display(Audio("/content/narration/01_01_intro.mp3"))

# üöÄ Image Diffusion Foundations: From Noise to Pictures

*Part 1 of the Vizuara series on Diffusion LLMs from Scratch*
*Estimated time: 30 minutes*

In [None]:
#@title üéß Listen: Why It Matters
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_02_why_it_matters.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 1. Why Does This Matter?

Imagine you are writing an essay on a typewriter. You type one letter at a time, left to right. Once a letter is pressed onto the page, it is permanent ‚Äî you cannot go back and change it. If you realize halfway through a sentence that the beginning was wrong, too bad.

This is exactly how modern language models like GPT-4 and LLaMA work. They generate text **one token at a time**, from left to right. Each token depends on all the previous tokens, but has no knowledge of what will come after it.

Now think of how an **artist** works. An artist does not paint the top-left pixel first, then the next pixel. Instead, they start with a rough sketch of the whole canvas, then progressively refine the details ‚Äî adding color here, sharpening edges there, going back to fix proportions. The whole image comes into focus *at the same time.*

This is how **diffusion models** generate images. And in this notebook series, we will see how this same idea can be applied to **text generation** ‚Äî leading to a fundamentally new paradigm for language models.

**By the end of this notebook, you will:**
- Understand how image diffusion works from first principles
- Build a working diffusion model that generates MNIST digits from pure noise
- See why this approach breaks for text (setting up Notebook 2)

In [None]:
#@title üéß Listen: Building Intuition
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_03_building_intuition.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 2. Building Intuition

### The Two Phases of Diffusion

The core idea behind diffusion models is beautifully simple:

**Phase 1 ‚Äî The Forward Process (Destroying):** Take a clean image. Gradually add random noise to it, step by step. After enough steps, the image becomes pure random static ‚Äî no trace of the original remains.

**Phase 2 ‚Äî The Reverse Process (Creating):** Train a neural network that learns to reverse each noise step. Given a noisy image, predict what it looked like one step earlier (slightly less noisy). Chain these predictions together, and you can go from pure noise all the way back to a clean image.

The magic: once the network learns to denoise, you can start from **pure random noise** and generate entirely new images that never existed in the training set.

### ü§î Think About This

If I gave you a completely noisy image and asked you to denoise it, what information would you need?

You would need to know:
1. **How noisy** the image is (a little noisy? a lot?)
2. **What kind of images** are possible (faces? digits? landscapes?)

The first is provided by the **timestep**. The second is learned from the **training data**. Keep these two ideas in mind ‚Äî they drive every design decision.

In [None]:
#@title üéß Listen: The Math
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_04_the_math.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 3. The Mathematics

### The Forward Process

At any timestep $t$, we can jump directly from the clean image $x_0$ to the noisy version $x_t$ using:

$$x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

**What this says computationally:** Mix the original image with random noise. The parameter $\bar{\alpha}_t$ controls the ratio:
- When $\bar{\alpha}_t = 1$: all signal, no noise (clean image)
- When $\bar{\alpha}_t = 0$: no signal, all noise (pure static)
- In between: a blend of image and noise

### Worked Example

Let us plug in real numbers. Suppose we have a single pixel with value $x_0 = 0.8$, and at timestep $t$, the noise schedule gives $\bar{\alpha}_t = 0.5$. A random noise sample gives $\epsilon = 0.3$.

$$x_t = \sqrt{0.5} \times 0.8 + \sqrt{0.5} \times 0.3 = 0.707 \times 0.8 + 0.707 \times 0.3 = 0.566 + 0.212 = 0.778$$

The pixel shifted from 0.8 towards the noise. As $\bar{\alpha}_t$ decreases towards 0, the first term shrinks and noise dominates. At $\bar{\alpha}_t = 0$, the original image is completely gone.

### The Training Objective

The neural network learns to predict the noise $\epsilon$ that was added. The simplified loss is:

$$\mathcal{L} = \| \epsilon - \epsilon_\theta(x_t, t) \|^2$$

**What this says computationally:** Take the actual noise that was added, subtract the model's prediction of that noise, and square the difference. This is just mean squared error between the true noise and predicted noise.

In [None]:
#@title üéß Listen: Noise Schedule
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_05_noise_schedule.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 4. Let's Build It ‚Äî Component by Component

### 4.1 The Noise Schedule

The noise schedule defines how quickly we destroy the image. We start with $\beta_t$ values (noise rates) and compute $\bar{\alpha}_t$ from them.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

TIMESTEPS = 1000
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

def linear_beta_schedule(timesteps):
    """Linear schedule from Ho et al. (2020).
    Beta increases linearly from 0.0001 to 0.02.
    """
    beta_start = 0.0001
    beta_end = 0.02
    return torch.linspace(beta_start, beta_end, timesteps)

# Compute all schedule quantities
betas = linear_beta_schedule(TIMESTEPS)
alphas = 1.0 - betas
alpha_bars = torch.cumprod(alphas, dim=0)
sqrt_alpha_bars = torch.sqrt(alpha_bars)
sqrt_one_minus_alpha_bars = torch.sqrt(1.0 - alpha_bars)

print(f"alpha_bar at t=0:   {alpha_bars[0]:.4f} (almost no noise)")
print(f"alpha_bar at t=500: {alpha_bars[500]:.4f} (half-noised)")
print(f"alpha_bar at t=999: {alpha_bars[999]:.4f} (nearly pure noise)")

In [None]:
# üìä Visualize the noise schedule
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

axes[0].plot(betas.numpy(), color='#e53935', linewidth=2)
axes[0].set_xlabel('Timestep t', fontsize=11)
axes[0].set_ylabel(r'$\beta_t$', fontsize=13)
axes[0].set_title('Noise Rate (Beta)', fontsize=13)
axes[0].grid(True, alpha=0.3)

axes[1].plot(alpha_bars.numpy(), color='#1565c0', linewidth=2)
axes[1].set_xlabel('Timestep t', fontsize=11)
axes[1].set_ylabel(r'$\bar{\alpha}_t$', fontsize=13)
axes[1].set_title('Signal Retention (Alpha Bar)', fontsize=13)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print("Left: noise rate increases over time.")
print("Right: signal retention decreases ‚Äî by t=1000, almost no original signal remains.")

In [None]:
#@title üéß Listen: Forward Process Code
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_06_forward_process_code.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### 4.2 The Forward Process

In [None]:
def forward_diffusion(x_0, t, sqrt_alpha_bar, sqrt_one_minus_alpha_bar):
    """Noise an image to timestep t in one step.

    Args:
        x_0: Clean images, shape (B, C, H, W)
        t: Timestep indices, shape (B,)
        sqrt_alpha_bar: Precomputed sqrt(alpha_bar), shape (T,)
        sqrt_one_minus_alpha_bar: Precomputed sqrt(1-alpha_bar), shape (T,)

    Returns:
        x_t: Noised images
        noise: The noise that was added (needed for training)
    """
    noise = torch.randn_like(x_0)

    # Gather the schedule values for each sample's timestep
    s_ab = sqrt_alpha_bar[t].view(-1, 1, 1, 1)
    s_omab = sqrt_one_minus_alpha_bar[t].view(-1, 1, 1, 1)

    x_t = s_ab * x_0 + s_omab * noise
    return x_t, noise

In [None]:
#@title üéß Listen: Forward Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_07_forward_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

In [None]:
# üìä Visualize: a single MNIST digit being progressively noised
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Scale to [-1, 1]
])
train_dataset = datasets.MNIST(root='./data', train=True, download=True,
                                transform=transform)

# Pick a digit
sample_img = train_dataset[0][0].unsqueeze(0)  # (1, 1, 28, 28)

timesteps_to_show = [0, 50, 150, 300, 500, 750, 999]
fig, axes = plt.subplots(1, len(timesteps_to_show), figsize=(18, 3))

for ax, t_val in zip(axes, timesteps_to_show):
    t = torch.tensor([t_val])
    noised, _ = forward_diffusion(sample_img, t,
                                   sqrt_alpha_bars, sqrt_one_minus_alpha_bars)
    ax.imshow(noised[0, 0].numpy(), cmap='gray')
    ax.set_title(f't = {t_val}', fontsize=11)
    ax.axis('off')

plt.suptitle('Forward Process: Gradually Adding Noise to a Digit', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print("The digit is progressively destroyed until only noise remains.")

In [None]:
#@title üéß Listen: Unet Model
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_08_unet_model.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### 4.3 The Denoising Model (Simple U-Net)

We need a neural network that takes a noisy image and timestep, and predicts the noise. We use a minimal U-Net with time embedding.

In [None]:
class SinusoidalTimeEmbedding(nn.Module):
    """Sinusoidal timestep embedding, like positional encoding."""
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        half_dim = self.dim // 2
        emb = math.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=t.device) * -emb)
        emb = t.float().unsqueeze(1) * emb.unsqueeze(0)
        return torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)

In [None]:
class ConvBlock(nn.Module):
    """Conv -> GroupNorm -> SiLU, with time embedding injection."""
    def __init__(self, in_ch, out_ch, time_dim):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.GroupNorm(8, out_ch),
            nn.SiLU(),
        )
        self.time_proj = nn.Linear(time_dim, out_ch)

    def forward(self, x, t_emb):
        h = self.conv(x)
        # Add time embedding (broadcast over spatial dims)
        h = h + self.time_proj(t_emb).unsqueeze(-1).unsqueeze(-1)
        return h

In [None]:
class SimpleUNet(nn.Module):
    """Minimal U-Net for 28x28 MNIST images."""
    def __init__(self, in_ch=1, base_ch=32, time_dim=64):
        super().__init__()
        self.time_embed = nn.Sequential(
            SinusoidalTimeEmbedding(time_dim),
            nn.Linear(time_dim, time_dim),
            nn.SiLU(),
        )

        # Encoder
        self.enc1 = ConvBlock(in_ch, base_ch, time_dim)
        self.enc2 = ConvBlock(base_ch, base_ch * 2, time_dim)
        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = ConvBlock(base_ch * 2, base_ch * 2, time_dim)

        # Decoder
        self.up = nn.Upsample(scale_factor=2, mode='nearest')
        self.dec2 = ConvBlock(base_ch * 4, base_ch, time_dim)  # skip connection
        self.dec1 = ConvBlock(base_ch * 2, base_ch, time_dim)

        self.final = nn.Conv2d(base_ch, in_ch, 1)

    def forward(self, x, t):
        t_emb = self.time_embed(t)

        # Encoder
        h1 = self.enc1(x, t_emb)               # (B, 32, 28, 28)
        h2 = self.enc2(self.pool(h1), t_emb)    # (B, 64, 14, 14)

        # Bottleneck
        h = self.bottleneck(self.pool(h2), t_emb)  # (B, 64, 7, 7)

        # Decoder with skip connections
        h = self.up(h)                           # (B, 64, 14, 14)
        h = self.dec2(torch.cat([h, h2], dim=1), t_emb)  # (B, 32, 14, 14)
        h = self.up(h)                           # (B, 32, 28, 28)
        h = self.dec1(torch.cat([h, h1], dim=1), t_emb)  # (B, 32, 28, 28)

        return self.final(h)                     # (B, 1, 28, 28)


model = SimpleUNet().to(device)
n_params = sum(p.numel() for p in model.parameters())
print(f"SimpleUNet parameters: {n_params:,}")

In [None]:
#@title üéß Listen: Training
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_09_training.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### 4.4 The Training Loop

In [None]:
from torch.utils.data import DataLoader

dataloader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)

NUM_EPOCHS = 10
losses = []

print("Training the diffusion model on MNIST...")
for epoch in range(NUM_EPOCHS):
    epoch_losses = []
    for batch_idx, (images, _) in enumerate(dataloader):
        images = images.to(device)

        # Sample random timesteps for each image
        t = torch.randint(0, TIMESTEPS, (images.shape[0],), device=device)

        # Forward process: add noise
        x_t, noise = forward_diffusion(images, t,
                                        sqrt_alpha_bars.to(device),
                                        sqrt_one_minus_alpha_bars.to(device))

        # Model predicts the noise
        predicted_noise = model(x_t, t)

        # MSE loss between true noise and predicted noise
        loss = F.mse_loss(predicted_noise, noise)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_losses.append(loss.item())

    avg_loss = np.mean(epoch_losses)
    losses.append(avg_loss)
    print(f"Epoch {epoch+1}/{NUM_EPOCHS} | Loss: {avg_loss:.4f}")

print("Training complete!")

In [None]:
# üìä Training loss curve
plt.figure(figsize=(10, 4))
plt.plot(losses, marker='o', color='#1565c0', linewidth=2, markersize=6)
plt.xlabel('Epoch', fontsize=11)
plt.ylabel('Mean MSE Loss', fontsize=11)
plt.title('Diffusion Model Training Loss', fontsize=13)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("The loss should decrease steadily ‚Äî the model is learning to predict noise.")

In [None]:
#@title üéß Listen: Todo1 Reverse
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_10_todo1_reverse.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 5. üîß Your Turn

### TODO 1: Implement the Sampling / Generation Function

This is the reverse process ‚Äî going from pure noise back to a clean image. At each timestep, the model predicts the noise, and we subtract a carefully scaled version of it.

The formula for each reverse step from $t$ to $t-1$:

$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \cdot \epsilon_\theta(x_t, t) \right) + \sigma_t \cdot z$$

where $z \sim \mathcal{N}(0, I)$ for $t > 1$ and $z = 0$ for $t = 1$, and $\sigma_t = \sqrt{\beta_t}$.

**What this says computationally:** Take the current noisy image, subtract the model's noise prediction (scaled appropriately), and add a small amount of fresh noise. At the very last step, skip the fresh noise to get a clean output.

In [None]:
@torch.no_grad()
def sample(model, n_samples=16, img_size=28):
    """Generate images via the reverse diffusion process.

    Args:
        model: Trained SimpleUNet
        n_samples: Number of images to generate
        img_size: Spatial size (28 for MNIST)

    Returns:
        Generated images, shape (n_samples, 1, img_size, img_size)
        history: List of intermediate snapshots
    """
    model.eval()
    betas_d = betas.to(device)
    alphas_d = alphas.to(device)
    alpha_bars_d = alpha_bars.to(device)

    # Start from pure Gaussian noise
    x = torch.randn(n_samples, 1, img_size, img_size, device=device)
    history = [x.cpu().clone()]

    # ============ TODO ============
    # Reverse loop: t = TIMESTEPS-1 down to 0
    for t_val in reversed(range(TIMESTEPS)):
        t = torch.full((n_samples,), t_val, device=device, dtype=torch.long)

        # Step 1: Get the model's noise prediction
        predicted_noise = ???  # YOUR CODE HERE

        # Step 2: Compute scaling coefficients
        beta_t = betas_d[t_val]
        alpha_t = alphas_d[t_val]
        alpha_bar_t = alpha_bars_d[t_val]
        noise_coeff = beta_t / torch.sqrt(1.0 - alpha_bar_t)

        # Step 3: Compute x_{t-1}
        x = ???  # YOUR CODE HERE: (1/sqrt(alpha_t)) * (x - noise_coeff * predicted_noise)

        # Step 4: Add stochastic noise (except at final step t=0)
        if t_val > 0:
            z = torch.randn_like(x)
            sigma_t = torch.sqrt(beta_t)
            x = ???  # YOUR CODE HERE: x + sigma_t * z

        if t_val % 100 == 0 or t_val == 0:
            history.append(x.cpu().clone())
    # ==============================

    model.train()
    return x, history

In [None]:
# ‚úÖ Verification
try:
    test_samples, test_history = sample(model, n_samples=4)
    assert test_samples.shape == (4, 1, 28, 28), f"Wrong shape: {test_samples.shape}"
    print("‚úÖ Sampling works! Generated 4 test images.")
    print(f"   Pixel range: [{test_samples.min():.2f}, {test_samples.max():.2f}]")
    print(f"   Snapshots: {len(test_history)}")
except NameError:
    print("‚ùå Replace the ??? placeholders with your code.")
except Exception as e:
    print(f"‚ùå Error: {e}")

In [None]:
#@title üéß Listen: Stop And Think Solution
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_11_stop_and_think_solution.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

---
### ‚úã Stop and Think
Before looking at the solution:
1. Why do we add fresh noise $z$ at every step except the last?
2. What would happen if we used a larger $\sigma_t$?
3. Why do we go from $t = T-1$ all the way to $t = 0$?

*Take a minute. Then scroll down.*

---

### Solution

In [None]:
@torch.no_grad()
def sample(model, n_samples=16, img_size=28):
    """Generate images via the reverse diffusion process."""
    model.eval()
    betas_d = betas.to(device)
    alphas_d = alphas.to(device)
    alpha_bars_d = alpha_bars.to(device)

    x = torch.randn(n_samples, 1, img_size, img_size, device=device)
    history = [x.cpu().clone()]

    for t_val in reversed(range(TIMESTEPS)):
        t = torch.full((n_samples,), t_val, device=device, dtype=torch.long)

        predicted_noise = model(x, t)

        beta_t = betas_d[t_val]
        alpha_t = alphas_d[t_val]
        alpha_bar_t = alpha_bars_d[t_val]
        noise_coeff = beta_t / torch.sqrt(1.0 - alpha_bar_t)

        x = (1.0 / torch.sqrt(alpha_t)) * (x - noise_coeff * predicted_noise)

        if t_val > 0:
            z = torch.randn_like(x)
            sigma_t = torch.sqrt(beta_t)
            x = x + sigma_t * z

        if t_val % 100 == 0 or t_val == 0:
            history.append(x.cpu().clone())

    model.train()
    return x, history


generated_images, generation_history = sample(model, n_samples=16)
print(f"Generated {generated_images.shape[0]} images!")

In [None]:
#@title üéß Listen: Todo2 Cosine
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_12_todo2_cosine.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### TODO 2: Implement a Cosine Noise Schedule

The linear schedule works, but a **cosine schedule** (Nichol & Dhariwal, 2021) spends more time at intermediate noise levels:

$$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2, \quad s = 0.008$$

In [None]:
def cosine_beta_schedule(timesteps, s=0.008):
    """Cosine noise schedule.

    Hints:
        1. Compute f(t) for t = 0, 1, ..., T
        2. alpha_bars = f(t) / f(0)
        3. betas = 1 - alpha_bar[t] / alpha_bar[t-1]
        4. Clamp betas to [0, 0.999]
    """
    steps = timesteps + 1
    t = torch.linspace(0, timesteps, steps)

    # ============ TODO ============
    f_t = ???  # YOUR CODE: cos((t/T + s)/(1+s) * pi/2)^2
    alpha_bars_cos = ???  # YOUR CODE: f(t) / f(0)
    betas_cos = ???  # YOUR CODE: 1 - alpha_bar[t] / alpha_bar[t-1]
    betas_cos = ???  # YOUR CODE: clamp to [0, 0.999]
    # ==============================

    return betas_cos

In [None]:
# ‚úÖ Verification ‚Äî compare schedules
try:
    cosine_betas = cosine_beta_schedule(TIMESTEPS)
    cosine_alphas = 1.0 - cosine_betas
    cosine_alpha_bars = torch.cumprod(cosine_alphas, dim=0)

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(alpha_bars.numpy(), label='Linear', color='#1565c0', linewidth=2.5)
    ax.plot(cosine_alpha_bars.numpy(), label='Cosine', color='#e53935',
            linewidth=2.5, linestyle='--')
    ax.set_xlabel('Timestep t', fontsize=12)
    ax.set_ylabel(r'$\bar{\alpha}_t$', fontsize=14)
    ax.set_title('Linear vs Cosine Noise Schedule', fontsize=14)
    ax.legend(fontsize=12)
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    print("‚úÖ Cosine schedule implemented!")
    print("Notice: cosine spends more time at intermediate noise levels.")
except NameError:
    print("‚ùå Replace the ??? placeholders.")

In [None]:
#@title üéß Listen: Results
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_13_results.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 6. Putting It All Together

In [None]:
# üìä Progressive denoising: watch a digit emerge from noise
torch.manual_seed(123)
_, full_history = sample(model, n_samples=1)

fig, axes = plt.subplots(1, len(full_history), figsize=(20, 3))
for idx, (ax, snapshot) in enumerate(zip(axes, full_history)):
    img = snapshot[0, 0].numpy()
    ax.imshow(img, cmap='gray')
    ax.axis('off')
    if idx == 0:
        ax.set_title('t=999\n(noise)', fontsize=9)
    elif idx == len(full_history) - 1:
        ax.set_title('t=0\n(clean)', fontsize=9)
    else:
        t_approx = 999 - idx * 100
        ax.set_title(f't‚âà{max(t_approx, 0)}', fontsize=9)

plt.suptitle('Reverse Process: From Pure Noise to a Digit', fontsize=14, y=1.05)
plt.tight_layout()
plt.show()
print("Watch the digit emerge ‚Äî first the rough shape, then fine details.")
print("This is exactly like an artist refining a sketch.")

## 7. üéØ Final Output

In [None]:
# Generate a grid of MNIST digits from pure noise
torch.manual_seed(42)
final_images, _ = sample(model, n_samples=64)
final_images = torch.clamp(final_images, -1, 1)

fig, axes = plt.subplots(8, 8, figsize=(12, 12))
for i in range(8):
    for j in range(8):
        idx = i * 8 + j
        axes[i, j].imshow(final_images[idx, 0].cpu().numpy(), cmap='gray')
        axes[i, j].axis('off')

plt.suptitle('Generated MNIST Digits ‚Äî From Pure Noise', fontsize=16, y=1.01)
plt.tight_layout()
plt.show()

In [None]:
# üìä Side-by-side: Real vs Generated
fig, axes = plt.subplots(2, 10, figsize=(16, 4))

for i in range(10):
    real_img = train_dataset[i * 600][0][0].numpy()
    axes[0, i].imshow(real_img, cmap='gray')
    axes[0, i].axis('off')
    if i == 5:
        axes[0, i].set_title('Real MNIST', fontsize=12, pad=10)

torch.manual_seed(7)
comp_images, _ = sample(model, n_samples=10)
comp_images = torch.clamp(comp_images, -1, 1)

for i in range(10):
    axes[1, i].imshow(comp_images[i, 0].cpu().numpy(), cmap='gray')
    axes[1, i].axis('off')
    if i == 5:
        axes[1, i].set_title('Generated', fontsize=12, pad=10)

plt.tight_layout()
plt.show()
print("Top: Real MNIST digits. Bottom: Generated from pure noise.")

In [None]:
#@title üéß Listen: Breaks For Text
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_14_breaks_for_text.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 8. The Problem ‚Äî Why This Breaks for Text

We have a working image diffusion model. Now let us confront the fundamental question that motivates this entire series: **why can't we apply this directly to text?**

### Continuous vs Discrete

Images live in a **continuous** space. Each pixel is a floating-point number. Adding Gaussian noise to pixel 0.8 gives you 0.73 or 0.85 ‚Äî still valid pixels.

Text lives in a **discrete** space. Each token is an integer ‚Äî an index into a vocabulary. The word "cat" might be token 3421, "dog" might be token 7856. What does token 5638.5 mean? Nothing. It is not a real word.

In [None]:
# Demonstrate why Gaussian noise fails for text
print("=" * 55)
print("EXPERIMENT: Gaussian Noise on Token IDs")
print("=" * 55)

vocab = {42: "the", 3421: "cat", 891: "sat", 156: "on", 7856: "dog", 2001: "mat"}

sentence = "the cat sat on the mat"
tokens = [42, 3421, 891, 156, 42, 2001]
print(f"\nOriginal: '{sentence}'")
print(f"Token IDs: {tokens}")

print(f"\n--- Adding Gaussian noise (sigma=500) ---")
np.random.seed(42)
for token_id, word in zip(tokens, sentence.split()):
    noise = np.random.normal(0, 500)
    noised_id = token_id + noise
    rounded_id = int(round(noised_id))
    noised_word = vocab.get(rounded_id, "???")
    print(f"  '{word}' ({token_id:5d}) + noise {noise:+8.1f} "
          f"= {noised_id:8.1f} ‚Üí '{noised_word}'")

print(f"\n‚ö†Ô∏è  Most noised tokens map to NOTHING in the vocabulary!")
print(f"   Gaussian noise is MEANINGLESS for discrete data.")

### Three Approaches to Fix This

Researchers have proposed three solutions:

1. **Continuous embedding diffusion** (Diffusion-LM, 2022) ‚Äî embed tokens into continuous vectors, run Gaussian diffusion on embeddings, round back. Problem: rounding introduces errors.

2. **Masked diffusion** (MDLM, LLaDA) ‚Äî replace tokens with [MASK] instead of adding noise. Masking is a natural "noise" for discrete data. Simplest and most successful.

3. **Score-based discrete diffusion** (SEDD) ‚Äî define transition probabilities directly in discrete space. Elegant but complex.

In [None]:
# Preview: what masked "noise" looks like
print("=" * 55)
print("PREVIEW: Masked Diffusion (Next Notebook)")
print("=" * 55)

words = ["The", "cat", "sat", "on", "the", "mat"]
print(f"\nOriginal:  {' '.join(words)}\n")

np.random.seed(0)
for ratio in [0.0, 0.17, 0.33, 0.50, 0.67, 0.83, 1.0]:
    masked = ["[M]" if np.random.random() < ratio else w for w in words]
    pct = masked.count("[M]") / len(masked) * 100
    print(f"  t={ratio:.2f} ({pct:3.0f}% masked):  {' '.join(masked)}")

print("\nEvery intermediate state is interpretable!")
print("No 'half-cat' nonsense. Tokens are present or masked.")
print("\nüí° Masking IS the noise process for text.")

In [None]:
#@title üéß Listen: Closing
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_15_closing.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 9. Reflection and Next Steps

### ü§î Reflection Questions

1. **Why does the model need to know the timestep $t$?** Without it, the model cannot distinguish between slightly noisy images (needing gentle correction) and heavily noisy images (needing aggressive reconstruction).

2. **What would happen with a very aggressive noise schedule?** If the image is destroyed too quickly, most of the trajectory is spent going from pure noise to slightly-less-pure noise ‚Äî the model has few steps to learn fine details.

3. **Could we use a Transformer instead of a U-Net?** Yes! The Diffusion Transformer (DiT) does exactly this and powers DALL-E 3 and Stable Diffusion 3. The diffusion math stays the same ‚Äî only the backbone changes.

### üèÜ Optional Challenges

1. Train on Fashion-MNIST and compare generation quality
2. Implement classifier-free guidance for conditional generation
3. Implement DDIM sampling (deterministic, fewer steps needed)

---

**Up Next ‚Äî Notebook 2:** *Masked Diffusion for Text.* We will see how replacing Gaussian noise with token masking gives us a diffusion process that works for discrete text ‚Äî and it turns out to be just BERT training, generalized to all masking ratios.

In [None]:
#@title üí¨ AI Teaching Assistant ‚Äî Click ‚ñ∂ to start
#@markdown This AI chatbot reads your notebook and can answer questions about any concept, code, or exercise.

import json as _json
import requests as _requests
from google.colab import output as _output
from IPython.display import display, HTML as _HTML, Markdown as _Markdown

# --- Read notebook content for context ---
def _get_notebook_context():
    try:
        from google.colab import _message
        nb = _message.blocking_request("get_ipynb", request="", timeout_sec=10)
        cells = nb.get("ipynb", {}).get("cells", [])
        parts = []
        for cell in cells:
            src = "".join(cell.get("source", []))
            tags = cell.get("metadata", {}).get("tags", [])
            if "chatbot" in tags:
                continue
            if src.strip():
                ct = cell.get("cell_type", "unknown")
                parts.append(f"[{ct.upper()}]\n{src}")
        return "\n\n---\n\n".join(parts)
    except Exception:
        return "Notebook content unavailable."

_NOTEBOOK_CONTEXT = _get_notebook_context()
_CHAT_HISTORY = []
_API_URL = "https://course-creator-brown.vercel.app/api/chat"

def _notebook_chat(question):
    global _CHAT_HISTORY
    try:
        resp = _requests.post(_API_URL, json={
            'question': question,
            'context': _NOTEBOOK_CONTEXT[:100000],
            'history': _CHAT_HISTORY[-10:],
        }, timeout=60)
        data = resp.json()
        answer = data.get('answer', 'Sorry, I could not generate a response.')
        _CHAT_HISTORY.append({'role': 'user', 'content': question})
        _CHAT_HISTORY.append({'role': 'assistant', 'content': answer})
        return answer
    except Exception as e:
        return f'Error connecting to teaching assistant: {str(e)}'

_output.register_callback('notebook_chat', _notebook_chat)

def ask(question):
    """Ask the AI teaching assistant a question about this notebook."""
    answer = _notebook_chat(question)
    display(_Markdown(answer))

print("\u2705 AI Teaching Assistant is ready!")
print("\U0001f4a1 Use the chat below, or call ask(\'your question\') in any cell.")

# --- Display chat widget ---
display(_HTML('''<style>
  .vc-wrap{font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:100%;border-radius:16px;overflow:hidden;box-shadow:0 4px 24px rgba(0,0,0,.12);background:#fff;border:1px solid #e5e7eb}
  .vc-hdr{background:linear-gradient(135deg,#667eea 0%,#764ba2 100%);color:#fff;padding:16px 20px;display:flex;align-items:center;gap:12px}
  .vc-avatar{width:42px;height:42px;background:rgba(255,255,255,.2);border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:22px}
  .vc-hdr h3{font-size:16px;font-weight:600;margin:0}
  .vc-hdr p{font-size:12px;opacity:.85;margin:2px 0 0}
  .vc-msgs{height:420px;overflow-y:auto;padding:16px;background:#f8f9fb;display:flex;flex-direction:column;gap:10px}
  .vc-msg{display:flex;flex-direction:column;animation:vc-fade .25s ease}
  .vc-msg.user{align-items:flex-end}
  .vc-msg.bot{align-items:flex-start}
  .vc-bbl{max-width:85%;padding:10px 14px;border-radius:16px;font-size:14px;line-height:1.55;word-wrap:break-word}
  .vc-msg.user .vc-bbl{background:linear-gradient(135deg,#667eea 0%,#764ba2 100%);color:#fff;border-bottom-right-radius:4px}
  .vc-msg.bot .vc-bbl{background:#fff;color:#1a1a2e;border:1px solid #e8e8e8;border-bottom-left-radius:4px}
  .vc-bbl code{background:rgba(0,0,0,.07);padding:2px 6px;border-radius:4px;font-size:13px;font-family:'Fira Code',monospace}
  .vc-bbl pre{background:#1e1e2e;color:#cdd6f4;padding:12px;border-radius:8px;overflow-x:auto;margin:8px 0;font-size:13px}
  .vc-bbl pre code{background:none;padding:0;color:inherit}
  .vc-bbl h3,.vc-bbl h4{margin:10px 0 4px;font-size:15px}
  .vc-bbl ul,.vc-bbl ol{margin:4px 0;padding-left:20px}
  .vc-bbl li{margin:2px 0}
  .vc-chips{display:flex;flex-wrap:wrap;gap:8px;padding:0 16px 12px;background:#f8f9fb}
  .vc-chip{background:#fff;border:1px solid #d1d5db;border-radius:20px;padding:6px 14px;font-size:12px;cursor:pointer;transition:all .15s;color:#4b5563}
  .vc-chip:hover{border-color:#667eea;color:#667eea;background:#f0f0ff}
  .vc-input{display:flex;padding:12px 16px;background:#fff;border-top:1px solid #eee;gap:8px}
  .vc-input input{flex:1;padding:10px 16px;border:2px solid #e8e8e8;border-radius:24px;font-size:14px;outline:none;transition:border-color .2s}
  .vc-input input:focus{border-color:#667eea}
  .vc-input button{background:linear-gradient(135deg,#667eea 0%,#764ba2 100%);color:#fff;border:none;border-radius:50%;width:42px;height:42px;cursor:pointer;display:flex;align-items:center;justify-content:center;font-size:18px;transition:transform .1s}
  .vc-input button:hover{transform:scale(1.05)}
  .vc-input button:disabled{opacity:.5;cursor:not-allowed;transform:none}
  .vc-typing{display:flex;gap:5px;padding:4px 0}
  .vc-typing span{width:8px;height:8px;background:#667eea;border-radius:50%;animation:vc-bounce 1.4s infinite ease-in-out}
  .vc-typing span:nth-child(2){animation-delay:.2s}
  .vc-typing span:nth-child(3){animation-delay:.4s}
  @keyframes vc-bounce{0%,80%,100%{transform:scale(0)}40%{transform:scale(1)}}
  @keyframes vc-fade{from{opacity:0;transform:translateY(8px)}to{opacity:1;transform:translateY(0)}}
  .vc-note{text-align:center;font-size:11px;color:#9ca3af;padding:8px 16px 12px;background:#fff}
</style>
<div class="vc-wrap">
  <div class="vc-hdr">
    <div class="vc-avatar">&#129302;</div>
    <div>
      <h3>Vizuara Teaching Assistant</h3>
      <p>Ask me anything about this notebook</p>
    </div>
  </div>
  <div class="vc-msgs" id="vcMsgs">
    <div class="vc-msg bot">
      <div class="vc-bbl">&#128075; Hi! I've read through this entire notebook. Ask me about any concept, code block, or exercise &mdash; I'm here to help you learn!</div>
    </div>
  </div>
  <div class="vc-chips" id="vcChips">
    <span class="vc-chip" onclick="vcAsk(this.textContent)">Explain the main concept</span>
    <span class="vc-chip" onclick="vcAsk(this.textContent)">Help with the TODO exercise</span>
    <span class="vc-chip" onclick="vcAsk(this.textContent)">Summarize what I learned</span>
  </div>
  <div class="vc-input">
    <input type="text" id="vcIn" placeholder="Ask about concepts, code, exercises..." />
    <button id="vcSend" onclick="vcSendMsg()">&#10148;</button>
  </div>
  <div class="vc-note">AI-generated &middot; Verify important information &middot; <a href="#" onclick="vcClear();return false" style="color:#667eea">Clear chat</a></div>
</div>
<script>
(function(){
  var msgs=document.getElementById('vcMsgs'),inp=document.getElementById('vcIn'),
      btn=document.getElementById('vcSend'),chips=document.getElementById('vcChips');

  function esc(s){var d=document.createElement('div');d.textContent=s;return d.innerHTML}

  function md(t){
    return t
      .replace(/```(\w*)\n([\s\S]*?)```/g,function(_,l,c){return '<pre><code>'+esc(c)+'</code></pre>'})
      .replace(/`([^`]+)`/g,'<code>$1</code>')
      .replace(/\*\*([^*]+)\*\*/g,'<strong>$1</strong>')
      .replace(/\*([^*]+)\*/g,'<em>$1</em>')
      .replace(/^#### (.+)$/gm,'<h4>$1</h4>')
      .replace(/^### (.+)$/gm,'<h4>$1</h4>')
      .replace(/^## (.+)$/gm,'<h3>$1</h3>')
      .replace(/^\d+\. (.+)$/gm,'<li>$1</li>')
      .replace(/^- (.+)$/gm,'<li>$1</li>')
      .replace(/\n\n/g,'<br><br>')
      .replace(/\n/g,'<br>');
  }

  function addMsg(text,isUser){
    var m=document.createElement('div');m.className='vc-msg '+(isUser?'user':'bot');
    var b=document.createElement('div');b.className='vc-bbl';
    b.innerHTML=isUser?esc(text):md(text);
    m.appendChild(b);msgs.appendChild(m);msgs.scrollTop=msgs.scrollHeight;
  }

  function showTyping(){
    var m=document.createElement('div');m.className='vc-msg bot';m.id='vcTyping';
    m.innerHTML='<div class="vc-bbl"><div class="vc-typing"><span></span><span></span><span></span></div></div>';
    msgs.appendChild(m);msgs.scrollTop=msgs.scrollHeight;
  }

  function hideTyping(){var e=document.getElementById('vcTyping');if(e)e.remove()}

  window.vcSendMsg=function(){
    var q=inp.value.trim();if(!q)return;
    inp.value='';chips.style.display='none';
    addMsg(q,true);showTyping();btn.disabled=true;
    google.colab.kernel.invokeFunction('notebook_chat',[q],{})
      .then(function(r){
        hideTyping();
        var a=r.data['application/json'];
        addMsg(typeof a==='string'?a:JSON.stringify(a),false);
      })
      .catch(function(){
        hideTyping();
        addMsg('Sorry, I encountered an error. Please check your internet connection and try again.',false);
      })
      .finally(function(){btn.disabled=false;inp.focus()});
  };

  window.vcAsk=function(q){inp.value=q;vcSendMsg()};
  window.vcClear=function(){
    msgs.innerHTML='<div class="vc-msg bot"><div class="vc-bbl">&#128075; Chat cleared. Ask me anything!</div></div>';
    chips.style.display='flex';
  };

  inp.addEventListener('keypress',function(e){if(e.key==='Enter')vcSendMsg()});
  inp.focus();
})();
</script>'''))