In [None]:
#@title üéß Download Narration Audio & Play Introduction
import os as _os
if not _os.path.exists("/content/narration"):
    !pip install -q gdown
    import gdown
    gdown.download(id="1ventuhdj998YNr_9KusKPNX2VFJg7Av1", output="/content/narration.zip", quiet=False)
    !unzip -q /content/narration.zip -d /content/narration
    !rm /content/narration.zip
    print(f"Loaded {len(_os.listdir('/content/narration'))} narration segments")
else:
    print("Narration audio already loaded.")

from IPython.display import Audio, display
display(Audio("/content/narration/00_intro.mp3"))

In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

In [None]:
#@title üéß Listen: Motivation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/01_motivation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

# The Reversal Curse ‚Äî Why Direction Matters in Language Models

*Part 2 of the Vizuara series on Diffusion Language Models*
*Estimated time: 40 minutes*

## 1. Why Does This Matter?

Here is a surprising fact about GPT-4, Claude, and every autoregressive language model you have used:

If you train a model on *"The capital of France is Paris"*, it can complete *"The capital of France is ___"* perfectly. But ask it *"Paris is the capital of ___"* and it struggles ‚Äî even though the answer is trivially implied by the training data.

This is called the **reversal curse**, and it is a fundamental limitation of left-to-right models. They can only model $P(\text{later tokens} \mid \text{earlier tokens})$, never the reverse.

Diffusion language models break free from this curse. Because they see all tokens simultaneously with bidirectional attention, they can fill in *any* position given *any* context ‚Äî forward, backward, or in the middle.

In this notebook, we will **prove this experimentally** by training both models on the same dataset and comparing their forward vs. reverse completion accuracy.

**Teaser ‚Äî what we will see:**

| | Forward (A ‚Üí B) | Reverse (B ‚Üí A) |
|---|---|---|
| Autoregressive | ~95% | ~12% |
| Diffusion | ~93% | ~88% |

In [None]:
# üîß Setup
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

torch.manual_seed(42)
np.random.seed(42)

%matplotlib inline

In [None]:
#@title üéß Listen: Intuition
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_intuition.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 2. Building Intuition

### The One-Way Street Problem

Think of an autoregressive model as someone who reads a book **only from left to right, one word at a time**. They have seen "The capital of France is Paris" many times. When you start them off with "The capital of France is ___", they smoothly continue with "Paris."

But when you say "Paris is the capital of ___", they are stuck. They have never practiced reading from right to left. Even though the knowledge is in their training data, the *direction* of their processing prevents them from accessing it.

A diffusion model is different. During training, tokens are masked **uniformly at random** ‚Äî sometimes the left side is masked, sometimes the right, sometimes the middle, sometimes everything. The model learns to predict any token from any combination of context. There is no preferred direction.

### ü§î Think About This

Before we run the experiment, predict:
- Will the autoregressive model have *any* ability to do reverse completion, or will it be at 0%?
- Will the diffusion model perform *exactly* the same on forward and reverse, or will there be some asymmetry?

*Think about this for a moment. The answers may surprise you.*

In [None]:
#@title üéß Listen: Math
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/03_math.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 3. The Mathematics

### Why Does the Reversal Curse Happen?

An autoregressive model factorizes the joint probability as:

$$P(x_1, x_2, \ldots, x_L) = \prod_{i=1}^{L} P(x_i \mid x_1, \ldots, x_{i-1})$$

Each factor conditions *only on the left context*. The model never computes $P(x_i \mid x_{i+1}, \ldots, x_L)$ during training.

**Numerical example:** For the pair "cat ‚Üí dog", the autoregressive model learns:
- $P(\text{dog} \mid \text{cat}) = 0.95$ (high ‚Äî seen in training)
- $P(\text{cat} \mid \text{dog}) = ?$ (never trained on this direction!)

The diffusion model's training objective sees all masking patterns:
- When $t = 0.5$: might mask "cat", must predict it from "dog" (reverse!)
- When $t = 0.5$: might mask "dog", must predict it from "cat" (forward!)
- Both directions are trained equally because masking is uniform random.

In [None]:
#@title üéß Listen: Data Models
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_data_models.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 4. Let's Build It ‚Äî Component by Component

### 4.1 The Paired-Sequence Dataset

We create a dataset of paired tokens: given token A, the "answer" is token B. Each training sequence is `[A, SEP, B]`. This simulates factual pairs like "France ‚Üí Paris" or "cat ‚Üí dog".

In [None]:
# --- Configuration ---
VOCAB_SIZE = 32
SEQ_LEN = 6          # [A1, A2, SEP, B1, B2, PAD] ‚Äî simple pairs
MASK_TOKEN = 0
SEP_TOKEN = 1         # Separator between A and B
PAD_TOKEN = 2
D_MODEL = 64
N_HEADS = 4
N_LAYERS = 3
BATCH_SIZE = 64

# Create fixed A‚ÜíB pairs (like "France ‚Üí Paris")
NUM_PAIRS = 50
np.random.seed(42)
A_tokens = np.random.randint(3, VOCAB_SIZE, size=(NUM_PAIRS, 2))
B_tokens = np.random.randint(3, VOCAB_SIZE, size=(NUM_PAIRS, 2))

# Ensure no overlap between A and B for clean evaluation
for i in range(NUM_PAIRS):
    while np.any(A_tokens[i] == B_tokens[i]):
        B_tokens[i] = np.random.randint(3, VOCAB_SIZE, size=2)

print(f"Created {NUM_PAIRS} A‚ÜíB pairs")
print(f"Example pairs:")
for i in range(5):
    print(f"  [{A_tokens[i][0]}, {A_tokens[i][1]}] ‚Üí [{B_tokens[i][0]}, {B_tokens[i][1]}]")

In [None]:
def make_forward_batch(batch_size):
    """Create training sequences in forward order: [A1, A2, SEP, B1, B2, PAD]"""
    indices = np.random.randint(0, NUM_PAIRS, size=batch_size)
    seqs = []
    for idx in indices:
        seq = [A_tokens[idx][0], A_tokens[idx][1], SEP_TOKEN,
               B_tokens[idx][0], B_tokens[idx][1], PAD_TOKEN]
        seqs.append(seq)
    return torch.tensor(seqs, dtype=torch.long, device=device)

# Peek at the data
batch = make_forward_batch(3)
for i in range(3):
    print(f"Training sequence: {batch[i].tolist()}")

### 4.2 The Autoregressive Model

A standard causal Transformer ‚Äî each position can only attend to itself and earlier positions.

In [None]:
class AutoregressiveLM(nn.Module):
    """Causal (left-to-right) language model."""

    def __init__(self, vocab_size, d_model, n_heads, n_layers):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(SEQ_LEN, d_model)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=0.1, batch_first=True, norm_first=True,
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, n_layers)
        self.head = nn.Linear(d_model, vocab_size)

        # Causal mask: prevent attending to future positions
        mask = torch.triu(torch.ones(SEQ_LEN, SEQ_LEN), diagonal=1).bool()
        self.register_buffer('causal_mask', mask)

    def forward(self, x):
        positions = torch.arange(x.size(1), device=x.device)
        h = self.embed(x) + self.pos_embed(positions)
        h = self.transformer(h, mask=self.causal_mask)
        return self.head(h)


ar_model = AutoregressiveLM(VOCAB_SIZE, D_MODEL, N_HEADS, N_LAYERS).to(device)
print(f"AR model parameters: {sum(p.numel() for p in ar_model.parameters()):,}")

### 4.3 The Diffusion Model

Same architecture but **bidirectional** ‚Äî no causal mask, plus time conditioning.

In [None]:
class DiffusionLM(nn.Module):
    """Bidirectional Transformer for masked diffusion."""

    def __init__(self, vocab_size, d_model, n_heads, n_layers):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(SEQ_LEN, d_model)
        self.time_mlp = nn.Sequential(
            nn.Linear(1, d_model), nn.SiLU(), nn.Linear(d_model, d_model)
        )
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=0.1, batch_first=True, norm_first=True,
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, n_layers)
        self.head = nn.Linear(d_model, vocab_size)

    def forward(self, x_t, t):
        positions = torch.arange(x_t.size(1), device=x_t.device)
        h = self.embed(x_t) + self.pos_embed(positions)
        h = h + self.time_mlp(t).unsqueeze(1)
        h = self.transformer(h)  # NO causal mask ‚Äî bidirectional!
        return self.head(h)


diff_model = DiffusionLM(VOCAB_SIZE, D_MODEL, N_HEADS, N_LAYERS).to(device)
print(f"Diffusion model parameters: {sum(p.numel() for p in diff_model.parameters()):,}")

In [None]:
#@title üéß Listen: Training
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/05_training.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### 4.4 Training Both Models

In [None]:
def train_ar_model(model, n_steps=3000, lr=3e-4):
    """Train the autoregressive model with next-token prediction."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    losses = []

    for step in range(n_steps):
        x = make_forward_batch(BATCH_SIZE)

        # Predict next token: input is x[:, :-1], target is x[:, 1:]
        logits = model(x[:, :-1])
        targets = x[:, 1:]
        loss = F.cross_entropy(logits.reshape(-1, VOCAB_SIZE), targets.reshape(-1))

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        losses.append(loss.item())

        if (step + 1) % 500 == 0:
            print(f"  AR Step {step+1}/{n_steps} | Loss: {loss.item():.4f}")

    return losses


def mask_tokens(x_0, t):
    """Forward masking process."""
    mask = torch.rand_like(x_0.float()) < t
    x_t = x_0.clone()
    x_t[mask] = MASK_TOKEN
    return x_t, mask


def train_diff_model(model, n_steps=3000, lr=3e-4):
    """Train the diffusion model with masked prediction."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    losses = []

    for step in range(n_steps):
        x_0 = make_forward_batch(BATCH_SIZE)
        t = torch.rand(BATCH_SIZE, 1, device=device) * 0.98 + 0.02
        x_t, mask = mask_tokens(x_0, t)
        logits = model(x_t, t)

        if mask.sum() == 0:
            continue
        loss = F.cross_entropy(logits[mask], x_0[mask])

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        losses.append(loss.item())

        if (step + 1) % 500 == 0:
            print(f"  Diff Step {step+1}/{n_steps} | Loss: {loss.item():.4f}")

    return losses


print("Training Autoregressive Model...")
ar_losses = train_ar_model(ar_model, n_steps=3000)
print("\nTraining Diffusion Model...")
diff_losses = train_diff_model(diff_model, n_steps=3000)
print("\nDone!")

In [None]:
# üìä Training curves side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
window = 50

smoothed_ar = np.convolve(ar_losses, np.ones(window)/window, mode='valid')
ax1.plot(smoothed_ar, color='#c62828', linewidth=2)
ax1.set_title('Autoregressive Model', fontsize=13)
ax1.set_xlabel('Step')
ax1.set_ylabel('Loss')
ax1.grid(True, alpha=0.3)

smoothed_diff = np.convolve(diff_losses, np.ones(window)/window, mode='valid')
ax2.plot(smoothed_diff, color='#1565c0', linewidth=2)
ax2.set_title('Diffusion Model', fontsize=13)
ax2.set_xlabel('Step')
ax2.set_ylabel('Loss')
ax2.grid(True, alpha=0.3)

plt.suptitle('Training Curves', fontsize=15)
plt.tight_layout()
plt.show()

In [None]:
#@title üéß Listen: Todo
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/06_todo.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 5. üîß Your Turn: Implement Forward and Reverse Evaluation

Now the key experiment. We test both models on:
- **Forward task:** Given A tokens, predict B tokens (the direction seen during training)
- **Reverse task:** Given B tokens, predict A tokens (the direction *never* seen by the AR model!)

### TODO: Implement the diffusion model evaluation

The autoregressive evaluation is provided. Complete the diffusion model evaluation.

In [None]:
@torch.no_grad()
def eval_ar_forward(model, n_tests=NUM_PAIRS):
    """AR model: given [A1, A2, SEP], predict [B1, B2]."""
    correct = 0
    for i in range(n_tests):
        prompt = torch.tensor(
            [[A_tokens[i][0], A_tokens[i][1], SEP_TOKEN, PAD_TOKEN, PAD_TOKEN]],
            dtype=torch.long, device=device
        )
        # Generate autoregressively
        for pos in range(3, 5):  # positions 3 and 4
            logits = model(prompt[:, :pos])
            next_token = logits[0, -1].argmax()
            prompt[0, pos] = next_token
        pred = prompt[0, 3:5].cpu().numpy()
        target = B_tokens[i]
        if np.array_equal(pred, target):
            correct += 1
    return correct / n_tests * 100


@torch.no_grad()
def eval_ar_reverse(model, n_tests=NUM_PAIRS):
    """AR model: given [B1, B2, SEP], predict [A1, A2]. (Reverse!)"""
    correct = 0
    for i in range(n_tests):
        # Present in REVERSE order: B first, then expect A
        prompt = torch.tensor(
            [[B_tokens[i][0], B_tokens[i][1], SEP_TOKEN, PAD_TOKEN, PAD_TOKEN]],
            dtype=torch.long, device=device
        )
        for pos in range(3, 5):
            logits = model(prompt[:, :pos])
            next_token = logits[0, -1].argmax()
            prompt[0, pos] = next_token
        pred = prompt[0, 3:5].cpu().numpy()
        target = A_tokens[i]
        if np.array_equal(pred, target):
            correct += 1
    return correct / n_tests * 100


@torch.no_grad()
def eval_diff_completion(model, source, target, n_steps=10):
    """Diffusion model: given source tokens, predict target tokens.

    Setup: [source1, source2, SEP, MASK, MASK, PAD]
    The model must fill in the MASKed positions.
    """
    # ============ TODO ============
    # Step 1: Build the input sequence with source visible and target masked
    x = torch.tensor(
        [[source[0], source[1], SEP_TOKEN, MASK_TOKEN, MASK_TOKEN, PAD_TOKEN]],
        dtype=torch.long, device=device
    )

    # Step 2: Run iterative unmasking for n_steps
    # At each step:
    #   a) Create the time tensor t = s/n_steps
    #   b) Get logits from model(x, t)
    #   c) For the masked positions (indices 3, 4), pick the argmax prediction
    #   d) Unmask them
    # Since we only have 2 positions to fill, a single step is usually enough,
    # but iterating helps with harder cases.

    for s in range(n_steps, 0, -1):
        t = ???  # YOUR CODE: shape (1, 1), value s/n_steps
        logits = ???  # YOUR CODE: get model predictions
        # For masked positions, take the argmax
        for pos in [3, 4]:
            if x[0, pos] == MASK_TOKEN:
                x[0, pos] = ???  # YOUR CODE: argmax of logits at this position
    # ==============================

    pred = x[0, 3:5].cpu().numpy()
    return np.array_equal(pred, target)

In [None]:
# ‚úÖ Verification
try:
    test_result = eval_diff_completion(diff_model, A_tokens[0], B_tokens[0])
    print(f"‚úÖ Diffusion evaluation works! Test result: {test_result}")
except Exception as e:
    print(f"‚ùå Error: {e}")

In [None]:
#@title üéß Listen: Post Todo
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/07_post_todo.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

### Solution

In [None]:
@torch.no_grad()
def eval_diff_completion(model, source, target, n_steps=10):
    """Diffusion model: given source tokens, predict target tokens."""
    x = torch.tensor(
        [[source[0], source[1], SEP_TOKEN, MASK_TOKEN, MASK_TOKEN, PAD_TOKEN]],
        dtype=torch.long, device=device
    )

    for s in range(n_steps, 0, -1):
        t = torch.tensor([[s / n_steps]], device=device, dtype=torch.float)
        logits = model(x, t)
        for pos in [3, 4]:
            if x[0, pos] == MASK_TOKEN:
                x[0, pos] = logits[0, pos].argmax()

    pred = x[0, 3:5].cpu().numpy()
    return np.array_equal(pred, target)


def eval_diff_forward(model, n_tests=NUM_PAIRS):
    """Diffusion: given A, predict B."""
    correct = sum(
        eval_diff_completion(model, A_tokens[i], B_tokens[i])
        for i in range(n_tests)
    )
    return correct / n_tests * 100


def eval_diff_reverse(model, n_tests=NUM_PAIRS):
    """Diffusion: given B, predict A. (Reverse!)"""
    correct = sum(
        eval_diff_completion(model, B_tokens[i], A_tokens[i])
        for i in range(n_tests)
    )
    return correct / n_tests * 100

In [None]:
#@title üéß Listen: Experiment
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/08_experiment.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 6. Putting It All Together ‚Äî The Reversal Curse Experiment

In [None]:
print("Evaluating both models on forward and reverse tasks...\n")

ar_fwd = eval_ar_forward(ar_model)
ar_rev = eval_ar_reverse(ar_model)
diff_fwd = eval_diff_forward(diff_model)
diff_rev = eval_diff_reverse(diff_model)

print(f"{'Model':<20} {'Forward (A‚ÜíB)':<18} {'Reverse (B‚ÜíA)':<18}")
print(f"{'-'*56}")
print(f"{'Autoregressive':<20} {ar_fwd:>10.1f}%       {ar_rev:>10.1f}%")
print(f"{'Diffusion':<20} {diff_fwd:>10.1f}%       {diff_rev:>10.1f}%")

In [None]:
# üìä The Reversal Curse ‚Äî Visualized
fig, ax = plt.subplots(figsize=(10, 6))

x_pos = np.array([0, 1.5])
width = 0.5

bars_fwd = ax.bar(x_pos - width/2, [ar_fwd, diff_fwd], width,
                   label='Forward (A ‚Üí B)', color=['#ef9a9a', '#90caf9'],
                   edgecolor=['#c62828', '#1565c0'], linewidth=2)
bars_rev = ax.bar(x_pos + width/2, [ar_rev, diff_rev], width,
                   label='Reverse (B ‚Üí A)', color=['#c62828', '#1565c0'],
                   edgecolor=['#c62828', '#1565c0'], linewidth=2)

# Add value labels
for bar in bars_fwd:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1.5,
            f'{bar.get_height():.0f}%', ha='center', fontsize=13, fontweight='bold')
for bar in bars_rev:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1.5,
            f'{bar.get_height():.0f}%', ha='center', fontsize=13, fontweight='bold')

ax.set_xticks(x_pos)
ax.set_xticklabels(['Autoregressive\n(Causal)', 'Diffusion\n(Bidirectional)'], fontsize=13)
ax.set_ylabel('Accuracy (%)', fontsize=13)
ax.set_ylim(0, 110)
ax.set_title('The Reversal Curse: Forward vs Reverse Completion', fontsize=15)
ax.legend(fontsize=12, loc='upper right')
ax.grid(True, axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
#@title üéß Listen: Why
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/09_why.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 7. üéØ Why Does This Happen?

In [None]:
# Let's visualize WHAT the models see during training

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))

# Autoregressive attention pattern
attn_ar = np.tril(np.ones((SEQ_LEN, SEQ_LEN)))
im1 = ax1.imshow(attn_ar, cmap='Reds', aspect='equal')
ax1.set_title('Autoregressive: Causal Attention', fontsize=13)
ax1.set_xlabel('Key Position')
ax1.set_ylabel('Query Position')
labels = ['A1', 'A2', 'SEP', 'B1', 'B2', 'PAD']
ax1.set_xticks(range(SEQ_LEN))
ax1.set_yticks(range(SEQ_LEN))
ax1.set_xticklabels(labels)
ax1.set_yticklabels(labels)
ax1.text(0.5, -0.2, 'B can see A ‚úì,  A cannot see B ‚úó',
         transform=ax1.transAxes, ha='center', fontsize=11, color='#c62828')

# Diffusion attention pattern (bidirectional)
attn_diff = np.ones((SEQ_LEN, SEQ_LEN))
im2 = ax2.imshow(attn_diff, cmap='Blues', aspect='equal')
ax2.set_title('Diffusion: Bidirectional Attention', fontsize=13)
ax2.set_xlabel('Key Position')
ax2.set_ylabel('Query Position')
ax2.set_xticks(range(SEQ_LEN))
ax2.set_yticks(range(SEQ_LEN))
ax2.set_xticklabels(labels)
ax2.set_yticklabels(labels)
ax2.text(0.5, -0.2, 'Every position sees every other position ‚úì',
         transform=ax2.transAxes, ha='center', fontsize=11, color='#1565c0')

plt.suptitle('Why the Reversal Curse Exists', fontsize=15)
plt.tight_layout()
plt.show()

print("\nüí° Key Insight:")
print("   The autoregressive model can only condition B on A (left context).")
print("   The diffusion model conditions any token on all other tokens.")
print("   This is why diffusion models handle reverse completion naturally.")

In [None]:
#@title üéß Listen: Closing
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/10_closing.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")

## 8. Reflection and Next Steps

### ü§î Reflection Questions

1. **Partial reversal:** The autoregressive model might still get a few reverse completions right. Why? (Hint: think about statistical correlations in the vocabulary.)

2. **Training cost:** Both models were trained for the same number of steps. Do you think the diffusion model needs more or fewer steps to learn bidirectional associations? Why?

3. **Real-world implications:** The LLaDA paper tested this on Chinese poetry couplets. LLaDA scored 42% on the reversal task vs. GPT-4o's 32%. Why might the gap be smaller in large models?

### üèÜ Optional Challenges

1. **Asymmetric pairs:** Create pairs where A‚ÜíB is easy but B‚ÜíA is ambiguous (e.g., multiple A values map to the same B). How does each model handle this?

2. **Longer sequences:** Increase the sequence length. At what point does the diffusion model's advantage over the AR model grow or shrink?

3. **Middle completion:** Instead of reverse, mask out the *middle* tokens and test both models. The AR model should also fail here ‚Äî can you show it?

**Next notebook:** We will explore different **sampling strategies** ‚Äî varying the number of denoising steps, comparing remasking methods, and benchmarking speed.

In [None]:
#@title üí¨ AI Teaching Assistant ‚Äî Click ‚ñ∂ to start
#@markdown This AI chatbot reads your notebook and can answer questions about any concept, code, or exercise.

import json as _json
import requests as _requests
from google.colab import output as _output
from IPython.display import display, HTML as _HTML, Markdown as _Markdown

# --- Read notebook content for context ---
def _get_notebook_context():
    try:
        from google.colab import _message
        nb = _message.blocking_request("get_ipynb", request="", timeout_sec=10)
        cells = nb.get("ipynb", {}).get("cells", [])
        parts = []
        for cell in cells:
            src = "".join(cell.get("source", []))
            tags = cell.get("metadata", {}).get("tags", [])
            if "chatbot" in tags:
                continue
            if src.strip():
                ct = cell.get("cell_type", "unknown")
                parts.append(f"[{ct.upper()}]\n{src}")
        return "\n\n---\n\n".join(parts)
    except Exception:
        return "Notebook content unavailable."

_NOTEBOOK_CONTEXT = _get_notebook_context()
_CHAT_HISTORY = []
_API_URL = "https://course-creator-brown.vercel.app/api/chat"

def _notebook_chat(question):
    global _CHAT_HISTORY
    try:
        resp = _requests.post(_API_URL, json={
            'question': question,
            'context': _NOTEBOOK_CONTEXT[:100000],
            'history': _CHAT_HISTORY[-10:],
        }, timeout=60)
        data = resp.json()
        answer = data.get('answer', 'Sorry, I could not generate a response.')
        _CHAT_HISTORY.append({'role': 'user', 'content': question})
        _CHAT_HISTORY.append({'role': 'assistant', 'content': answer})
        return answer
    except Exception as e:
        return f'Error connecting to teaching assistant: {str(e)}'

_output.register_callback('notebook_chat', _notebook_chat)

def ask(question):
    """Ask the AI teaching assistant a question about this notebook."""
    answer = _notebook_chat(question)
    display(_Markdown(answer))

print("\u2705 AI Teaching Assistant is ready!")
print("\U0001f4a1 Use the chat below, or call ask(\'your question\') in any cell.")

# --- Display chat widget ---
display(_HTML('''<style>
  .vc-wrap{font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Roboto,sans-serif;max-width:100%;border-radius:16px;overflow:hidden;box-shadow:0 4px 24px rgba(0,0,0,.12);background:#fff;border:1px solid #e5e7eb}
  .vc-hdr{background:linear-gradient(135deg,#667eea 0%,#764ba2 100%);color:#fff;padding:16px 20px;display:flex;align-items:center;gap:12px}
  .vc-avatar{width:42px;height:42px;background:rgba(255,255,255,.2);border-radius:50%;display:flex;align-items:center;justify-content:center;font-size:22px}
  .vc-hdr h3{font-size:16px;font-weight:600;margin:0}
  .vc-hdr p{font-size:12px;opacity:.85;margin:2px 0 0}
  .vc-msgs{height:420px;overflow-y:auto;padding:16px;background:#f8f9fb;display:flex;flex-direction:column;gap:10px}
  .vc-msg{display:flex;flex-direction:column;animation:vc-fade .25s ease}
  .vc-msg.user{align-items:flex-end}
  .vc-msg.bot{align-items:flex-start}
  .vc-bbl{max-width:85%;padding:10px 14px;border-radius:16px;font-size:14px;line-height:1.55;word-wrap:break-word}
  .vc-msg.user .vc-bbl{background:linear-gradient(135deg,#667eea 0%,#764ba2 100%);color:#fff;border-bottom-right-radius:4px}
  .vc-msg.bot .vc-bbl{background:#fff;color:#1a1a2e;border:1px solid #e8e8e8;border-bottom-left-radius:4px}
  .vc-bbl code{background:rgba(0,0,0,.07);padding:2px 6px;border-radius:4px;font-size:13px;font-family:'Fira Code',monospace}
  .vc-bbl pre{background:#1e1e2e;color:#cdd6f4;padding:12px;border-radius:8px;overflow-x:auto;margin:8px 0;font-size:13px}
  .vc-bbl pre code{background:none;padding:0;color:inherit}
  .vc-bbl h3,.vc-bbl h4{margin:10px 0 4px;font-size:15px}
  .vc-bbl ul,.vc-bbl ol{margin:4px 0;padding-left:20px}
  .vc-bbl li{margin:2px 0}
  .vc-chips{display:flex;flex-wrap:wrap;gap:8px;padding:0 16px 12px;background:#f8f9fb}
  .vc-chip{background:#fff;border:1px solid #d1d5db;border-radius:20px;padding:6px 14px;font-size:12px;cursor:pointer;transition:all .15s;color:#4b5563}
  .vc-chip:hover{border-color:#667eea;color:#667eea;background:#f0f0ff}
  .vc-input{display:flex;padding:12px 16px;background:#fff;border-top:1px solid #eee;gap:8px}
  .vc-input input{flex:1;padding:10px 16px;border:2px solid #e8e8e8;border-radius:24px;font-size:14px;outline:none;transition:border-color .2s}
  .vc-input input:focus{border-color:#667eea}
  .vc-input button{background:linear-gradient(135deg,#667eea 0%,#764ba2 100%);color:#fff;border:none;border-radius:50%;width:42px;height:42px;cursor:pointer;display:flex;align-items:center;justify-content:center;font-size:18px;transition:transform .1s}
  .vc-input button:hover{transform:scale(1.05)}
  .vc-input button:disabled{opacity:.5;cursor:not-allowed;transform:none}
  .vc-typing{display:flex;gap:5px;padding:4px 0}
  .vc-typing span{width:8px;height:8px;background:#667eea;border-radius:50%;animation:vc-bounce 1.4s infinite ease-in-out}
  .vc-typing span:nth-child(2){animation-delay:.2s}
  .vc-typing span:nth-child(3){animation-delay:.4s}
  @keyframes vc-bounce{0%,80%,100%{transform:scale(0)}40%{transform:scale(1)}}
  @keyframes vc-fade{from{opacity:0;transform:translateY(8px)}to{opacity:1;transform:translateY(0)}}
  .vc-note{text-align:center;font-size:11px;color:#9ca3af;padding:8px 16px 12px;background:#fff}
</style>
<div class="vc-wrap">
  <div class="vc-hdr">
    <div class="vc-avatar">&#129302;</div>
    <div>
      <h3>Vizuara Teaching Assistant</h3>
      <p>Ask me anything about this notebook</p>
    </div>
  </div>
  <div class="vc-msgs" id="vcMsgs">
    <div class="vc-msg bot">
      <div class="vc-bbl">&#128075; Hi! I've read through this entire notebook. Ask me about any concept, code block, or exercise &mdash; I'm here to help you learn!</div>
    </div>
  </div>
  <div class="vc-chips" id="vcChips">
    <span class="vc-chip" onclick="vcAsk(this.textContent)">Explain the main concept</span>
    <span class="vc-chip" onclick="vcAsk(this.textContent)">Help with the TODO exercise</span>
    <span class="vc-chip" onclick="vcAsk(this.textContent)">Summarize what I learned</span>
  </div>
  <div class="vc-input">
    <input type="text" id="vcIn" placeholder="Ask about concepts, code, exercises..." />
    <button id="vcSend" onclick="vcSendMsg()">&#10148;</button>
  </div>
  <div class="vc-note">AI-generated &middot; Verify important information &middot; <a href="#" onclick="vcClear();return false" style="color:#667eea">Clear chat</a></div>
</div>
<script>
(function(){
  var msgs=document.getElementById('vcMsgs'),inp=document.getElementById('vcIn'),
      btn=document.getElementById('vcSend'),chips=document.getElementById('vcChips');

  function esc(s){var d=document.createElement('div');d.textContent=s;return d.innerHTML}

  function md(t){
    return t
      .replace(/```(\w*)\n([\s\S]*?)```/g,function(_,l,c){return '<pre><code>'+esc(c)+'</code></pre>'})
      .replace(/`([^`]+)`/g,'<code>$1</code>')
      .replace(/\*\*([^*]+)\*\*/g,'<strong>$1</strong>')
      .replace(/\*([^*]+)\*/g,'<em>$1</em>')
      .replace(/^#### (.+)$/gm,'<h4>$1</h4>')
      .replace(/^### (.+)$/gm,'<h4>$1</h4>')
      .replace(/^## (.+)$/gm,'<h3>$1</h3>')
      .replace(/^\d+\. (.+)$/gm,'<li>$1</li>')
      .replace(/^- (.+)$/gm,'<li>$1</li>')
      .replace(/\n\n/g,'<br><br>')
      .replace(/\n/g,'<br>');
  }

  function addMsg(text,isUser){
    var m=document.createElement('div');m.className='vc-msg '+(isUser?'user':'bot');
    var b=document.createElement('div');b.className='vc-bbl';
    b.innerHTML=isUser?esc(text):md(text);
    m.appendChild(b);msgs.appendChild(m);msgs.scrollTop=msgs.scrollHeight;
  }

  function showTyping(){
    var m=document.createElement('div');m.className='vc-msg bot';m.id='vcTyping';
    m.innerHTML='<div class="vc-bbl"><div class="vc-typing"><span></span><span></span><span></span></div></div>';
    msgs.appendChild(m);msgs.scrollTop=msgs.scrollHeight;
  }

  function hideTyping(){var e=document.getElementById('vcTyping');if(e)e.remove()}

  window.vcSendMsg=function(){
    var q=inp.value.trim();if(!q)return;
    inp.value='';chips.style.display='none';
    addMsg(q,true);showTyping();btn.disabled=true;
    google.colab.kernel.invokeFunction('notebook_chat',[q],{})
      .then(function(r){
        hideTyping();
        var a=r.data['application/json'];
        addMsg(typeof a==='string'?a:JSON.stringify(a),false);
      })
      .catch(function(){
        hideTyping();
        addMsg('Sorry, I encountered an error. Please check your internet connection and try again.',false);
      })
      .finally(function(){btn.disabled=false;inp.focus()});
  };

  window.vcAsk=function(q){inp.value=q;vcSendMsg()};
  window.vcClear=function(){
    msgs.innerHTML='<div class="vc-msg bot"><div class="vc-bbl">&#128075; Chat cleared. Ask me anything!</div></div>';
    chips.style.display='flex';
  };

  inp.addEventListener('keypress',function(e){if(e.key==='Enter')vcSendMsg()});
  inp.focus();
})();
</script>'''))