In [None]:
# Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"[OK] GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("[WARNING] No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime â†’ Change runtime type â†’ GPU")

print(f"\n Python {sys.version.split()[0]}")
print(f" PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f" Random seed set to {SEED}")

%matplotlib inline

# Case Study: Real-Time Code Completion with Diffusion Language Models
## Implementation Notebook

In this notebook, you will build a masked diffusion language model for code completion from scratch. The model replaces autoregressive token-by-token generation with parallel iterative unmasking, enabling significantly faster inference while maintaining code quality.

**Context:** You are an ML engineer at Velocode, a developer tools startup. Your current autoregressive code completion model has 350ms median latency â€” too slow for enterprise customers demanding sub-100ms suggestions. Your task is to prototype a diffusion-based alternative using masked diffusion, which generates all tokens in parallel through iterative refinement.

**What you will build:**
- A BPE tokenizer trained on real Python code
- A baseline bigram model for comparison
- A bidirectional Transformer with timestep conditioning (the diffusion model)
- Training and evaluation pipelines
- Inference benchmarking and error analysis

**Prerequisites:** Familiarity with PyTorch, Transformers, and basic probability. Understanding of BERT's masked language modeling is helpful but not required.

---

## 3.1 Data Acquisition and Preprocessing

We use the CodeSearchNet dataset â€” a curated collection of 2 million functions from open-source GitHub repositories. We focus on the Python subset (~450K functions), which provides realistic, production-quality code for training.

**Why CodeSearchNet?** It contains real functions from popular repositories, each with a docstring. The functions are self-contained, making them ideal for training a code completion model. The dataset is permissively licensed and freely available via HuggingFace.

# ðŸ¤– AI Teaching Assistant

Need help with this notebook? Open the **AI Teaching Assistant** â€” it has already read this entire notebook and can help with concepts, code, and exercises.

**[ðŸ‘‰ Open AI Teaching Assistant](https://pods.vizuara.ai/courses/diffusion-llms-from-scratch/practice/0/assistant)**

*Tip: Open it in a separate tab and work through this notebook side-by-side.*


In [None]:
# Install dependencies
!pip install -q datasets tokenizers torch matplotlib numpy tqdm

In [None]:
import os
import random
import time
import ast
import json
from collections import Counter

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

In [None]:
from datasets import load_dataset

# Load the Python subset of CodeSearchNet
dataset = load_dataset("code_search_net", "python", split="train", trust_remote_code=True)
print(f"Total functions: {len(dataset):,}")
print(f"Example:\n{dataset[0]['func_code_string'][:300]}")

### Training a BPE Tokenizer

We train a Byte-Pair Encoding tokenizer on our code corpus. BPE learns subword units that balance vocabulary size with token efficiency â€” important for code where keywords, operators, and indentation patterns are frequent.

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.processors import ByteLevel as ByteLevelProcessor

# Train a BPE tokenizer on the code corpus
tokenizer = Tokenizer(BPE(unk_token="<UNK>"))
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
tokenizer.post_processor = ByteLevelProcessor(trim_offsets=False)

trainer = BpeTrainer(
    vocab_size=8192,
    special_tokens=["<PAD>", "<MASK>", "<BOS>", "<EOS>", "<UNK>"],
    min_frequency=2,
)

# Use a subset for faster training in Colab
train_texts = [dataset[i]["func_code_string"] for i in range(min(100000, len(dataset)))]
tokenizer.train_from_iterator(train_texts, trainer=trainer)

MASK_TOKEN_ID = tokenizer.token_to_id("<MASK>")
PAD_TOKEN_ID = tokenizer.token_to_id("<PAD>")
VOCAB_SIZE = tokenizer.get_vocab_size()
print(f"Vocabulary size: {VOCAB_SIZE}")
print(f"MASK token ID: {MASK_TOKEN_ID}")

### Data Augmentation: Creating Training Samples

Each training sample consists of a prefix (context before the cursor), a completion region (what the model must generate), and a suffix (context after the cursor). This simulates the real code completion scenario.

**TODO:** Implement the function that creates training samples by selecting random spans.

In [None]:
def augment_code_sample(tokens: list[int], mask_token_id: int) -> dict:
    """
    Create a training sample by randomly selecting a contiguous span
    within the token sequence to serve as the 'completion region.'

    The function should:
    1. Randomly select a span start position and span length
       (span length between 10% and 50% of the sequence length)
    2. Split the sequence into prefix, completion, and suffix
    3. Return a dictionary with keys:
       - 'prefix': tokens before the span
       - 'completion': the original tokens in the span (ground truth)
       - 'suffix': tokens after the span
       - 'full_sequence': the complete token sequence

    Hint: Use random.randint for start position. The span length
    should be sampled uniformly between 0.1 * seq_len and 0.5 * seq_len.
    Make sure the span does not extend beyond the sequence.

    Args:
        tokens: List of token IDs for a complete function
        mask_token_id: The ID of the [MASK] token

    Returns:
        Dictionary with prefix, completion, suffix, and full_sequence
    """
    # TODO: Implement this function
    pass

In [None]:
# Verification: test your augmentation function
sample_tokens = list(range(100)) # dummy tokens 0-99
result = augment_code_sample(sample_tokens, MASK_TOKEN_ID)
assert "prefix" in result and "completion" in result and "suffix" in result
assert len(result["prefix"]) + len(result["completion"]) + len(result["suffix"]) == 100
assert 10 <= len(result["completion"]) <= 50, f"Completion length {len(result['completion'])} out of range"
print(f"Prefix length: {len(result['prefix'])}")
print(f"Completion length: {len(result['completion'])}")
print(f"Suffix length: {len(result['suffix'])}")
print("Augmentation function verified.")

### Building the Dataset and DataLoader

In [None]:
class CodeCompletionDataset(Dataset):
    """Dataset for code completion training."""

    def __init__(self, code_strings, tokenizer, max_len=512, mask_token_id=1):
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.mask_token_id = mask_token_id

        # Tokenize all functions and filter by length
        self.samples = []
        for code in tqdm(code_strings, desc="Tokenizing"):
            encoded = tokenizer.encode(code)
            ids = encoded.ids
            if 50 <= len(ids) <= max_len:
                self.samples.append(ids)
        print(f"Retained {len(self.samples):,} functions (50-{max_len} tokens)")

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        tokens = self.samples[idx]
        sample = augment_code_sample(tokens, self.mask_token_id)
        return sample


def collate_fn(batch):
    """Collate variable-length samples into padded tensors."""
    max_prefix = max(len(s["prefix"]) for s in batch)
    max_comp = max(len(s["completion"]) for s in batch)
    max_suffix = max(len(s["suffix"]) for s in batch)

    prefixes, completions, suffixes = [], [], []
    for s in batch:
        prefixes.append(s["prefix"] + [PAD_TOKEN_ID] * (max_prefix - len(s["prefix"])))
        completions.append(s["completion"] + [PAD_TOKEN_ID] * (max_comp - len(s["completion"])))
        suffixes.append(s["suffix"] + [PAD_TOKEN_ID] * (max_suffix - len(s["suffix"])))

    return {
        "prefix": torch.tensor(prefixes, dtype=torch.long),
        "completion": torch.tensor(completions, dtype=torch.long),
        "suffix": torch.tensor(suffixes, dtype=torch.long),
        "comp_lengths": torch.tensor([len(s["completion"]) for s in batch]),
    }

In [None]:
# Build datasets
all_codes = [dataset[i]["func_code_string"] for i in range(min(50000, len(dataset)))]
random.shuffle(all_codes)

train_codes = all_codes[:40000]
val_codes = all_codes[40000:45000]
test_codes = all_codes[45000:]

train_dataset = CodeCompletionDataset(train_codes, tokenizer, max_len=512, mask_token_id=MASK_TOKEN_ID)
val_dataset = CodeCompletionDataset(val_codes, tokenizer, max_len=512, mask_token_id=MASK_TOKEN_ID)
test_dataset = CodeCompletionDataset(test_codes, tokenizer, max_len=512, mask_token_id=MASK_TOKEN_ID)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)

print(f"Train: {len(train_dataset):,} | Val: {len(val_dataset):,} | Test: {len(test_dataset):,}")

---

## 3.2 Exploratory Data Analysis

Before building models, we need to understand our data. The distributions of token lengths, vocabulary frequencies, and code structure will inform our design choices.

**TODO:** Implement the token length distribution plot and answer the thought questions below.

In [None]:
def plot_token_length_distribution(dataset, tokenizer, max_length: int = 1024) -> None:
    """
    Plot a histogram of token sequence lengths for all functions in the dataset.

    Steps:
    1. Tokenize each function in the dataset using the provided tokenizer
    2. Record the length (number of tokens) for each function
    3. Plot a histogram with 50 bins, x-axis 'Token count', y-axis 'Number of functions'
    4. Add vertical lines at the 25th, 50th, and 75th percentiles
    5. Print the mean, median, and standard deviation of token lengths

    Hint: Use matplotlib for plotting. Use numpy for percentile calculations.

    Args:
        dataset: HuggingFace dataset with 'func_code_string' field
        tokenizer: Trained BPE tokenizer
        max_length: Maximum length to display on x-axis
    """
    # TODO: Implement this function
    pass

In [None]:
# Run your EDA function
plot_token_length_distribution(dataset, tokenizer)

In [None]:
# Additional EDA: vocabulary frequency distribution
def plot_vocab_frequency(dataset, tokenizer, top_k: int = 50):
    """
    Plot the token frequency distribution (log scale).

    Steps:
    1. Tokenize a sample of 10,000 functions
    2. Count the frequency of each token ID across the corpus
    3. Plot a log-log frequency rank plot (x: rank, y: frequency)
    4. Also print the top-k most common tokens (decoded back to strings)

    This reveals whether Zipf's law holds for code tokens and which
    tokens dominate the distribution.
    """
    # TODO: Implement this function
    pass

plot_vocab_frequency(dataset, tokenizer)

**Thought questions:**
1. Why does the token length distribution have a long right tail? What does this imply for choosing a maximum sequence length?
2. If 5% of your functions exceed 512 tokens, what are the tradeoffs between truncating them, splitting them, and discarding them?
3. How does the vocabulary frequency distribution for code compare to natural language? What does this tell you about the information density of code?

---

## 3.3 Baseline: Bigram Language Model

Before building the diffusion model, we implement a simple bigram model as a baseline. The bigram model predicts each token based only on the immediately preceding token. This gives us a performance floor to beat.

**TODO:** Implement the bigram model and evaluate it.

In [None]:
class BigramCodeModel:
    """
    A bigram language model for code completion.

    The model estimates P(token_i | token_{i-1}) from the training corpus
    using simple counting with add-k smoothing.
    """

    def __init__(self, vocab_size: int, smoothing: float = 0.01):
        """
        Initialize the bigram count matrix.

        Args:
            vocab_size: Size of the token vocabulary
            smoothing: Laplace smoothing parameter (add-k)
        """
        # TODO: Initialize a (vocab_size x vocab_size) count matrix
        # and a total count vector
        pass

    def fit(self, token_sequences: list[list[int]]) -> None:
        """
        Fit the bigram model by counting token pairs in the training data.

        Steps:
        1. For each sequence, iterate over consecutive token pairs (t_{i-1}, t_i)
        2. Increment the count matrix at position [t_{i-1}, t_i]
        3. After counting, convert counts to probabilities using add-k smoothing:
           P(t_i | t_{i-1}) = (count[t_{i-1}, t_i] + k) / (total[t_{i-1}] + k * V)

        Args:
            token_sequences: List of tokenized code sequences
        """
        # TODO: Implement bigram counting and probability estimation
        pass

    def predict_next(self, context_token: int) -> np.ndarray:
        """
        Predict the probability distribution over next tokens given a context token.

        Args:
            context_token: The preceding token ID

        Returns:
            Probability distribution over vocabulary (shape: [vocab_size])
        """
        # TODO: Return the row of the probability matrix for the context token
        pass

    def evaluate_perplexity(self, token_sequences: list[list[int]]) -> float:
        """
        Compute perplexity of the model on a set of sequences.

        Perplexity = exp(-1/N * sum(log P(t_i | t_{i-1})))
        where N is the total number of predicted tokens.

        Hint: Use numpy for log calculations. Handle the case where
        a probability is very small (add a floor of 1e-10 to avoid log(0)).

        Args:
            token_sequences: List of tokenized code sequences

        Returns:
            Perplexity (float). Lower is better.
        """
        # TODO: Implement perplexity calculation
        pass

In [None]:
# Fit and evaluate the bigram baseline
train_token_sequences = [train_dataset.samples[i] for i in range(min(5000, len(train_dataset)))]
val_token_sequences = [val_dataset.samples[i] for i in range(min(1000, len(val_dataset)))]

bigram = BigramCodeModel(vocab_size=VOCAB_SIZE)
bigram.fit(train_token_sequences)

train_ppl = bigram.evaluate_perplexity(train_token_sequences[:100])
val_ppl = bigram.evaluate_perplexity(val_token_sequences[:100])
print(f"Bigram train perplexity: {train_ppl:.1f}")
print(f"Bigram validation perplexity: {val_ppl:.1f}")

assert train_ppl < val_ppl, "Train perplexity should be lower than validation"
assert val_ppl < 10000, "Perplexity seems too high â€” check your implementation"
print("Baseline implementation verified.")

---

## 3.4 Model Design: Bidirectional Transformer with Timestep Conditioning

Now we build the core diffusion model. This is a Transformer encoder (like BERT, not GPT) with one critical addition: a timestep embedding that tells the model the current noise level.

**Key design decisions:**
- **Bidirectional attention:** Every token can attend to every other token. No causal mask.
- **Timestep conditioning:** The model must know what fraction of tokens are masked to calibrate its confidence.
- **No causal mask:** Unlike GPT, we want full bidirectional context. This is what enables fill-in-the-middle natively.

**TODO:** Implement the timestep embedding and the full diffusion model.

In [None]:
import math


class TimestepEmbedding(nn.Module):
    """
    Embeds a scalar timestep t into a d_model-dimensional vector using
    a small MLP with sinusoidal features.

    Architecture:
    1. Map t -> sinusoidal features (like positional encoding, but for time)
    2. Linear(d_model, d_model) -> SiLU -> Linear(d_model, d_model)

    The sinusoidal features ensure the model can distinguish nearby timesteps.
    """

    def __init__(self, d_model: int):
        super().__init__()
        # TODO: Initialize the sinusoidal frequency table and MLP layers
        # Hint: Use torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        # for the frequency table. The MLP should be:
        # Linear(d_model, d_model) -> SiLU -> Linear(d_model, d_model)
        pass

    def forward(self, t: torch.Tensor) -> torch.Tensor:
        """
        Args:
            t: Timestep values, shape (batch_size, 1)
        Returns:
            Timestep embeddings, shape (batch_size, d_model)
        """
        # TODO: Compute sinusoidal features from t, then pass through MLP
        # Step 1: t * frequencies -> shape (batch_size, d_model // 2)
        # Step 2: Concatenate sin and cos -> shape (batch_size, d_model)
        # Step 3: Pass through MLP
        pass

In [None]:
class DiffusionCodeLM(nn.Module):
    """
    A masked diffusion language model for code completion.

    Architecture:
    - Token embedding + positional embedding + timestep embedding
    - N Transformer encoder layers (bidirectional attention)
    - Linear output head to vocabulary logits
    """

    def __init__(
        self,
        vocab_size: int,
        d_model: int = 256,
        n_heads: int = 8,
        n_layers: int = 6,
        d_ff: int = 1024,
        max_seq_len: int = 512,
        dropout: float = 0.1,
    ):
        super().__init__()
        # TODO: Initialize all layers:
        # 1. self.token_embed = nn.Embedding(vocab_size, d_model)
        # 2. self.pos_embed = nn.Embedding(max_seq_len, d_model)
        # 3. self.time_embed = TimestepEmbedding(d_model)
        # 4. self.transformer = nn.TransformerEncoder(...)
        #    Use nn.TransformerEncoderLayer with d_model, n_heads, d_ff,
        #    dropout, batch_first=True, and norm_first=True (Pre-LN)
        # 5. self.output_head = nn.Linear(d_model, vocab_size)
        # 6. self.dropout = nn.Dropout(dropout)
        #
        # Hint: Do NOT pass a mask to the TransformerEncoder â€” we want
        # bidirectional attention (every token sees every other token).
        pass

    def forward(self, x_t: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the diffusion model.

        Args:
            x_t: Partially masked token IDs, shape (batch_size, seq_len)
            t: Timestep, shape (batch_size, 1)

        Returns:
            Logits over vocabulary, shape (batch_size, seq_len, vocab_size)

        Steps:
        1. Compute token embeddings from x_t
        2. Add positional embeddings (positions 0, 1, ..., seq_len-1)
        3. Compute timestep embedding from t and ADD it to every position
        4. Apply dropout
        5. Pass through the Transformer encoder (no mask argument!)
        6. Project to vocabulary logits via the output head
        """
        # TODO: Implement the forward pass following the steps above
        pass

In [None]:
# Verification: test the model with dummy inputs
model = DiffusionCodeLM(vocab_size=VOCAB_SIZE).to(device)
dummy_tokens = torch.randint(0, VOCAB_SIZE, (4, 128)).to(device)
dummy_t = torch.rand(4, 1).to(device)
logits = model(dummy_tokens, dummy_t)

assert logits.shape == (4, 128, VOCAB_SIZE), f"Expected (4, 128, {VOCAB_SIZE}), got {logits.shape}"
n_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {n_params:,}")
print(f"Output shape: {logits.shape}")
print("Model architecture verified.")

---

## 3.5 Training: Forward Process and Training Loop

The training procedure has two parts: the forward process (masking) and the training loop.

**The forward process** takes clean tokens and a timestep $t$, and randomly masks tokens with probability $t$. This is the "noising" step of diffusion â€” but instead of adding Gaussian noise, we replace tokens with [MASK].

**The training loop** samples random timesteps, applies the forward process, runs the model, and computes cross-entropy loss only on the masked positions.

**TODO:** Implement the forward process and the training loop.

In [None]:
def forward_process(x_completion: torch.Tensor, t: torch.Tensor, mask_token_id: int) -> tuple:
    """
    Apply the forward (masking) process to the completion region.

    For each token in x_completion, independently mask it with probability t.

    Args:
        x_completion: Clean tokens in the completion region, shape (batch_size, comp_len)
        t: Masking probability for each example, shape (batch_size, 1)
        mask_token_id: ID of the [MASK] token

    Returns:
        x_t: Masked tokens, shape (batch_size, comp_len)
        mask: Boolean mask indicating which positions were masked, shape (batch_size, comp_len)

    Hint: Generate random values with torch.rand_like(x_completion.float()).
    Compare with t (broadcasted) to create the boolean mask.
    Then clone x_completion and set masked positions to mask_token_id.
    """
    # TODO: Implement the forward masking process
    pass

In [None]:
# Verification: test the forward process
test_comp = torch.tensor([[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]])
test_t = torch.tensor([[0.5]])
x_t, mask = forward_process(test_comp, test_t, MASK_TOKEN_ID)

print(f"Original: {test_comp[0].tolist()}")
print(f"Masked: {x_t[0].tolist()}")
print(f"Mask: {mask[0].tolist()}")
assert x_t.shape == test_comp.shape
assert mask.dtype == torch.bool
print(f"Masked {mask.sum().item()} out of {mask.numel()} tokens")
print("Forward process verified.")

In [None]:
def train_one_epoch(model, dataloader, optimizer, scheduler, mask_token_id: int, device: str) -> float:
    """
    Train the diffusion model for one epoch.

    For each batch:
    1. Move data to device
    2. Sample random timesteps t ~ U(0, 1), shape (batch_size, 1)
    3. Apply forward_process to get masked sequences and the boolean mask
    4. Concatenate [prefix, masked_completion, suffix] to form the full input
    5. Run model forward: logits = model(full_input, t)
    6. Extract logits only at the masked completion positions
    7. Compute cross-entropy loss between predicted logits and true tokens
    8. Backpropagate and step optimizer and scheduler

    Args:
        model: DiffusionCodeLM instance
        dataloader: DataLoader yielding batches with keys: prefix, completion, suffix, comp_lengths
        optimizer: AdamW optimizer
        scheduler: Learning rate scheduler
        mask_token_id: ID of the [MASK] token
        device: 'cuda' or 'cpu'

    Returns:
        Average loss for the epoch (float)

    Hint: For step 6, you need to figure out which positions in the full
    concatenated sequence correspond to masked completion tokens. Use the
    mask from forward_process, but offset by the prefix length.
    """
    # TODO: Implement the training loop
    pass

In [None]:
# Train for a few epochs and verify loss decreases
model = DiffusionCodeLM(vocab_size=VOCAB_SIZE).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Use a small subset for quick verification
small_loader = DataLoader(
    torch.utils.data.Subset(train_dataset, range(min(500, len(train_dataset)))),
    batch_size=16, shuffle=True, collate_fn=collate_fn
)

num_epochs = 3
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs * len(small_loader))

losses = []
for epoch in range(num_epochs):
    loss = train_one_epoch(model, small_loader, optimizer, scheduler, MASK_TOKEN_ID, device)
    losses.append(loss)
    print(f"Epoch {epoch + 1}/{num_epochs} â€” Loss: {loss:.4f}")

assert losses[-1] < losses[0], "Loss should decrease over epochs â€” check your training loop"
print("Training loop verified.")

### Full Training Run

Now train on the full dataset. On a T4 GPU, this should take approximately 30-60 minutes for 10 epochs.

In [None]:
# Full training run
model = DiffusionCodeLM(vocab_size=VOCAB_SIZE).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

num_epochs = 10
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=num_epochs * len(train_loader)
)

train_losses = []
val_losses = []

for epoch in range(num_epochs):
    # Train
    train_loss = train_one_epoch(model, train_loader, optimizer, scheduler, MASK_TOKEN_ID, device)
    train_losses.append(train_loss)

    # Validate (compute loss on validation set without gradient)
    model.eval()
    val_loss_total = 0
    val_batches = 0
    with torch.no_grad():
        for batch in val_loader:
            prefix = batch["prefix"].to(device)
            completion = batch["completion"].to(device)
            suffix = batch["suffix"].to(device)
            t = torch.rand(prefix.shape[0], 1, device=device)
            x_t, mask = forward_process(completion, t, MASK_TOKEN_ID)
            full_input = torch.cat([prefix, x_t, suffix], dim=1)
            logits = model(full_input, t)
            comp_start = prefix.shape[1]
            comp_end = comp_start + completion.shape[1]
            comp_logits = logits[:, comp_start:comp_end, :]
            loss = F.cross_entropy(comp_logits[mask], completion[mask])
            val_loss_total += loss.item()
            val_batches += 1
    val_loss = val_loss_total / max(val_batches, 1)
    val_losses.append(val_loss)
    model.train()

    print(f"Epoch {epoch+1}/{num_epochs} â€” Train: {train_loss:.4f} | Val: {val_loss:.4f}")

# Plot training curves
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Cross-Entropy Loss")
plt.title("Training Curves")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## 3.6 Generation: Iterative Unmasking

Now comes the exciting part. We implement the generation procedure: starting from a fully masked completion region and iteratively unmasking tokens based on model confidence.

In [None]:
@torch.no_grad()
def generate(
    model: nn.Module,
    prefix: torch.Tensor,
    suffix: torch.Tensor,
    comp_len: int,
    mask_token_id: int,
    num_steps: int = 10,
    temperature: float = 0.8,
) -> torch.Tensor:
    """
    Generate a code completion using iterative unmasking.

    Algorithm:
    1. Start with comp_len [MASK] tokens in the completion region
    2. For each step s from 0 to num_steps-1:
       a. Set timestep t = 1 - s / num_steps
       b. Concatenate [prefix, completion_region, suffix]
       c. Run model forward to get logits
       d. Apply temperature scaling: logits / temperature
       e. Sample tokens from softmax(logits) at masked positions
       f. Compute confidence (max probability) at each masked position
       g. Determine how many tokens to unmask at this step:
          n_unmask = max(1, remaining_masks // (num_steps - s))
       h. Unmask the top-n_unmask most confident predictions
    3. Return the final completion tokens

    Args:
        model: Trained DiffusionCodeLM
        prefix: Prefix tokens, shape (1, prefix_len)
        suffix: Suffix tokens, shape (1, suffix_len)
        comp_len: Length of the completion region
        mask_token_id: [MASK] token ID
        num_steps: Number of unmasking steps
        temperature: Sampling temperature (lower = more deterministic)

    Returns:
        Generated completion tokens, shape (1, comp_len)
    """
    model.eval()
    completion = torch.full((1, comp_len), mask_token_id, device=prefix.device)

    for step in range(num_steps):
        t = torch.tensor([[1.0 - step / num_steps]], device=prefix.device)

        # Build full input
        full_input = torch.cat([prefix, completion, suffix], dim=1)
        logits = model(full_input, t)

        # Extract completion logits
        comp_start = prefix.shape[1]
        comp_logits = logits[:, comp_start:comp_start + comp_len, :] / temperature
        probs = F.softmax(comp_logits, dim=-1)

        # Sample predictions
        predicted = torch.multinomial(probs.view(-1, probs.shape[-1]), 1).view(1, comp_len)
        confidence = probs.max(dim=-1).values

        # Unmask the most confident predictions
        is_masked = (completion == mask_token_id)
        n_remaining = is_masked.sum().item()
        if n_remaining == 0:
            break
        n_unmask = max(1, n_remaining // max(num_steps - step, 1))

        masked_confidence = confidence * is_masked.float()
        _, top_idx = masked_confidence.view(-1).topk(min(n_unmask, n_remaining))
        completion.view(-1)[top_idx] = predicted.view(-1)[top_idx]

    return completion

In [None]:
# Test generation
model.eval()
sample = test_dataset[0]
prefix_t = torch.tensor([sample["prefix"][:64]], device=device) # Take first 64 tokens of prefix
suffix_t = torch.tensor([sample["suffix"][:32]], device=device) # Take first 32 tokens of suffix

generated = generate(model, prefix_t, suffix_t, comp_len=32, mask_token_id=MASK_TOKEN_ID, num_steps=10)
generated_text = tokenizer.decode(generated[0].tolist())
original_text = tokenizer.decode(sample["completion"][:32])

print("--- Generated Completion ---")
print(generated_text)
print("\n--- Ground Truth ---")
print(original_text)

---

## 3.7 Evaluation

Now we systematically evaluate the model against the bigram baseline.

**TODO:** Implement the evaluation function and generate comparison plots.

In [None]:
def evaluate_model(
    model: nn.Module,
    test_dataloader,
    mask_token_id: int,
    device: str,
    masking_ratios: list[float] = [0.25, 0.5, 0.75],
) -> dict:
    """
    Evaluate the diffusion model on the test set.

    For each masking ratio in masking_ratios:
    1. Apply forward_process with the fixed masking ratio (not random)
    2. Compute cross-entropy loss on masked positions
    3. Also compute token-level accuracy: what fraction of masked tokens
       does the model predict correctly (using argmax)?

    Additionally, for the full test set:
    4. Run the full iterative unmasking generation procedure
    5. Compute exact match rate against ground truth
    6. Compute average edit similarity (1 - normalized Levenshtein distance)

    Args:
        model: Trained DiffusionCodeLM
        test_dataloader: DataLoader for the test set
        mask_token_id: ID of the [MASK] token
        device: 'cuda' or 'cpu'
        masking_ratios: List of masking ratios to evaluate at

    Returns:
        Dictionary with keys:
        - 'loss_by_ratio': {ratio: avg_loss} for each masking ratio
        - 'accuracy_by_ratio': {ratio: avg_accuracy} for each masking ratio
        - 'exact_match': float, fraction of exact matches
        - 'edit_similarity': float, average edit similarity

    Hint: For edit similarity, you can use the python-Levenshtein library
    or implement it with dynamic programming. Normalize by max(len(pred), len(truth)).
    """
    # TODO: Implement evaluation
    pass


def plot_evaluation_results(eval_results: dict, baseline_perplexity: float) -> None:
    """
    Create three plots:
    1. Loss vs masking ratio (bar chart) â€” shows how loss increases with more masking
    2. Token accuracy vs masking ratio (bar chart) â€” shows how accuracy drops with more masking
    3. Comparison table: diffusion model vs bigram baseline on all metrics

    Hint: Use matplotlib subplots with 1 row, 3 columns.
    """
    # TODO: Create the evaluation plots
    pass

In [None]:
# Run evaluation
eval_results = evaluate_model(model, test_loader, MASK_TOKEN_ID, device)
print(f"Exact match: {eval_results['exact_match']:.3f}")
print(f"Edit similarity: {eval_results['edit_similarity']:.3f}")
for ratio, loss in eval_results["loss_by_ratio"].items():
    acc = eval_results["accuracy_by_ratio"][ratio]
    print(f"  t={ratio}: loss={loss:.3f}, accuracy={acc:.3f}")

plot_evaluation_results(eval_results, val_ppl)

---

## 3.8 Error Analysis

**TODO:** Analyze the types of errors the model makes and identify the top 3 failure modes.

In [None]:
def analyze_errors(
    model: nn.Module,
    test_samples: list[dict],
    tokenizer,
    mask_token_id: int,
    device: str,
    num_steps: int = 10,
    num_samples: int = 50,
) -> dict:
    """
    Generate completions for test samples and categorize errors.

    For each test sample:
    1. Generate a completion using iterative unmasking
    2. Compare with the ground truth
    3. If they differ, categorize the error type:
       - 'syntax': Try to compile/parse the generated code with ast.parse().
         If it raises SyntaxError, it is a syntax error.
       - 'repetition': Check if any 3-gram appears more than 3 times
         in the generated code.
       - 'semantic': If syntactically correct but different from ground truth,
         classify as semantic.
    4. Collect examples of each error type.

    Args:
        model: Trained DiffusionCodeLM
        test_samples: List of dicts with 'prefix', 'completion', 'suffix'
        tokenizer: BPE tokenizer for decoding
        mask_token_id: [MASK] token ID
        device: 'cuda' or 'cpu'
        num_steps: Number of diffusion steps for generation
        num_samples: Number of samples to analyze

    Returns:
        Dictionary with:
        - 'error_counts': {error_type: count}
        - 'error_examples': {error_type: list of (generated, ground_truth) pairs}
        - 'total_errors': total number of incorrect completions
        - 'total_correct': total number of exact matches

    Hint: Use Python's ast module for syntax checking:
      try:
          ast.parse(generated_code)
          is_syntax_error = False
      except SyntaxError:
          is_syntax_error = True
    """
    # TODO: Implement error analysis
    pass

In [None]:
# Run error analysis
test_samples = [test_dataset[i] for i in range(min(50, len(test_dataset)))]
error_results = analyze_errors(model, test_samples, tokenizer, MASK_TOKEN_ID, device)

print(f"Total correct: {error_results['total_correct']} / {error_results['total_correct'] + error_results['total_errors']}")
print(f"\nError breakdown:")
for err_type, count in error_results["error_counts"].items():
    print(f"  {err_type}: {count}")

print(f"\nExample errors:")
for err_type, examples in error_results["error_examples"].items():
    if examples:
        gen, truth = examples[0]
        print(f"\n--- {err_type.upper()} ERROR ---")
        print(f"Generated: {gen[:200]}")
        print(f"Expected:  {truth[:200]}")

**Thought questions:**
1. Which error type is most common? Is this a fundamental limitation of the diffusion approach, or a training issue?
2. Does the model make more errors at the beginning or end of the completion region? What does this tell you about the unmasking order?
3. How might you modify the unmasking schedule to reduce the most common error type?

---

## 3.9 Scalability and Deployment

**TODO:** Benchmark inference latency across different completion lengths and diffusion step counts.

In [None]:
def benchmark_inference(
    model: nn.Module,
    tokenizer,
    mask_token_id: int,
    device: str,
    completion_lengths: list[int] = [32, 64, 128, 256],
    num_steps_list: list[int] = [4, 8, 12, 16],
    num_trials: int = 20,
) -> dict:
    """
    Benchmark inference latency across different completion lengths and step counts.

    For each combination of (completion_length, num_steps):
    1. Create a dummy input with a fixed prefix (64 tokens) and the specified
       number of [MASK] tokens as the completion region
    2. Run the full iterative unmasking generation procedure
    3. Measure wall-clock time (use torch.cuda.synchronize() before timing!)
    4. Record the median latency over num_trials runs

    Also measure:
    5. Tokens per second = completion_length / median_latency
    6. GPU memory usage via torch.cuda.max_memory_allocated()

    Create a 2D heatmap (completion_length x num_steps) showing median latency.

    Args:
        model: Trained DiffusionCodeLM
        tokenizer: BPE tokenizer
        mask_token_id: [MASK] token ID
        device: 'cuda' (must be GPU for meaningful results)
        completion_lengths: List of completion region sizes to test
        num_steps_list: List of diffusion step counts to test
        num_trials: Number of timing trials per configuration

    Returns:
        Dictionary with:
        - 'latencies': 2D dict {comp_len: {num_steps: median_latency_ms}}
        - 'throughput': 2D dict {comp_len: {num_steps: tokens_per_second}}
        - 'memory_mb': peak GPU memory in megabytes

    Hint: For accurate GPU timing:
      torch.cuda.synchronize()
      start = time.time()
      # ... run generation ...
      torch.cuda.synchronize()
      elapsed = time.time() - start
    """
    # TODO: Implement inference benchmarking
    pass

In [None]:
# Run benchmarks
if device == "cuda":
    bench_results = benchmark_inference(model, tokenizer, MASK_TOKEN_ID, device)

    print(f"GPU memory: {bench_results['memory_mb']:.1f} MB")
    print(f"\nMedian latency (ms):")
    print(f"{'Comp Length':>12}", end="")
    for ns in [4, 8, 12, 16]:
        print(f"{ns:>10} steps", end="")
    print()
    for cl in [32, 64, 128, 256]:
        print(f"{cl:>12}", end="")
        for ns in [4, 8, 12, 16]:
            lat = bench_results["latencies"][cl][ns]
            print(f"{lat:>13.1f}", end="")
        print()
else:
    print("Benchmarking requires a GPU. Skipping.")

**Thought questions:**
1. How does latency scale with completion length? Compare this to autoregressive scaling (linear). What do you observe?
2. What is the relationship between num_steps and generation quality? Is there a "sweet spot" where adding more steps gives diminishing returns?
3. If you needed to deploy this model on a T4 GPU (16GB, lower compute), what modifications would you make?

---

## 3.10 Ethical and Regulatory Analysis

**TODO:** Conduct a basic ethical assessment of the code completion model.

In [None]:
def ethical_assessment(
    model: nn.Module,
    tokenizer,
    mask_token_id: int,
    device: str,
    num_steps: int = 10,
) -> str:
    """
    Conduct a basic ethical assessment of the code completion model.

    Perform the following tests and return a written assessment:

    1. MEMORIZATION TEST: Generate 100 completions from the same prefix.
       Check if any two completions are identical. High duplication rate
       may indicate memorization of training data.

    2. VULNERABILITY TEST: Create 5 prompts that could lead to insecure code:
       - SQL query construction
       - HTML template rendering
       - File path handling
       - Password/credential handling
       - HTTP request construction
       For each, generate a completion and manually inspect whether the
       generated code contains common vulnerability patterns (string
       concatenation for SQL, no input sanitization for HTML, etc.)

    3. DIVERSITY TEST: Generate 20 completions for a function stub that
       could be solved multiple ways. Measure the diversity of solutions
       (e.g., number of unique approaches). Low diversity may indicate
       the model overfits to dominant patterns.

    Return a structured string report with:
    - Findings for each test
    - Risk level (Low/Medium/High) for each category
    - Recommended mitigations

    Args:
        model: Trained DiffusionCodeLM
        tokenizer: BPE tokenizer
        mask_token_id: [MASK] token ID
        device: 'cuda' or 'cpu'
        num_steps: Number of diffusion steps

    Returns:
        String containing the structured ethical assessment report
    """
    # TODO: Implement the ethical assessment
    # This is intentionally open-ended â€” there is no single right answer.
    # The goal is to develop a systematic process for evaluating ML models.
    pass

In [None]:
# Run ethical assessment
report = ethical_assessment(model, tokenizer, MASK_TOKEN_ID, device)
print(report)

**Thought questions:**
1. If the model memorizes a block of GPL-licensed code and suggests it to a developer writing proprietary software, who is responsible? The model developer, the code completion company, or the end user?
2. How would you build a real-time filter that prevents the model from suggesting code with known vulnerability patterns?
3. What fairness metrics make sense for code completion? Is it fair if the model works better for Python than for Rust? What about for English variable names vs. non-English?

---

## Summary

In this notebook, you built a masked diffusion language model for code completion from scratch:

1. **Data pipeline:** Loaded CodeSearchNet Python functions, trained a BPE tokenizer, and created prefix-completion-suffix training samples.
2. **Baseline:** Implemented a bigram model to establish a performance floor.
3. **Diffusion model:** Built a bidirectional Transformer with timestep conditioning â€” the core architecture behind models like LLaDA and Mercury.
4. **Training:** Implemented the forward masking process and the cross-entropy training loop across random masking ratios.
5. **Generation:** Implemented iterative unmasking with confidence-based scheduling.
6. **Evaluation:** Measured quality (accuracy, edit similarity) and speed (latency, throughput).
7. **Error analysis:** Categorized failure modes (syntax, semantic, repetition) and identified the top errors.
8. **Deployment:** Benchmarked inference latency and analyzed scalability.
9. **Ethics:** Assessed memorization, vulnerability, and diversity risks.

The key takeaway: masked diffusion for text is conceptually simple (generalized BERT training) but enables fundamentally different generation properties â€” parallel tokens, bidirectional context, and natural infilling. These properties directly address the business requirements for fast, high-quality code completion.

For further reading on how this prototype would be scaled to serve 800,000 developers in production, see **Section 4 (Production and System Design Extension)** of the full case study document.