# Chapter 11: Building the Transformer

> "We are what we repeatedly do. Excellence, then, is not an act, but a habit."
> — **Aristotle**, Philosopher

---

## What You'll Learn

- How to stack Transformer blocks into a complete language model
- The configuration pattern that makes model experimentation easy
- What the language modeling head does and why we need it
- Weight tying: the elegant trick that saves 38 million parameters
- Essential sanity checks to verify your model before training
- How to load pretrained GPT-2 weights into your architecture

---

## Setup

First, let's install required packages:

In [None]:
# Install required packages
!pip install -q torch transformers

In [None]:
# ===== IMPORTS =====
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
from transformers import AutoTokenizer

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 1. Components from Chapter 10

First, let's bring in the components we built in Chapter 10: `MultiHeadAttention`, `FeedForward`, and `TransformerBlock`.

In [None]:
# ===== MULTI-HEAD ATTENTION (from Chapter 10) =====

class MultiHeadAttention(nn.Module):
    """Efficient multi-head attention (batches all heads together)."""
    
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_head = d_model // num_heads
        
        # Combined QKV projection
        self.qkv_proj = nn.Linear(d_model, 3 * d_model, bias=False)
        self.out_proj = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        batch, seq, d_model = x.shape
        
        # Project to Q, K, V
        qkv = self.qkv_proj(x)
        qkv = qkv.reshape(batch, seq, 3, self.num_heads, self.d_head)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        
        # Scaled dot-product attention
        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_head)
        
        if mask is not None:
            if mask.dim() == 2:
                mask = mask.unsqueeze(0).unsqueeze(0)
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Weighted sum and concatenate
        attn_output = attn_weights @ V
        attn_output = attn_output.transpose(1, 2).reshape(batch, seq, d_model)
        
        return self.out_proj(attn_output), attn_weights

print("MultiHeadAttention defined!")

In [None]:
# ===== FEEDFORWARD NETWORK (from Chapter 10) =====

class FeedForward(nn.Module):
    """Position-wise feedforward network."""
    
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        x = self.fc1(x)
        x = F.gelu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

print("FeedForward defined!")

In [None]:
# ===== TRANSFORMER BLOCK (from Chapter 10) =====

class TransformerBlock(nn.Module):
    """Complete Transformer block (pre-norm style like GPT-2)."""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.ffn = FeedForward(d_model, d_ff, dropout)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Attention with residual
        attn_out, attn_weights = self.attn(self.ln1(x), mask)
        x = x + self.dropout(attn_out)
        
        # FFN with residual
        ffn_out = self.ffn(self.ln2(x))
        x = x + self.dropout(ffn_out)
        
        return x, attn_weights

print("TransformerBlock defined!")

## 2. Model Configuration

Let's create a configuration dataclass to bundle all model hyperparameters.

In [None]:
@dataclass
class GPTConfig:
    """Configuration for MiniGPT model."""
    vocab_size: int = 50257      # GPT-2 vocabulary size
    max_seq_len: int = 1024      # Maximum context length
    embed_dim: int = 768         # Embedding dimension
    num_heads: int = 12          # Number of attention heads
    num_layers: int = 12         # Number of Transformer blocks
    d_ff: int = 3072             # Feedforward hidden dimension
    dropout: float = 0.1         # Dropout probability

    def __post_init__(self):
        """Validate configuration."""
        assert self.embed_dim % self.num_heads == 0, \
            f"embed_dim ({self.embed_dim}) must be divisible by num_heads ({self.num_heads})"


# Test different configurations
print("GPT-2 Small (default):")
config = GPTConfig()
print(f"  Layers: {config.num_layers}, Heads: {config.num_heads}, Embed: {config.embed_dim}")

print("\nTiny config (for experiments):")
tiny_config = GPTConfig(embed_dim=64, num_heads=2, num_layers=2, d_ff=256)
print(f"  Layers: {tiny_config.num_layers}, Heads: {tiny_config.num_heads}, Embed: {tiny_config.embed_dim}")

## 3. The Complete MiniGPT Model

Now let's build the complete model by stacking Transformer blocks and adding the language modeling head with weight tying.

In [None]:
class MiniGPT(nn.Module):
    """
    A minimal GPT-style language model.
    Combines embeddings, Transformer blocks, and language modeling head.
    """

    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        # ===== Token and Position Embeddings =====
        self.token_embed = nn.Embedding(config.vocab_size, config.embed_dim)
        self.pos_embed = nn.Embedding(config.max_seq_len, config.embed_dim)
        self.dropout = nn.Dropout(config.dropout)

        # ===== Transformer Blocks =====
        self.blocks = nn.ModuleList([
            TransformerBlock(
                d_model=config.embed_dim,
                num_heads=config.num_heads,
                d_ff=config.d_ff,
                dropout=config.dropout
            )
            for _ in range(config.num_layers)
        ])

        # ===== Final LayerNorm =====
        self.ln_f = nn.LayerNorm(config.embed_dim)

        # ===== Language Modeling Head =====
        self.lm_head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)

        # ===== Weight Tying =====
        self.lm_head.weight = self.token_embed.weight

        # Initialize weights
        self._init_weights()

    def _init_weights(self):
        """Initialize weights with small random values."""
        nn.init.normal_(self.token_embed.weight, std=0.02)
        nn.init.normal_(self.pos_embed.weight, std=0.02)

    def forward(self, token_ids, return_attention=False):
        """
        Forward pass through the model.

        Args:
            token_ids: Input token IDs (batch, seq)
            return_attention: Whether to return attention weights

        Returns:
            logits: Vocabulary scores (batch, seq, vocab_size)
        """
        batch, seq = token_ids.shape
        device = token_ids.device

        # Embeddings
        tok_emb = self.token_embed(token_ids)
        positions = torch.arange(seq, device=device)
        pos_emb = self.pos_embed(positions)
        x = self.dropout(tok_emb + pos_emb)

        # Causal mask
        mask = torch.tril(torch.ones(seq, seq, device=device))

        # Transformer blocks
        attention_weights = []
        for block in self.blocks:
            x, attn = block(x, mask)
            if return_attention:
                attention_weights.append(attn)

        # Final norm and projection
        x = self.ln_f(x)
        logits = self.lm_head(x)

        if return_attention:
            return logits, attention_weights
        return logits


print("MiniGPT class defined!")

In [None]:
# Quick test with tiny config
tiny_config = GPTConfig(embed_dim=64, num_heads=2, num_layers=2, d_ff=256)
model = MiniGPT(tiny_config)

# Test forward pass
test_tokens = torch.randint(0, tiny_config.vocab_size, (2, 16))
logits = model(test_tokens)

print(f"Input shape: {test_tokens.shape}")
print(f"Output shape: {logits.shape}")
print(f"\nExpected: (2, 16, {tiny_config.vocab_size})")

## 4. Sanity Checks

Let's verify our model is wired correctly with 5 essential tests.

In [None]:
# ===== TEST 1: Shape Verification =====

def test_forward_shapes(config):
    """Verify model output shapes."""
    model = MiniGPT(config)
    
    batch_size, seq_len = 2, 16
    token_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
    
    logits = model(token_ids)
    
    expected_shape = (batch_size, seq_len, config.vocab_size)
    assert logits.shape == expected_shape, \
        f"Expected {expected_shape}, got {logits.shape}"
    
    print(f"Shape check PASSED: {logits.shape}")

test_forward_shapes(tiny_config)

In [None]:
# ===== TEST 2: Parameter Count =====

def count_parameters(model):
    """Count total trainable parameters."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

def test_parameter_count(config):
    """Verify parameter count is reasonable."""
    model = MiniGPT(config)
    total = count_parameters(model)
    print(f"Total parameters: {total:,}")
    return total

# Test with tiny config
print("Tiny model:")
test_parameter_count(tiny_config)

# Test with full GPT-2 config
print("\nGPT-2 Small:")
test_parameter_count(GPTConfig())

In [None]:
# ===== TEST 3: Causal Masking =====

def test_causal_masking():
    """Verify model can't see future tokens."""
    config = GPTConfig(
        num_layers=1,
        embed_dim=64,
        num_heads=2,
        d_ff=256,
        dropout=0.0  # No dropout for determinism
    )
    model = MiniGPT(config)
    model.eval()

    # Same prefix, different suffix
    tokens_a = torch.tensor([[100, 200, 300, 400]])
    tokens_b = torch.tensor([[100, 200, 300, 999]])  # Last token different

    with torch.no_grad():
        logits_a = model(tokens_a)
        logits_b = model(tokens_b)

    # Positions 0, 1, 2 should be IDENTICAL
    for pos in range(3):
        assert torch.allclose(logits_a[0, pos], logits_b[0, pos], atol=1e-5), \
            f"Position {pos} logits differ!"

    # Position 3 SHOULD differ
    assert not torch.allclose(logits_a[0, 3], logits_b[0, 3], atol=1e-5), \
        "Position 3 logits same despite different input!"

    print("Causal masking PASSED!")

test_causal_masking()

In [None]:
# ===== TEST 4: Weight Tying =====

def test_weight_tying():
    """Verify embedding and lm_head share weights."""
    config = GPTConfig(embed_dim=64, num_heads=2, num_layers=2, d_ff=256)
    model = MiniGPT(config)

    # Should be the SAME tensor
    assert model.lm_head.weight is model.token_embed.weight, \
        "Weight tying failed: different tensors!"

    # Modify one, check the other changes
    with torch.no_grad():
        original = model.token_embed.weight[0, 0].item()
        model.token_embed.weight[0, 0] = 999.0
        
        assert model.lm_head.weight[0, 0].item() == 999.0, \
            "Weight tying failed: changes don't propagate!"
        
        model.token_embed.weight[0, 0] = original

    print("Weight tying PASSED!")

test_weight_tying()

In [None]:
# ===== TEST 5: Gradient Flow =====

def test_gradient_flow():
    """Verify gradients reach all parameters."""
    config = GPTConfig(num_layers=2, embed_dim=64, num_heads=2, d_ff=256)
    model = MiniGPT(config)

    # Forward pass
    tokens = torch.randint(0, config.vocab_size, (1, 8))
    logits = model(tokens)

    # Backward pass
    loss = logits.sum()
    loss.backward()

    # Check all parameters have gradients
    for name, param in model.named_parameters():
        assert param.grad is not None, f"No gradient for {name}"

    print("Gradient flow PASSED!")

test_gradient_flow()

## 5. Your First Forward Pass

Let's run real text through our model and see what it outputs with random weights.

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Create small model
config = GPTConfig(num_layers=2, embed_dim=256, num_heads=4, d_ff=1024)
model = MiniGPT(config)
model.eval()

# Tokenize a prompt
prompt = "The quick brown fox"
token_ids = tokenizer.encode(prompt, return_tensors="pt")

print(f"Prompt: '{prompt}'")
print(f"Token IDs: {token_ids}")
print(f"Tokens: {[tokenizer.decode([t]) for t in token_ids[0]]}")

In [None]:
# Forward pass
with torch.no_grad():
    logits = model(token_ids)

print(f"Logits shape: {logits.shape}")

# Get top 5 predictions for the next token
last_logits = logits[0, -1, :]
top_probs, top_indices = torch.softmax(last_logits, dim=-1).topk(5)

print(f"\nTop 5 predictions after '{prompt}':")
for prob, idx in zip(top_probs, top_indices):
    token = tokenizer.decode([idx])
    print(f"  '{token}': {prob:.4f}")

print("\n(Random predictions - model has random weights!)")

## 6. Loading Pretrained Weights

Now the exciting part: let's load real GPT-2 weights into our MiniGPT!

In [None]:
def load_gpt2_weights(model, model_name="gpt2"):
    """
    Load pretrained GPT-2 weights into our MiniGPT model.
    """
    from transformers import GPT2LMHeadModel

    print(f"Loading weights from '{model_name}'...")

    # Load HuggingFace model
    hf_model = GPT2LMHeadModel.from_pretrained(model_name)
    hf_state = hf_model.state_dict()

    # Our model's state dict
    our_state = model.state_dict()

    # Copy embeddings
    our_state['token_embed.weight'].copy_(hf_state['transformer.wte.weight'])
    our_state['pos_embed.weight'].copy_(hf_state['transformer.wpe.weight'])

    # Copy each Transformer block
    for i in range(model.config.num_layers):
        # Layer norms
        our_state[f'blocks.{i}.ln1.weight'].copy_(
            hf_state[f'transformer.h.{i}.ln_1.weight'])
        our_state[f'blocks.{i}.ln1.bias'].copy_(
            hf_state[f'transformer.h.{i}.ln_1.bias'])
        our_state[f'blocks.{i}.ln2.weight'].copy_(
            hf_state[f'transformer.h.{i}.ln_2.weight'])
        our_state[f'blocks.{i}.ln2.bias'].copy_(
            hf_state[f'transformer.h.{i}.ln_2.bias'])

        # Attention (need to transpose!)
        our_state[f'blocks.{i}.attn.qkv_proj.weight'].copy_(
            hf_state[f'transformer.h.{i}.attn.c_attn.weight'].T)
        our_state[f'blocks.{i}.attn.out_proj.weight'].copy_(
            hf_state[f'transformer.h.{i}.attn.c_proj.weight'].T)

        # FFN (need to transpose!)
        our_state[f'blocks.{i}.ffn.fc1.weight'].copy_(
            hf_state[f'transformer.h.{i}.mlp.c_fc.weight'].T)
        our_state[f'blocks.{i}.ffn.fc1.bias'].copy_(
            hf_state[f'transformer.h.{i}.mlp.c_fc.bias'])
        our_state[f'blocks.{i}.ffn.fc2.weight'].copy_(
            hf_state[f'transformer.h.{i}.mlp.c_proj.weight'].T)
        our_state[f'blocks.{i}.ffn.fc2.bias'].copy_(
            hf_state[f'transformer.h.{i}.mlp.c_proj.bias'])

    # Final layer norm
    our_state['ln_f.weight'].copy_(hf_state['transformer.ln_f.weight'])
    our_state['ln_f.bias'].copy_(hf_state['transformer.ln_f.bias'])

    print("Weights loaded successfully!")
    return model

In [None]:
@torch.no_grad()
def generate_simple(model, tokenizer, prompt, max_new_tokens=20):
    """
    Generate text using greedy decoding.
    """
    model.eval()
    device = next(model.parameters()).device

    # Encode prompt
    token_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    # Generate tokens one at a time
    for _ in range(max_new_tokens):
        logits = model(token_ids)
        next_logits = logits[:, -1, :]
        next_token = next_logits.argmax(dim=-1, keepdim=True)
        token_ids = torch.cat([token_ids, next_token], dim=1)

        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(token_ids[0])

In [None]:
# Create model with GPT-2 Small config
config = GPTConfig()  # Defaults match GPT-2 Small
model = MiniGPT(config)

# Load pretrained weights
model = load_gpt2_weights(model, "gpt2")
model = model.to(device)

print(f"\nModel on device: {device}")
print(f"Parameters: {count_parameters(model):,}")

In [None]:
# Generate text!
prompt = "The quick brown fox"
generated = generate_simple(model, tokenizer, prompt, max_new_tokens=30)

print(f"Prompt: '{prompt}'")
print(f"Generated: '{generated}'")

In [None]:
# Try more prompts!
prompts = [
    "Artificial intelligence will",
    "Once upon a time",
    "The capital of France is",
    "def fibonacci(n):"
]

for prompt in prompts:
    generated = generate_simple(model, tokenizer, prompt, max_new_tokens=20)
    print(f"'{prompt}' -> {generated}")
    print()

## 7. Random vs Pretrained Comparison

Let's dramatically compare random weights vs pretrained weights.

In [None]:
# Random weights model
model_random = MiniGPT(GPTConfig())
model_random = model_random.to(device)

prompt = "Artificial intelligence will"

print("=" * 50)
print("RANDOM WEIGHTS:")
print(generate_simple(model_random, tokenizer, prompt, max_new_tokens=20))
print()
print("PRETRAINED WEIGHTS:")
print(generate_simple(model, tokenizer, prompt, max_new_tokens=20))
print("=" * 50)

print("\nSame architecture. Same code. Training makes all the difference!")

## 8. Exercises

### Exercise 1: Tiny Model

Build and test a tiny model with specific specs.

In [None]:
# YOUR CODE HERE
# Build a tiny MiniGPT with:
# - vocab_size=500
# - max_seq_len=16
# - embed_dim=64
# - num_heads=2
# - num_layers=2
# - d_ff=256

# 1. Create the config
# 2. Build the model
# 3. Run a forward pass on random tokens
# 4. Verify output shape
# 5. Count parameters

### Exercise 2: Temperature Exploration

Modify generation to use temperature scaling.

In [None]:
# YOUR CODE HERE
# Modify generate_simple to accept a temperature parameter:
# next_logits = next_logits / temperature
# 
# Try temperatures: 0.5, 1.0, 1.5
# How does the output change?

### Exercise 3: Attention Visualization

Visualize attention patterns in the pretrained model.

In [None]:
# YOUR CODE HERE
# 1. Run model with return_attention=True
# 2. Get attention weights from the last block
# 3. Plot a heatmap for head 0
# Hint: Use matplotlib.pyplot.imshow()

## Summary

**What we built:**

1. **GPTConfig**: Clean configuration pattern
2. **MiniGPT**: Complete language model with weight tying
3. **Sanity checks**: 5 tests to verify correctness
4. **Weight loading**: Transfer GPT-2 weights
5. **Text generation**: Greedy decoding

**Key concepts:**

- `nn.ModuleList` for stacking layers
- Weight tying saves 38M parameters
- LM head: `(batch, seq, embed_dim)` → `(batch, seq, vocab_size)`
- `model.train()` vs `model.eval()`

**Next:** Chapter 12 will teach you to train this model from scratch!