# GPT Research Paper | Part II

## The Architecture: Building the Decoder-Only Transformer

---

**Paper:** [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

**Authors:** Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (OpenAI, 2018)

---

In Part I, we understood the motivation behind GPT: leverage massive unlabeled text through unsupervised pre-training, then fine-tune on specific tasks. Now we dive deep into the architecture that makes this possible.

This notebook provides:
- **Detailed explanations** of every architectural decision
- **Direct quotes** from the paper with analysis
- **Historical context** comparing to prior work
- **Complete implementations** with accurate visualizations

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle, Circle
import numpy as np
import math
from dataclasses import dataclass
from typing import Optional, Tuple

torch.manual_seed(42)
np.random.seed(42)

print("PyTorch version:", torch.__version__)

---

## 1. The Big Picture: Why This Architecture?

### 1.1 The Core Insight from the Paper

From Section 3.1:

> *"We use a multi-layer Transformer decoder for the language model, which is a variant of the transformer. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens."*

This single paragraph tells us the entire architecture. Let's break it down:

| Phrase | Meaning |
|--------|--------|
| "multi-layer" | Stack multiple blocks (12 in GPT) |
| "Transformer decoder" | Not encoder-decoder, just decoder |
| "multi-headed self-attention" | Parallel attention mechanisms |
| "position-wise feedforward" | Same MLP applied to each position |
| "output distribution over target tokens" | Predict next token probabilities |

### 1.2 Why Decoder-Only? The Three Options in 2018

When GPT was being developed, researchers had three main architectural options:

**Option 1: RNN/LSTM** (the incumbent)
- Pros: Well-understood, proven on many tasks
- Cons: Sequential processing (slow), vanishing gradients, limited context

**Option 2: Transformer Encoder** (what BERT would use 4 months later)
- Pros: Bidirectional context, parallel processing
- Cons: Can't do autoregressive generation (sees future tokens)

**Option 3: Transformer Decoder** (GPT's choice)
- Pros: Parallel processing, natural for language modeling, can generate text
- Cons: Only sees left context (not bidirectional)

### 1.3 The Key Equation

From the paper's Equation 1:

$$L_1(\mathcal{U}) = \sum_i \log P(u_i | u_{i-k}, ..., u_{i-1}; \Theta)$$

This is the **language modeling objective**: predict each token given only the *previous* tokens. This fundamentally requires a decoder architecture that can only look backwards.

### 1.4 Historical Context: June 2018

GPT was published in a pivotal moment:

| Date | Model | Architecture | Key Innovation |
|------|-------|-------------|----------------|
| 2013 | Word2Vec | Static embeddings | Word vectors |
| Feb 2018 | ELMo | Bidirectional LSTM | Contextualized embeddings |
| **Jun 2018** | **GPT** | **Transformer Decoder** | **Pre-train + fine-tune paradigm** |
| Oct 2018 | BERT | Transformer Encoder | Bidirectional pre-training |

GPT established that **Transformers could be pre-trained on raw text and transfer to diverse tasks** - a paradigm that now dominates NLP.

---

## 2. Architecture Specifications: The Exact Numbers

### 2.1 What the Paper Says

From Section 4.1 (Model specifications):

> *"We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states."*

And on training:

> *"We trained for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens."*

And on regularization:

> *"We use... residual, embedding, and attention dropouts with a rate of 0.1 for regularization."*

### 2.2 Complete Specification Table

| Parameter | Value | Paper Quote / Derivation |
|-----------|-------|-------------------------|
| Layers | 12 | "12-layer decoder-only transformer" |
| Hidden dimension | 768 | "768 dimensional states" |
| Attention heads | 12 | "12 attention heads" |
| Head dimension | 64 | 768 / 12 = 64 |
| FFN inner dim | 3072 | "3072 dimensional inner states" |
| Max sequence | 512 | "contiguous sequences of 512 tokens" |
| Vocab size | ~40,478 | "40,000 merges" (BPE) + special tokens |
| Dropout | 0.1 | "dropouts with a rate of 0.1" |

### 2.3 Why These Numbers?

**768 hidden dimension**: 
- Divisible by 12 (heads) giving 64 per head
- 64 is a power of 2 (GPU-efficient)
- Larger than original Transformer's 512, giving more capacity

**12 layers**:
- Double the original Transformer (6 layers)
- More depth = more abstraction levels
- Same as BERT-base (published 4 months later)

**3072 FFN inner dimension**:
- Exactly 4x the hidden size (768 * 4)
- Standard ratio from original Transformer
- Provides "memory" for learned patterns

**512 context length**:
- Memory scales as O(n^2) with sequence length
- 512 is long enough for most documents
- Longer than most prior work

In [None]:
@dataclass
class GPTConfig:
    """
    GPT-1 configuration - every value matches the paper exactly.
    
    References:
    - Section 4.1 for architecture specs
    - Section 4.1 for regularization
    """
    # === From Section 4.1 ===
    vocab_size: int = 40478       # "40,000 merges" + special tokens
    n_positions: int = 512        # "contiguous sequences of 512 tokens"
    n_embd: int = 768             # "768 dimensional states"
    n_layer: int = 12             # "12-layer decoder-only transformer"
    n_head: int = 12              # "12 attention heads"
    n_inner: int = 3072           # "3072 dimensional inner states"
    
    # === Regularization ===
    embd_pdrop: float = 0.1       # "dropouts with a rate of 0.1"
    attn_pdrop: float = 0.1       
    resid_pdrop: float = 0.1      
    
    @property
    def head_dim(self) -> int:
        """Each head has dimension 768/12 = 64"""
        return self.n_embd // self.n_head


config = GPTConfig()

print("GPT-1 Configuration (All values from paper Section 4.1)")
print("=" * 60)
print(f"\n[Architecture]")
print(f"  Vocabulary:        {config.vocab_size:,} tokens")
print(f"  Max sequence:      {config.n_positions} tokens")
print(f"  Hidden dimension:  {config.n_embd}")
print(f"  Layers:            {config.n_layer}")
print(f"  Attention heads:   {config.n_head}")
print(f"  Head dimension:    {config.head_dim} (= {config.n_embd}/{config.n_head})")
print(f"  FFN inner dim:     {config.n_inner} (= {config.n_embd} * 4)")
print(f"\n[Regularization]")
print(f"  Dropout rate:      {config.embd_pdrop}")

### 2.4 Complete Architecture Diagram

In [None]:
def draw_gpt_architecture():
    """Complete GPT architecture with paper annotations."""
    fig, ax = plt.subplots(figsize=(14, 18))
    ax.set_xlim(0, 14)
    ax.set_ylim(0, 18)
    ax.axis('off')
    
    # Colors
    c_emb = '#3498db'   # Blue - embeddings
    c_attn = '#e74c3c'  # Red - attention
    c_ffn = '#2ecc71'   # Green - FFN
    c_norm = '#f39c12'  # Orange - layer norm
    c_out = '#9b59b6'   # Purple - output
    
    # Title
    ax.text(7, 17.5, 'GPT Architecture', fontsize=20, fontweight='bold', ha='center')
    ax.text(7, 17, '"12-layer decoder-only transformer" - Section 4.1', 
            fontsize=11, ha='center', style='italic', color='gray')
    
    # === INPUT ===
    ax.text(7, 1.2, 'Input: "The cat sat on the mat"', fontsize=11, ha='center', fontweight='bold')
    tokens = ['<s>', 'The', 'cat', 'sat', 'on', '...']
    for i, tok in enumerate(tokens):
        x = 2.5 + i * 1.8
        rect = FancyBboxPatch((x-0.5, 0.3), 1.0, 0.6, boxstyle="round,pad=0.02",
                              facecolor='white', edgecolor='black', linewidth=1.5)
        ax.add_patch(rect)
        ax.text(x, 0.6, tok, ha='center', va='center', fontsize=10)
    
    ax.annotate('', xy=(7, 2.0), xytext=(7, 1.4),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    
    # === EMBEDDINGS ===
    rect_te = FancyBboxPatch((3.5, 2.1), 3, 1, boxstyle="round,pad=0.03",
                             facecolor=c_emb, edgecolor='black', linewidth=2)
    ax.add_patch(rect_te)
    ax.text(5, 2.6, 'Token Embedding', ha='center', va='center', fontsize=11,
            fontweight='bold', color='white')
    
    rect_pe = FancyBboxPatch((7.5, 2.1), 3, 1, boxstyle="round,pad=0.03",
                             facecolor=c_emb, edgecolor='black', linewidth=2)
    ax.add_patch(rect_pe)
    ax.text(9, 2.6, 'Position Embedding', ha='center', va='center', fontsize=11,
            fontweight='bold', color='white')
    
    ax.text(6.75, 2.6, '+', fontsize=20, ha='center', va='center', fontweight='bold')
    ax.text(12, 2.6, '40,478 x 768\n(learned)', fontsize=9, ha='left', 
            va='center', color='gray', style='italic')
    
    ax.annotate('', xy=(7, 3.6), xytext=(7, 3.2),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    
    # Dropout
    rect_drop = FancyBboxPatch((5.5, 3.7), 3, 0.5, boxstyle="round,pad=0.02",
                               facecolor='#ecf0f1', edgecolor='black', linewidth=1)
    ax.add_patch(rect_drop)
    ax.text(7, 3.95, 'Dropout (p=0.1)', ha='center', va='center', fontsize=9)
    
    # === TRANSFORMER BLOCK ===
    block_y = 4.5
    block_h = 7.5
    
    rect_block = FancyBboxPatch((2, block_y), 10, block_h, boxstyle="round,pad=0.05",
                                facecolor='#f8f9fa', edgecolor='#2c3e50', linewidth=2.5)
    ax.add_patch(rect_block)
    ax.text(0.8, block_y + block_h/2, 'Transformer\nDecoder\nBlock\n\nx 12', fontsize=11,
            ha='center', va='center', fontweight='bold')
    
    ax.annotate('', xy=(7, block_y + 0.4), xytext=(7, 4.3),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    
    # Layer Norm 1
    ln1_y = block_y + 0.6
    rect_ln1 = FancyBboxPatch((4, ln1_y), 6, 0.7, boxstyle="round,pad=0.02",
                              facecolor=c_norm, edgecolor='black', linewidth=2)
    ax.add_patch(rect_ln1)
    ax.text(7, ln1_y + 0.35, 'Layer Normalization', ha='center', va='center',
            fontsize=10, fontweight='bold', color='white')
    
    ax.annotate('', xy=(7, ln1_y + 1.1), xytext=(7, ln1_y + 0.75),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    
    # Attention
    attn_y = ln1_y + 1.2
    rect_attn = FancyBboxPatch((3.5, attn_y), 7, 1.6, boxstyle="round,pad=0.03",
                               facecolor=c_attn, edgecolor='black', linewidth=2)
    ax.add_patch(rect_attn)
    ax.text(7, attn_y + 1.0, 'Masked Multi-Head', ha='center', va='center',
            fontsize=12, fontweight='bold', color='white')
    ax.text(7, attn_y + 0.5, 'Self-Attention', ha='center', va='center',
            fontsize=12, fontweight='bold', color='white')
    ax.text(11.5, attn_y + 1.0, '12 heads', fontsize=9, ha='left', color='gray', style='italic')
    ax.text(11.5, attn_y + 0.6, '64 dim/head', fontsize=9, ha='left', color='gray', style='italic')
    
    # Residual 1
    res_x = 2.8
    ax.plot([res_x, res_x], [block_y + 0.4, attn_y + 2.0], 'k-', linewidth=2)
    ax.plot([res_x, 6.6], [attn_y + 2.0, attn_y + 2.0], 'k-', linewidth=2)
    
    ax.annotate('', xy=(7, attn_y + 2.15), xytext=(7, attn_y + 1.7),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    circle1 = Circle((7, attn_y + 2.2), 0.25, facecolor='white', edgecolor='black', linewidth=2)
    ax.add_patch(circle1)
    ax.text(7, attn_y + 2.2, '+', ha='center', va='center', fontsize=14, fontweight='bold')
    
    # Layer Norm 2
    ln2_y = attn_y + 2.6
    rect_ln2 = FancyBboxPatch((4, ln2_y), 6, 0.7, boxstyle="round,pad=0.02",
                              facecolor=c_norm, edgecolor='black', linewidth=2)
    ax.add_patch(rect_ln2)
    ax.text(7, ln2_y + 0.35, 'Layer Normalization', ha='center', va='center',
            fontsize=10, fontweight='bold', color='white')
    
    ax.annotate('', xy=(7, ln2_y + 1.0), xytext=(7, ln2_y + 0.75),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    
    # FFN
    ffn_y = ln2_y + 1.1
    rect_ffn = FancyBboxPatch((3.5, ffn_y), 7, 1.6, boxstyle="round,pad=0.03",
                              facecolor=c_ffn, edgecolor='black', linewidth=2)
    ax.add_patch(rect_ffn)
    ax.text(7, ffn_y + 1.0, 'Position-wise', ha='center', va='center',
            fontsize=12, fontweight='bold', color='white')
    ax.text(7, ffn_y + 0.5, 'Feed-Forward Network', ha='center', va='center',
            fontsize=12, fontweight='bold', color='white')
    ax.text(11.5, ffn_y + 1.0, '768 -> 3072 -> 768', fontsize=9, ha='left', color='gray', style='italic')
    ax.text(11.5, ffn_y + 0.6, 'GELU activation', fontsize=9, ha='left', color='gray', style='italic')
    
    # Residual 2
    ax.plot([res_x, res_x], [attn_y + 2.45, ffn_y + 1.9], 'k-', linewidth=2)
    ax.plot([res_x, 6.6], [ffn_y + 1.9, ffn_y + 1.9], 'k-', linewidth=2)
    
    ax.annotate('', xy=(7, ffn_y + 2.05), xytext=(7, ffn_y + 1.7),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    circle2 = Circle((7, ffn_y + 2.1), 0.25, facecolor='white', edgecolor='black', linewidth=2)
    ax.add_patch(circle2)
    ax.text(7, ffn_y + 2.1, '+', ha='center', va='center', fontsize=14, fontweight='bold')
    
    # === OUTPUT ===
    out_y = block_y + block_h + 0.3
    ax.annotate('', xy=(7, out_y + 0.3), xytext=(7, out_y),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    
    rect_ln_f = FancyBboxPatch((4, out_y + 0.4), 6, 0.7, boxstyle="round,pad=0.02",
                               facecolor=c_norm, edgecolor='black', linewidth=2)
    ax.add_patch(rect_ln_f)
    ax.text(7, out_y + 0.75, 'Layer Normalization', ha='center', va='center',
            fontsize=10, fontweight='bold', color='white')
    
    ax.annotate('', xy=(7, out_y + 1.5), xytext=(7, out_y + 1.15),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    
    rect_head = FancyBboxPatch((4, out_y + 1.6), 6, 0.8, boxstyle="round,pad=0.03",
                               facecolor=c_out, edgecolor='black', linewidth=2)
    ax.add_patch(rect_head)
    ax.text(7, out_y + 2.0, 'Linear (768 -> 40,478)', ha='center', va='center',
            fontsize=10, fontweight='bold', color='white')
    ax.text(11.5, out_y + 2.0, 'Weight tied with\ntoken embedding', fontsize=9, 
            ha='left', color='gray', style='italic')
    
    ax.annotate('', xy=(7, out_y + 2.8), xytext=(7, out_y + 2.45),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    
    rect_soft = FancyBboxPatch((5, out_y + 2.9), 4, 0.6, boxstyle="round,pad=0.02",
                               facecolor='#ecf0f1', edgecolor='black', linewidth=1.5)
    ax.add_patch(rect_soft)
    ax.text(7, out_y + 3.2, 'Softmax', ha='center', va='center', fontsize=11, fontweight='bold')
    
    ax.annotate('', xy=(7, out_y + 3.9), xytext=(7, out_y + 3.55),
                arrowprops=dict(arrowstyle='->', color='black', lw=2))
    ax.text(7, out_y + 4.2, 'P(next token | context)', fontsize=12, ha='center', 
            va='center', fontweight='bold')
    
    # Legend
    legend_items = [(c_emb, 'Embeddings'), (c_norm, 'Layer Norm'),
                    (c_attn, 'Attention'), (c_ffn, 'Feed-Forward'), (c_out, 'Output')]
    for i, (color, label) in enumerate(legend_items):
        x = 1.5 + i * 2.4
        rect = Rectangle((x, 16.3), 0.4, 0.4, facecolor=color, edgecolor='black')
        ax.add_patch(rect)
        ax.text(x + 0.55, 16.5, label, fontsize=9, va='center')
    
    plt.tight_layout()
    plt.show()

draw_gpt_architecture()

---

## 3. GELU Activation: Why Not ReLU?

### 3.1 The Paper's Choice

The original Transformer (2017) used ReLU. GPT explicitly chose **GELU**:

> *"We used... the Gaussian Error Linear Unit (GELU) activation function."*

### 3.2 What is GELU?

GELU (Gaussian Error Linear Unit) was introduced by Hendrycks & Gimpel in 2016. The key idea:

**ReLU asks**: "Is this value positive?" (binary gate)
$$\text{ReLU}(x) = \max(0, x) = x \cdot \mathbf{1}_{x > 0}$$

**GELU asks**: "How likely is this value to be larger than others?" (probabilistic gate)
$$\text{GELU}(x) = x \cdot \Phi(x)$$

Where $\Phi(x)$ is the CDF of the standard normal distribution.

### 3.3 The Approximation Used in Practice

Computing the exact CDF is expensive. GPT uses this approximation:

$$\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right)$$

### 3.4 Why GELU Works Better

| Property | ReLU | GELU |
|----------|------|------|
| For x < 0 | Always 0 ("dead") | Small negative values |
| At x = 0 | Not differentiable | Smooth |
| Gradient | Discontinuous | Continuous everywhere |
| Interpretation | Hard gate | Soft, probabilistic gate |

In [None]:
def gelu_approx(x: torch.Tensor) -> torch.Tensor:
    """GELU activation (tanh approximation as used in GPT)."""
    return 0.5 * x * (1.0 + torch.tanh(
        math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3))
    ))


# Visualization
x = torch.linspace(-4, 4, 300)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: GELU vs ReLU
ax1 = axes[0]
ax1.plot(x.numpy(), gelu_approx(x).numpy(), 'b-', linewidth=3, label='GELU (GPT)')
ax1.plot(x.numpy(), F.relu(x).numpy(), 'r--', linewidth=2.5, label='ReLU (Original Transformer)')
ax1.axhline(y=0, color='gray', linewidth=0.8, linestyle=':')
ax1.axvline(x=0, color='gray', linewidth=0.8, linestyle=':')
ax1.fill_between(x.numpy(), 0, gelu_approx(x).numpy(), where=(x.numpy() < 0), 
                 alpha=0.3, color='blue', label='GELU allows small negatives')
ax1.set_xlabel('Input (x)', fontsize=12)
ax1.set_ylabel('Output', fontsize=12)
ax1.set_title('GELU vs ReLU Activation Functions', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-4, 4)
ax1.set_ylim(-0.5, 4)

# Plot 2: The "gate" interpretation
ax2 = axes[1]
gate = 0.5 * (1 + torch.tanh(math.sqrt(2/math.pi) * (x + 0.044715 * x**3)))
relu_gate = (x > 0).float()

ax2.plot(x.numpy(), gate.numpy(), 'b-', linewidth=3, label='GELU gate (smooth)')
ax2.plot(x.numpy(), relu_gate.numpy(), 'r--', linewidth=2.5, label='ReLU gate (hard)')
ax2.axhline(y=0.5, color='gray', linewidth=0.8, linestyle=':')
ax2.axvline(x=0, color='gray', linewidth=0.8, linestyle=':')
ax2.set_xlabel('Input (x)', fontsize=12)
ax2.set_ylabel('Gate Value', fontsize=12)
ax2.set_title('The "Gating" Function\nGELU(x) = x * gate(x)', fontsize=13, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
ax2.set_xlim(-4, 4)
ax2.set_ylim(-0.1, 1.1)

ax2.annotate('Smooth\ntransition', xy=(0, 0.5), xytext=(1.5, 0.3),
            arrowprops=dict(arrowstyle='->', color='blue', lw=2),
            fontsize=11, color='blue', fontweight='bold')

plt.tight_layout()
plt.show()

print("Why GELU became standard in modern LLMs:")
print("  1. Smooth gradients -> better optimization")
print("  2. Non-zero for negative inputs -> no 'dead neurons'")
print("  3. Now used in GPT-2, GPT-3, GPT-4, BERT, and most transformers")

---

## 4. Layer Normalization: Pre-LN vs Post-LN

### 4.1 Why Layer Norm?

Deep networks suffer from **internal covariate shift** - layer inputs change distribution during training. Normalization stabilizes this.

**Batch Norm** (commonly used in vision): Normalizes across batch dimension
- Problem: Batch statistics vary, doesn't work for variable-length sequences

**Layer Norm** (used in GPT): Normalizes across feature dimension
- Each position normalized independently
- Works with any batch size, any sequence length

### 4.2 The Math

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Where:
- $\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$ (mean across 768 features)
- $\sigma^2 = \frac{1}{d}\sum_{i=1}^{d} (x_i - \mu)^2$ (variance)
- $\gamma, \beta$ are learnable parameters

### 4.3 Pre-LN vs Post-LN (Critical Choice!)

**Post-LN** (Original Transformer 2017):
```
x = LayerNorm(x + Sublayer(x))
```

**Pre-LN** (GPT and most modern transformers):
```
x = x + Sublayer(LayerNorm(x))
```

**Why Pre-LN is better:**
1. Residual path stays "clean" (gradients flow directly)
2. More stable training without careful warmup
3. Enables training much deeper networks

In [None]:
class LayerNorm(nn.Module):
    """Layer Normalization as used in GPT (normalizes across features)."""
    
    def __init__(self, n_embd: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(n_embd))   # Scale
        self.beta = nn.Parameter(torch.zeros(n_embd))   # Shift
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta


# Test
ln = LayerNorm(config.n_embd)
x = torch.randn(2, 5, config.n_embd) * 10 + 5  # Unnormalized
out = ln(x)

print("LayerNorm Test")
print("=" * 50)
print(f"Input:  mean={x[0,0].mean():.2f}, std={x[0,0].std():.2f}")
print(f"Output: mean={out[0,0].mean():.4f}, std={out[0,0].std():.4f}")
print(f"\nAfter LayerNorm: mean -> 0, std -> 1 (as expected)")

---

## 5. Causal Self-Attention: The Heart of GPT

### 5.1 What the Paper Says

> *"This model applies a **multi-headed self-attention** operation over the input context tokens..."*

> *"We trained a 12-layer decoder-only transformer with **masked self-attention heads**"*

The word **"masked"** is crucial - it means causal masking.

### 5.2 Why Causal Masking?

The training objective (Equation 1 from the paper):

$$L_1(\mathcal{U}) = \sum_i \log P(u_i | u_{i-k}, ..., u_{i-1}; \Theta)$$

To predict token $u_i$, we can only use tokens $u_1, ..., u_{i-1}$. If position $i$ could see position $i+1$, it would be **cheating**.

### 5.3 The Attention Equation

Standard attention:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Causal attention adds a mask:
$$\text{CausalAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$$

Where:
$$M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$$

### 5.4 Multi-Head Attention

From the paper: **"12 attention heads"** with **"768 dimensional states"**

This means:
- 12 parallel attention operations
- Each head has dimension 768/12 = 64
- Each head can learn different patterns (syntax, semantics, etc.)

In [None]:
class CausalSelfAttention(nn.Module):
    """
    Multi-head causal self-attention.
    
    From paper: "12 attention heads" with "768 dimensional states"
    -> 64 dimensions per head
    """
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_dim = config.n_embd // config.n_head
        self.scale = 1.0 / math.sqrt(self.head_dim)
        
        # Q, K, V projections combined
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # Dropout
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)
        # Causal mask
        mask = torch.tril(torch.ones(config.n_positions, config.n_positions))
        self.register_buffer('mask', mask.view(1, 1, config.n_positions, config.n_positions))
    
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        B, T, C = x.shape
        
        # Project to Q, K, V
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        
        # Reshape for multi-head
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        
        # Attention scores
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn_weights = F.softmax(attn, dim=-1)
        attn_weights = self.attn_dropout(attn_weights)
        
        # Apply to values
        out = attn_weights @ v
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.resid_dropout(self.c_proj(out))
        
        return out, attn_weights


# Test
attn = CausalSelfAttention(config)
x = torch.randn(2, 10, config.n_embd)
out, weights = attn(x)

print("Causal Self-Attention Test")
print("=" * 50)
print(f"Input:            {x.shape}")
print(f"Output:           {out.shape}")
print(f"Attention shape:  {weights.shape} (batch, heads, queries, keys)")
print(f"Parameters:       {sum(p.numel() for p in attn.parameters()):,}")

In [None]:
def visualize_causal_attention():
    """Visualize the causal attention mechanism."""
    fig, axes = plt.subplots(1, 3, figsize=(16, 5))
    
    tokens = ['<s>', 'The', 'cat', 'sat', 'on']
    n = len(tokens)
    
    # Generate example scores
    np.random.seed(42)
    scores = np.random.randn(n, n) * 0.5
    
    # Plot 1: Raw scores
    ax1 = axes[0]
    im1 = ax1.imshow(scores, cmap='RdBu_r', vmin=-2, vmax=2)
    ax1.set_title('Step 1: Raw Attention Scores\n(QK^T / sqrt(d_k))', fontsize=12, fontweight='bold')
    ax1.set_xticks(range(n))
    ax1.set_xticklabels(tokens, fontsize=10)
    ax1.set_yticks(range(n))
    ax1.set_yticklabels(tokens, fontsize=10)
    ax1.set_xlabel('Key (attending TO)', fontsize=10)
    ax1.set_ylabel('Query (attending FROM)', fontsize=10)
    plt.colorbar(im1, ax=ax1, shrink=0.8)
    
    # Plot 2: Causal mask
    ax2 = axes[1]
    mask = np.tril(np.ones((n, n)))
    scores_masked = np.where(mask == 1, scores, -10)
    im2 = ax2.imshow(scores_masked, cmap='RdBu_r', vmin=-3, vmax=2)
    ax2.set_title('Step 2: Apply Causal Mask\n(future = -inf)', fontsize=12, fontweight='bold')
    ax2.set_xticks(range(n))
    ax2.set_xticklabels(tokens, fontsize=10)
    ax2.set_yticks(range(n))
    ax2.set_yticklabels(tokens, fontsize=10)
    for i in range(n):
        for j in range(n):
            if j > i:
                ax2.text(j, i, 'X', ha='center', va='center', fontsize=12, 
                        color='red', fontweight='bold')
    plt.colorbar(im2, ax=ax2, shrink=0.8)
    
    # Plot 3: After softmax
    ax3 = axes[2]
    scores_inf = np.where(mask == 1, scores, -np.inf)
    attn_weights = np.zeros((n, n))
    for i in range(n):
        row = scores_inf[i, :i+1]
        exp_row = np.exp(row - np.max(row))
        attn_weights[i, :i+1] = exp_row / exp_row.sum()
    
    im3 = ax3.imshow(attn_weights, cmap='Blues', vmin=0, vmax=1)
    ax3.set_title('Step 3: Softmax\n(each row sums to 1)', fontsize=12, fontweight='bold')
    ax3.set_xticks(range(n))
    ax3.set_xticklabels(tokens, fontsize=10)
    ax3.set_yticks(range(n))
    ax3.set_yticklabels(tokens, fontsize=10)
    for i in range(n):
        for j in range(n):
            val = attn_weights[i, j]
            if val > 0.01:
                color = 'white' if val > 0.5 else 'black'
                ax3.text(j, i, f'{val:.2f}', ha='center', va='center', fontsize=9, color=color)
    plt.colorbar(im3, ax=ax3, shrink=0.8)
    
    plt.tight_layout()
    plt.show()
    
    print("Key insight: Each position can ONLY attend to itself and previous positions.")
    print("  - '<s>' (pos 0): 100% self-attention")
    print("  - 'on'  (pos 4): Can attend to all 5 tokens")

visualize_causal_attention()

---

## 6. Feed-Forward Network: The "Memory"

### 6.1 From the Paper

> *"For the position-wise feed-forward networks, we used 3072 dimensional inner states."*

### 6.2 Architecture

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

- $W_1 \in \mathbb{R}^{768 \times 3072}$ (expand)
- $W_2 \in \mathbb{R}^{3072 \times 768}$ (contract)

### 6.3 Why FFN Matters

Think of attention and FFN as doing different jobs:
- **Attention**: Routing information between positions
- **FFN**: Processing information at each position

The 3072-dimensional hidden layer acts like a "memory" that stores learned patterns. Research has shown individual neurons often correspond to specific concepts!

In [None]:
class MLP(nn.Module):
    """Position-wise FFN. From paper: '3072 dimensional inner states'"""
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, config.n_inner)    # 768 -> 3072
        self.c_proj = nn.Linear(config.n_inner, config.n_embd)  # 3072 -> 768
        self.dropout = nn.Dropout(config.resid_pdrop)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.c_fc(x)
        x = gelu_approx(x)
        x = self.c_proj(x)
        return self.dropout(x)


mlp = MLP(config)
print("FFN Analysis")
print("=" * 50)
print(f"Architecture: 768 -> 3072 -> 768 (4x expansion)")
print(f"Parameters:   {sum(p.numel() for p in mlp.parameters()):,}")
print(f"  W1: 768 x 3072 + 3072 = {768*3072 + 3072:,}")
print(f"  W2: 3072 x 768 + 768 = {3072*768 + 768:,}")

---

## 7. Complete GPT Model

Now we assemble all components:

In [None]:
class Block(nn.Module):
    """Transformer block with Pre-LayerNorm."""
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd)
        self.mlp = MLP(config)
    
    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        attn_out, attn_weights = self.attn(self.ln_1(x))
        x = x + attn_out  # Residual
        x = x + self.mlp(self.ln_2(x))  # Residual
        return x, attn_weights


class GPTEmbeddings(nn.Module):
    """Token + Position embeddings (both learned)."""
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
        self.drop = nn.Dropout(config.embd_pdrop)
    
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        B, T = input_ids.shape
        tok_emb = self.wte(input_ids)
        pos_emb = self.wpe(torch.arange(T, device=input_ids.device))
        return self.drop(tok_emb + pos_emb)


class GPT(nn.Module):
    """Complete GPT model matching the paper."""
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        
        self.embeddings = GPTEmbeddings(config)
        self.blocks = nn.ModuleList([Block(config) for _ in range(config.n_layer)])
        self.ln_f = LayerNorm(config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        
        # Weight tying (important!)
        self.lm_head.weight = self.embeddings.wte.weight
        
        self.apply(self._init_weights)
        print(f"GPT: {sum(p.numel() for p in self.parameters()):,} parameters")
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, input_ids: torch.Tensor, targets: Optional[torch.Tensor] = None):
        x = self.embeddings(input_ids)
        for block in self.blocks:
            x, _ = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss


model = GPT(config)

In [None]:
def parameter_analysis(config):
    """Detailed parameter breakdown."""
    print("\nGPT-1 Parameter Breakdown")
    print("=" * 60)
    
    # Embeddings
    tok_emb = config.vocab_size * config.n_embd
    pos_emb = config.n_positions * config.n_embd
    print(f"\nEMBEDDINGS")
    print(f"  Token:    {config.vocab_size:,} x {config.n_embd} = {tok_emb:,}")
    print(f"  Position: {config.n_positions} x {config.n_embd} = {pos_emb:,}")
    
    # Per block
    attn = config.n_embd * (3 * config.n_embd) + (3 * config.n_embd) + \
           config.n_embd * config.n_embd + config.n_embd
    ffn = config.n_embd * config.n_inner + config.n_inner + \
          config.n_inner * config.n_embd + config.n_embd
    ln = 4 * config.n_embd  # 2 LayerNorms
    block = attn + ffn + ln
    
    print(f"\nPER BLOCK")
    print(f"  Attention:  {attn:,}")
    print(f"  FFN:        {ffn:,}")
    print(f"  LayerNorm:  {ln:,}")
    print(f"  Total:      {block:,}")
    
    all_blocks = block * config.n_layer
    final_ln = 2 * config.n_embd
    total = tok_emb + pos_emb + all_blocks + final_ln
    
    print(f"\nTOTAL")
    print(f"  12 Blocks:    {all_blocks:,}")
    print(f"  Final LN:     {final_ln:,}")
    print(f"  LM Head:      (tied, 0 additional)")
    print(f"  " + "="*40)
    print(f"  TOTAL:        {total:,}")
    print(f"\n  Paper reports: ~117M parameters")

parameter_analysis(config)

In [None]:
# Test
model.eval()
x = torch.randint(0, config.vocab_size, (2, 64))
y = torch.randint(0, config.vocab_size, (2, 64))

with torch.no_grad():
    logits, loss = model(x, y)

print("\nForward Pass Test")
print("=" * 50)
print(f"Input:  {x.shape}")
print(f"Logits: {logits.shape}")
print(f"Loss:   {loss.item():.4f}")
print(f"Expected (random): {math.log(config.vocab_size):.4f}")
print(f"\nModel working correctly!")

---

## 8. Summary

### Key Architectural Choices (All from Paper)

| Choice | GPT Specification | Why |
|--------|------------------|-----|
| Architecture | Decoder-only | Language modeling is autoregressive |
| Depth | 12 layers | Balance of capacity and efficiency |
| Hidden dim | 768 | Divisible by 12, GPU-efficient |
| Attention heads | 12 | Multiple parallel attention patterns |
| FFN inner | 3072 (4x) | Standard expansion ratio |
| Activation | GELU | Smooth, better than ReLU |
| Normalization | Pre-LN | Better gradient flow |
| Positions | Learned | Flexibility |
| Output | Tied weights | Reduces parameters |

### Differences from Original Transformer (2017)

| Aspect | Original | GPT |
|--------|----------|-----|
| Structure | Encoder-Decoder | Decoder-only |
| Position | Sinusoidal | Learned |
| Activation | ReLU | GELU |
| LayerNorm | Post-LN | Pre-LN |
| Weight tying | No | Yes |

### What's Next

**Part III**: Pre-training - Training objective, learning rate schedule, text generation

**Part IV**: Fine-tuning - Task-specific input transformations, auxiliary loss

**Part V**: Complete implementation and evaluation

---

## References

1. Radford et al. (2018). [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
2. Vaswani et al. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
3. Hendrycks & Gimpel (2016). [GELUs](https://arxiv.org/abs/1606.08415)
4. Ba et al. (2016). [Layer Normalization](https://arxiv.org/abs/1607.06450)