# GPT Research Paper | Part V

## Complete Implementation from Scratch

---

**Paper:** [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

**Authors:** Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (OpenAI, 2018)

---

This is the **final, comprehensive implementation** of GPT-1. Every line is documented with:
- Paper references and quotes
- Mathematical derivations
- Design rationale
- Shape annotations

## Table of Contents

1. **Configuration** - All hyperparameters from Section 4.1
2. **Embeddings** - Token + Positional (learned)
3. **Layer Normalization** - Pre-LN variant
4. **GELU Activation** - Gaussian Error Linear Unit
5. **Causal Self-Attention** - Multi-head with masking
6. **Feed-Forward Network** - Position-wise MLP
7. **Transformer Block** - Attention + FFN with residuals
8. **Complete GPT Model** - Full architecture
9. **Training** - Language modeling objective
10. **Generation** - Autoregressive sampling
11. **Fine-tuning** - Task-specific adaptation
12. **Verification** - Tests and parameter counting

In [None]:
"""
GPT-1 Complete Implementation
=============================

This module implements the GPT-1 architecture exactly as described in:
"Improving Language Understanding by Generative Pre-Training"
Radford et al., 2018

All hyperparameters and architectural choices match the paper.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
import matplotlib.pyplot as plt
import numpy as np
import math
from dataclasses import dataclass
from typing import Optional, Tuple, List, Dict, Any
from tqdm import tqdm

# Reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")

---

## 1. Configuration

### Paper Reference (Section 4.1):

> *"We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states."*

> *"We trained for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens."*

> *"We use... residual, embedding, and attention dropouts with a rate of 0.1 for regularization."*

In [None]:
@dataclass
class GPTConfig:
    """
    GPT-1 Configuration
    ===================
    
    All values are extracted directly from the paper, Section 4.1.
    
    Architecture (from "Model specifications"):
    - "12-layer decoder-only transformer"
    - "768 dimensional states"
    - "12 attention heads"
    - "3072 dimensional inner states" (FFN)
    
    Training:
    - "contiguous sequences of 512 tokens"
    - "40,000 merges" (BPE vocabulary)
    - "dropouts with a rate of 0.1"
    
    Attributes:
        vocab_size (int): Vocabulary size. Paper uses BPE with 40,000 merges
            plus special tokens, giving ~40,478 tokens.
        n_positions (int): Maximum sequence length. Paper: 512.
        n_embd (int): Embedding dimension. Paper: 768.
        n_layer (int): Number of transformer blocks. Paper: 12.
        n_head (int): Number of attention heads. Paper: 12.
        n_inner (int): FFN hidden dimension. Paper: 3072 (4x n_embd).
        embd_pdrop (float): Embedding dropout probability. Paper: 0.1.
        attn_pdrop (float): Attention dropout probability. Paper: 0.1.
        resid_pdrop (float): Residual dropout probability. Paper: 0.1.
    """
    
    # ========== ARCHITECTURE (Section 4.1) ==========
    vocab_size: int = 40478      # "40,000 merges" + special tokens
    n_positions: int = 512       # "contiguous sequences of 512 tokens"
    n_embd: int = 768            # "768 dimensional states"
    n_layer: int = 12            # "12-layer decoder-only transformer"
    n_head: int = 12             # "12 attention heads"
    n_inner: int = 3072          # "3072 dimensional inner states"
    
    # ========== REGULARIZATION (Section 4.1) ==========
    embd_pdrop: float = 0.1      # "dropouts with a rate of 0.1"
    attn_pdrop: float = 0.1      # Applied to attention weights
    resid_pdrop: float = 0.1     # Applied after projections
    
    # ========== DERIVED VALUES ==========
    @property
    def head_dim(self) -> int:
        """Dimension per attention head: 768 / 12 = 64"""
        assert self.n_embd % self.n_head == 0, "n_embd must be divisible by n_head"
        return self.n_embd // self.n_head
    
    def __post_init__(self):
        """Validate configuration."""
        assert self.n_embd % self.n_head == 0, \
            f"n_embd ({self.n_embd}) must be divisible by n_head ({self.n_head})"
        assert self.n_inner == 4 * self.n_embd, \
            f"n_inner should be 4 * n_embd (standard transformer ratio)"


@dataclass
class TrainingConfig:
    """
    Training Configuration
    ======================
    
    Pre-training hyperparameters from Section 4.1:
    - "trained for 100 epochs"
    - "minibatches of 64"
    - "Adam optimization scheme"
    - "max learning rate of 2.5e-4"
    - "increased linearly from zero over the first 2000 updates"
    - "annealed to 0 using a cosine schedule"
    - "weight initialization of N(0, 0.02)"
    """
    
    # ========== PRE-TRAINING (Section 4.1) ==========
    batch_size: int = 64         # "minibatches of 64"
    epochs: int = 100            # "trained for 100 epochs"
    max_lr: float = 2.5e-4       # "max learning rate of 2.5e-4"
    warmup_steps: int = 2000     # "over the first 2000 updates"
    
    # ========== OPTIMIZER ==========
    weight_decay: float = 0.01   # Standard for AdamW
    betas: Tuple[float, float] = (0.9, 0.999)  # Adam defaults
    eps: float = 1e-8            # Adam default
    
    # ========== INITIALIZATION ==========
    init_std: float = 0.02       # "weight initialization of N(0, 0.02)"
    
    # ========== GRADIENT CLIPPING ==========
    max_grad_norm: float = 1.0   # Standard practice


@dataclass
class FineTuningConfig:
    """
    Fine-tuning Configuration
    =========================
    
    From Section 4.1:
    - "learning rate of 6.25e-5"
    - "batchsize of 32"
    - "train for 3 epochs"
    - "warmup over 0.2% of training"
    - "weight lambda = 0.5" (auxiliary LM loss)
    """
    
    learning_rate: float = 6.25e-5   # "learning rate of 6.25e-5"
    batch_size: int = 32             # "batchsize of 32"
    epochs: int = 3                  # "train for 3 epochs"
    warmup_fraction: float = 0.002   # "warmup over 0.2% of training"
    lm_weight: float = 0.5           # "weight lambda = 0.5"


# Create default configs
config = GPTConfig()
train_config = TrainingConfig()
ft_config = FineTuningConfig()

print("=" * 70)
print("GPT-1 CONFIGURATION (All values from paper Section 4.1)")
print("=" * 70)
print(f"\n[Model Architecture]")
print(f"  Layers:           {config.n_layer} (\"12-layer decoder-only transformer\")")
print(f"  Hidden dim:       {config.n_embd} (\"768 dimensional states\")")
print(f"  Attention heads:  {config.n_head} (\"12 attention heads\")")
print(f"  Head dimension:   {config.head_dim} (768 / 12 = 64)")
print(f"  FFN inner dim:    {config.n_inner} (\"3072 dimensional inner states\")")
print(f"  Max sequence:     {config.n_positions} (\"sequences of 512 tokens\")")
print(f"  Vocabulary:       {config.vocab_size:,} (\"40,000 merges\" + special)")
print(f"\n[Regularization]")
print(f"  Dropout:          {config.embd_pdrop} (\"dropouts with a rate of 0.1\")")
print(f"\n[Pre-training]")
print(f"  Batch size:       {train_config.batch_size}")
print(f"  Epochs:           {train_config.epochs}")
print(f"  Max LR:           {train_config.max_lr}")
print(f"  Warmup steps:     {train_config.warmup_steps}")
print(f"\n[Fine-tuning]")
print(f"  Learning rate:    {ft_config.learning_rate}")
print(f"  Epochs:           {ft_config.epochs}")
print(f"  LM weight (λ):    {ft_config.lm_weight}")

---

## 2. GELU Activation Function

### Paper Reference:

> *"We used... the Gaussian Error Linear Unit (GELU) activation function."*

### Mathematical Definition:

**Exact form:**
$$\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$$

**Approximation (used in practice):**
$$\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right]\right)$$

### Why GELU over ReLU?

- **ReLU**: Hard threshold at 0, gradient is 0 for x < 0 ("dead neurons")
- **GELU**: Smooth, probabilistic gating, non-zero gradients everywhere

In [None]:
def gelu(x: torch.Tensor) -> torch.Tensor:
    """
    Gaussian Error Linear Unit (GELU) Activation
    =============================================
    
    Paper: "We used... the Gaussian Error Linear Unit (GELU) activation function."
    
    GELU was introduced by Hendrycks & Gimpel (2016) and provides a smooth,
    probabilistic alternative to ReLU.
    
    Mathematical formulation:
        GELU(x) = x * Phi(x)
        
    where Phi(x) is the CDF of the standard normal distribution.
    
    This implementation uses the tanh approximation:
        GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
    
    The approximation is faster and numerically stable while being
    very close to the exact value (max error < 0.001).
    
    Args:
        x: Input tensor of any shape
        
    Returns:
        Tensor of same shape with GELU applied element-wise
        
    Example:
        >>> x = torch.tensor([-1.0, 0.0, 1.0])
        >>> gelu(x)
        tensor([-0.1588,  0.0000,  0.8413])
    """
    # Constants for the approximation
    # sqrt(2/pi) ≈ 0.7978845608
    # The coefficient 0.044715 was empirically determined
    
    return 0.5 * x * (
        1.0 + torch.tanh(
            math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3))
        )
    )


# Verify GELU implementation
print("GELU Activation Function")
print("=" * 50)
test_values = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f"Input:  {test_values.tolist()}")
print(f"Output: {[f'{v:.4f}' for v in gelu(test_values).tolist()]}")
print(f"\nCompare with PyTorch's GELU:")
print(f"PyTorch: {[f'{v:.4f}' for v in F.gelu(test_values).tolist()]}")
print(f"Match: {torch.allclose(gelu(test_values), F.gelu(test_values), atol=1e-4)}")

---

## 3. Layer Normalization

### Why Layer Norm?

Deep networks suffer from **internal covariate shift** - the distribution of layer inputs changes during training. Layer Normalization stabilizes training by normalizing across the feature dimension.

### Mathematical Definition:

$$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Where:
- $\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$ (mean across features)
- $\sigma^2 = \frac{1}{d}\sum_{i=1}^{d} (x_i - \mu)^2$ (variance across features)
- $\gamma, \beta \in \mathbb{R}^d$ are learned scale and shift parameters
- $\epsilon$ is a small constant for numerical stability

### Pre-LN vs Post-LN:

GPT uses **Pre-LN** (LayerNorm before sublayer):
```
x = x + Sublayer(LayerNorm(x))  # Pre-LN (GPT)
x = LayerNorm(x + Sublayer(x))  # Post-LN (original Transformer)
```

Pre-LN provides more stable gradients during training.

In [None]:
class LayerNorm(nn.Module):
    """
    Layer Normalization
    ===================
    
    Normalizes inputs across the feature dimension (last dimension).
    Used extensively throughout GPT for training stability.
    
    From Ba et al. (2016): "Layer Normalization"
    
    Mathematical formulation:
        y = gamma * (x - mean) / sqrt(var + eps) + beta
        
    where mean and var are computed across the last dimension.
    
    GPT uses Pre-LayerNorm (LN before attention/FFN) rather than
    Post-LayerNorm (LN after residual). This provides:
    - More stable gradients
    - Better training without careful warmup
    - Cleaner residual path
    
    Args:
        n_embd (int): Feature dimension (768 for GPT-1)
        eps (float): Small constant for numerical stability
        
    Shape:
        Input: (batch, seq_len, n_embd)
        Output: (batch, seq_len, n_embd)
        
    Parameters:
        gamma: Learned scale, shape (n_embd,), initialized to 1
        beta: Learned shift, shape (n_embd,), initialized to 0
    """
    
    def __init__(self, n_embd: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        
        # Learnable parameters
        # gamma (scale): initialized to 1, allows network to learn optimal scale
        # beta (shift): initialized to 0, allows network to learn optimal bias
        self.gamma = nn.Parameter(torch.ones(n_embd))
        self.beta = nn.Parameter(torch.zeros(n_embd))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Apply layer normalization.
        
        Args:
            x: Input tensor, shape (batch, seq_len, n_embd)
            
        Returns:
            Normalized tensor, shape (batch, seq_len, n_embd)
        """
        # Compute mean across last dimension (features)
        # Shape: (batch, seq_len, 1)
        mean = x.mean(dim=-1, keepdim=True)
        
        # Compute variance across last dimension
        # unbiased=False uses N instead of N-1 in denominator (matches paper)
        # Shape: (batch, seq_len, 1)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        
        # Normalize: (x - mean) / sqrt(var + eps)
        # Shape: (batch, seq_len, n_embd)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        
        # Scale and shift with learned parameters
        # gamma and beta broadcast across batch and seq_len
        return self.gamma * x_norm + self.beta


# Test LayerNorm
print("Layer Normalization")
print("=" * 50)
ln = LayerNorm(config.n_embd)
x = torch.randn(2, 5, config.n_embd) * 10 + 5  # Unnormalized
out = ln(x)

print(f"Input shape:  {x.shape}")
print(f"Output shape: {out.shape}")
print(f"\nBefore LayerNorm (sample position):")
print(f"  Mean: {x[0, 0].mean().item():.4f}")
print(f"  Std:  {x[0, 0].std().item():.4f}")
print(f"\nAfter LayerNorm (sample position):")
print(f"  Mean: {out[0, 0].mean().item():.6f} (≈0)")
print(f"  Std:  {out[0, 0].std().item():.6f} (≈1)")
print(f"\nParameters: {sum(p.numel() for p in ln.parameters()):,} (gamma + beta)")

---

## 4. Causal Self-Attention

### Paper Reference:

> *"This model applies a multi-headed self-attention operation over the input context tokens"*

> *"We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads)."*

### Mathematical Formulation:

**Standard Attention:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

**Causal (Masked) Attention:**
$$\text{CausalAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$$

Where mask $M$ is:
$$M_{ij} = \begin{cases} 0 & \text{if } j \leq i \text{ (can attend)} \\ -\infty & \text{if } j > i \text{ (cannot attend)} \end{cases}$$

**Multi-Head Attention:**
- Split into 12 heads, each with dimension 64
- Process in parallel, concatenate, project

In [None]:
class CausalSelfAttention(nn.Module):
    """
    Multi-Head Causal Self-Attention
    =================================
    
    Paper: "12 attention heads" with "768 dimensional states"
    This gives 64 dimensions per head (768 / 12 = 64).
    
    The "masked" in "masked self-attention heads" refers to causal masking:
    each position can only attend to itself and previous positions.
    This is required for autoregressive language modeling.
    
    Architecture:
        1. Project input to Q, K, V (combined for efficiency)
        2. Split into 12 heads
        3. Compute attention scores: QK^T / sqrt(d_k)
        4. Apply causal mask (future = -inf)
        5. Softmax to get attention weights
        6. Apply attention to values
        7. Concatenate heads and project output
    
    Args:
        config: GPTConfig with n_embd, n_head, dropout settings
        
    Shape:
        Input: (batch, seq_len, n_embd)
        Output: (batch, seq_len, n_embd)
        
    Parameters:
        c_attn: Combined Q/K/V projection, (n_embd, 3 * n_embd)
        c_proj: Output projection, (n_embd, n_embd)
    """
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        
        # Store configuration
        self.n_head = config.n_head        # 12 heads
        self.n_embd = config.n_embd        # 768 dimensions
        self.head_dim = config.head_dim    # 64 dimensions per head
        
        # Scaling factor: 1/sqrt(d_k) for stable gradients
        # Without scaling, dot products grow with dimension, causing
        # softmax to have extremely small gradients
        self.scale = 1.0 / math.sqrt(self.head_dim)
        
        # === PROJECTIONS ===
        # Combined Q, K, V projection for efficiency
        # Instead of 3 separate (768, 768) matrices, use one (768, 2304)
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        
        # Output projection: combines all heads back to n_embd
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        
        # === DROPOUT ===
        # Paper: "dropouts with a rate of 0.1"
        self.attn_dropout = nn.Dropout(config.attn_pdrop)   # On attention weights
        self.resid_dropout = nn.Dropout(config.resid_pdrop) # On output
        
        # === CAUSAL MASK ===
        # Lower triangular matrix: position i can attend to positions 0...i
        # Register as buffer (not a parameter, but saved with model)
        mask = torch.tril(torch.ones(config.n_positions, config.n_positions))
        self.register_buffer(
            'mask', 
            mask.view(1, 1, config.n_positions, config.n_positions)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass for causal self-attention.
        
        Args:
            x: Input tensor, shape (batch, seq_len, n_embd)
            
        Returns:
            Output tensor, shape (batch, seq_len, n_embd)
        """
        B, T, C = x.shape  # batch, seq_len, n_embd
        
        # === STEP 1: Project to Q, K, V ===
        # x @ W_qkv + b_qkv
        # Shape: (B, T, C) @ (C, 3C) -> (B, T, 3C)
        qkv = self.c_attn(x)
        
        # Split into Q, K, V
        # Each has shape (B, T, C)
        q, k, v = qkv.split(self.n_embd, dim=2)
        
        # === STEP 2: Reshape for multi-head attention ===
        # (B, T, C) -> (B, T, n_head, head_dim) -> (B, n_head, T, head_dim)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        # Now: q, k, v all have shape (B, n_head, T, head_dim)
        
        # === STEP 3: Compute attention scores ===
        # QK^T / sqrt(d_k)
        # (B, n_head, T, head_dim) @ (B, n_head, head_dim, T) -> (B, n_head, T, T)
        attn_scores = (q @ k.transpose(-2, -1)) * self.scale
        
        # === STEP 4: Apply causal mask ===
        # Where mask is 0, set attention score to -inf
        # After softmax, -inf becomes 0 (no attention to future)
        attn_scores = attn_scores.masked_fill(
            self.mask[:, :, :T, :T] == 0, 
            float('-inf')
        )
        
        # === STEP 5: Softmax to get attention weights ===
        # Shape: (B, n_head, T, T)
        attn_weights = F.softmax(attn_scores, dim=-1)
        
        # Apply dropout to attention weights
        attn_weights = self.attn_dropout(attn_weights)
        
        # === STEP 6: Apply attention to values ===
        # (B, n_head, T, T) @ (B, n_head, T, head_dim) -> (B, n_head, T, head_dim)
        out = attn_weights @ v
        
        # === STEP 7: Concatenate heads and project ===
        # (B, n_head, T, head_dim) -> (B, T, n_head, head_dim) -> (B, T, C)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        
        # Output projection with dropout
        out = self.resid_dropout(self.c_proj(out))
        
        return out


# Test Causal Self-Attention
print("Causal Self-Attention")
print("=" * 50)
attn = CausalSelfAttention(config)
x = torch.randn(2, 10, config.n_embd)
out = attn(x)

print(f"Input shape:  {x.shape}")
print(f"Output shape: {out.shape}")
print(f"\nParameter breakdown:")
print(f"  c_attn (Q/K/V combined): {config.n_embd} x {3*config.n_embd} + {3*config.n_embd} = {config.n_embd * 3 * config.n_embd + 3 * config.n_embd:,}")
print(f"  c_proj (output):         {config.n_embd} x {config.n_embd} + {config.n_embd} = {config.n_embd * config.n_embd + config.n_embd:,}")
print(f"  Total: {sum(p.numel() for p in attn.parameters()):,}")

---

## 5. Feed-Forward Network (MLP)

### Paper Reference:

> *"For the position-wise feed-forward networks, we used 3072 dimensional inner states."*

### Mathematical Formulation:

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

Where:
- $W_1 \in \mathbb{R}^{768 \times 3072}$ (expand)
- $W_2 \in \mathbb{R}^{3072 \times 768}$ (contract)

### Why 4x Expansion?

The 4x expansion ratio (768 → 3072 → 768) is standard in transformers:
- More "working space" for computation
- Attention routes information; FFN processes it
- Research shows FFN neurons encode specific features/concepts

In [None]:
class MLP(nn.Module):
    """
    Position-wise Feed-Forward Network
    ===================================
    
    Paper: "3072 dimensional inner states"
    
    This is a simple two-layer MLP applied independently to each position:
        1. Linear: 768 -> 3072 (expand)
        2. GELU activation
        3. Linear: 3072 -> 768 (contract)
        4. Dropout
    
    The 4x expansion ratio (768 * 4 = 3072) is standard in transformers.
    This provides more "working space" for the network to process information.
    
    Research has shown that individual neurons in the FFN often encode
    specific concepts or features ("knowledge neurons").
    
    Args:
        config: GPTConfig with n_embd, n_inner, dropout settings
        
    Shape:
        Input: (batch, seq_len, n_embd)
        Output: (batch, seq_len, n_embd)
        
    Parameters:
        c_fc: First linear layer, (n_embd, n_inner) = (768, 3072)
        c_proj: Second linear layer, (n_inner, n_embd) = (3072, 768)
    """
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        
        # First linear layer: expand 768 -> 3072
        # "fc" = "fully connected"
        self.c_fc = nn.Linear(config.n_embd, config.n_inner)
        
        # Second linear layer: contract 3072 -> 768
        # "proj" = "projection"
        self.c_proj = nn.Linear(config.n_inner, config.n_embd)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(config.resid_pdrop)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass for FFN.
        
        Args:
            x: Input tensor, shape (batch, seq_len, n_embd)
            
        Returns:
            Output tensor, shape (batch, seq_len, n_embd)
        """
        # Expand: (B, T, 768) -> (B, T, 3072)
        x = self.c_fc(x)
        
        # GELU activation
        x = gelu(x)
        
        # Contract: (B, T, 3072) -> (B, T, 768)
        x = self.c_proj(x)
        
        # Dropout
        x = self.dropout(x)
        
        return x


# Test MLP
print("Feed-Forward Network (MLP)")
print("=" * 50)
mlp = MLP(config)
x = torch.randn(2, 10, config.n_embd)
out = mlp(x)

print(f"Input shape:  {x.shape}")
print(f"Output shape: {out.shape}")
print(f"\nParameter breakdown:")
print(f"  c_fc:   {config.n_embd} x {config.n_inner} + {config.n_inner} = {config.n_embd * config.n_inner + config.n_inner:,}")
print(f"  c_proj: {config.n_inner} x {config.n_embd} + {config.n_embd} = {config.n_inner * config.n_embd + config.n_embd:,}")
print(f"  Total: {sum(p.numel() for p in mlp.parameters()):,}")

---

## 6. Transformer Block

### Architecture:

Each block consists of:
1. **Layer Norm** → **Multi-Head Attention** → **Residual Add**
2. **Layer Norm** → **Feed-Forward** → **Residual Add**

```
x = x + Attention(LayerNorm(x))  # Pre-LN attention
x = x + FFN(LayerNorm(x))        # Pre-LN FFN
```

### Residual Connections:

Residual connections (He et al., 2016) allow gradients to flow directly:
- Enables training much deeper networks
- Each layer learns a "residual" or "delta" to add

In [None]:
class Block(nn.Module):
    """
    Transformer Decoder Block
    =========================
    
    A single transformer block with:
        1. Layer Norm + Multi-Head Causal Self-Attention + Residual
        2. Layer Norm + Feed-Forward Network + Residual
    
    GPT uses Pre-LayerNorm (LN before sublayers) for better stability:
        x = x + Attention(LayerNorm(x))
        x = x + FFN(LayerNorm(x))
    
    vs the original Transformer's Post-LayerNorm:
        x = LayerNorm(x + Attention(x))
        x = LayerNorm(x + FFN(x))
    
    Pre-LN provides:
    - Cleaner gradient flow through residual path
    - More stable training
    - Less sensitivity to hyperparameters
    
    Args:
        config: GPTConfig with all hyperparameters
        
    Shape:
        Input: (batch, seq_len, n_embd)
        Output: (batch, seq_len, n_embd)
    """
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        
        # Layer Norm before attention
        self.ln_1 = LayerNorm(config.n_embd)
        
        # Multi-head causal self-attention
        self.attn = CausalSelfAttention(config)
        
        # Layer Norm before FFN
        self.ln_2 = LayerNorm(config.n_embd)
        
        # Position-wise feed-forward network
        self.mlp = MLP(config)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the transformer block.
        
        Args:
            x: Input tensor, shape (batch, seq_len, n_embd)
            
        Returns:
            Output tensor, shape (batch, seq_len, n_embd)
        """
        # === Attention sub-block (Pre-LN) ===
        # 1. Apply LayerNorm
        # 2. Apply attention
        # 3. Add residual connection
        x = x + self.attn(self.ln_1(x))
        
        # === FFN sub-block (Pre-LN) ===
        # 1. Apply LayerNorm
        # 2. Apply FFN
        # 3. Add residual connection
        x = x + self.mlp(self.ln_2(x))
        
        return x


# Test Block
print("Transformer Block")
print("=" * 50)
block = Block(config)
x = torch.randn(2, 10, config.n_embd)
out = block(x)

print(f"Input shape:  {x.shape}")
print(f"Output shape: {out.shape}")
print(f"\nParameter breakdown:")
print(f"  ln_1:      {sum(p.numel() for p in block.ln_1.parameters()):,}")
print(f"  attention: {sum(p.numel() for p in block.attn.parameters()):,}")
print(f"  ln_2:      {sum(p.numel() for p in block.ln_2.parameters()):,}")
print(f"  mlp:       {sum(p.numel() for p in block.mlp.parameters()):,}")
print(f"  Total:     {sum(p.numel() for p in block.parameters()):,}")

---

## 7. Complete GPT Model

### Paper Reference (Section 3.1):

> *"We use a multi-layer Transformer decoder for the language model, which is a variant of the transformer. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens."*

### Complete Architecture:

```
Input tokens
    ↓
Token Embedding (40,478 x 768) + Position Embedding (512 x 768)
    ↓
Dropout
    ↓
Transformer Block × 12
    ↓
Final Layer Norm
    ↓
Language Model Head (tied with token embedding)
    ↓
Output logits (vocabulary probabilities)
```

In [None]:
class GPT(nn.Module):
    """
    GPT-1 Language Model
    ====================
    
    Complete implementation of "Improving Language Understanding by Generative Pre-Training"
    (Radford et al., 2018)
    
    Architecture (from Section 4.1):
    - "12-layer decoder-only transformer"
    - "768 dimensional states"
    - "12 attention heads"
    - "3072 dimensional inner states" (FFN)
    - "contiguous sequences of 512 tokens"
    
    Components:
        1. Token Embedding: vocab_size -> n_embd
        2. Position Embedding: n_positions -> n_embd (LEARNED, not sinusoidal)
        3. 12 Transformer Blocks (attention + FFN with residuals)
        4. Final Layer Norm
        5. Language Model Head (weight-tied with token embedding)
    
    Training Objective (Equation 1):
        L1(U) = sum_i log P(u_i | u_{i-k}, ..., u_{i-1})
    
    This is standard causal language modeling: predict the next token
    given all previous tokens.
    
    Args:
        config: GPTConfig with all hyperparameters
        
    Inputs:
        input_ids: Token IDs, shape (batch, seq_len)
        targets: Target token IDs for loss, shape (batch, seq_len), optional
        
    Outputs:
        logits: Vocabulary logits, shape (batch, seq_len, vocab_size)
        loss: Cross-entropy loss if targets provided, else None
    """
    
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config
        
        # ========== EMBEDDINGS ==========
        # Token embeddings: map token IDs to vectors
        # Shape: (vocab_size, n_embd) = (40478, 768)
        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        
        # Position embeddings: LEARNED (not sinusoidal like original Transformer)
        # Shape: (n_positions, n_embd) = (512, 768)
        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
        
        # Embedding dropout
        self.drop = nn.Dropout(config.embd_pdrop)
        
        # ========== TRANSFORMER BLOCKS ==========
        # 12 identical decoder blocks
        self.blocks = nn.ModuleList([
            Block(config) for _ in range(config.n_layer)
        ])
        
        # ========== OUTPUT ==========
        # Final layer normalization
        self.ln_f = LayerNorm(config.n_embd)
        
        # Language model head: project to vocabulary
        # Note: bias=False because we'll tie weights with embedding
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        
        # ========== WEIGHT TYING ==========
        # Share weights between token embedding and LM head
        # This is standard practice that:
        # - Reduces parameters
        # - Improves performance
        # - Makes semantic sense (similar tokens have similar embeddings)
        self.lm_head.weight = self.wte.weight
        
        # ========== INITIALIZATION ==========
        # Paper: "weight initialization of N(0, 0.02)"
        self.apply(self._init_weights)
        
        # Report parameter count
        n_params = sum(p.numel() for p in self.parameters())
        print(f"GPT-1 Model initialized with {n_params:,} parameters")
    
    def _init_weights(self, module: nn.Module):
        """
        Initialize weights.
        
        Paper: "weight initialization of N(0, 0.02) was sufficient"
        
        This simple initialization works because LayerNorm helps
        stabilize the forward pass.
        """
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(
        self, 
        input_ids: torch.Tensor,
        targets: Optional[torch.Tensor] = None
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        """
        Forward pass for language modeling.
        
        Args:
            input_ids: Token IDs, shape (batch, seq_len)
            targets: Target token IDs for loss computation, shape (batch, seq_len)
            
        Returns:
            logits: Vocabulary logits, shape (batch, seq_len, vocab_size)
            loss: Cross-entropy loss if targets provided, else None
        """
        B, T = input_ids.shape
        assert T <= self.config.n_positions, \
            f"Sequence length {T} exceeds maximum {self.config.n_positions}"
        
        # ========== EMBEDDINGS ==========
        # Token embeddings: (B, T) -> (B, T, n_embd)
        tok_emb = self.wte(input_ids)
        
        # Position embeddings: (T,) -> (T, n_embd) -> broadcast to (B, T, n_embd)
        positions = torch.arange(T, device=input_ids.device)
        pos_emb = self.wpe(positions)
        
        # Combine embeddings
        x = self.drop(tok_emb + pos_emb)
        
        # ========== TRANSFORMER BLOCKS ==========
        for block in self.blocks:
            x = block(x)
        
        # ========== OUTPUT ==========
        # Final layer norm
        x = self.ln_f(x)
        
        # Project to vocabulary
        # (B, T, n_embd) -> (B, T, vocab_size)
        logits = self.lm_head(x)
        
        # ========== LOSS ==========
        loss = None
        if targets is not None:
            # Reshape for cross-entropy
            # logits: (B * T, vocab_size)
            # targets: (B * T,)
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )
        
        return logits, loss
    
    def get_num_params(self, non_embedding: bool = True) -> int:
        """
        Count parameters.
        
        Args:
            non_embedding: If True, exclude embedding parameters
            
        Returns:
            Number of parameters
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.wpe.weight.numel()
        return n_params


# Create model
print("=" * 70)
print("CREATING GPT-1 MODEL")
print("=" * 70)
model = GPT(config)

In [None]:
def detailed_parameter_count(model: GPT, config: GPTConfig):
    """
    Detailed breakdown of model parameters.
    """
    print("\n" + "=" * 70)
    print("DETAILED PARAMETER COUNT")
    print("=" * 70)
    
    # Embeddings
    tok_emb = config.vocab_size * config.n_embd
    pos_emb = config.n_positions * config.n_embd
    print(f"\n[EMBEDDINGS]")
    print(f"  Token embedding:    {config.vocab_size:,} x {config.n_embd} = {tok_emb:,}")
    print(f"  Position embedding: {config.n_positions} x {config.n_embd} = {pos_emb:,}")
    print(f"  Subtotal: {tok_emb + pos_emb:,}")
    
    # Per block
    print(f"\n[PER TRANSFORMER BLOCK]")
    
    # Attention
    c_attn = config.n_embd * (3 * config.n_embd) + (3 * config.n_embd)  # W + b
    c_proj = config.n_embd * config.n_embd + config.n_embd
    attn_total = c_attn + c_proj
    print(f"  Attention:")
    print(f"    c_attn (Q/K/V): {config.n_embd} x {3*config.n_embd} + {3*config.n_embd} = {c_attn:,}")
    print(f"    c_proj:         {config.n_embd} x {config.n_embd} + {config.n_embd} = {c_proj:,}")
    
    # FFN
    c_fc = config.n_embd * config.n_inner + config.n_inner
    c_proj_ffn = config.n_inner * config.n_embd + config.n_embd
    ffn_total = c_fc + c_proj_ffn
    print(f"  FFN:")
    print(f"    c_fc:           {config.n_embd} x {config.n_inner} + {config.n_inner} = {c_fc:,}")
    print(f"    c_proj:         {config.n_inner} x {config.n_embd} + {config.n_embd} = {c_proj_ffn:,}")
    
    # LayerNorms
    ln_params = 2 * config.n_embd  # gamma + beta
    ln_total = 2 * ln_params  # Two LayerNorms per block
    print(f"  LayerNorm:        2 x (2 x {config.n_embd}) = {ln_total:,}")
    
    block_total = attn_total + ffn_total + ln_total
    print(f"  Block total:      {block_total:,}")
    
    # All blocks
    all_blocks = block_total * config.n_layer
    print(f"\n[ALL BLOCKS]")
    print(f"  {config.n_layer} blocks x {block_total:,} = {all_blocks:,}")
    
    # Output
    ln_f = 2 * config.n_embd
    print(f"\n[OUTPUT]")
    print(f"  Final LayerNorm:  {ln_f:,}")
    print(f"  LM Head:          (tied with token embedding, 0 additional)")
    
    # Total
    total = tok_emb + pos_emb + all_blocks + ln_f
    print(f"\n[TOTAL]")
    print(f"  Embeddings:       {tok_emb + pos_emb:,}")
    print(f"  Blocks:           {all_blocks:,}")
    print(f"  Output:           {ln_f:,}")
    print(f"  " + "-" * 40)
    print(f"  TOTAL:            {total:,}")
    
    # Actual count
    actual = sum(p.numel() for p in model.parameters())
    print(f"\n  Actual (PyTorch): {actual:,}")
    print(f"  Paper reports:    ~117M parameters")

detailed_parameter_count(model, config)

---

## 8. Training

### Paper Reference (Section 3.1):

> *"Given an unsupervised corpus of tokens U = {u1, ..., un}, we use a standard language modeling objective to maximize the following likelihood:"*

$$L_1(\mathcal{U}) = \sum_i \log P(u_i | u_{i-k}, ..., u_{i-1}; \Theta)$$

### Learning Rate Schedule:

> *"The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule."*

In [None]:
class Trainer:
    """
    GPT Trainer
    ===========
    
    Implements the training procedure from Section 4.1:
    - Adam optimizer with max LR 2.5e-4
    - Linear warmup for 2000 steps
    - Cosine annealing to 0
    - Gradient clipping
    
    The language modeling objective (Equation 1):
        L1(U) = sum_i log P(u_i | u_{i-k}, ..., u_{i-1})
    
    is implemented as cross-entropy loss between model predictions
    and the next token at each position.
    """
    
    def __init__(
        self, 
        model: GPT, 
        train_config: TrainingConfig,
        device: torch.device = torch.device('cpu')
    ):
        self.model = model
        self.config = train_config
        self.device = device
        
        # Move model to device
        self.model.to(device)
        
        # Optimizer: "Adam optimization scheme"
        self.optimizer = AdamW(
            model.parameters(),
            lr=train_config.max_lr,
            betas=train_config.betas,
            eps=train_config.eps,
            weight_decay=train_config.weight_decay
        )
        
        # Training state
        self.step = 0
        self.losses = []
    
    def get_lr(self, step: int, total_steps: int) -> float:
        """
        Get learning rate for current step.
        
        Paper: "The learning rate was increased linearly from zero over
        the first 2000 updates and annealed to 0 using a cosine schedule."
        
        Args:
            step: Current training step
            total_steps: Total number of training steps
            
        Returns:
            Learning rate for this step
        """
        warmup = self.config.warmup_steps
        max_lr = self.config.max_lr
        
        if step < warmup:
            # Linear warmup: 0 -> max_lr
            return max_lr * step / warmup
        else:
            # Cosine annealing: max_lr -> 0
            progress = (step - warmup) / (total_steps - warmup)
            return max_lr * 0.5 * (1.0 + math.cos(math.pi * progress))
    
    def train_step(
        self, 
        input_ids: torch.Tensor, 
        targets: torch.Tensor,
        total_steps: int
    ) -> Tuple[float, float]:
        """
        Single training step.
        
        Args:
            input_ids: Input token IDs, shape (batch, seq_len)
            targets: Target token IDs, shape (batch, seq_len)
            total_steps: Total steps for LR schedule
            
        Returns:
            loss: Training loss
            lr: Current learning rate
        """
        # Update learning rate
        lr = self.get_lr(self.step, total_steps)
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        
        # Move data to device
        input_ids = input_ids.to(self.device)
        targets = targets.to(self.device)
        
        # Forward pass
        self.model.train()
        logits, loss = self.model(input_ids, targets)
        
        # Backward pass
        self.optimizer.zero_grad()
        loss.backward()
        
        # Gradient clipping (common practice for stability)
        torch.nn.utils.clip_grad_norm_(
            self.model.parameters(), 
            self.config.max_grad_norm
        )
        
        # Update weights
        self.optimizer.step()
        
        # Update state
        self.step += 1
        self.losses.append(loss.item())
        
        return loss.item(), lr


# Create trainer
trainer = Trainer(model, train_config, device)
print(f"\nTrainer initialized on {device}")

In [None]:
def demo_training(model: GPT, trainer: Trainer, num_steps: int = 100):
    """
    Demonstrate training on synthetic data.
    """
    print("\n" + "=" * 70)
    print("TRAINING DEMO")
    print("=" * 70)
    print(f"Running {num_steps} steps on synthetic data...")
    
    # Reset trainer
    trainer.step = 0
    trainer.losses = []
    
    batch_size = 8
    seq_len = 64
    
    losses = []
    lrs = []
    
    for step in range(num_steps):
        # Generate synthetic batch
        input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
        targets = torch.randint(0, config.vocab_size, (batch_size, seq_len))
        
        loss, lr = trainer.train_step(input_ids, targets, num_steps)
        losses.append(loss)
        lrs.append(lr)
        
        if (step + 1) % 25 == 0:
            print(f"  Step {step+1:4d} | Loss: {loss:.4f} | LR: {lr:.2e}")
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 4))
    
    axes[0].plot(losses, 'b-', linewidth=1.5, alpha=0.7)
    axes[0].set_xlabel('Step')
    axes[0].set_ylabel('Loss')
    axes[0].set_title('Training Loss')
    axes[0].grid(True, alpha=0.3)
    
    axes[1].plot(lrs, 'r-', linewidth=2)
    axes[1].set_xlabel('Step')
    axes[1].set_ylabel('Learning Rate')
    axes[1].set_title('Learning Rate Schedule')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nFinal loss: {losses[-1]:.4f}")
    print(f"Expected random: {math.log(config.vocab_size):.4f}")

demo_training(model, trainer, num_steps=100)

---

## 9. Text Generation

### Autoregressive Generation:

Once trained, GPT generates text by:
1. Start with a prompt
2. Predict the next token distribution
3. Sample from the distribution
4. Append sampled token to sequence
5. Repeat until done

In [None]:
@torch.no_grad()
def generate(
    model: GPT,
    input_ids: torch.Tensor,
    max_new_tokens: int = 50,
    temperature: float = 1.0,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
) -> torch.Tensor:
    """
    Autoregressive Text Generation
    ==============================
    
    Generate text by repeatedly predicting and sampling the next token.
    
    Sampling strategies:
    - Temperature: Scale logits before softmax
        - T < 1: More confident (peaked distribution)
        - T > 1: More random (uniform distribution)
    - Top-k: Only consider the k highest probability tokens
    - Top-p (nucleus): Only consider tokens whose cumulative probability <= p
    
    Args:
        model: Trained GPT model
        input_ids: Starting token IDs, shape (batch, seq_len)
        max_new_tokens: Number of tokens to generate
        temperature: Sampling temperature (1.0 = no change)
        top_k: If set, only sample from top k tokens
        top_p: If set, use nucleus sampling
        
    Returns:
        Generated token IDs including input, shape (batch, seq_len + max_new_tokens)
    """
    model.eval()
    generated = input_ids.clone()
    
    for _ in range(max_new_tokens):
        # Crop context to max sequence length if needed
        idx_cond = generated
        if generated.size(1) > model.config.n_positions:
            idx_cond = generated[:, -model.config.n_positions:]
        
        # Get predictions
        logits, _ = model(idx_cond)
        
        # Get logits for the last position only
        logits = logits[:, -1, :]  # (batch, vocab_size)
        
        # Apply temperature
        logits = logits / temperature
        
        # Apply top-k filtering
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = float('-inf')
        
        # Apply top-p (nucleus) filtering
        if top_p is not None:
            sorted_logits, sorted_indices = torch.sort(logits, descending=True)
            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
            
            # Remove tokens with cumulative probability above threshold
            sorted_indices_to_remove = cumulative_probs > top_p
            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
            sorted_indices_to_remove[..., 0] = 0
            
            indices_to_remove = sorted_indices_to_remove.scatter(
                1, sorted_indices, sorted_indices_to_remove
            )
            logits[indices_to_remove] = float('-inf')
        
        # Sample from the distribution
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        
        # Append to generated sequence
        generated = torch.cat([generated, next_token], dim=1)
    
    return generated


# Demo generation
print("\n" + "=" * 70)
print("GENERATION DEMO")
print("=" * 70)
print("Note: Model is untrained, so outputs are random.")

prompt = torch.randint(0, 100, (1, 5)).to(device)

settings = [
    ("Greedy (T=0.1)", {"temperature": 0.1}),
    ("Normal (T=1.0)", {"temperature": 1.0}),
    ("Creative (T=1.5)", {"temperature": 1.5}),
    ("Top-k (k=10)", {"top_k": 10}),
    ("Top-p (p=0.9)", {"top_p": 0.9}),
]

for name, kwargs in settings:
    output = generate(model, prompt, max_new_tokens=10, **kwargs)
    generated_tokens = output[0, 5:].tolist()
    print(f"\n{name}: {generated_tokens}")

---

## 10. Fine-tuning for Classification

### Paper Reference (Section 3.2):

> *"The inputs are passed through our pre-trained model to obtain the final transformer block's activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$."*

$$P(y | x^1, ..., x^m) = \text{softmax}(h_l^m W_y)$$

### Combined Objective (Equation 3):

> *"We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning."*

$$L_3(C) = L_2(C) + \lambda \cdot L_1(C)$$

In [None]:
class GPTForSequenceClassification(nn.Module):
    """
    GPT with Classification Head
    ============================
    
    For fine-tuning on classification tasks (sentiment, NLI, etc.).
    
    Paper (Section 3.2):
    "The inputs are passed through our pre-trained model to obtain the
    final transformer block's activation h_l^m, which is then fed into
    an added linear output layer with parameters W_y to predict y."
    
    Combined objective (Equation 3):
        L3(C) = L2(C) + lambda * L1(C)
        
    where:
        L2 = task-specific classification loss
        L1 = auxiliary language modeling loss
        lambda = 0.5 (from paper)
    
    Args:
        config: GPTConfig
        num_labels: Number of classification labels
        lm_weight: Weight for auxiliary LM loss (lambda = 0.5)
    """
    
    def __init__(
        self, 
        config: GPTConfig, 
        num_labels: int,
        lm_weight: float = 0.5
    ):
        super().__init__()
        
        # Pre-trained GPT backbone
        self.gpt = GPT(config)
        
        # Task configuration
        self.num_labels = num_labels
        self.lm_weight = lm_weight  # lambda from Equation 3
        
        # Classification head: single linear layer
        # "added linear output layer with parameters W_y"
        self.classifier = nn.Linear(config.n_embd, num_labels)
        self.dropout = nn.Dropout(0.1)
        
        # Initialize classifier
        nn.init.normal_(self.classifier.weight, std=0.02)
        nn.init.zeros_(self.classifier.bias)
        
        # Count parameters
        total = sum(p.numel() for p in self.parameters())
        classifier_params = sum(p.numel() for p in self.classifier.parameters())
        print(f"\nClassification Model:")
        print(f"  Total parameters:      {total:,}")
        print(f"  New classifier params: {classifier_params:,}")
    
    def forward(
        self,
        input_ids: torch.Tensor,
        labels: Optional[torch.Tensor] = None,
        lm_targets: Optional[torch.Tensor] = None,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        """
        Forward pass with combined loss.
        
        Args:
            input_ids: Token IDs, shape (batch, seq_len)
            labels: Classification labels, shape (batch,)
            lm_targets: Targets for auxiliary LM loss, shape (batch, seq_len)
            
        Returns:
            logits: Classification logits, shape (batch, num_labels)
            loss: Combined loss L3 = L2 + lambda * L1
        """
        batch_size = input_ids.size(0)
        
        # Get GPT outputs
        # We need hidden states, so we modify to return them
        logits_lm, loss_lm = self.gpt(input_ids, lm_targets)
        
        # Get hidden states from last position
        # We need to recompute without the LM head
        B, T = input_ids.shape
        tok_emb = self.gpt.wte(input_ids)
        pos_emb = self.gpt.wpe(torch.arange(T, device=input_ids.device))
        x = self.gpt.drop(tok_emb + pos_emb)
        
        for block in self.gpt.blocks:
            x = block(x)
        
        x = self.gpt.ln_f(x)
        
        # Extract representation at last position: h_l^m
        # "final transformer block's activation h_l^m"
        pooled = x[:, -1, :]  # (batch, n_embd)
        pooled = self.dropout(pooled)
        
        # Classification: softmax(h_l^m * W_y)
        classification_logits = self.classifier(pooled)  # (batch, num_labels)
        
        # Compute combined loss (Equation 3)
        loss = None
        if labels is not None:
            # L2: Task-specific classification loss
            loss_task = F.cross_entropy(classification_logits, labels)
            
            # L3 = L2 + lambda * L1
            if loss_lm is not None:
                loss = loss_task + self.lm_weight * loss_lm
            else:
                loss = loss_task
        
        return classification_logits, loss


# Demo
print("\n" + "=" * 70)
print("FINE-TUNING DEMO")
print("=" * 70)

# Create classification model (e.g., for sentiment: positive/negative)
model_cls = GPTForSequenceClassification(config, num_labels=2)
model_cls.to(device)

# Simulate fine-tuning step
batch_size = 4
seq_len = 64

input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len)).to(device)
labels = torch.randint(0, 2, (batch_size,)).to(device)
lm_targets = torch.randint(0, config.vocab_size, (batch_size, seq_len)).to(device)

# Forward pass
logits, loss = model_cls(input_ids, labels=labels, lm_targets=lm_targets)

print(f"\nForward pass:")
print(f"  Input shape:  {input_ids.shape}")
print(f"  Labels shape: {labels.shape}")
print(f"  Logits shape: {logits.shape}")
print(f"  Combined loss (L3 = L2 + 0.5*L1): {loss.item():.4f}")

---

## 11. Verification and Testing

Let's verify our implementation is correct.

In [None]:
def run_verification_tests(model: GPT, config: GPTConfig):
    """
    Run verification tests on the model.
    """
    print("\n" + "=" * 70)
    print("VERIFICATION TESTS")
    print("=" * 70)
    
    all_passed = True
    
    # Test 1: Output shape
    print("\n[Test 1] Output shape")
    x = torch.randint(0, config.vocab_size, (2, 32)).to(device)
    logits, _ = model(x)
    expected_shape = (2, 32, config.vocab_size)
    passed = logits.shape == expected_shape
    print(f"  Expected: {expected_shape}")
    print(f"  Got:      {tuple(logits.shape)}")
    print(f"  PASSED: {passed}")
    all_passed &= passed
    
    # Test 2: Causal masking
    print("\n[Test 2] Causal masking (no future leakage)")
    x1 = torch.tensor([[1, 2, 3, 4, 5]]).to(device)
    x2 = torch.tensor([[1, 2, 3, 9, 9]]).to(device)  # Different future
    logits1, _ = model(x1)
    logits2, _ = model(x2)
    # Positions 0, 1, 2 should be identical
    diff = (logits1[0, :3] - logits2[0, :3]).abs().max().item()
    passed = diff < 1e-5
    print(f"  Max diff at positions 0-2: {diff:.8f}")
    print(f"  PASSED: {passed} (should be ~0)")
    all_passed &= passed
    
    # Test 3: Loss computation
    print("\n[Test 3] Loss computation")
    x = torch.randint(0, config.vocab_size, (4, 64)).to(device)
    targets = torch.randint(0, config.vocab_size, (4, 64)).to(device)
    _, loss = model(x, targets)
    expected_loss = math.log(config.vocab_size)  # Random baseline
    passed = abs(loss.item() - expected_loss) < 1.0  # Within 1 of expected
    print(f"  Loss: {loss.item():.4f}")
    print(f"  Expected (random): {expected_loss:.4f}")
    print(f"  PASSED: {passed}")
    all_passed &= passed
    
    # Test 4: Gradient flow
    print("\n[Test 4] Gradient flow")
    model.zero_grad()
    x = torch.randint(0, config.vocab_size, (2, 32)).to(device)
    targets = torch.randint(0, config.vocab_size, (2, 32)).to(device)
    _, loss = model(x, targets)
    loss.backward()
    
    has_grad = all(
        p.grad is not None and p.grad.abs().sum() > 0 
        for p in model.parameters() 
        if p.requires_grad
    )
    print(f"  All parameters have gradients: {has_grad}")
    print(f"  PASSED: {has_grad}")
    all_passed &= has_grad
    
    # Test 5: Parameter count
    print("\n[Test 5] Parameter count")
    n_params = sum(p.numel() for p in model.parameters())
    # GPT-1 is approximately 117M parameters
    passed = 110_000_000 < n_params < 125_000_000
    print(f"  Parameters: {n_params:,}")
    print(f"  Expected: ~117M")
    print(f"  PASSED: {passed}")
    all_passed &= passed
    
    # Test 6: Generation
    print("\n[Test 6] Generation")
    prompt = torch.randint(0, 100, (1, 5)).to(device)
    generated = generate(model, prompt, max_new_tokens=10)
    passed = generated.shape == (1, 15)
    print(f"  Prompt length: 5")
    print(f"  Generated length: {generated.shape[1]}")
    print(f"  PASSED: {passed}")
    all_passed &= passed
    
    print("\n" + "=" * 70)
    print(f"ALL TESTS PASSED: {all_passed}")
    print("=" * 70)
    
    return all_passed

run_verification_tests(model, config)

---

## 12. Summary: Complete GPT-1 Implementation

### What We Built

| Component | Implementation | Paper Reference |
|-----------|---------------|----------------|
| **Configuration** | `GPTConfig` | Section 4.1 |
| **GELU** | `gelu()` | "Gaussian Error Linear Unit" |
| **Layer Norm** | `LayerNorm` | Pre-LN variant |
| **Attention** | `CausalSelfAttention` | "masked self-attention heads" |
| **FFN** | `MLP` | "3072 dimensional inner states" |
| **Block** | `Block` | Pre-LN residual connections |
| **Model** | `GPT` | "12-layer decoder-only transformer" |
| **Training** | `Trainer` | Section 4.1 hyperparameters |
| **Generation** | `generate()` | Autoregressive sampling |
| **Fine-tuning** | `GPTForSequenceClassification` | Equation 3 |

### Architecture Specifications

| Parameter | Value | Source |
|-----------|-------|--------|
| Layers | 12 | "12-layer decoder-only transformer" |
| Hidden dim | 768 | "768 dimensional states" |
| Heads | 12 | "12 attention heads" |
| Head dim | 64 | 768 / 12 |
| FFN dim | 3072 | "3072 dimensional inner states" |
| Max seq | 512 | "sequences of 512 tokens" |
| Vocab | 40,478 | "40,000 merges" |
| Dropout | 0.1 | "dropouts with a rate of 0.1" |
| Parameters | ~117M | Paper reports |

### Key Equations

**Pre-training (Equation 1):**
$$L_1(\mathcal{U}) = \sum_i \log P(u_i | u_{i-k}, ..., u_{i-1}; \Theta)$$

**Fine-tuning (Equation 3):**
$$L_3(C) = L_2(C) + \lambda \cdot L_1(C)$$

### Historical Impact

GPT-1 established the **pre-train then fine-tune** paradigm that now dominates NLP:

- **GPT-2** (2019): Scaled up, demonstrated zero-shot capabilities
- **GPT-3** (2020): 175B parameters, in-context learning
- **BERT** (2018): Bidirectional variant, same paradigm
- **ChatGPT** (2022): RLHF fine-tuning for dialogue

---

## References

1. Radford et al. (2018). [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
2. Vaswani et al. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
3. Hendrycks & Gimpel (2016). [GELUs](https://arxiv.org/abs/1606.08415)
4. Ba et al. (2016). [Layer Normalization](https://arxiv.org/abs/1607.06450)
5. Sennrich et al. (2016). [BPE](https://arxiv.org/abs/1508.07909)