# Module 08: GPT and Autoregressive Models

**Difficulty**: ⭐⭐⭐ Advanced  
**Estimated Time**: 120 minutes  
**Prerequisites**: [Module 06-07: Transformers and BERT](07_bert_masked_lm.ipynb)

## Learning Objectives

1. Understand autoregressive language modeling
2. Learn GPT architecture (decoder-only transformer)
3. Implement causal (masked) attention
4. Generate text with different sampling strategies
5. Compare GPT vs BERT approaches
6. Understand scaling laws and GPT evolution

## GPT: Generative Pre-trained Transformer

**Key difference from BERT**:
- **BERT**: Bidirectional, encoder-only, masked LM
- **GPT**: Unidirectional, decoder-only, causal LM

### Autoregressive Modeling:

Predict next token given previous tokens:

$$P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, ..., x_{i-1})$$

**Training**: Maximize likelihood of next token

**Generation**: Sample tokens one by one

## Setup

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
from transformers import pipeline
import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
print('✓ Libraries imported!')

## 1. Causal (Masked) Attention

**Causal masking**: Token can only attend to previous tokens (not future).

**Prevents cheating**: Model can't see the answer during training!

In [None]:
def create_causal_mask(seq_len):
    """
    Create causal attention mask.
    
    Returns lower triangular matrix:
    [[1, 0, 0],
     [1, 1, 0],
     [1, 1, 1]]
    """
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask

# Visualize causal mask
mask = create_causal_mask(10)
plt.figure(figsize=(8, 8))
plt.imshow(mask, cmap='Blues')
plt.title('Causal Attention Mask')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.colorbar(label='Can Attend')
plt.show()

## 2. Using Pre-trained GPT-2

In [None]:
# Load GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to(device)
model.eval()

print(f'✓ GPT-2 loaded!')
print(f'Parameters: {sum(p.numel() for p in model.parameters()):,}')

## 3. Text Generation Strategies

### 1. Greedy Decoding
- Always pick most probable token
- Deterministic but repetitive

### 2. Temperature Sampling
- $P(x_i) = \frac{\exp(\text{logit}_i / T)}{\sum_j \exp(\text{logit}_j / T)}$
- Lower T = more conservative
- Higher T = more creative

### 3. Top-k Sampling
- Sample from top k most probable tokens

### 4. Top-p (Nucleus) Sampling
- Sample from smallest set with cumulative probability > p

In [None]:
# Text generation with different strategies
from transformers import set_seed
set_seed(42)

prompt = "Artificial intelligence will"

print("=" * 60)
print(f"Prompt: {prompt}\n")

# Greedy
print("1. GREEDY DECODING:")
generated = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1)
result = generated(prompt, max_length=50, num_return_sequences=1, do_sample=False)
print(result[0]['generated_text'])

# Temperature sampling
print("\n2. TEMPERATURE SAMPLING (T=0.7):")
result = generated(prompt, max_length=50, temperature=0.7, do_sample=True)
print(result[0]['generated_text'])

# Top-k
print("\n3. TOP-K SAMPLING (k=50):")
result = generated(prompt, max_length=50, top_k=50, do_sample=True)
print(result[0]['generated_text'])

# Top-p
print("\n4. TOP-P SAMPLING (p=0.9):")
result = generated(prompt, max_length=50, top_p=0.9, do_sample=True)
print(result[0]['generated_text'])

**Exercise 1**: Compare generation strategies

1. Try different temperatures (0.1, 0.5, 1.0, 2.0)
2. Compare top-k with different k values
3. Analyze diversity vs quality trade-off
4. Find optimal parameters for your use case

In [None]:
# YOUR CODE HERE

## 4. GPT Evolution

### Timeline:

- **GPT-1** (2018): 117M params, proof of concept
- **GPT-2** (2019): 1.5B params, "too dangerous to release"
- **GPT-3** (2020): 175B params, few-shot learning
- **GPT-4** (2023): Multimodal, RLHF

### Scaling Laws:

**Bigger is better** (up to a point):
- More parameters → better performance
- More data → better performance
- More compute → better performance

## Summary

### Key Concepts:

1. **Autoregressive Modeling**: Predict next token
2. **Causal Attention**: Can't see future tokens
3. **Generation Strategies**: Greedy, sampling, top-k, top-p
4. **Scaling**: Larger models perform better

### GPT vs BERT:

| Aspect | GPT | BERT |
|--------|-----|------|
| Architecture | Decoder | Encoder |
| Attention | Causal | Bidirectional |
| Training | Next token | Masked LM |
| Best for | Generation | Understanding |

### What's Next?

In **Module 09: Fine-Tuning**, we'll learn to adapt pre-trained models to specific tasks.

### Resources:

- **GPT-3 Paper**: [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- **GPT-2 Blog**: [Better Language Models](https://openai.com/blog/better-language-models/)