# Assignment 5: Build and Train a GPT Model

**PSYC 51.17: Models of Language and Communication**

**Due Date: February 13, 2026 at 11:59 PM EST**

---

## Overview

In this capstone-style assignment, you will build a GPT (Generative Pre-trained Transformer) model from the ground up and train it to generate text.

### Learning Objectives

- Implement transformer architecture components (attention, FFN, layer norm)
- Understand autoregressive language modeling
- Implement causal (masked) self-attention
- Build a complete training pipeline
- Apply different text generation strategies
- Analyze learned representations

---

## Table of Contents

1. [Setup and Installation](#1-setup-and-installation)
2. [Dataset Selection and Loading](#2-dataset-selection-and-loading)
3. [Transformer Components](#3-transformer-components)
4. [GPT Model](#4-gpt-model)
5. [Training Pipeline](#5-training-pipeline)
6. [Text Generation](#6-text-generation)
7. [Analysis and Visualization](#7-analysis-and-visualization)
8. [Comparison with Pre-trained Models](#8-comparison-with-pre-trained-models)
9. [Conclusion](#9-conclusion)

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -q torch transformers
!pip install -q datasets tiktoken
!pip install -q matplotlib seaborn plotly
!pip install -q tqdm numpy pandas
!pip install -q wandb  # Optional: for experiment tracking

In [None]:
# Core imports
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math
from tqdm import tqdm
import matplotlib.pyplot as plt

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

## 2. Dataset Selection and Loading

Choose a dataset:
- Shakespeare (small, good for testing)
- Code (Python from GitHub)
- Stories/Creative Writing
- Domain-specific text

In [None]:
# Option 1: Shakespeare dataset (small, good for initial testing)
!wget -q https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O shakespeare.txt

with open('shakespeare.txt', 'r') as f:
    text = f.read()

print(f"Dataset size: {len(text):,} characters")
print(f"\nSample text:")
print(text[:500])

In [None]:
# Tokenization - you can use character-level or BPE
# Option A: Character-level (simpler)
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size}")

# Create mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

# Option B: Use GPT-2 tokenizer (more realistic)
# from transformers import GPT2Tokenizer
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# vocab_size = tokenizer.vocab_size

In [None]:
# Prepare data
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Data shape: {data.shape}")

# Train/val split
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
print(f"Train: {len(train_data):,} tokens")
print(f"Val: {len(val_data):,} tokens")

## 3. Transformer Components

Implement the core building blocks of GPT.

In [None]:
# Hyperparameters
batch_size = 64
block_size = 256  # context length
n_embd = 384      # embedding dimension
n_head = 6        # number of attention heads
n_layer = 6       # number of transformer blocks
dropout = 0.2
learning_rate = 3e-4
max_iters = 5000
eval_interval = 500
eval_iters = 200

In [None]:
# TODO: Implement Multi-Head Self-Attention with Causal Masking
class MultiHeadAttention(nn.Module):
    """Multi-head self-attention with causal masking."""
    
    def __init__(self, n_embd, n_head, dropout, block_size):
        super().__init__()
        # Your implementation here
        pass
    
    def forward(self, x):
        # Your implementation here
        pass

In [None]:
# TODO: Implement Feed-Forward Network
class FeedForward(nn.Module):
    """Position-wise feed-forward network."""
    
    def __init__(self, n_embd, dropout):
        super().__init__()
        # Your implementation here
        pass
    
    def forward(self, x):
        # Your implementation here
        pass

In [None]:
# TODO: Implement Transformer Block
class TransformerBlock(nn.Module):
    """Transformer block with attention and FFN."""
    
    def __init__(self, n_embd, n_head, dropout, block_size):
        super().__init__()
        # Your implementation here
        pass
    
    def forward(self, x):
        # Your implementation here
        pass

## 4. GPT Model

Assemble the complete GPT model.

In [None]:
# TODO: Implement GPT Model
class GPT(nn.Module):
    """GPT Language Model."""
    
    def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout):
        super().__init__()
        # Your implementation here
        pass
    
    def forward(self, idx, targets=None):
        # Your implementation here
        pass
    
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """Generate text autoregressively."""
        # Your implementation here
        pass

In [None]:
# Create model
model = GPT(
    vocab_size=vocab_size,
    n_embd=n_embd,
    n_head=n_head,
    n_layer=n_layer,
    block_size=block_size,
    dropout=dropout
).to(device)

# Print model size
num_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {num_params:,}")

## 5. Training Pipeline

In [None]:
# Data loading function
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

In [None]:
# TODO: Implement training loop
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Your training loop here
# ...

## 6. Text Generation

Implement different sampling strategies: greedy, temperature, top-k, nucleus.

In [None]:
# TODO: Generate text with different strategies
prompt = "ROMEO:"
context = torch.tensor([encode(prompt)], dtype=torch.long, device=device)

# Greedy decoding
print("Greedy decoding:")
# ...

# Temperature sampling
print("\nTemperature sampling (T=0.8):")
# ...

# Top-k sampling
print("\nTop-k sampling (k=40):")
# ...

# Nucleus sampling
print("\nNucleus sampling (p=0.9):")
# ...

## 7. Analysis and Visualization

In [None]:
# TODO: Visualize attention patterns
# Your implementation here
# ...

In [None]:
# TODO: Visualize token embeddings (UMAP or t-SNE)
# Your implementation here
# ...

## 8. Comparison with Pre-trained Models

In [None]:
# TODO: Compare with GPT-2
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Your comparison here
# ...

## 9. Conclusion

*Summarize your implementation, discuss training dynamics, analyze results, and reflect on what you learned.*

TODO: Your conclusion here...