# 1.12a: Train Lil Gatsby

**Goal:** Train a tiny GPT-2 style model on The Great Gatsby and record everything.

## Model Architecture

- **Vocabulary:** 128 tokens (ASCII)
- **Context window:** 128 tokens
- **Hidden dimensions:** 64
- **Layers:** 2
- **Attention heads:** 2
- **Weight tying:** E = W^T (matches Qwen 3 4B)
- **Total parameters:** ~49k

## Training Configuration

- **Steps:** 10,000
- **Batch size:** 32
- **Optimizer:** Adam (lr=0.001)
- **Precision:** bfloat16 for model, float32 for optimizer

## Data Recording (Every Step)

Pure PyTorch tensors, no HDF5:
- **W (unembedding):** (10001, 128, 64) bf16
- **Gradients:** (10001, 128, 64) bf16
- **Adam momentum:** (10001, 128, 64) f32
- **Adam variance:** (10001, 128, 64) f32
- **Logits (mean):** (10001, 128) f32
- **Loss:** (10001,) f32

Total: ~650 MB saved as safetensors

## Parameters

In [102]:
# Model architecture
VOCAB_SIZE = 128
HIDDEN_DIM = 64
N_LAYERS = 2
N_HEADS = 2
CONTEXT_LENGTH = 128

# Training
N_STEPS = 10000
BATCH_SIZE = 32
LEARNING_RATE = 0.001
RANDOM_SEED = 42

# Data paths
CORPUS_PATH = "../data/gatsby_clean.txt"
TOKENIZER_PATH = "../data/tokenizers/ascii_128"
OUTPUT_PATH = "../tensors/Lil_Gatsby/1.12a_training_data.safetensors"

## Imports

In [103]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
from tqdm import tqdm
from transformers import PreTrainedTokenizerFast, GPT2Config, GPT2LMHeadModel
from safetensors.torch import save_file

## Device Detection

In [104]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

Using device: mps


## Load Tokenizer and Corpus

In [105]:
# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(TOKENIZER_PATH)

# Load and tokenize corpus
with open(CORPUS_PATH, 'r', encoding='utf-8') as f:
    corpus_text = f.read()

tokens = tokenizer.encode(corpus_text)

print(f"✓ Loaded corpus")
print(f"  Characters: {len(corpus_text):,}")
print(f"  Tokens: {len(tokens):,}")
print(f"  Vocab size: {len(tokenizer)}")

✓ Loaded corpus
  Characters: 266,252
  Tokens: 266,252
  Vocab size: 128


## Create Dataset

In [106]:
class GatsbyDataset(Dataset):
    def __init__(self, tokens, context_length):
        self.tokens = torch.tensor(tokens, dtype=torch.long)
        self.context_length = context_length
        
    def __len__(self):
        return len(self.tokens) - self.context_length
    
    def __getitem__(self, idx):
        x = self.tokens[idx:idx + self.context_length]
        y = self.tokens[idx + 1:idx + self.context_length + 1]
        return x, y

dataset = GatsbyDataset(tokens, CONTEXT_LENGTH)

# Don't create DataLoader yet - wait until after we're done with tokenizer
print(f"✓ Created dataset: {len(dataset):,} examples")

✓ Created dataset: 266,124 examples


## Create Model

In [107]:
torch.manual_seed(RANDOM_SEED)

config = GPT2Config(
    vocab_size=VOCAB_SIZE,
    n_positions=CONTEXT_LENGTH,
    n_embd=HIDDEN_DIM,
    n_layer=N_LAYERS,
    n_head=N_HEADS,
    resid_pdrop=0.0,
    embd_pdrop=0.0,
    attn_pdrop=0.0,
    use_cache=False,
    tie_word_embeddings=True,  # E = W^T
    # Explicitly set loss_type to avoid warning
)

model = GPT2LMHeadModel(config).to(device).to(torch.bfloat16)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

n_params = sum(p.numel() for p in model.parameters())
print(f"✓ Created model: {n_params:,} parameters")

✓ Created model: 116,480 parameters


## Prepare Storage Tensors

In [108]:
# Pre-allocate tensors for all steps ON GPU (or device)
n_steps_total = N_STEPS + 1

# Store on device to avoid constant GPU→CPU transfers
W_history = torch.zeros(n_steps_total, VOCAB_SIZE, HIDDEN_DIM, dtype=torch.bfloat16, device=device)
grads_history = torch.zeros(n_steps_total, VOCAB_SIZE, HIDDEN_DIM, dtype=torch.bfloat16, device=device)
momentum_history = torch.zeros(n_steps_total, VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32, device=device)
variance_history = torch.zeros(n_steps_total, VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32, device=device)
logits_history = torch.zeros(n_steps_total, VOCAB_SIZE, dtype=torch.float32, device=device)
loss_history = torch.zeros(n_steps_total, dtype=torch.float32, device=device)

print(f"✓ Allocated storage tensors on {device}")
print(f"  Total memory: ~{(W_history.nbytes + grads_history.nbytes + momentum_history.nbytes + variance_history.nbytes + logits_history.nbytes + loss_history.nbytes) / 1024**2:.0f} MB")

# NOW create DataLoader (after all tokenizer usage is done)
dataloader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    generator=torch.Generator().manual_seed(RANDOM_SEED)
)

print(f"✓ Created DataLoader")

✓ Allocated storage tensors on mps
  Total memory: ~943 MB
✓ Created DataLoader


## Training Loop

In [109]:
print("\n" + "=" * 80)
print("TRAINING")
print("=" * 80)
print()

# Record step 0 (initialization)
with torch.no_grad():
    W_history[0] = model.lm_head.weight  # No .cpu() - keep on device
    # Gradients, momentum, variance are zero at step 0
    # Get dummy logits
    dummy_x, _ = next(iter(dataloader))
    dummy_out = model(dummy_x.to(device))
    logits_history[0] = dummy_out.logits[:, -1, :].mean(dim=0).float()

print("✓ Recorded step 0\n")

# Training loop
data_iter = iter(dataloader)

for step in tqdm(range(1, N_STEPS + 1), desc="Training"):
    # Get batch
    try:
        x, y = next(data_iter)
    except StopIteration:
        data_iter = iter(dataloader)
        x, y = next(data_iter)
    
    x, y = x.to(device), y.to(device)
    
    # Forward pass
    optimizer.zero_grad()
    outputs = model(x, labels=y)
    loss = outputs.loss
    
    # Backward pass
    loss.backward()
    
    # Record data BEFORE optimizer step (all on device, no transfers!)
    with torch.no_grad():
        # W matrix
        W_history[step] = model.lm_head.weight
        
        # Gradients
        if model.lm_head.weight.grad is not None:
            grads_history[step] = model.lm_head.weight.grad
        
        # Adam state (if it exists)
        if model.lm_head.weight in optimizer.state:
            state = optimizer.state[model.lm_head.weight]
            if 'exp_avg' in state:
                momentum_history[step] = state['exp_avg'].float()
            if 'exp_avg_sq' in state:
                variance_history[step] = state['exp_avg_sq'].float()
        
        # Mean logits
        logits_history[step] = outputs.logits[:, -1, :].mean(dim=0).float()
        
        # Loss
        loss_history[step] = loss.float()
    
    # Optimizer step
    optimizer.step()

print(f"\n✓ Training complete! Final loss: {loss_history[-1]:.4f}")


TRAINING

✓ Recorded step 0



Training: 100%|██████████| 10000/10000 [01:31<00:00, 108.95it/s]


✓ Training complete! Final loss: 1.9609





## Save Training Data

In [110]:
Path(OUTPUT_PATH).parent.mkdir(parents=True, exist_ok=True)

# Move everything to CPU for saving
print("Moving data to CPU for saving...")
save_file(
    {
        'W': W_history.cpu(),
        'grads': grads_history.cpu(),
        'momentum': momentum_history.cpu(),
        'variance': variance_history.cpu(),
        'logits': logits_history.cpu(),
        'loss': loss_history.cpu()
    },
    OUTPUT_PATH,
    metadata={
        'model': 'GPT2',
        'vocab_size': str(VOCAB_SIZE),
        'hidden_dim': str(HIDDEN_DIM),
        'n_layers': str(N_LAYERS),
        'n_heads': str(N_HEADS),
        'n_steps': str(N_STEPS),
        'batch_size': str(BATCH_SIZE),
        'learning_rate': str(LEARNING_RATE),
        'final_loss': str(loss_history[-1].item())
    }
)

file_size_mb = Path(OUTPUT_PATH).stat().st_size / 1024**2
print(f"✓ Saved training data: {OUTPUT_PATH}")
print(f"  File size: {file_size_mb:.1f} MB")

Moving data to CPU for saving...
✓ Saved training data: ../tensors/Lil_Gatsby/1.12a_training_data.safetensors
  File size: 942.5 MB


## Inference Test

In [111]:
print("\n" + "=" * 80)
print("INFERENCE TEST")
print("=" * 80)
print()

model.eval()

prompts = [
    "In my younger",
    "Gatsby",
    "The green light"
]

for prompt in prompts:
    input_ids = torch.tensor([tokenizer.encode(prompt)]).to(device)
    
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=50,
            do_sample=True,
            temperature=0.8,
            pad_token_id=tokenizer.pad_token_id
        )
    
    print(f"Prompt: '{prompt}'")
    print(f"Output: '{tokenizer.decode(output[0])}'")
    print()


INFERENCE TEST

Prompt: 'In my younger'
Output: 'I n   m y   y o u n g e r   a o   i   o   u l n t r   f o   r n a e <0x00> h g g t r n s a   r s r r s r'

Prompt: 'Gatsby'
Output: 'G a t s b y a l s y   n t o t i e , w r n n n l . <0x00> I h k c   t e l e h t e   h m a e   e o R s a i'

Prompt: 'The green light'
Output: 'T h e   g r e e n   l i g h t   t m e   t k t t o s   r y r b   h a r   i   o   o   s o n t d s o e'



## Summary

In [112]:
print("=" * 80)
print("SUMMARY")
print("=" * 80)
print()
print(f"Model: {N_LAYERS}L, {N_HEADS}H, {HIDDEN_DIM}D ({n_params:,} params)")
print(f"Training: {N_STEPS:,} steps, batch {BATCH_SIZE}")
print(f"Final loss: {loss_history[-1]:.4f}")
print()
print(f"Data saved: {OUTPUT_PATH}")
print(f"Size: {file_size_mb:.1f} MB")
print()
print(f"Next: 1.13x - Analyze dead token dynamics")
print("=" * 80)

SUMMARY

Model: 2L, 2H, 64D (116,480 params)
Training: 10,000 steps, batch 32
Final loss: 1.9609

Data saved: ../tensors/Lil_Gatsby/1.12a_training_data.safetensors
Size: 942.5 MB

Next: 1.13x - Analyze dead token dynamics
