
# STAT 453 — HW5 (Optional): Switch GPT‑0 to BPE

This notebook is **self-contained**, but builds on the [GPT-0 implementation](https://github.com/AdaptInfer/easy-gpt) discussed in class.

First, switch this notebook runtime to use GPU, then run this notebook top to bottom.

**What you will do:**
1) Train a tiny GPT-0 on Tiny Shakespeare **using a BPE tokenizer** (GPT‑2 tokenizer).  
2) Inspect the tokenizer behavior.  
3) Generate text and compare to the character-level version from lecture.  

**Submit:** Print-out with questions at bottom answered.


In [1]:

# --- Setup (Colab) ---
!pip -q install transformers==4.45.2


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m77.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:

import math, torch, torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
from transformers import GPT2TokenizerFast

torch.manual_seed(1337)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device


'cuda'

In [3]:

# --- GPT Config ---
@dataclass
class GPTConfig:
    block_size: int = 128
    vocab_size: int = None
    n_embd: int = 64
    n_head: int = 4
    n_layer: int = 4
    dropout: float = 0.0
    device: str = device


In [4]:

# --- Minimal Transformer block (decoder-only) ---
class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.n_head = config.n_head
        self.head_dim = config.n_embd // config.n_head
        self.qkv = nn.Linear(config.n_embd, 3*config.n_embd, bias=False)
        self.proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.attn_drop = nn.Dropout(config.dropout)
        self.resid_drop = nn.Dropout(config.dropout)

        # causal mask as buffer (1 for allowed, 0 for masked)
        bias = torch.tril(torch.ones(config.block_size, config.block_size))
        self.register_buffer('mask', bias.view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x)  # (B, T, 3C)
        q, k, v = qkv.chunk(3, dim=-1)
        # shape into heads
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)  # (B, nh, T, hd)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)

        # scaled dot-product attention
        att = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)  # (B, nh, T, T)
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v  # (B, nh, T, hd)
        y = y.transpose(1, 2).contiguous().view(B, T, C)  # (B, T, C)
        y = self.resid_drop(self.proj(y))
        return y

class MLP(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.n_embd, 4*config.n_embd),
            nn.GELU(),
            nn.Linear(4*config.n_embd, config.n_embd),
            nn.Dropout(config.dropout),
        )
    def forward(self, x): return self.net(x)

class Block(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x


In [5]:

# --- GPT-zero model (from class) ---
class GPT(nn.Module):
    """
    GPT Language Model:
    - Transformer decoder stack
    - Token + positional embeddings
    - Predicts next token using causal self-attention
    """
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.block_size = config.block_size
        self.device = config.device
        self.token_embedding = nn.Embedding(config.vocab_size, config.n_embd)
        self.position_embedding = nn.Embedding(config.block_size, config.n_embd)
        self.transformer = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.block_size, "Sequence length exceeds block size"
        token_embeddings = self.token_embedding(idx)                                        # (B,T,C)
        position_embeddings = self.position_embedding(torch.arange(T, device=idx.device))   # (T,C)
        x = token_embeddings + position_embeddings                                          # (B,T,C)
        x = self.transformer(x)                                                             # (B,T,C)
        x = self.ln_f(x)                                                                    # (B,T,C)
        logits = self.lm_head(x)                                                            # (B,T,V)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(B*T, -1), targets.view(B*T))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        self.eval()
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('inf')
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_token), dim=1)
        return idx


In [6]:

# --- Data: Tiny Shakespeare ---
import os, requests, textwrap

url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
if not os.path.exists("input.txt"):
    r = requests.get(url)
    with open("input.txt", "w", encoding="utf-8") as f:
        f.write(r.text)

with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()

print("length of dataset in characters:", len(text))
print(text[:300].replace("\n","\n"))


length of dataset in characters: 1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us



## Tokenizer: **Replace char-level with BPE (GPT‑2 tokenizer)**


In [7]:

# --- BPE tokenizer (GPT-2) ---
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# No pad token needed for contiguous slicing; decode will handle it.

# Encode entire corpus into token ids
ids = tokenizer.encode(text)
data = torch.tensor(ids, dtype=torch.long)
print("Total tokens in corpus:", len(data))
print("Vocab size:", tokenizer.vocab_size)

# Config
config = GPTConfig(
    block_size=128,         # token-level context
    vocab_size=tokenizer.vocab_size,
    n_embd=64, n_head=4, n_layer=4, dropout=0.0, device=device
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (338025 > 1024). Running this sequence through the model will result in indexing errors


Total tokens in corpus: 338025
Vocab size: 50257



### Inspect the tokenizer (include outputs in your submission)


In [8]:

sample = "Wherefore art thou, Romeo?"
print("tokenize:", tokenizer.tokenize(sample))
print("encode:", tokenizer.encode(sample))
for tok_id in tokenizer.encode("Wherefore"):
    print(tok_id, tokenizer.convert_ids_to_tokens(tok_id))


tokenize: ['Where', 'fore', 'Ġart', 'Ġthou', ',', 'ĠRomeo', '?']
encode: [8496, 754, 1242, 14210, 11, 43989, 30]
8496 Where
754 fore


In [9]:

# --- Train/Val split and batching ---
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

batch_size = 16
def get_batch(split):
    data_split = train_data if split == 'train' else val_data
    ix = torch.randint(len(data_split) - config.block_size, (batch_size,))
    x = torch.stack([data_split[i:i+config.block_size] for i in ix])
    y = torch.stack([data_split[i+1:i+config.block_size+1] for i in ix])
    return x.to(device), y.to(device)

@torch.no_grad()
def estimate_loss(model, eval_iters=200):
    out = {}
    model.eval()
    for split in ['train','val']:
        losses = torch.zeros(eval_iters, device=device)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            _, loss = model(X, Y)
            losses[k] = loss
        out[split] = losses.mean().item()
    model.train()
    return out

# Sanity check a batch
xb, yb = get_batch('train')
xb.shape, yb.shape, xb[0][:10], yb[0][:10]


(torch.Size([16, 128]),
 torch.Size([16, 128]),
 tensor([ 25, 198,  40, 423,  25, 475, 644, 286, 683,  30], device='cuda:0'),
 tensor([198,  40, 423,  25, 475, 644, 286, 683,  30, 198], device='cuda:0'))

In [10]:

# --- Model + training ---
model = GPT(config).to(device)
print(sum(p.numel() for p in model.parameters())/1e6, "M parameters")

learning_rate = 1e-3
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

max_iters = 1000
eval_interval = 200

for it in range(max_iters+1):
    if it % eval_interval == 0 or it == max_iters:
        losses = estimate_loss(model, eval_iters=100)
        print(f"step {it}: train {losses['train']:.4f}, val {losses['val']:.4f}")
    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


6.690385 M parameters
step 0: train 10.9754, val 10.9681
step 200: train 5.8571, val 6.0008
step 400: train 5.2345, val 5.5981
step 600: train 4.8937, val 5.2976
step 800: train 4.6276, val 5.1208
step 1000: train 4.4254, val 5.0030


In [11]:

# --- Generation ---
context = torch.tensor([[tokenizer.eos_token_id]], dtype=torch.long, device=device)
out = model.generate(context, max_new_tokens=300, temperature=1.0, top_k=50)
sample_text = tokenizer.decode(out[0].tolist())
print(sample_text[:1200])


<|endoftext|>IO:
Why, that they have we be not soest all
Of no heart in thy head
For my heart thy queen?

Second Murderer:
Now, they follow this world. Go for it.

POLIXO:
Hath me:
The blood of such a piece!

I am dead's true.
ROMEO:

Bost:
O, good father with a thousand better thou woe.

HENRY VI:
The other:
Here? I have seen that I have said your good tongue in 'tis some other,
Is at the
Isio, one and how to the king's the ground to the stroke;
And for
And see thy quarrel
Whose's deadful day, but our course of me with this deed
Tis,
That thou do,
That thou in, a king, but mine,
Which's the earth
How to me,
By so be it, I will be mine canst
To call.
SICINCE:
And she lies:
No, for you say,
It, I am a soldier a thousand good he would, and one time,
Of the queen for
To do you may be gone's very truth'd on his good,
and a man?
But that be the king, my queen's this night be a gentleman,
Which
I'll follow of the



## Analysis Questions

**Q1. Tokenization (5 pts)**  
Paste the output of the *tokenize* and *encode* calls for `"Wherefore art thou, Romeo?"`.  
Explain in 2–3 sentences how this differs from character-level tokenization.

tokenize: ['Where', 'fore', 'Ġart', 'Ġthou', ',', 'ĠRomeo', '?']
encode: [8496, 754, 1242, 14210, 11, 43989, 30]

Unlike character-level tokenization, which assigns a unique ID to every individual letter and punctuation mark, BPE groups common sequences of characters into sub-word "tokens". This allows the model to represent complex semantic concepts in fewer steps.

**Q2. IDs & length (5 pts)**  
Are there more or fewer tokens from the BPE tokenizer or the character-level tokenizer we discussed in class? (1–2 sentences.)

Significantly fewer tokens.

**Q3. Loss behavior (5 pts)**  
Copy the printed train/val losses at step 0 and final step. Are they qualitatively different from your char‑level run from lecture? (2–3 sentences.)

step 0: train 10.9754, val 10.9681
step 200: train 5.8571, val 6.0008

These losses are qualitatively different primarily because the starting loss is much higher (~10.9 compared to ~4.2 in a character-level run).

**Q4. Sample (5 pts)**  
Paste 5–10 lines of generated text from the BPE model.

Bost:
O, good father with a thousand better thou woe.

HENRY VI:
The other:
Here? I have seen that I have said your good tongue in 'tis some other,
Is at the

**Q5. Comparison (7 pts)**  
Compare BPE output to char‑level output along: word coherence, spaces/punctuation, repetition, long‑range structure. (3–4 sentences.)


BPE significantly improves word coherence because the model selects from a vocabulary of established sub-words and phrases, making it nearly impossible to "misspell" a word compared to character-level models that often hallucinate nonsensical letter strings.


**Q6. Concept (8 pts)**  
What does a tokenizer *do* for a GPT? Why is BPE usually preferred over char‑level? What tradeoffs remain? What would you do to improve a BPE? (4–5 sentences.)



Byte-Pair Encoding (BPE) is usually preferred because it creates a more efficient "information density"; by grouping common character sequences into single tokens, the model can capture much more context and meaning within the same fixed sequence length (e.g., 1024 tokens) compared to character-level models.


Despite these advantages, tradeoffs remain: BPE requires a massive embedding matrix (often 50k+ rows) which consumes significant memory, and it can still struggle with rare words or creative misspellings that weren't in the training corpus. To improve a BPE tokenizer, you could implement Byte-fallback to ensure every possible UTF-8 byte is representable, or use BPE-dropout during training, which randomly segments words into different sub-word combinations to help the model become more robust to varied spelling patterns.