# AIG 230 – BPE Demo Notebook  
## Byte Pair Encoding (BPE) with the State of the Union Corpus

**Purpose of this notebook**  
This notebook demonstrates, end-to-end, how **Byte Pair Encoding (BPE)** works using a *real corpus*: the U.S. State of the Union addresses.

This notebook is intentionally simple and conceptual. It exists to build correct mental models about subword tokenization.


## 1. Setup

We use:
- NLTK to access the State of the Union corpus
- Hugging Face `tokenizers` to train a simple BPE tokenizer


In [1]:
# Install if needed
! pip install nltk tokenizers

import nltk
from nltk.corpus import state_union

nltk.download("state_union")

Collecting tokenizers
  Using cached tokenizers-0.22.2-cp39-abi3-macosx_11_0_arm64.whl.metadata (7.3 kB)
Collecting huggingface-hub<2.0,>=0.16.4 (from tokenizers)
  Downloading huggingface_hub-1.3.3-py3-none-any.whl.metadata (13 kB)
Collecting hf-xet<2.0.0,>=1.2.0 (from huggingface-hub<2.0,>=0.16.4->tokenizers)
  Using cached hf_xet-1.2.0-cp37-abi3-macosx_11_0_arm64.whl.metadata (4.9 kB)
Using cached tokenizers-0.22.2-cp39-abi3-macosx_11_0_arm64.whl (3.0 MB)
Downloading huggingface_hub-1.3.3-py3-none-any.whl (536 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached hf_xet-1.2.0-cp37-abi3-macosx_11_0_arm64.whl (2.7 MB)
Installing collected packages: hf-xet, huggingface-hub, tokenizers
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [tokenizers]3[0m [huggingface-hub]
[1A[2KSuccessfully installed hf-xet-1.2.0 huggingface-hub-1.3.3 tokenizers-0.22.2


[nltk_data] Downloading package state_union to
[nltk_data]     /Users/software/nltk_data...
[nltk_data]   Package state_union is already up-to-date!


True

## 2. Load and Inspect the Corpus

In [2]:
fileids = state_union.fileids()
len(fileids), fileids[:5]


(65,
 ['1945-Truman.txt',
  '1946-Truman.txt',
  '1947-Truman.txt',
  '1948-Truman.txt',
  '1949-Truman.txt'])

In [3]:
texts = [state_union.raw(fid) for fid in fileids[:10]]
corpus_text = "\n".join(texts)

corpus_text[:500]


"PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.\nOnly yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.\nYet, in this decisive hour, when worl"

## 3. Why BPE?

BPE learns **subword units** from data instead of relying on predefined words.


## 4. Train a Simple BPE Tokenizer

In [4]:
! pip install tokenizers



In [5]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(
    vocab_size=200,
    special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
)

tokenizer.train_from_iterator([corpus_text], trainer=trainer)







## 5. Inspect the Learned Vocabulary

In [6]:
vocab = tokenizer.get_vocab()
list(vocab.items())[:30]

[('%', 8),
 ('ation', 111),
 ('as', 113),
 ("'", 9),
 ('g', 62),
 ('ate', 157),
 ('al', 95),
 ('st', 96),
 ('!', 5),
 ('are', 134),
 ('H', 36),
 ('su', 171),
 ('p', 71),
 ('by', 185),
 ('Q', 45),
 ('W', 51),
 ('ist', 193),
 ('A', 29),
 ('ith', 175),
 ('am', 153),
 ('l', 67),
 ('ill', 126),
 ('N', 42),
 (':', 26),
 ('˝', 83),
 ('ig', 160),
 ('ro', 107),
 ('ong', 183),
 ('v', 77),
 ('ag', 138)]

## 6. BPE Tokenization Example

In [7]:
sentence = "Democracy and democratic institutions must be protected."
tokenizer.encode(sentence).tokens

['D',
 'e',
 'mo',
 'c',
 'r',
 'ac',
 'y',
 'and',
 'de',
 'mo',
 'c',
 'r',
 'at',
 'ic',
 'in',
 'st',
 'it',
 'u',
 'tion',
 's',
 'm',
 'ust',
 'be',
 'pro',
 't',
 'ec',
 't',
 'ed',
 '.']

## 7. Key Takeaways

- BPE tokenization is **learned from data**
- It captures shared structure across related words
- It is used by modern language models
- It is intentionally outside NLTK and spaCy
