# AIG 230 – BPE Demo Notebook  
## Byte Pair Encoding (BPE) with the State of the Union Corpus

**Purpose of this notebook**  
This notebook demonstrates, end-to-end, how **Byte Pair Encoding (BPE)** works using a *real corpus*: the U.S. State of the Union addresses.

This notebook is intentionally simple and conceptual. It exists to build correct mental models about subword tokenization.


## 1. Setup

We use:
- NLTK to access the State of the Union corpus
- Hugging Face `tokenizers` to train a simple BPE tokenizer


In [8]:
# Install if needed
# !pip install nltk tokenizers

import nltk
from nltk.corpus import state_union

nltk.download("state_union", quiet=True)


True

## 2. Load and Inspect the Corpus

In [2]:
fileids = state_union.fileids()
len(fileids), fileids[:5]


(65,
 ['1945-Truman.txt',
  '1946-Truman.txt',
  '1947-Truman.txt',
  '1948-Truman.txt',
  '1949-Truman.txt'])

In [3]:
texts = [state_union.raw(fid) for fid in fileids[:10]]
corpus_text = "\n".join(texts)

corpus_text[:500]


"PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.\nOnly yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.\nYet, in this decisive hour, when worl"

## 3. Why BPE?

BPE learns **subword units** from data instead of relying on predefined words.


## 4. Train a Simple BPE Tokenizer

In [4]:
!pip install tokenizers

Collecting tokenizers
  Downloading tokenizers-0.22.2-cp39-abi3-macosx_10_12_x86_64.whl.metadata (7.3 kB)
Collecting huggingface-hub<2.0,>=0.16.4 (from tokenizers)
  Downloading huggingface_hub-1.3.2-py3-none-any.whl.metadata (13 kB)
Collecting filelock (from huggingface-hub<2.0,>=0.16.4->tokenizers)
  Using cached filelock-3.20.3-py3-none-any.whl.metadata (2.1 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub<2.0,>=0.16.4->tokenizers)
  Using cached fsspec-2026.1.0-py3-none-any.whl.metadata (10 kB)
Collecting hf-xet<2.0.0,>=1.2.0 (from huggingface-hub<2.0,>=0.16.4->tokenizers)
  Downloading hf_xet-1.2.0-cp37-abi3-macosx_10_12_x86_64.whl.metadata (4.9 kB)
Collecting httpx<1,>=0.23.0 (from huggingface-hub<2.0,>=0.16.4->tokenizers)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pyyaml>=5.1 (from huggingface-hub<2.0,>=0.16.4->tokenizers)
  Downloading pyyaml-6.0.3-cp311-cp311-macosx_10_13_x86_64.whl.metadata (2.4 kB)
Collecting shellingham (fro

In [5]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(
    vocab_size=200,
    special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
)

tokenizer.train_from_iterator([corpus_text], trainer=trainer)







## 5. Inspect the Learned Vocabulary

In [6]:
vocab = tokenizer.get_vocab()
list(vocab.items())[:30]


[('ce', 144),
 ('7', 23),
 ('ow', 143),
 ('K', 39),
 ('(', 10),
 ('ab', 155),
 ('th', 84),
 ('am', 153),
 ('U', 49),
 ('that', 137),
 ('ir', 163),
 ('em', 181),
 ('[SEP]', 3),
 ('ha', 146),
 ('2', 18),
 ('pro', 116),
 ('z', 81),
 ('d', 59),
 ('are', 134),
 ('j', 65),
 ('M', 41),
 ('The', 152),
 ('ear', 168),
 ('over', 167),
 ('for', 117),
 ('en', 91),
 ('end', 145),
 ('ve', 112),
 ('I', 37),
 ('W', 51)]

## 6. BPE Tokenization Example

In [7]:
sentence = "Democracy and democratic institutions must be protected."
tokenizer.encode(sentence).tokens


['D',
 'e',
 'mo',
 'c',
 'r',
 'ac',
 'y',
 'and',
 'de',
 'mo',
 'c',
 'r',
 'at',
 'ic',
 'in',
 'st',
 'it',
 'u',
 'tion',
 's',
 'm',
 'ust',
 'be',
 'pro',
 't',
 'ec',
 't',
 'ed',
 '.']

## 7. Key Takeaways

- BPE tokenization is **learned from data**
- It captures shared structure across related words
- It is used by modern language models
- It is intentionally outside NLTK and spaCy


Different purpose: Classical NLP vs LLMs
| Library                      | Main goal                     | Tokenization type              |
| ---------------------------- | ----------------------------- | ------------------------------ |
| **NLTK**                     | linguistics + traditional NLP | word / sentence tokenization   |
| **spaCy**                    | production NLP pipelines      | rule-based word tokenization   |
| **BPE (Byte Pair Encoding)** | ML model input efficiency     | subword tokens for neural nets |
