# AIG 230 ‚Äì Week 2 Lab
## From Raw Text to Corpus: Tokenization, Normalization, and Vocabulary

Industry Context: Exploring the State of the Union Corpus

## Learning Objectives
- Understand raw text, documents, and corpora
- Explore a real-world corpus
- Compare NLTK and spaCy preprocessing pipelines
- Perform tokenization, normalization, and vocabulary analysis

In [54]:

import nltk
import spacy
import string
from collections import Counter

from nltk import sent_tokenize

nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("state_union", quiet=True)

True

In [2]:
# Load spaCy English model
# Run once if needed:
!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m36.3 MB/s[0m  [33m0:00:00[0meta [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Breaking it down:

```
spacy.load() - loads a pre-trained language model
```

"en_core_web_sm" - the small English model that includes:

- Tokenizer - splits text into words/sentences
- Part-of-speech tagger - identifies noun, verb, adjective, etc.
- Dependency parser - analyzes grammatical relationships
- Named entity recognizer - identifies people, places, organizations
- Word vectors - semantic meaning of words

nlp = - stores the loaded model in a variable so you can use it to process text

## Part 1 ‚Äì Obtaining the Corpus

In [3]:

from nltk.corpus import state_union
state_union.fileids()[:10]


['1945-Truman.txt',
 '1946-Truman.txt',
 '1947-Truman.txt',
 '1948-Truman.txt',
 '1949-Truman.txt',
 '1950-Truman.txt',
 '1951-Truman.txt',
 '1953-Eisenhower.txt',
 '1954-Eisenhower.txt',
 '1955-Eisenhower.txt']

Each file is a document. The collection is the corpus.

In [4]:

len(state_union.fileids())


65

## Part 2 ‚Äì Inspect Raw Text

In [5]:

sample_file = state_union.fileids()[0]
raw_text = state_union.raw(sample_file)
raw_text[:1000]


"PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.\nOnly yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.\nYet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.\nIn His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.\nNo man could possibly fill the tremendous void left by the passing of that noble soul. No words can ease the aching hearts of untold millions of every race, creed and color. The world knows it has lost a heroic champion of justice and freedom.\nTragic fate has 

## Part 3 ‚Äì Word Tokenization

üìå Tokenization splits text into meaningful units (tokens).
There is no universal standard, but conventions vary by task and language.

from nltk.tokenize import  word_tokenize

In [7]:
from nltk.tokenize import  word_tokenize
tokenize_nltk = word_tokenize(raw_text)
tokenize_nltk

['PRESIDENT',
 'HARRY',
 'S.',
 'TRUMAN',
 "'S",
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 'April',
 '16',
 ',',
 '1945',
 'Mr.',
 'Speaker',
 ',',
 'Mr.',
 'President',
 ',',
 'Members',
 'of',
 'the',
 'Congress',
 ':',
 'It',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'I',
 'stand',
 'before',
 'you',
 ',',
 'my',
 'friends',
 'and',
 'colleagues',
 ',',
 'in',
 'the',
 'Congress',
 'of',
 'the',
 'United',
 'States',
 '.',
 'Only',
 'yesterday',
 ',',
 'we',
 'laid',
 'to',
 'rest',
 'the',
 'mortal',
 'remains',
 'of',
 'our',
 'beloved',
 'President',
 ',',
 'Franklin',
 'Delano',
 'Roosevelt',
 '.',
 'At',
 'a',
 'time',
 'like',
 'this',
 ',',
 'words',
 'are',
 'inadequate',
 '.',
 'The',
 'most',
 'eloquent',
 'tribute',
 'would',
 'be',
 'a',
 'reverent',
 'silence',
 '.',
 'Yet',
 ',',
 'in',
 'this',
 'decisive',
 'hour',
 ',',
 'when',
 'world',
 'events',
 'are',
 'moving',
 'so',
 'rapidly',
 ',',
 'our',
 'silence',
 'migh

In [10]:
test = raw_text
test

'PRESIDENT HARRY S. TRUMAN\'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.\nOnly yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.\nYet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.\nIn His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.\nNo man could possibly fill the tremendous void left by the passing of that noble soul. No words can ease the aching hearts of untold millions of every race, creed and color. The world knows it has lost a heroic champion of justice and freedom.\nTragic fate has

In [8]:
doc = nlp(raw_text) # model see and understand my data/ text
doc

PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS
 
April 16, 1945

Mr. Speaker, Mr. President, Members of the Congress:
It is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.
Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.
Yet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.
In His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.
No man could possibly fill the tremendous void left by the passing of that noble soul. No words can ease the aching hearts of untold millions of every race, creed and color. The world knows it has lost a heroic champion of justice and freedom.
Tragic fate has thrust upon

In [12]:
tokens_spacy = [token.text for token in doc] #list comprehansion
tokens_spacy

#same as

['PRESIDENT',
 'HARRY',
 'S.',
 'TRUMAN',
 "'S",
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 '\n \n',
 'April',
 '16',
 ',',
 '1945',
 '\n\n',
 'Mr.',
 'Speaker',
 ',',
 'Mr.',
 'President',
 ',',
 'Members',
 'of',
 'the',
 'Congress',
 ':',
 '\n',
 'It',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'I',
 'stand',
 'before',
 'you',
 ',',
 'my',
 'friends',
 'and',
 'colleagues',
 ',',
 'in',
 'the',
 'Congress',
 'of',
 'the',
 'United',
 'States',
 '.',
 '\n',
 'Only',
 'yesterday',
 ',',
 'we',
 'laid',
 'to',
 'rest',
 'the',
 'mortal',
 'remains',
 'of',
 'our',
 'beloved',
 'President',
 ',',
 'Franklin',
 'Delano',
 'Roosevelt',
 '.',
 'At',
 'a',
 'time',
 'like',
 'this',
 ',',
 'words',
 'are',
 'inadequate',
 '.',
 'The',
 'most',
 'eloquent',
 'tribute',
 'would',
 'be',
 'a',
 'reverent',
 'silence',
 '.',
 '\n',
 'Yet',
 ',',
 'in',
 'this',
 'decisive',
 'hour',
 ',',
 'when',
 'world',
 'events',
 'are',
 'moving',
 'so',
 'ra

## Part 4 ‚Äì Sentence Tokenization

In [14]:
from nltk.tokenize import sent_tokenize

sentance_nltk = sent_tokenize(raw_text)

sentance_nltk

["PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.",
 'Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt.',
 'At a time like this, words are inadequate.',
 'The most eloquent tribute would be a reverent silence.',
 'Yet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.',
 'In His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.',
 'No man could possibly fill the tremendous void left by the passing of that noble soul.',
 'No words can ease the aching hearts of untold millions of every race, creed and color.',
 'The world knows it has lost a heroic champion of justice a

In [15]:
sentance_spacy = list(doc.sents)

sentance_spacy

[PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS
  
 April 16, 1945
 
 Mr. Speaker, Mr. President, Members of the Congress:,
 It is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.,
 Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt.,
 At a time like this, words are inadequate.,
 The most eloquent tribute would be a reverent silence.,
 Yet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.,
 In His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.,
 No man could possibly fill the tremendous void left by the passing of that noble soul.,
 No words can ease the aching hearts of untold millions of every race, creed and color.,
 The world knows it has lost a heroic champion of justice and freedom.,
 Tr

## Part 5 ‚Äì Normalization
Normalization makes text more consistent.

In [19]:
def normalize(tokes):
    return[
        token.lower()
        for token in tokes
        if token.isalpha()
    ]

In [20]:
fun_normalzed_nltk = normalize(tokenize_nltk)

In [17]:
#compress way from above
normalized_nltk = [t.lower() for t in tokenize_nltk if t.isalpha()]
normalized_nltk

['president',
 'harry',
 'truman',
 'address',
 'before',
 'a',
 'joint',
 'session',
 'of',
 'the',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'of',
 'the',
 'congress',
 'it',
 'is',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'i',
 'stand',
 'before',
 'you',
 'my',
 'friends',
 'and',
 'colleagues',
 'in',
 'the',
 'congress',
 'of',
 'the',
 'united',
 'states',
 'only',
 'yesterday',
 'we',
 'laid',
 'to',
 'rest',
 'the',
 'mortal',
 'remains',
 'of',
 'our',
 'beloved',
 'president',
 'franklin',
 'delano',
 'roosevelt',
 'at',
 'a',
 'time',
 'like',
 'this',
 'words',
 'are',
 'inadequate',
 'the',
 'most',
 'eloquent',
 'tribute',
 'would',
 'be',
 'a',
 'reverent',
 'silence',
 'yet',
 'in',
 'this',
 'decisive',
 'hour',
 'when',
 'world',
 'events',
 'are',
 'moving',
 'so',
 'rapidly',
 'our',
 'silence',
 'might',
 'be',
 'misunderstood',
 'and',
 'might',
 'give',
 'comfort',
 'to',
 'our',
 'enemies',
 'in',
 'his',
 'infinite',
 'wisdom',
 'almigh

## Part 6 ‚Äì Stop Word Removal

üìå Stop words are common words that often add little semantic meaning.

In [30]:
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')
stop_words =  set(stopwords.words('english'))

In [32]:
filtered_nltk = [t for t in normalized_nltk if t not in stop_words]
filtered_nltk[:100]

['president',
 'harry',
 'truman',
 'address',
 'joint',
 'session',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'congress',
 'heavy',
 'heart',
 'stand',
 'friends',
 'colleagues',
 'congress',
 'united',
 'states',
 'yesterday',
 'laid',
 'rest',
 'mortal',
 'remains',
 'beloved',
 'president',
 'franklin',
 'delano',
 'roosevelt',
 'time',
 'like',
 'words',
 'inadequate',
 'eloquent',
 'tribute',
 'would',
 'reverent',
 'silence',
 'yet',
 'decisive',
 'hour',
 'world',
 'events',
 'moving',
 'rapidly',
 'silence',
 'might',
 'misunderstood',
 'might',
 'give',
 'comfort',
 'enemies',
 'infinite',
 'wisdom',
 'almighty',
 'god',
 'seen',
 'fit',
 'take',
 'us',
 'great',
 'man',
 'loved',
 'beloved',
 'humanity',
 'man',
 'could',
 'possibly',
 'fill',
 'tremendous',
 'void',
 'left',
 'passing',
 'noble',
 'soul',
 'words',
 'ease',
 'aching',
 'hearts',
 'untold',
 'millions',
 'every',
 'race',
 'creed',
 'color',
 'world',
 'knows',
 'lost',
 'heroic',
 'champ

## Part 7 ‚Äì Vocabulary and Frequency

üìå The vocabulary is the set of unique tokens in a corpus.

# Stemming vs Lemmatization

Stemming and lemmatization are both normalization techniques, but they make very different trade-offs.

Stemming is fast and rule-based but can distort meaning

Lemmatization is slower but linguistically informed

In industry, the choice depends on task, domain, and interpretability requirements.

Common stemmers in NLTK (quick table)
| Stemmer          | Best for                         | Code                         |
| ---------------- | -------------------------------- | ---------------------------- |
| PorterStemmer    | most common/basic                | `PorterStemmer()`            |
| SnowballStemmer  | better rules, supports languages | `SnowballStemmer("english")` |
| LancasterStemmer | aggressive stemming              | `LancasterStemmer()`         |


In [33]:
from nltk import PorterStemmer

stemmer = PorterStemmer()

stemmed = [ stemmer.stem(word) for word in filtered_nltk]

stemmed[:100]

['presid',
 'harri',
 'truman',
 'address',
 'joint',
 'session',
 'congress',
 'april',
 'speaker',
 'presid',
 'member',
 'congress',
 'heavi',
 'heart',
 'stand',
 'friend',
 'colleagu',
 'congress',
 'unit',
 'state',
 'yesterday',
 'laid',
 'rest',
 'mortal',
 'remain',
 'belov',
 'presid',
 'franklin',
 'delano',
 'roosevelt',
 'time',
 'like',
 'word',
 'inadequ',
 'eloqu',
 'tribut',
 'would',
 'rever',
 'silenc',
 'yet',
 'decis',
 'hour',
 'world',
 'event',
 'move',
 'rapidli',
 'silenc',
 'might',
 'misunderstood',
 'might',
 'give',
 'comfort',
 'enemi',
 'infinit',
 'wisdom',
 'almighti',
 'god',
 'seen',
 'fit',
 'take',
 'us',
 'great',
 'man',
 'love',
 'belov',
 'human',
 'man',
 'could',
 'possibl',
 'fill',
 'tremend',
 'void',
 'left',
 'pass',
 'nobl',
 'soul',
 'word',
 'eas',
 'ach',
 'heart',
 'untold',
 'million',
 'everi',
 'race',
 'creed',
 'color',
 'world',
 'know',
 'lost',
 'heroic',
 'champion',
 'justic',
 'freedom',
 'tragic',
 'fate',
 'thrust',
 'u

‚Äúdemocracy‚Äù ‚Üí ‚Äúdemocraci‚Äù

stems are not necessarily real words

### spaCy does not include a built-in stemmer by default.

This is not a limitation. It is a design choice.

spaCy prioritizes:

- linguistically informed processing

- lemmatization over stemming

However, in real pipelines, you can still perform stemming alongside spaCy.

In [40]:
space_lemmatized = [token.lemma_ for token in doc]
space_lemmatized

['PRESIDENT',
 'HARRY',
 'S.',
 'TRUMAN',
 "'S",
 'ADDRESS',
 'before',
 'a',
 'JOINT',
 'session',
 'of',
 'the',
 'CONGRESS',
 '\n \n',
 'April',
 '16',
 ',',
 '1945',
 '\n\n',
 'Mr.',
 'Speaker',
 ',',
 'Mr.',
 'President',
 ',',
 'Members',
 'of',
 'the',
 'Congress',
 ':',
 '\n',
 'it',
 'be',
 'with',
 'a',
 'heavy',
 'heart',
 'that',
 'I',
 'stand',
 'before',
 'you',
 ',',
 'my',
 'friend',
 'and',
 'colleague',
 ',',
 'in',
 'the',
 'Congress',
 'of',
 'the',
 'United',
 'States',
 '.',
 '\n',
 'only',
 'yesterday',
 ',',
 'we',
 'lay',
 'to',
 'rest',
 'the',
 'mortal',
 'remain',
 'of',
 'our',
 'beloved',
 'President',
 ',',
 'Franklin',
 'Delano',
 'Roosevelt',
 '.',
 'at',
 'a',
 'time',
 'like',
 'this',
 ',',
 'word',
 'be',
 'inadequate',
 '.',
 'the',
 'most',
 'eloquent',
 'tribute',
 'would',
 'be',
 'a',
 'reverent',
 'silence',
 '.',
 '\n',
 'yet',
 ',',
 'in',
 'this',
 'decisive',
 'hour',
 ',',
 'when',
 'world',
 'event',
 'be',
 'move',
 'so',
 'rapidly',
 '

NLTK does support lemmatization, but it requires:

- a lemmatizer

- part-of-speech information to work well

By default, NLTK‚Äôs lemmatizer assumes nouns

In [41]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
nltk.download('wordnet')

lemmatized_nltk = [lemmatizer.lemmatize(word) for word in filtered_nltk]
lemmatized_nltk[:100]


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/olgaleikin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['president',
 'harry',
 'truman',
 'address',
 'joint',
 'session',
 'congress',
 'april',
 'speaker',
 'president',
 'member',
 'congress',
 'heavy',
 'heart',
 'stand',
 'friend',
 'colleague',
 'congress',
 'united',
 'state',
 'yesterday',
 'laid',
 'rest',
 'mortal',
 'remains',
 'beloved',
 'president',
 'franklin',
 'delano',
 'roosevelt',
 'time',
 'like',
 'word',
 'inadequate',
 'eloquent',
 'tribute',
 'would',
 'reverent',
 'silence',
 'yet',
 'decisive',
 'hour',
 'world',
 'event',
 'moving',
 'rapidly',
 'silence',
 'might',
 'misunderstood',
 'might',
 'give',
 'comfort',
 'enemy',
 'infinite',
 'wisdom',
 'almighty',
 'god',
 'seen',
 'fit',
 'take',
 'u',
 'great',
 'man',
 'loved',
 'beloved',
 'humanity',
 'man',
 'could',
 'possibly',
 'fill',
 'tremendous',
 'void',
 'left',
 'passing',
 'noble',
 'soul',
 'word',
 'ease',
 'aching',
 'heart',
 'untold',
 'million',
 'every',
 'race',
 'creed',
 'color',
 'world',
 'know',
 'lost',
 'heroic',
 'champion',
 'justi

Notice that many verbs are not lemmatized correctly.
This is because NLTK‚Äôs lemmatizer defaults to noun POS tags.

In [48]:
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger_eng')

def treebank_to_pos(tag: str):
    if tag.startswith("J"):
        return wordnet.ADJ
    if tag.startswith("V"):
        return wordnet.VERB
    if tag.startswith("N"):
        return wordnet.NOUN
    if tag.startswith("R"):
        return wordnet.ADV
    return wordnet.NOUN  # default fallback

# POS tag your tokens
tagged = nltk.pos_tag(filtered_nltk)

tagged[:10]


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/olgaleikin/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


[('president', 'NN'),
 ('harry', 'NN'),
 ('truman', 'NN'),
 ('address', 'NN'),
 ('joint', 'NN'),
 ('session', 'NN'),
 ('congress', 'NN'),
 ('april', 'JJ'),
 ('speaker', 'NN'),
 ('president', 'NN')]

In [49]:

lemmatized_nltk = [
    lemmatizer.lemmatize(word, pos=treebank_to_pos(tag))
    for word, tag in tagged
]

lemmatized_nltk[:100]

['president',
 'harry',
 'truman',
 'address',
 'joint',
 'session',
 'congress',
 'april',
 'speaker',
 'president',
 'member',
 'congress',
 'heavy',
 'heart',
 'stand',
 'friend',
 'colleague',
 'congress',
 'united',
 'state',
 'yesterday',
 'lay',
 'rest',
 'mortal',
 'remain',
 'beloved',
 'president',
 'franklin',
 'delano',
 'roosevelt',
 'time',
 'like',
 'word',
 'inadequate',
 'eloquent',
 'tribute',
 'would',
 'reverent',
 'silence',
 'yet',
 'decisive',
 'hour',
 'world',
 'event',
 'move',
 'rapidly',
 'silence',
 'might',
 'misunderstand',
 'might',
 'give',
 'comfort',
 'enemy',
 'infinite',
 'wisdom',
 'almighty',
 'god',
 'see',
 'fit',
 'take',
 'u',
 'great',
 'man',
 'love',
 'beloved',
 'humanity',
 'man',
 'could',
 'possibly',
 'fill',
 'tremendous',
 'void',
 'leave',
 'pass',
 'noble',
 'soul',
 'word',
 'ease',
 'ache',
 'heart',
 'untold',
 'million',
 'every',
 'race',
 'creed',
 'color',
 'world',
 'know',
 'lose',
 'heroic',
 'champion',
 'justice',
 'fre

# From Words to Subwords: Byte Pair Encoding (BPE)

So far, we have treated words as the basic unit of meaning.
Modern NLP systems often go one step further and operate on subword units.

One of the most common subword tokenization methods is Byte Pair Encoding (BPE).

In large corpora like the State of the Union addresses, word-level tokenization creates several problems:

Rare words appear very infrequently

New words appear over time (e.g. cybersecurity, biotechnology)

Related words are treated as completely separate tokens

Subword tokenization solves this by breaking words into frequently occurring pieces.

In [50]:
# We will use a small subset of real policy-related words that appear in State of the Union speeches.

words = [
    "democracy",
    "democratic",
    "democratization",
    "economy",
    "economic",
    "economics"
]

words
# At the word level, all of these are treated as separate tokens.

['democracy',
 'democratic',
 'democratization',
 'economy',
 'economic',
 'economics']

## Step 1 ‚Äì Character-Level Representation

BPE starts by representing each word as a sequence of characters
(with a special end-of-word marker).

In [51]:
char_tokens = [list(word) + ["</w>"] for word in words]
char_tokens


[['d', 'e', 'm', 'o', 'c', 'r', 'a', 'c', 'y', '</w>'],
 ['d', 'e', 'm', 'o', 'c', 'r', 'a', 't', 'i', 'c', '</w>'],
 ['d',
  'e',
  'm',
  'o',
  'c',
  'r',
  'a',
  't',
  'i',
  'z',
  'a',
  't',
  'i',
  'o',
  'n',
  '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'y', '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'i', 'c', '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'i', 'c', 's', '</w>']]

## Step 2 ‚Äì Count Frequent Character Pairs

BPE repeatedly merges the most frequent adjacent character pairs across the corpus.

In [52]:
from collections import Counter

pair_counts = Counter()

for word in char_tokens:
    for i in range(len(word) - 1):
        pair = (word[i], word[i+1])
        pair_counts[pair] += 1

pair_counts.most_common(10)


[(('o', 'n'), 4),
 (('d', 'e'), 3),
 (('e', 'm'), 3),
 (('m', 'o'), 3),
 (('o', 'c'), 3),
 (('c', 'r'), 3),
 (('r', 'a'), 3),
 (('a', 't'), 3),
 (('t', 'i'), 3),
 (('i', 'c'), 3)]

## Step 3 ‚Äì Merge Frequent Pairs (Conceptual)

The most frequent pair is ('o', 'n').
BPE merges it into a new token: "on".

This process repeats many times, gradually forming meaningful subwords.

In [53]:
bpe_tokens_example = [
    ["democr", "acy</w>"],
    ["democr", "atic</w>"],
    ["democr", "atization</w>"],
    ["econ", "omy</w>"],
    ["econ", "omic</w>"],
    ["econ", "omics</w>"]
]

bpe_tokens_example


[['democr', 'acy</w>'],
 ['democr', 'atic</w>'],
 ['democr', 'atization</w>'],
 ['econ', 'omy</w>'],
 ['econ', 'omic</w>'],
 ['econ', 'omics</w>']]