# AIG 230 ‚Äì Week 2 Lab
## From Raw Text to Corpus: Tokenization, Normalization, and Vocabulary

Industry Context: Exploring the State of the Union Corpus

## Learning Objectives
- Understand raw text, documents, and corpora
- Explore a real-world corpus
- Compare NLTK and spaCy preprocessing pipelines
- Perform tokenization, normalization, and vocabulary analysis

In [1]:

import nltk
import spacy
import string
from collections import Counter

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("state_union")

[nltk_data] Downloading package punkt to /Users/arad/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/arad/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     /Users/arad/nltk_data...
[nltk_data]   Package state_union is already up-to-date!


True

In [2]:
# Load spaCy English model
# Run once if needed:
!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m22.2 MB/s[0m  [33m0:00:00[0meta [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Breaking it down:

```
spacy.load() - loads a pre-trained language model
```

"en_core_web_sm" - the small English model that includes:

- Tokenizer - splits text into words/sentences
- Part-of-speech tagger - identifies noun, verb, adjective, etc.
- Dependency parser - analyzes grammatical relationships
- Named entity recognizer - identifies people, places, organizations
- Word vectors - semantic meaning of words

nlp = - stores the loaded model in a variable so you can use it to process text

## Part 1 ‚Äì Obtaining the Corpus

In [3]:

from nltk.corpus import state_union
state_union.fileids()[:10]


['1945-Truman.txt',
 '1946-Truman.txt',
 '1947-Truman.txt',
 '1948-Truman.txt',
 '1949-Truman.txt',
 '1950-Truman.txt',
 '1951-Truman.txt',
 '1953-Eisenhower.txt',
 '1954-Eisenhower.txt',
 '1955-Eisenhower.txt']

Each file is a document. The collection is the corpus.

In [4]:

len(state_union.fileids())


65

## Part 2 ‚Äì Inspect Raw Text

In [5]:

sample_file = state_union.fileids()[0]
raw_text = state_union.raw(sample_file)
raw_text[:1000]


"PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.\nOnly yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.\nYet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.\nIn His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.\nNo man could possibly fill the tremendous void left by the passing of that noble soul. No words can ease the aching hearts of untold millions of every race, creed and color. The world knows it has lost a heroic champion of justice and freedom.\nTragic fate has 

## Part 3 ‚Äì Word Tokenization

üìå Tokenization splits text into meaningful units (tokens).
There is no universal standard, but conventions vary by task and language.

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize
tokens = word_tokenize(raw_text)
tokens[:20]

['PRESIDENT',
 'HARRY',
 'S.',
 'TRUMAN',
 "'S",
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 'April',
 '16',
 ',',
 '1945',
 'Mr.',
 'Speaker',
 ',']

In [9]:
document = nlp(raw_text)
document

PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS
 
April 16, 1945

Mr. Speaker, Mr. President, Members of the Congress:
It is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.
Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt. At a time like this, words are inadequate. The most eloquent tribute would be a reverent silence.
Yet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.
In His infinite wisdom, Almighty God has seen fit to take from us a great man who loved, and was beloved by, all humanity.
No man could possibly fill the tremendous void left by the passing of that noble soul. No words can ease the aching hearts of untold millions of every race, creed and color. The world knows it has lost a heroic champion of justice and freedom.
Tragic fate has thrust upon

In [10]:
tokens_spacy = [token.text for token in document]
tokens_spacy[:20]

['PRESIDENT',
 'HARRY',
 'S.',
 'TRUMAN',
 "'S",
 'ADDRESS',
 'BEFORE',
 'A',
 'JOINT',
 'SESSION',
 'OF',
 'THE',
 'CONGRESS',
 '\n \n',
 'April',
 '16',
 ',',
 '1945',
 '\n\n',
 'Mr.']

## Part 4 ‚Äì Sentence Tokenization

In [11]:
tokens_sent_nltk = sent_tokenize(raw_text)
tokens_sent_nltk[:5]

["PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\nIt is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.",
 'Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt.',
 'At a time like this, words are inadequate.',
 'The most eloquent tribute would be a reverent silence.',
 'Yet, in this decisive hour, when world events are moving so rapidly, our silence might be misunderstood and might give comfort to our enemies.']

In [12]:
tokens_sent_spacy = [sent.text for sent in document.sents]
tokens_sent_spacy[:5]

["PRESIDENT HARRY S. TRUMAN'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS\n \nApril 16, 1945\n\nMr. Speaker, Mr. President, Members of the Congress:\n",
 'It is with a heavy heart that I stand before you, my friends and colleagues, in the Congress of the United States.\n',
 'Only yesterday, we laid to rest the mortal remains of our beloved President, Franklin Delano Roosevelt.',
 'At a time like this, words are inadequate.',
 'The most eloquent tribute would be a reverent silence.\n']

## Part 5 ‚Äì Normalization
Normalization makes text more consistent.

In [13]:
normalized_tokens = [token.lower() for token in tokens if token.isalpha()]
normalized_tokens[:20]

['president',
 'harry',
 'truman',
 'address',
 'before',
 'a',
 'joint',
 'session',
 'of',
 'the',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'of',
 'the',
 'congress',
 'it',
 'is']

In [14]:
normalized_tokens_spacy = [token.text.lower() for token in document if token.is_alpha]
normalized_tokens_spacy[:20]

['president',
 'harry',
 'truman',
 'address',
 'before',
 'a',
 'joint',
 'session',
 'of',
 'the',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'of',
 'the',
 'congress',
 'it',
 'is']

## Part 6 ‚Äì Stop Word Removal

üìå Stop words are common words that often add little semantic meaning.

In [19]:
from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /Users/arad/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
filtered_tokens = [token for token in normalized_tokens if token not in stop_words]
filtered_tokens[:20]

['president',
 'harry',
 'truman',
 'address',
 'joint',
 'session',
 'congress',
 'april',
 'speaker',
 'president',
 'members',
 'congress',
 'heavy',
 'heart',
 'stand',
 'friends',
 'colleagues',
 'congress',
 'united',
 'states']

## Part 7 ‚Äì Vocabulary and Frequency

üìå The vocabulary is the set of unique tokens in a corpus.

# Stemming vs Lemmatization

Stemming and lemmatization are both normalization techniques, but they make very different trade-offs.

Stemming is fast and rule-based but can distort meaning

Lemmatization is slower but linguistically informed

In industry, the choice depends on task, domain, and interpretability requirements.

‚Äúdemocracy‚Äù ‚Üí ‚Äúdemocraci‚Äù

stems are not necessarily real words

### spaCy does not include a built-in stemmer by default.

This is not a limitation. It is a design choice.

spaCy prioritizes:

- linguistically informed processing

- lemmatization over stemming

However, in real pipelines, you can still perform stemming alongside spaCy.

NLTK does support lemmatization, but it requires:

- a lemmatizer

- part-of-speech information to work well

By default, NLTK‚Äôs lemmatizer assumes nouns

Notice that many verbs are not lemmatized correctly.
This is because NLTK‚Äôs lemmatizer defaults to noun POS tags.

# From Words to Subwords: Byte Pair Encoding (BPE)

So far, we have treated words as the basic unit of meaning.
Modern NLP systems often go one step further and operate on subword units.

One of the most common subword tokenization methods is Byte Pair Encoding (BPE).

In large corpora like the State of the Union addresses, word-level tokenization creates several problems:

Rare words appear very infrequently

New words appear over time (e.g. cybersecurity, biotechnology)

Related words are treated as completely separate tokens

Subword tokenization solves this by breaking words into frequently occurring pieces.

In [33]:
# We will use a small subset of real policy-related words that appear in State of the Union speeches.

words = [
    "democracy",
    "democratic",
    "democratization",
    "economy",
    "economic",
    "economics"
]

words
# At the word level, all of these are treated as separate tokens.

['democracy',
 'democratic',
 'democratization',
 'economy',
 'economic',
 'economics']

## Step 1 ‚Äì Character-Level Representation

BPE starts by representing each word as a sequence of characters
(with a special end-of-word marker).

In [34]:
char_tokens = [list(word) + ["</w>"] for word in words]
char_tokens


[['d', 'e', 'm', 'o', 'c', 'r', 'a', 'c', 'y', '</w>'],
 ['d', 'e', 'm', 'o', 'c', 'r', 'a', 't', 'i', 'c', '</w>'],
 ['d',
  'e',
  'm',
  'o',
  'c',
  'r',
  'a',
  't',
  'i',
  'z',
  'a',
  't',
  'i',
  'o',
  'n',
  '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'y', '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'i', 'c', '</w>'],
 ['e', 'c', 'o', 'n', 'o', 'm', 'i', 'c', 's', '</w>']]

## Step 2 ‚Äì Count Frequent Character Pairs

BPE repeatedly merges the most frequent adjacent character pairs across the corpus.

In [35]:
from collections import Counter

pair_counts = Counter()

for word in char_tokens:
    for i in range(len(word) - 1):
        pair = (word[i], word[i+1])
        pair_counts[pair] += 1

pair_counts.most_common(10)


[(('o', 'n'), 4),
 (('d', 'e'), 3),
 (('e', 'm'), 3),
 (('m', 'o'), 3),
 (('o', 'c'), 3),
 (('c', 'r'), 3),
 (('r', 'a'), 3),
 (('a', 't'), 3),
 (('t', 'i'), 3),
 (('i', 'c'), 3)]

## Step 3 ‚Äì Merge Frequent Pairs (Conceptual)

The most frequent pair is ('o', 'n').
BPE merges it into a new token: "on".

This process repeats many times, gradually forming meaningful subwords.

In [36]:
bpe_tokens_example = [
    ["democr", "acy</w>"],
    ["democr", "atic</w>"],
    ["democr", "atization</w>"],
    ["econ", "omy</w>"],
    ["econ", "omic</w>"],
    ["econ", "omics</w>"]
]

bpe_tokens_example


[['democr', 'acy</w>'],
 ['democr', 'atic</w>'],
 ['democr', 'atization</w>'],
 ['econ', 'omy</w>'],
 ['econ', 'omic</w>'],
 ['econ', 'omics</w>']]