<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/1.words/HW1_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/1.words/HW1_Tokenization.ipynb)

# Homework 1: Tokenization

In this homework, you'll compare the tokenizations outputs from different classes of tokenizers. This homework is also an opportunity for you to check in on your Python proficiency; for all of the operations below (downloading a file, reading it in, counting objects), you should either be comfortable implementing them already or know how to find out how to do so yourself (if you find yourself struggling with them, we encourage you to take this class at a later date, with a bit more Python experience under your belt).

We've added some space for you to write the code for each section, but feel free to create more code cells if you'd like.

## Part 1

Tokenize the following document with each of these models. Feel free to use the documentation linked (and AI Assistance) to do so for this low-level operation (but again remember that you have to be able to explain what your code is doing).  For each of the tokenizers above, we want to see a list of tokens for this document (not numeric token IDs, but legible words) -- e.g., \["London", ".", ...\]

* NLTK `word_tokenize` (https://www.nltk.org/book/ch03.html)
* Spacy `tokenize` (https://spacy.io/usage/spacy-101#annotations-token)
* Tiktoken BPE tokenization (https://github.com/openai/tiktoken) -- cl100k_base (GPT-3.5, GPT-4).



In [41]:
document = "London. Michaelmas term lately over, and the Lord Chancellor sitting in Lincoln’s Inn Hall. Implacable November weather. As much mud in the streets as if the waters had but newly retired from the face of the earth, and it would not be wonderful to meet a Megalosaurus, forty feet long or so, waddling like an elephantine lizard up Holborn Hill. Smoke lowering down from chimney-pots, making a soft black drizzle, with flakes of soot in it as big as full-grown snowflakes—gone into mourning, one might imagine, for the death of the sun. Dogs, undistinguishable in mire. Horses, scarcely better; splashed to their very blinkers. Foot passengers, jostling one another’s umbrellas in a general infection of ill temper, and losing their foot-hold at street-corners, where tens of thousands of other foot passengers have been slipping and sliding since the day broke (if this day ever broke), adding new deposits to the crust upon crust of mud, sticking at those points tenaciously to the pavement, and accumulating at compound interest."

In [42]:
# make sure dependencies are installed
!pip install nltk
!pip install spacy
!pip install tiktoken



In [43]:
# NLTK word_tokenize
import nltk
nltk.download('punkt_tab')
from nltk import word_tokenize

def nltkTokenize(document):
  tokens = word_tokenize(document)
  return tokens

print(nltkTokenize(document))

['London', '.', 'Michaelmas', 'term', 'lately', 'over', ',', 'and', 'the', 'Lord', 'Chancellor', 'sitting', 'in', 'Lincoln', '’', 's', 'Inn', 'Hall', '.', 'Implacable', 'November', 'weather', '.', 'As', 'much', 'mud', 'in', 'the', 'streets', 'as', 'if', 'the', 'waters', 'had', 'but', 'newly', 'retired', 'from', 'the', 'face', 'of', 'the', 'earth', ',', 'and', 'it', 'would', 'not', 'be', 'wonderful', 'to', 'meet', 'a', 'Megalosaurus', ',', 'forty', 'feet', 'long', 'or', 'so', ',', 'waddling', 'like', 'an', 'elephantine', 'lizard', 'up', 'Holborn', 'Hill', '.', 'Smoke', 'lowering', 'down', 'from', 'chimney-pots', ',', 'making', 'a', 'soft', 'black', 'drizzle', ',', 'with', 'flakes', 'of', 'soot', 'in', 'it', 'as', 'big', 'as', 'full-grown', 'snowflakes—gone', 'into', 'mourning', ',', 'one', 'might', 'imagine', ',', 'for', 'the', 'death', 'of', 'the', 'sun', '.', 'Dogs', ',', 'undistinguishable', 'in', 'mire', '.', 'Horses', ',', 'scarcely', 'better', ';', 'splashed', 'to', 'their', 'very

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [44]:
# Spacy tokenize
import spacy

nlp = spacy.load("en_core_web_sm")
document_nlp = nlp(document)

def spacyTokenize(document_nlp):
  tokens = []
  for token in document_nlp:
    tokens.append(token.text)
  return tokens

print(spacyTokenize(document_nlp))

['London', '.', 'Michaelmas', 'term', 'lately', 'over', ',', 'and', 'the', 'Lord', 'Chancellor', 'sitting', 'in', 'Lincoln', '’s', 'Inn', 'Hall', '.', 'Implacable', 'November', 'weather', '.', 'As', 'much', 'mud', 'in', 'the', 'streets', 'as', 'if', 'the', 'waters', 'had', 'but', 'newly', 'retired', 'from', 'the', 'face', 'of', 'the', 'earth', ',', 'and', 'it', 'would', 'not', 'be', 'wonderful', 'to', 'meet', 'a', 'Megalosaurus', ',', 'forty', 'feet', 'long', 'or', 'so', ',', 'waddling', 'like', 'an', 'elephantine', 'lizard', 'up', 'Holborn', 'Hill', '.', 'Smoke', 'lowering', 'down', 'from', 'chimney', '-', 'pots', ',', 'making', 'a', 'soft', 'black', 'drizzle', ',', 'with', 'flakes', 'of', 'soot', 'in', 'it', 'as', 'big', 'as', 'full', '-', 'grown', 'snowflakes', '—', 'gone', 'into', 'mourning', ',', 'one', 'might', 'imagine', ',', 'for', 'the', 'death', 'of', 'the', 'sun', '.', 'Dogs', ',', 'undistinguishable', 'in', 'mire', '.', 'Horses', ',', 'scarcely', 'better', ';', 'splashed', 

In [45]:
# Tiktoken BPE tokenization
import tiktoken

def bpeTokenize(document):
  enc = tiktoken.get_encoding("cl100k_base")
  enc_tokens = enc.encode(document)

  tokens = []
  for tokenInt in enc_tokens:
    tokens.append(enc.decode([tokenInt]))

  return tokens

print(bpeTokenize(document))

['London', '.', ' Michael', 'mas', ' term', ' lately', ' over', ',', ' and', ' the', ' Lord', ' Chancellor', ' sitting', ' in', ' Lincoln', '’s', ' Inn', ' Hall', '.', ' Impl', 'ac', 'able', ' November', ' weather', '.', ' As', ' much', ' mud', ' in', ' the', ' streets', ' as', ' if', ' the', ' waters', ' had', ' but', ' newly', ' retired', ' from', ' the', ' face', ' of', ' the', ' earth', ',', ' and', ' it', ' would', ' not', ' be', ' wonderful', ' to', ' meet', ' a', ' Meg', 'al', 'os', 'aurus', ',', ' forty', ' feet', ' long', ' or', ' so', ',', ' w', 'add', 'ling', ' like', ' an', ' elephant', 'ine', ' lizard', ' up', ' Hol', 'born', ' Hill', '.', ' Smoke', ' lowering', ' down', ' from', ' chimney', '-p', 'ots', ',', ' making', ' a', ' soft', ' black', ' dr', 'izzle', ',', ' with', ' flakes', ' of', ' so', 'ot', ' in', ' it', ' as', ' big', ' as', ' full', '-g', 'rown', ' snow', 'fl', 'akes', '—', 'gone', ' into', ' mourning', ',', ' one', ' might', ' imagine', ',', ' for', ' the'

## Part 2

Examine the different tokenizations for the passage above -- i.e., actually read through them and see how they differ. In a paragraph or two, characterize the salient differences in tokenization between a.) NLTK and Spacy and b.) NLTK and BPE.  Reference real examples in the text. At the end of this homework, you want to be able to discuss the practical differences between tokenization methods.

**Response**:

As I was looking over the output for the tokenizers, I saw an interesting behaviour out of all three when encountering the word 'chimney-pots.' NLTK kept the entire word together, Spacy separated it into three tokens: 'chimney', '-', and 'pots', and BPE used a weird hybrid of 'chimney', '-p', and 'ots'. I think this example speaks really well to how each tokenizer segments the entire text, as NLTK seems to separate based on complete words, Spacy prefers to break up complete words into smaller subparts, and BPE seems to segment (very efficiently) with a non-human understanding of the words themselves. I find BPE to be of particular interest here as the recursive algorithm that BPE carries out has nothing to do with the meaning of the words themselves, only finding the most common pairings of letters, thereby sacrificing human interpretability for a computationally optimal result. Spacy, then, would produce results that are more human interpretable, but that are in small units. NLTK seems to produce results that make the most intuitive sense, opting to keep words like 'chimney-pots' together.


## Part 3

Download the full text of *Pride and Prejudice* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt) and tokenize it using each of the methods above. How many word types (in the formal sense we discussed in class) does each tokenization method have for that complete file?

In [46]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt
prideAndPrejudice = open('1342_pride_and_prejudice.txt').read()

nltkTokensPrideAndPrejudice = nltkTokenize(prideAndPrejudice)
spacyTokens = spacyTokenize(nlp(prideAndPrejudice))
bpeTokens = bpeTokenize(prideAndPrejudice)
print([len(set(nltkTokensPrideAndPrejudice)), len(set(spacyTokens)), len(set(bpeTokens))]) # 7475, 6780, and 8364 unique types respectively

--2025-09-03 23:01:12--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 691804 (676K) [text/plain]
Saving to: ‘1342_pride_and_prejudice.txt.9’


2025-09-03 23:01:12 (2.72 MB/s) - ‘1342_pride_and_prejudice.txt.9’ saved [691804/691804]

[7475, 6780, 8364]


## Part 4

Which text has the greater type-token ratio, *Pride and Prejudice* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/1342_pride_and_prejudice.txt) or *Emma* (https://raw.githubusercontent.com/dbamman/anlp25/main/data/158_emma.txt)?  Calculate the TTR for both texts using the NLTK tokenizer, but only use the first 1,000 tokens from each text when calculating its TTR.

In [47]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/158_emma.txt
emma = open('158_emma.txt').read()

nltkTokensEmma = nltkTokenize(emma)

nltkTokensPrideAndPrejudice = nltkTokensPrideAndPrejudice[:1000]
nltkTokensEmma = nltkTokensEmma[:1000]

prideAndPrejudiceTTR = len(set(nltkTokensPrideAndPrejudice)) / len(nltkTokensPrideAndPrejudice)
emmaTTR = len(set(nltkTokensEmma)) / len(nltkTokensEmma)

print(prideAndPrejudiceTTR, emmaTTR) # for the first 1000 tokens, Emma has the higher TTR

--2025-09-03 23:02:00--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/158_emma.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 891405 (871K) [text/plain]
Saving to: ‘158_emma.txt.9’


2025-09-03 23:02:01 (3.14 MB/s) - ‘158_emma.txt.9’ saved [891405/891405]

0.36 0.41


In [48]:
pp_ttr = 0.36  # fill this in!
emma_ttr = 0.41  # fill this in!
answer = "Emma"  # fill this in!

print("The TTR for 'Pride and Prejudice' is", pp_ttr)
print("The TTR for 'Emma' is", emma_ttr)
print(f"{answer} has the higher TTR.")

The TTR for 'Pride and Prejudice' is 0.36
The TTR for 'Emma' is 0.41
Emma has the higher TTR.
