# Tokenization, Lemmatization and Stemming

In this notebook, we want to learn about three important terms in NLP: Tokenization, Lemmatization and Stemming.

In summary:
1. Tokenization: Splitting text into smaller units (tokens). (happiness -> ['hap', 'pi', 'ness'])
2. Stemming: Cutting words down to a crude root form. (happiness -> happi)
3. Lemmatization: Reducing words to their dictionary base form. (is -> be)

The pipeline is typically: raw text → tokenization → (optional) stemming/lemmatization → further processing

We start by loading the books that you donwloaded during your first assignment. 

In [1]:
from collections import Counter
from scrape_books import load_books

books = load_books("downloads", n_books=5, seed=42)

We get a list of `n_books` books:

In [2]:
books

['The Project Gutenberg eBook of The lesser Key of Solomon, Goetia, the book of evil spirits\n    \nThis ebook...our email newsletter to hear about new eBooks.\n\n\n',
 'The Project Gutenberg eBook of Simple Sabotage Field Manual\n    \nThis ebook is for the use of anyone anywh...our email newsletter to hear about new eBooks.\n\n\n',
 'The Project Gutenberg eBook of Ulysses\n    \nThis ebook is for the use of anyone anywhere in the United Sta...our email newsletter to hear about new eBooks.\n\n\n',
 'The Project Gutenberg eBook of Jane Eyre: An Autobiography\n    \nThis ebook is for the use of anyone anywhe...our email newsletter to hear about new eBooks.\n\n\n',
 'The Project Gutenberg eBook of The Adventures of Ferdinand Count Fathom — Complete\n    \nThis ebook is for ...to our email newsletter to hear about new eBooks.\n\n\n']

Let's first create a list of all words in the books via `.split()`. That's a very simple and naive way to tokenize text, which just splits the text on spaces.
Note, this is often not enough because it doesn’t handle punctuation, contractions, special characters, or subword structures, which are important for accurate NLP processing. That's why we look into the BPE tokenizer later.

In [3]:
raw_tokens = []
for text in books:
    raw_tokens.extend(text.split())

In [4]:
raw_tokens[:20]

['The',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'The',
 'lesser',
 'Key',
 'of',
 'Solomon,',
 'Goetia,',
 'the',
 'book',
 'of',
 'evil',
 'spirits',
 'This',
 'ebook',
 'is',
 'for']

Let's compute some statistics on `raw_tokens`

In [5]:
counter = Counter(raw_tokens)

print("Total tokens:", len(raw_tokens))
print("Unique tokens:", len(counter))
print("Top 20 tokens:", counter.most_common(20))

Total tokens: 657732
Unique tokens: 78948
Top 20 tokens: [('the', 32789), ('of', 21236), ('and', 18954), ('to', 15864), ('a', 14133), ('in', 11221), ('I', 9948), ('his', 8184), ('he', 7148), ('was', 6440), ('with', 6259), ('that', 5879), ('her', 4705), ('for', 4388), ('you', 4108), ('it', 3874), ('as', 3778), ('had', 3685), ('is', 3553), ('on', 3553)]


Note that the statistic is case-sensitive

In [6]:
counter['The']

1915

To deal with the case sensitivity, we can first call `.lower()` on each token

In [7]:
lower_tokens = [t.lower() for t in raw_tokens]
lower_counter = Counter(lower_tokens)

print("Unique tokens before:", len(counter))
print("Unique tokens after:", len(lower_counter))

print("Before:")
print(counter["the"], counter["The"], counter["THE"])

print("After:")
print(lower_counter["the"])

Unique tokens before: 78948
Unique tokens after: 72440
Before:
32789 1915 383
After:
35087


### Tokenization

Tokenization is the process of splitting text into smaller units called tokens, such as words, subwords, or characters.
It is usually the first step in an NLP pipeline, turning raw text into pieces that a computer can process and analyze.
Modern systems often use subword tokenization (like BPE) so they can handle new or rare words by breaking them into meaningful parts.

We start by introducting BPE (Byte Pair Encoding): A subword tokenization method that iteratively merges the most frequent pairs of characters or symbols to build a compact, reusable vocabulary. In the following we learn how to tokenize a text with BPE in Python.

We use the tokenizers library from Hugging Face, which implements BPE efficiently in Rust and exposes it in Python.

In [83]:
from tokenizers import Tokenizer # main object for tokenization
from tokenizers.models import BPE # model implementing Byte Pair Encoding.
from tokenizers.trainers import BpeTrainer # trains the BPE vocabulary
from tokenizers.pre_tokenizers import Whitespace # simple pre-tokenizer splitting on spaces

Instead of applying BPE immediately to the whole text lets first focus on a small example

In [84]:
corpus = ["low lower lowest newer wider best"]

Initialize a BPE Tokenizer first. Pre-tokenization ensures BPE merges operate inside words rather than on raw text.

In [85]:
tokenizer = Tokenizer(BPE())  # BPE model with unknown token
tokenizer.pre_tokenizer = Whitespace()         # Split text by spaces first

Train the tokenizer
- The trainer counts frequent symbol pairs and merges them iteratively.
- vocab_size limits the number of subword tokens.

In [None]:
trainer = BpeTrainer(vocab_size=23, min_frequency=1, special_tokens=[])
tokenizer.train_from_iterator(corpus, trainer=trainer)

Let's see what the the create vocabulary (collection of all tokens) looks like

In [87]:
tokenizer.get_vocab()

{'l': 4,
 'ew': 18,
 'r': 7,
 'i': 3,
 'ider': 19,
 'est': 15,
 'o': 6,
 'd': 1,
 'wider': 21,
 'new': 20,
 'lower': 22,
 'lo': 12,
 'best': 16,
 'n': 5,
 'er': 11,
 'der': 17,
 'es': 14,
 'e': 2,
 'low': 13,
 'w': 10,
 's': 8,
 'b': 0,
 't': 9}

You see that the word 'lowest' is not included in this vocabulary. Let's see how it's split into tokens.

In [88]:
output = tokenizer.encode("lowest")
print("Tokens:", output.tokens)
print("IDs:", output.ids)

Tokens: ['low', 'est']
IDs: [13, 15]


Lets continue by tokenizing our books

In [130]:
tokenizer = Tokenizer(BPE())  # BPE model with unknown token
tokenizer.pre_tokenizer = Whitespace() 
trainer = BpeTrainer(vocab_size=20000, min_frequency=1, special_tokens=[])
tokenizer.train_from_iterator(books, trainer=trainer)

for w in ['lowest', 'bathroom', 'availability', 'heelllooooo']:
    output = tokenizer.encode(w)
    print("Tokens:", output.tokens)




Tokens: ['lowest']
Tokens: ['bath', 'room']
Tokens: ['avail', 'ability']
Tokens: ['hee', 'll', 'lo', 'oooo']


We can also reverse the encoding operationg by going back from ids to tokens

In [131]:
ids = tokenizer.encode('bathroom').ids
print(tokenizer.encode('bathroom').tokens)
print(tokenizer.decode(ids))


['bath', 'room']
bath room


Note that encoding the word bathroom is still split into two words, the tokenizer does not know anymore that it had originally no space in between them. That information was lost during training. 

BPE tokenizers used in real LLMs add a word-boundary symbol such as:
-	`</w>` (classic BPE, as in the textbook)
-	`Ġ` (used by GPT-2 / RoBERTa)

These markers preserve whether a token started with or without a space, making decoding lossless. To achieve this we can for example add `end_of_word_suffix="</w>"`to the `BpeTrainer`. Now we introduced a new character that symbolizes the end of a word.

In [135]:
tokenizer = Tokenizer(BPE())  # BPE model with unknown token
tokenizer.pre_tokenizer = Whitespace() 
trainer = BpeTrainer(vocab_size=20000, min_frequency=1, special_tokens=[], end_of_word_suffix="</w>")
tokenizer.train_from_iterator(books, trainer=trainer)
print(tokenizer.encode('bathroom').tokens)

from tokenizers.decoders import BPEDecoder
tokenizer.decoder = BPEDecoder(suffix="</w>")
ids = tokenizer.encode('bathroom').ids
tokenizer.decode(ids)




['bath', 'room</w>']


'bathroom'

Now let's see how we can save a tokenizer

In [140]:
# Save tokenizer to a file
tokenizer.save("bpe_tokenizer.json")

# Load tokenizer from file
tokenizer = Tokenizer.from_file("bpe_tokenizer.json")

# Now you can use your tokenizer as before
output = tokenizer.encode("This is a test")
print(output.tokens)

['This</w>', 'is</w>', 'a</w>', 'test</w>']


You can also save your encoded text and decode it later

In [139]:
import json
encoded = tokenizer.encode("This is a small example")

# Save only the IDs (this is what models actually use)
ids = encoded.ids
print(ids)

with open("encoded_text.json", "w") as f:
    json.dump(ids, f)

# Reload the data
with open("encoded_text.json") as f:
    ids = json.load(f)

# And decode the text
decoded_text = tokenizer.decode(ids)
print(decoded_text)

[973, 324, 154, 1834, 3692]
This is a small example


Finally, lets have a look how to load the tokenizer from an actual LLM. Lets load the GPT2 tokenizer using the Hugging Face `transformers` library. 

In [145]:
from transformers import AutoTokenizer

# Load pretrained GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [146]:
text = "I am learning about tokenization."

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("IDs:", ids)
print("Decoded: ", tokenizer.decode(ids))

Tokens: ['I', 'Ġam', 'Ġlearning', 'Ġabout', 'Ġtoken', 'ization', '.']
IDs: [40, 716, 4673, 546, 11241, 1634, 13]
Decoded:  I am learning about tokenization.


Here you see the special token Ġ. Ġ means “this token starts after a space”. This is how GPT-2 preserves word boundaries.

### Lemmatization

We continue with Lemmatization. We will use the nltk package that we first have to download.

In [26]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/fabianlaakmann/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/fabianlaakmann/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/fabianlaakmann/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/fabianlaakmann/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

A lemmatizer reduces a word to its dictionary base form (lemma) using linguistic knowledge such as part of speech and vocabulary. It always returns a valid word that represents the word’s underlying meaning.

1.	"running" → "run" (verb in progressive form)
2.	"mice" → "mouse" (irregular plural)
3.	"is" → "be" (infinitive)

In [None]:
lemmatizer = WordNetLemmatizer()

In [40]:
# Words to lemmatize
words = ["running", "mice", "is"]

# Apply lemmatizer
lemmas = [lemmatizer.lemmatize(w) for w in words]

# Print results
for word, lemma in zip(words, lemmas):
    print(f"{word} → {lemma}")

running → running
mice → mouse
is → is


Mhh, this only seemed to have worked for the word mice, what about 'running' and 'is'?

We have to tell the lemmatizer which type of a word we have by providing the `pos` argument. If you leave out the `pos` parameter, the NLTK WordNetLemmatizer assumes the word is a noun by default.
This can lead to incorrect lemmas for verbs, adjectives, or adverbs.


In [28]:
# Apply lemmatizer with appropriate POS
lemmas = [
    lemmatizer.lemmatize("running", pos="v"),  # verb
    lemmatizer.lemmatize("mice", pos="n"),     # noun
    lemmatizer.lemmatize("is", pos="v")        # verb
]

# Print results
for word, lemma in zip(words, lemmas):
    print(f"{word} → {lemma}")

running → run
mice → mouse
is → be


You can use the following function to automatically determine the word type

In [35]:
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default

pos_tags = nltk.pos_tag(raw_tokens)

Now we lemmatize all the words in our books.

In [45]:
lemmatized_tokens_pos = [
    lemmatizer.lemmatize(word.lower(), pos=get_wordnet_pos(tag))
    for word, tag in pos_tags
]

lemma_counter_pos = Counter(lemmatized_tokens_pos)

In [46]:
print("Unique tokens after lemmatization:", len(lemma_counter_pos))

print("dog:", lemma_counter_pos["dog"])
print("dogs:", lemma_counter_pos["dogs"])

print("run:", lemma_counter_pos["run"])
print("running:", lemma_counter_pos["running"])

print("be:", lemma_counter_pos["be"])
print("is:", lemma_counter_pos["is"])

Unique tokens after lemmatization: 66580
dog: 69
dogs: 0
run: 260
running: 6
be: 18194
is: 24


In a few cases the word running was apparently used as a noun rather than a verb in the context of the text.

### Stemming

We continue with stemming. Stemming is a text-normalization technique that reduces words to a common root by mechanically removing prefixes or suffixes.

1. "connection" → "connect" (removes -ion, grouping related words)
2. "happiness" → "happi" (removes -ness, but leaves a non-word stem)
3. "studies" → "studi" (the stemmer chops off -es, but the result is not a real word)

How does it compare to a lemmatizer:
1.	"running" → "run" (suffix -ing is stripped)
2.	"mice" → "mice" (unchanged, because most stemmers don’t handle irregular forms)
3.	"is" → "is" (unchanged, a stemmer doesn't know about verbs infinitive)

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [41]:
# Words to lemmatize
words = ["connection", "happiness", "studies", "running", "mice", "is"]

# Apply lemmatizer
lemmas = [stemmer.stem(w) for w in words]

# Print results
for word, lemma in zip(words, lemmas):
    print(f"{word} → {lemma}")

connection → connect
happiness → happi
studies → studi
running → run
mice → mice
is → is


We continue by stemming all words in our books

In [42]:
stemmed_tokens = [stemmer.stem(t.lower()) for t in raw_tokens]
stem_counter = Counter(stemmed_tokens)

In [44]:
print("Unique tokens after stemming:", len(stem_counter))
stem_counter['dog']
# Inspect common stems
stem_counter.most_common(5)

Unique tokens after stemming: 62330


[('the', 35087), ('of', 21489), ('and', 19744), ('to', 16121), ('a', 14892)]

We observe that we have less unique tokens after stemming than after lemmatization. That can make sense, as stemming strips endings more aggressively and merges many forms into the same rough root, whereas lemmatization only combines words when they truly share the same dictionary base form.