# Tokenizers

For a model, such as a **Large Language Model (LLM)**, to be able to read data, the data must first be transformed into some format the model can process and "understand". The models do not comprehend raw input data directly, various steps happen "under the hood" to make the input usable. This notebook will explore it.

## What is tokenization?

Tokenization is the process of converting text into smaller units called tokens (often subwords) and mapping them to numeric IDs so models can process them, then reversing the process to produce text. Tokenizers are the first and last step around the model: encode text to tokens, embed/process, then decode tokens back to text

## Real-world impact

**Cost**: Most API pricing is per token (or, more precisely, per million tokens), so more tokens mean higher cost. Typically, input data (the data we enter), cached input data (previously processed inputs stored for faster access), and output data (the data produced by the model) are billed differently.

**Performance and limits**: Tokenization determines how many tokens a prompt becomes, affecting speed, attention cost, and hitting context limits.

**Practical control**: Optimizing prompts and model choice can reduce token counts and costs without losing necessary context.

## Rule of thumb

There is an accepted rule that in English, 1 token is usually about 4 characters, or roughly 3/4 of a word. Therefore, it can be assumed that 100 tokens are about 75 words.

## Main tokenization algorithms in Natural Language Processing (NLP) models

- **Whitespace / Rule-based** - split on spaces and punctuation, often using simple rules (like how to treat quotes or hyphens). Fast and easy to implement, but can break down with tricky language cases or different writing styles.

- **Word tokenization** - uses language-specific rules to split text into words. Good for Western languages like English, but struggles with languages without spaces, like Chinese or Japanese, and for contractions/punctuation quirks.

- **Sentence tokenization** - identifies where sentences start and end, which is usually based on punctuation and common abbreviations. Often used as a first step before splitting text into words or subwords.

- **Character tokenization** - each character is a separate token. It avoids Out-of-Vocabulary issues but results in longer sequences and weaker semantics per token.

- **Subword tokenization**

  - **Byte-Pair Enconding (BPE)** - starts with characters and repeatedly merges the most frequent pairs to build a vocabulary. It is efficient and consistent but can create unusual splits for rare or made-up words.

  - **WordPiece** - similar to BPE, but instead of just merging the most frequent pairs, it chooses merges that maximize the likelihood of the training data under a language model. It is used in models like **BERT** and usually produces more linguistically meaningful subwords. In many implementations, continuation subwords are marked with a prefix like “##” (e.g., “play” + “##ing”), whereas **BPE** does not require such a convention.
 
  - **Unigram** - starts with a large set of possible subword tokens and gradually removes less useful ones, keeping the vocabulary that best explains the training data with the lowest loss. Unlike **BPE** or **WordPiece**, it can generate multiple valid segmentations for the same word and chooses the most likely one. It also supports sampling, which helps make models more robust during training (a technique called subword regularization).
 
  - **SentencePiece** - a framework that trains tokenizers (like **BPE** or **Unigram**) directly on raw text (no pre-tokenization), commonly marking word starts with a special token (e.g., ▁ (visual symbol - not an underscore)) - good for languages that don't uses spaces.

- **Byte-level variants**: **GPT** models use **byte-level BPE** via OpenAI encodings like cl100k_base, operating on raw bytes to improve Unicode robustness and compression, avoid OOVs, and ensure consistent behavior across scripts and emoji.

### Whitespace / Rule-based

In [1]:
import re
import warnings
warnings.filterwarnings("ignore")

text = "I don't like the so-called 'state-of-the-art' design, she said. It's too over-the-top for my taste, especially in the U.S."
tokens = [t for t in re.split(r"[^\w'-]+", text) if t]
print(tokens)

['I', "don't", 'like', 'the', 'so-called', "'state-of-the-art'", 'design', 'she', 'said', "It's", 'too', 'over-the-top', 'for', 'my', 'taste', 'especially', 'in', 'the', 'U', 'S']


### Word tokenization (spaCy)

In [2]:
import spacy

text = "I don't like the so-called 'state-of-the-art' design, she said. It's too over-the-top for my taste, especially in the U.S."
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens = [t.text for t in doc]
print(tokens)

['I', 'do', "n't", 'like', 'the', 'so', '-', 'called', "'", 'state', '-', 'of', '-', 'the', '-', 'art', "'", 'design', ',', 'she', 'said', '.', 'It', "'s", 'too', 'over', '-', 'the', '-', 'top', 'for', 'my', 'taste', ',', 'especially', 'in', 'the', 'U.S.']


### Sentence tokenization (NLTK)

In [3]:
import nltk
from nltk.tokenize import sent_tokenize

text = "Dr. Smith arrived at 5 p.m. He left at 6."
sentences = sent_tokenize(text)
print(sentences)

['Dr. Smith arrived at 5 p.m.', 'He left at 6.']


### Character tokenization

In [4]:
text = "A simple text."
tokens = list(text)
print(tokens)

['A', ' ', 's', 'i', 'm', 'p', 'l', 'e', ' ', 't', 'e', 'x', 't', '.']


### Byte-Pair Encoding (BPE)

In [5]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

corpus = [
    "lowercasing", 
    "unbelievable", 
    "state-of-the-art", 
    "tokenization is fun!"
    "Dr. Smith"
]

# initialize a BPE tokenizer with unknown token
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# use whitespace to split text initially (pre-tokenization)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# trainer with small vocab size for demo, plus special tokens
trainer = trainers.BpeTrainer(
    vocab_size=50,  # small vocab for demo clarity
    special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
)

# train the tokenizer on the toy corpus
tokenizer.train_from_iterator(corpus, trainer)

# show learned vocabulary tokens
print("Learned vocabulary:")
print(sorted(tokenizer.get_vocab().keys()))

# encode a sample word and show the tokens and IDs
sample = "lowercasing"
encoded = tokenizer.encode(sample)

print(f"\nEncoding the word '{sample}':")
print("Tokens:", encoded.tokens)
print("Token IDs:", encoded.ids)

Learned vocabulary:
['!', '-', '.', 'D', 'Dr', 'S', 'Sm', '[CLS]', '[MASK]', '[PAD]', '[SEP]', '[UNK]', 'a', 'ab', 'ar', 'as', 'at', 'b', 'be', 'c', 'cas', 'e', 'en', 'er', 'ev', 'f', 'fun', 'g', 'h', 'i', 'iev', 'in', 'io', 'is', 'ith', 'iz', 'k', 'l', 'm', 'n', 'o', 'r', 's', 't', 'th', 'u', 'un', 'v', 'w', 'z']

Encoding the word 'lowercasing':
Tokens: ['l', 'o', 'w', 'er', 'cas', 'in', 'g']
Token IDs: [19, 22, 28, 41, 39, 44, 15]


### WordPiece

In [6]:
corpus = [
    "lowercasing", 
    "unbelievable", 
    "state-of-the-art", 
    "tokenization is fun!"
    "Dr. Smith"
]

# initialize a BPE tokenizer with unknown token
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

# use whitespace to split text initially (pre-tokenization)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# trainer with small vocab size for demo, plus special tokens
trainer = trainers.WordPieceTrainer(
    vocab_size=50,  # small vocab for demo clarity
    special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
)

# train the tokenizer on the toy corpus
tokenizer.train_from_iterator(corpus, trainer)

# show learned vocabulary tokens
print("Learned vocabulary:")
print(sorted(tokenizer.get_vocab().keys()))

# encode a sample word and show the tokens and IDs
sample = "lowercasing"
encoded = tokenizer.encode(sample)

print(f"\nEncoding the word '{sample}':")
print("Tokens:", encoded.tokens)
print("Token IDs:", encoded.ids)

Learned vocabulary:
['!', '##a', '##b', '##c', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##r', '##s', '##t', '##u', '##v', '##w', '##z', '-', '.', 'D', 'S', '[CLS]', '[MASK]', '[PAD]', '[SEP]', '[UNK]', 'a', 'b', 'c', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'r', 's', 't', 'u', 'v', 'w', 'z']

Encoding the word 'lowercasing':
Tokens: ['l', '##o', '##w', '##e', '##r', '##c', '##a', '##s', '##i', '##n', '##g']
Token IDs: [19, 35, 36, 32, 37, 38, 31, 39, 40, 34, 41]


### Unigram

In [7]:
import sentencepiece as spm
import tempfile

# corupus text
corpus_text = "lowercasing\nunbelievable\nstate-of-the-art\ntokenization is fun!\nDr. Smith\n"

# create a temporary file and write the corpus text there
with tempfile.NamedTemporaryFile(mode='w', delete=False) as tmp_file:
    tmp_file.write(corpus_text)
    tmp_filename = tmp_file.name

# train the unigram model using the temp file
spm.SentencePieceTrainer.Train(
    input=tmp_filename,
    model_prefix="spm_unigram",
    vocab_size=32,
    model_type="unigram",
    character_coverage=1.0
)

# load and use the model
sp = spm.SentencePieceProcessor(model_file="spm_unigram.model")

sample = "lowercasing"

print("Learned vocabulary:")
vocab_list = [sp.id_to_piece(i) for i in range(sp.get_piece_size())]
print(vocab_list)

print(f"\nEncoding the word '{sample}':")

tokens = sp.encode(sample, out_type=str)
print("Tokens:", tokens)

# encode with token IDs (integers)
token_ids = sp.encode("lowercasing", out_type=int)
print("Token IDs:", token_ids)

# delete the temporary file
import os
os.remove(tmp_filename)

Learned vocabulary:
['<unk>', '<s>', '</s>', '▁', 'e', 'i', 'o', 't', 'a', 'n', '-', 's', 'r', 'l', 'b', 'f', 'th', 'un', 'at', 'u', '!', '.', 'D', 'S', 'g', 'k', 'm', 'z', 'v', 'w', 'c', 'h']

Encoding the word 'lowercasing':
Tokens: ['▁', 'l', 'o', 'w', 'e', 'r', 'c', 'a', 's', 'i', 'n', 'g']
Token IDs: [3, 13, 6, 29, 4, 12, 30, 8, 11, 5, 9, 24]


**NOTE**: Results from subword models depend heavily on the training corpus and chosen hyperparameters. Therefore, variations in training data or settings can produce different tokenization results.

This is why pretrained tokenizers, trained on large diverse corpora with carefully tuned hyperparameters, are commonly used - they provide consistent and reliable tokenization across many tasks.

## Pretrained tokenizers

In [8]:
text = (
    "Natural Language Processing (NLP) enables computers to understand human language. "
    "Challenges include handling idioms like 'kick the bucket', contractions such as 'won't', "
    "abbreviations like 'e.g.', hyphenated words like 'well-known', and even emojis 🤖. "
    "Context is key: 'bank' in 'river bank' vs. 'bank' in 'savings bank'. "
    "Also, numbers (e.g., 42, 3.1415) and proper nouns like 'New York' add complexity."
)

### BERT (WordPiece)

In [9]:
from transformers import AutoTokenizer, AutoModel

name = "bert-base-uncased"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModel.from_pretrained(name)

enc = tok(text, return_tensors="pt")
print(tok.tokenize(text))
print(enc["input_ids"].shape) # [1, seq_len]
out = model(**enc) # out.last_hidden_state: [1, seq_len, hidden]
print(out.last_hidden_state.shape)

['natural', 'language', 'processing', '(', 'nl', '##p', ')', 'enables', 'computers', 'to', 'understand', 'human', 'language', '.', 'challenges', 'include', 'handling', 'id', '##ioms', 'like', "'", 'kick', 'the', 'bucket', "'", ',', 'contraction', '##s', 'such', 'as', "'", 'won', "'", 't', "'", ',', 'abbreviation', '##s', 'like', "'", 'e', '.', 'g', '.', "'", ',', 'h', '##yp', '##hen', '##ated', 'words', 'like', "'", 'well', '-', 'known', "'", ',', 'and', 'even', 'em', '##oj', '##is', '[UNK]', '.', 'context', 'is', 'key', ':', "'", 'bank', "'", 'in', "'", 'river', 'bank', "'", 'vs', '.', "'", 'bank', "'", 'in', "'", 'savings', 'bank', "'", '.', 'also', ',', 'numbers', '(', 'e', '.', 'g', '.', ',', '42', ',', '3', '.', '141', '##5', ')', 'and', 'proper', 'nouns', 'like', "'", 'new', 'york', "'", 'add', 'complexity', '.']
torch.Size([1, 117])
torch.Size([1, 117, 768])


### RoBERTa (BPE)

In [10]:
from transformers import AutoTokenizer, AutoModel

name = "roberta-base"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModel.from_pretrained(name)

enc = tok(text, return_tensors="pt")
print(tok.tokenize(text)) # BPE (byte-levelish pretokenization with Ġ marker)
print(enc["input_ids"].shape) # [1, seq_len]
out = model(**enc) # out.last_hidden_state: [1, seq_len, hidden]
print(out.last_hidden_state.shape)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


['Natural', 'ĠLanguage', 'ĠProcessing', 'Ġ(', 'N', 'LP', ')', 'Ġenables', 'Ġcomputers', 'Ġto', 'Ġunderstand', 'Ġhuman', 'Ġlanguage', '.', 'ĠChallenges', 'Ġinclude', 'Ġhandling', 'Ġidi', 'oms', 'Ġlike', "Ġ'", 'kick', 'Ġthe', 'Ġbucket', "',", 'Ġcontract', 'ions', 'Ġsuch', 'Ġas', "Ġ'", 'won', "'t", "',", 'Ġabbrevi', 'ations', 'Ġlike', "Ġ'", 'e', '.', 'g', ".'", ',', 'Ġhyp', 'hen', 'ated', 'Ġwords', 'Ġlike', "Ġ'", 'well', '-', 'known', "',", 'Ġand', 'Ġeven', 'Ġem', 'oj', 'is', 'ĠðŁ', '¤', 'ĸ', '.', 'ĠContext', 'Ġis', 'Ġkey', ':', "Ġ'", 'bank', "'", 'Ġin', "Ġ'", 'river', 'Ġbank', "'", 'Ġvs', '.', "Ġ'", 'bank', "'", 'Ġin', "Ġ'", 'sav', 'ings', 'Ġbank', "'.", 'ĠAlso', ',', 'Ġnumbers', 'Ġ(', 'e', '.', 'g', '.,', 'Ġ42', ',', 'Ġ3', '.', '14', '15', ')', 'Ġand', 'Ġproper', 'Ġnoun', 's', 'Ġlike', "Ġ'", 'New', 'ĠYork', "'", 'Ġadd', 'Ġcomplexity', '.']
torch.Size([1, 113])
torch.Size([1, 113, 768])


### GPT (Byte-level BPE)

In [11]:
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM

name = "gpt2"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModel.from_pretrained(name)

enc = tok(text, return_tensors="pt")
print(tok.tokenize(text)) # byte-level BPE
print(enc["input_ids"].shape) # [1, seq_len]
out = model(**enc) # out.last_hidden_state: [1, seq_len, hidden]
print(out.last_hidden_state.shape)

gen_model = AutoModelForCausalLM.from_pretrained(name)
gen = gen_model.generate(**enc, max_new_tokens=20)
print(tok.decode(gen[0], skip_special_tokens=True))

['Natural', 'ĠLanguage', 'ĠProcessing', 'Ġ(', 'N', 'LP', ')', 'Ġenables', 'Ġcomputers', 'Ġto', 'Ġunderstand', 'Ġhuman', 'Ġlanguage', '.', 'ĠChallenges', 'Ġinclude', 'Ġhandling', 'Ġidi', 'oms', 'Ġlike', "Ġ'", 'kick', 'Ġthe', 'Ġbucket', "',", 'Ġcontract', 'ions', 'Ġsuch', 'Ġas', "Ġ'", 'won', "'t", "',", 'Ġabbrevi', 'ations', 'Ġlike', "Ġ'", 'e', '.', 'g', ".'", ',', 'Ġhyp', 'hen', 'ated', 'Ġwords', 'Ġlike', "Ġ'", 'well', '-', 'known', "',", 'Ġand', 'Ġeven', 'Ġem', 'oj', 'is', 'ĠðŁ', '¤', 'ĸ', '.', 'ĠContext', 'Ġis', 'Ġkey', ':', "Ġ'", 'bank', "'", 'Ġin', "Ġ'", 'river', 'Ġbank', "'", 'Ġvs', '.', "Ġ'", 'bank', "'", 'Ġin', "Ġ'", 'sav', 'ings', 'Ġbank', "'.", 'ĠAlso', ',', 'Ġnumbers', 'Ġ(', 'e', '.', 'g', '.,', 'Ġ42', ',', 'Ġ3', '.', '14', '15', ')', 'Ġand', 'Ġproper', 'Ġnoun', 's', 'Ġlike', "Ġ'", 'New', 'ĠYork', "'", 'Ġadd', 'Ġcomplexity', '.']
torch.Size([1, 111])
torch.Size([1, 111, 768])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Natural Language Processing (NLP) enables computers to understand human language. Challenges include handling idioms like 'kick the bucket', contractions such as 'won't', abbreviations like 'e.g.', hyphenated words like 'well-known', and even emojis 🤖. Context is key: 'bank' in 'river bank' vs. 'bank' in 'savings bank'. Also, numbers (e.g., 42, 3.1415) and proper nouns like 'New York' add complexity.

The NLP is a powerful tool for understanding human language. It is a tool for understanding


In [12]:
import tiktoken

# cl100k_base is used by many newer OpenAI models
enc = tiktoken.get_encoding("cl100k_base")
ids = enc.encode(text)
tokens = [enc.decode([i]) for i in ids] # approximate token display
print(tokens) # byte-level BPE pieces; robust to any unicode
print(ids)
print(enc.decode(ids))

['Natural', ' Language', ' Processing', ' (', 'N', 'LP', ')', ' enables', ' computers', ' to', ' understand', ' human', ' language', '.', ' Challenges', ' include', ' handling', ' idi', 'oms', ' like', " '", 'kick', ' the', ' bucket', "',", ' contr', 'actions', ' such', ' as', " '", 'won', "'t", "',", ' abbrev', 'iations', ' like', " '", 'e', '.g', ".',", ' hy', 'phen', 'ated', ' words', ' like', " '", 'well', '-known', "',", ' and', ' even', ' emojis', ' �', '�', '�', '.', ' Context', ' is', ' key', ':', " '", 'bank', "'", ' in', " '", 'river', ' bank', "'", ' vs', '.', " '", 'bank', "'", ' in', " '", 's', 'avings', ' bank', "'.", ' Also', ',', ' numbers', ' (', 'e', '.g', '.,', ' ', '42', ',', ' ', '3', '.', '141', '5', ')', ' and', ' proper', ' nouns', ' like', " '", 'New', ' York', "'", ' add', ' complexity', '.']
[55381, 11688, 29225, 320, 45, 12852, 8, 20682, 19002, 311, 3619, 3823, 4221, 13, 69778, 2997, 11850, 41760, 7085, 1093, 364, 56893, 279, 15994, 518, 6155, 4109, 1778, 43

### T5 (SentencePiece Unigram)

In [13]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

name = "t5-small"
tok = AutoTokenizer.from_pretrained(name)
model = AutoModelForSeq2SeqLM.from_pretrained(name)

text = "Translate English to German: The tokenization is robust."
enc = tok(text, return_tensors="pt")
out = model.generate(**enc, max_new_tokens=20)

print(tok.convert_ids_to_tokens(enc["input_ids"][0][:10]))  
print(tok.decode(out[0], skip_special_tokens=True))

['▁Translat', 'e', '▁English', '▁to', '▁German', ':', '▁The', '▁token', 'ization', '▁is']
Die tokenization ist robust.
