# Chapter 8: Building the Tokenizer

> "Language is the dress of thought." ‚Äî **Samuel Johnson**, Writer

---

## What is Tokenization?

**Tokenization** is the process of converting text into numbers so neural networks can process it. Think of it as translating English into a secret code where each word, subword, or character gets a unique number.

```
"Hello world" ‚Üí [15496, 995]  (using GPT-2's tokenizer)
```

The tokenizer also works in reverse: given numbers, it produces text.

---

## What You'll Learn

- How text becomes numbers through three different tokenization strategies
- Why character-level tokenization is simple but inefficient
- How word-level tokenization handles unknown words and why vocabulary size matters
- The clever trick behind subword tokenization (BPE) that powers modern LLMs
- How to use production tokenizers like tiktoken and Hugging Face transformers
- The quirks and gotchas that affect prompts and API costs

---

## Setup

First, let's install the required packages:

In [None]:
# Install required packages
!pip install -q tiktoken transformers torch

## 1. Character-Level Tokenization

Let's build the simplest tokenizer: treat every character as a token.

**Python Class Reminder:**
- A **class** is a blueprint that bundles data and functions together
- `__init__(self)` runs when you create an object (initializes its data)
- `self` refers to "this specific object" (like "this car" vs "cars in general")
- `@property` makes a method behave like an attribute (no parentheses needed)

In [None]:
class CharTokenizer:
    def __init__(self):
        # Two dictionaries for bidirectional lookup
        self.char_to_id = {}
        self.id_to_char = {}

    def fit(self, text):
        """Build vocabulary from text.
        
        Why sorted()? So vocab is deterministic‚Äîsame text always
        produces same IDs. Without sorting, Python's set order is random.
        """
        # Extract unique characters - sets automatically handle uniqueness
        chars = sorted(set(text))
        
        # Assign each character an ID (starting from 0)
        for i, c in enumerate(chars):
            self.char_to_id[c] = i
            self.id_to_char[i] = c

    def encode(self, text):
        """Convert text to list of token IDs.
        
        Returns a list of integers, one per character.
        """
        return [self.char_to_id[c] for c in text]

    def decode(self, ids):
        """Convert list of token IDs back to text.
        
        Uses str.join() to concatenate characters with no spaces
        between them (unlike word tokenization which needs spaces).
        """
        return "".join(self.id_to_char[i] for i in ids)

    @property
    def vocab_size(self):
        """How many unique characters we know."""
        return len(self.char_to_id)

### Try It: Complete Example

In [None]:
# Create and train tokenizer
tokenizer = CharTokenizer()
tokenizer.fit("Hello, world!")

print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Vocab: {tokenizer.char_to_id}")

# Encode
text = "Hello"
ids = tokenizer.encode(text)
print(f"\n'{text}' ‚Üí {ids}")

# Decode - round trip should be lossless!
decoded = tokenizer.decode(ids)
print(f"{ids} ‚Üí '{decoded}'")
print(f"Perfect round trip? {text == decoded}")

### The Vocabulary Size vs Sequence Length Tradeoff

In [None]:
sentence = "Tokenization is the first step in any language model."

# Character-level
char_tok = CharTokenizer()
char_tok.fit(sentence)
char_ids = char_tok.encode(sentence)

print(f"Original text: {len(sentence)} characters")
print(f"Vocab size: {char_tok.vocab_size}")
print(f"Sequence length: {len(char_ids)}")
print(f"First 20 tokens: {char_ids[:20]}")

## 2. Word-Level Tokenization

Now let's build a word-level tokenizer that handles unknown words with special tokens.

**Special Tokens:** Reserved IDs with specific meanings:
- `<PAD>` (ID 0): Padding ‚Äî fills sequences to equal length for batching
- `<UNK>` (ID 1): Unknown ‚Äî represents words not in our vocabulary
- `<BOS>` (ID 2): Beginning of Sequence ‚Äî marks where text starts
- `<EOS>` (ID 3): End of Sequence ‚Äî marks where text ends

Why do we need these? Without `<UNK>`, our tokenizer would crash on new words. Without `<PAD>`, we couldn't process multiple sentences at once (they'd have different lengths).

In [None]:
from collections import Counter

class WordTokenizer:
    def __init__(self, max_vocab_size=10000):
        """
        max_vocab_size: Maximum vocabulary size (including special tokens)
        
        Why 10,000? It's a balance:
        - Too small (1,000): Too many unknowns
        - Too large (100,000): Huge embedding table, slow training
        - 10,000-50,000: Sweet spot for learning
        """
        self.max_vocab_size = max_vocab_size
        self.word_to_id = {}
        self.id_to_word = {}

    def fit(self, text):
        """Build vocabulary from most frequent words."""
        # Start simple: split on whitespace and lowercase
        words = text.lower().split()
        
        # Count word frequencies - why? Common words get their own IDs,
        # rare words become <UNK>. This minimizes unknowns in practice.
        counts = Counter(words)
        
        # Reserve IDs 0-3 for special tokens
        self.word_to_id = {
            "<PAD>": 0,
            "<UNK>": 1,
            "<BOS>": 2,
            "<EOS>": 3
        }
        self.id_to_word = {v: k for k, v in self.word_to_id.items()}
        
        # Add most common words (keeping max_vocab_size limit)
        # Start IDs at 4 since 0-3 are reserved for special tokens
        for i, (word, count) in enumerate(counts.most_common(self.max_vocab_size - 4), start=4):
            self.word_to_id[word] = i
            self.id_to_word[i] = word

    def encode(self, text, add_special_tokens=False):
        """Convert text to token IDs.
        
        add_special_tokens: If True, add <BOS> at start and <EOS> at end
        """
        words = text.lower().split()
        
        # Look up each word, fallback to <UNK> (ID=1) if not found
        # .get(word, 1) returns 1 if word isn't in our vocabulary
        ids = [self.word_to_id.get(w, 1) for w in words]
        
        if add_special_tokens:
            ids = [2] + ids + [3]  # [<BOS>] + text + [<EOS>]
        
        return ids

    def decode(self, ids, skip_special_tokens=True):
        """Convert token IDs back to text.
        
        skip_special_tokens: If True, don't output <PAD>, <BOS>, etc.
        Why? You don't want output like: "<BOS> Hello world <EOS>"
        """
        words = []
        for i in ids:
            word = self.id_to_word.get(i, "<UNK>")
            # Skip special tokens in output if requested
            if skip_special_tokens and word in ["<PAD>", "<BOS>", "<EOS>"]:
                continue
            words.append(word)
        
        # Join with spaces (unlike char tokenizer which used "".join)
        return " ".join(words)

    @property
    def vocab_size(self):
        """Current vocabulary size (number of unique tokens)."""
        return len(self.word_to_id)

### Try It: Complete Example with Unknown Words

In [None]:
# Create tokenizer with small vocab to force unknowns
tokenizer = WordTokenizer(max_vocab_size=10)

# Train on limited text
training_text = """
The cat sat on the mat.
The cat was on the mat.
The dog sat on the mat.
"""
tokenizer.fit(training_text)

print(f"Vocabulary: {tokenizer.word_to_id}")

# Encode a sentence with known words
text1 = "the cat sat"
ids1 = tokenizer.encode(text1)
print(f"\n'{text1}' ‚Üí {ids1}")
print(f"Decoded: '{tokenizer.decode(ids1)}'")

# Encode with unknown word
text2 = "the elephant sat"  # "elephant" not in vocab!
ids2 = tokenizer.encode(text2)
print(f"\n'{text2}' ‚Üí {ids2}")
print(f"Decoded: '{tokenizer.decode(ids2)}'")

# Try with special tokens
ids3 = tokenizer.encode("the cat", add_special_tokens=True)
print(f"\nWith special tokens: {ids3}")
print(f"Decoded (showing special): '{tokenizer.decode(ids3, skip_special_tokens=False)}'")
print(f"Decoded (hiding special): '{tokenizer.decode(ids3, skip_special_tokens=True)}'")

### Comparison Exercise: See the Tradeoff

In [None]:
text = "The quick brown fox jumps over the lazy dog"

# Character tokenizer
char_tok = CharTokenizer()
char_tok.fit(text)
char_ids = char_tok.encode(text)

# Word tokenizer
word_tok = WordTokenizer(max_vocab_size=20)
word_tok.fit(text)
word_ids = word_tok.encode(text)

print("CHARACTER TOKENIZER:")
print(f"  Vocab size: {char_tok.vocab_size}")
print(f"  Sequence length: {len(char_ids)}")
print(f"  Tokens: {char_ids[:20]}...")

print("\nWORD TOKENIZER:")
print(f"  Vocab size: {word_tok.vocab_size}")
print(f"  Sequence length: {len(word_ids)}")
print(f"  Tokens: {word_ids}")

## How Production Tokenizers Work: BPE (Byte-Pair Encoding)

Modern LLMs use **subword tokenization** ‚Äî a clever middle ground between characters and words:

**The Problem:**
- Character tokenizers: Too many tokens per text (slow, expensive)
- Word tokenizers: Can't handle new words ("ChatGPT" ‚Üí `<UNK>`)

**The Solution: BPE (Byte-Pair Encoding)**

BPE learns subwords automatically by repeatedly merging the most frequent character pairs:

```
Step 1: Start with characters: ["l", "o", "w", "e", "r"]
Step 2: Most frequent pair is ("l", "o") ‚Üí merge to "lo"
Step 3: Most frequent pair is ("lo", "w") ‚Üí merge to "low"
Step 4: Continue until vocab size reached...
```

**Result:** Common words become single tokens, rare words split into known pieces:
- "lower" ‚Üí ["low", "er"] ‚úì (common, efficient)
- "lowest" ‚Üí ["low", "est"] ‚úì (compound word handled!)
- "ChatGPT" ‚Üí ["Chat", "G", "PT"] ‚úì (no `<UNK>` needed!)

**Key Insight:** BPE never produces `<UNK>` because any character sequence can be broken into known pieces!

---

## 3. Production Tokenizers: tiktoken

Now let's use OpenAI's tiktoken library for GPT-4's tokenizer.

In [None]:
import tiktoken

# Load GPT-4's tokenizer encoding
enc = tiktoken.get_encoding("cl100k_base")  # GPT-4, GPT-3.5-turbo

# Encode text
text = "Hello, world! How are you?"
tokens = enc.encode(text)

print(f"Text: '{text}'")
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

# Decode back
decoded = enc.decode(tokens)
print(f"Decoded: '{decoded}'")
print(f"Perfect round-trip: {text == decoded}")

### See the Actual Token Strings

In [None]:
# Decode each token individually to see what it represents
token_strings = [enc.decode([t]) for t in tokens]

print(f"Token breakdown:")
for token_id, token_str in zip(tokens, token_strings):
    print(f"  {token_id:5d} ‚Üí '{token_str}'")

### Token Counting for API Budgeting

In [None]:
prompts = [
    "Write a haiku about programming.",
    "Explain quantum computing in simple terms for a 10-year-old child.",
    "Generate a 500-word essay on climate change."
]

enc = tiktoken.get_encoding("cl100k_base")

for prompt in prompts:
    tokens = enc.encode(prompt)
    # Example pricing (check current rates at openai.com/api/pricing)
    cost = len(tokens) * 0.00001  # $0.01 per 1K input tokens
    
    print(f"Prompt: '{prompt}'")
    print(f"  Tokens: {len(tokens)}")
    print(f"  Cost (input): ~${cost:.5f}\n")

## 4. Production Tokenizers: Hugging Face Transformers

**Hugging Face** is a company/library that provides pre-trained models and tokenizers. `AutoTokenizer` automatically loads the right tokenizer for any model.

The Hugging Face tokenizer returns a dictionary with:
- `input_ids`: The token IDs (what we care about most)
- `attention_mask`: 1s for real tokens, 0s for padding (tells model what to ignore)

In [None]:
from transformers import AutoTokenizer

# Load GPT-2 tokenizer (open-source)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hello, world! How are you?"

# Encode - returns dict with token IDs and attention mask
encoded = tokenizer(text, return_tensors="pt")  # "pt" = PyTorch tensors

print(f"Text: '{text}'")
print(f"Token IDs: {encoded['input_ids']}")
print(f"Attention mask: {encoded['attention_mask']}")

# Decode
decoded = tokenizer.decode(encoded['input_ids'][0])
print(f"Decoded: '{decoded}'")

### Token Breakdown

In [None]:
tokens = encoded['input_ids'][0].tolist()
token_strings = [tokenizer.decode([t]) for t in tokens]

print(f"Token breakdown:")
for tid, tstr in zip(tokens, token_strings):
    print(f"  {tid:5d} ‚Üí '{tstr}'")

## 5. Tokenization Quirks

### Quirk #1: Leading Spaces Change Everything

In [None]:
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

# Compare with and without leading space
texts = ["Hello", " Hello", "world", " world"]

for text in texts:
    tokens = enc.encode(text)
    print(f"'{text}' ‚Üí {tokens} ({len(tokens)} token{'s' if len(tokens) > 1 else ''})")

### Quirk #2: Numbers Split by Digits

In [None]:
numbers = ["10", "100", "1000", "10000", "42", "2024"]

enc = tiktoken.get_encoding("cl100k_base")

for num in numbers:
    tokens = enc.encode(num)
    token_strs = [enc.decode([t]) for t in tokens]
    print(f"'{num}' ‚Üí {tokens} = {token_strs}")

### Quirk #3: Emoji and Special Characters

In [None]:
emojis = ["üòÄ", "üöÄ", "üëç", "Hello üòÄ world", "üî•üî•üî•"]

enc = tiktoken.get_encoding("cl100k_base")

for text in emojis:
    tokens = enc.encode(text)
    token_strs = [enc.decode([t]) for t in tokens]
    print(f"'{text}' ‚Üí {len(tokens)} tokens: {token_strs}")

## 6. Practical Exercise: Tokenize Your Dataset

Connect Chapter 7's dataset to tokenization (you'll need your chapter7_output.jsonl file for this):

In [None]:
import json
import tiktoken

# Example dataset (replace with your Chapter 7 output)
example_dataset = [
    {"text": "AI systems learn from examples", "split": "train"},
    {"text": "Neural networks need lots of data", "split": "train"},
    {"text": "Deep learning uses multiple layers", "split": "val"}
]

# Tokenize each example
enc = tiktoken.get_encoding("cl100k_base")

for record in example_dataset:
    text = record["text"]
    tokens = enc.encode(text)
    record["token_ids"] = tokens
    record["token_count"] = len(tokens)

print(f"Tokenized {len(example_dataset)} examples")

# Compute statistics
token_counts = [r["token_count"] for r in example_dataset]
avg_tokens = sum(token_counts) / len(token_counts)
max_tokens = max(token_counts)
min_tokens = min(token_counts)

print(f"\nStatistics:")
print(f"  Average tokens per example: {avg_tokens:.1f}")
print(f"  Max tokens: {max_tokens}")
print(f"  Min tokens: {min_tokens}")

# Show first example
print(f"\nFirst example:")
print(json.dumps(example_dataset[0], indent=2))

## Chapter Summary

**What we built:**

1. **Character tokenizer:** Simple but inefficient (tiny vocab ~100, long sequences)
2. **Word tokenizer:** Efficient sequences but huge vocab and unknown word problems
3. **Production tools:** Used tiktoken and Hugging Face for real-world tokenization

**What we learned:**

- Tokenization is reversible (lossless round-trip)
- The vocab size vs sequence length tradeoff is fundamental
- Special tokens serve specific purposes (BOS/EOS/PAD/UNK)
- BPE learns subwords automatically by merging frequent pairs
- Tokenization has quirks (leading spaces, number splitting, emoji)

**Next:** Chapter 9 will convert these token IDs to embedding vectors!