# Week 8 Lab: Tokenization & Vocabulary

## Learning Objectives
- Understand different tokenization strategies (word, character, subword)
- Implement Byte-Pair Encoding (BPE) from scratch
- Compare tokenizers from popular NLP libraries
- Analyze vocabulary efficiency and coverage

## Prerequisites
```bash
pip install transformers torch numpy matplotlib sentencepiece
```

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import re

# Setup
print('Week 8: Tokenization & Vocabulary')
print('=' * 50)

## Part 1: Basic Tokenization Strategies

Let's compare three fundamental approaches:
1. **Word-level**: Split on whitespace/punctuation
2. **Character-level**: Each character is a token
3. **Subword-level**: Balance between word and character

In [None]:
# Sample text for tokenization
sample_text = """Natural language processing enables computers to understand human language.
Tokenization is the first step in most NLP pipelines.
Different tokenization strategies have different tradeoffs."""

# Word-level tokenization
def word_tokenize(text):
    return re.findall(r'\b\w+\b', text.lower())

# Character-level tokenization
def char_tokenize(text):
    return list(text)

word_tokens = word_tokenize(sample_text)
char_tokens = char_tokenize(sample_text)

print(f"Original text length: {len(sample_text)} characters")
print(f"\nWord tokens ({len(word_tokens)} tokens):")
print(word_tokens[:15], '...')
print(f"\nCharacter tokens ({len(char_tokens)} tokens):")
print(char_tokens[:30], '...')

In [None]:
# Vocabulary analysis
word_vocab = set(word_tokens)
char_vocab = set(char_tokens)

print("Vocabulary Comparison:")
print(f"  Word vocabulary size: {len(word_vocab)}")
print(f"  Character vocabulary size: {len(char_vocab)}")
print(f"\nCompression ratio (tokens/chars):")
print(f"  Word: {len(word_tokens)/len(sample_text):.3f}")
print(f"  Character: {len(char_tokens)/len(sample_text):.3f}")

## Part 2: Byte-Pair Encoding (BPE) from Scratch

BPE is the foundation of modern subword tokenization. Let's implement it step by step.

In [None]:
class SimpleBPE:
    """Simple Byte-Pair Encoding implementation"""
    
    def __init__(self, num_merges=100):
        self.num_merges = num_merges
        self.merges = {}  # (pair) -> merged token
        self.vocab = set()
    
    def get_stats(self, vocab):
        """Count frequency of adjacent pairs"""
        pairs = defaultdict(int)
        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i+1])] += freq
        return pairs
    
    def merge_vocab(self, pair, vocab):
        """Merge the most frequent pair in vocabulary"""
        new_vocab = {}
        bigram = ' '.join(pair)
        replacement = ''.join(pair)
        
        for word in vocab:
            new_word = word.replace(bigram, replacement)
            new_vocab[new_word] = vocab[word]
        return new_vocab
    
    def fit(self, text):
        """Learn BPE merges from text"""
        # Initialize vocabulary with character-level tokens
        words = text.lower().split()
        word_freqs = Counter(words)
        
        # Add end-of-word marker and split into characters
        vocab = {' '.join(list(word) + ['</w>']): freq 
                 for word, freq in word_freqs.items()}
        
        print(f"Initial vocabulary: {len(set(' '.join(vocab.keys()).split()))} tokens")
        
        # Iteratively merge most frequent pairs
        for i in range(self.num_merges):
            pairs = self.get_stats(vocab)
            if not pairs:
                break
            
            best_pair = max(pairs, key=pairs.get)
            vocab = self.merge_vocab(best_pair, vocab)
            self.merges[best_pair] = ''.join(best_pair)
            
            if (i + 1) % 20 == 0:
                print(f"Merge {i+1}: {best_pair} -> {''.join(best_pair)} (freq: {pairs[best_pair]})")
        
        # Build final vocabulary
        self.vocab = set(' '.join(vocab.keys()).split())
        print(f"\nFinal vocabulary size: {len(self.vocab)} tokens")
        return self
    
    def tokenize(self, word):
        """Tokenize a single word using learned merges"""
        word = list(word.lower()) + ['</w>']
        
        while len(word) > 1:
            pairs = [(word[i], word[i+1]) for i in range(len(word)-1)]
            
            # Find first pair that can be merged
            mergeable = [p for p in pairs if p in self.merges]
            if not mergeable:
                break
            
            # Merge first valid pair
            pair = mergeable[0]
            new_word = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and (word[i], word[i+1]) == pair:
                    new_word.append(self.merges[pair])
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            word = new_word
        
        return word

In [None]:
# Train BPE on sample text
training_text = """the cat sat on the mat
the dog ran in the park
cats and dogs are popular pets
running and sitting are activities
the quick brown fox jumps over the lazy dog"""

bpe = SimpleBPE(num_merges=50)
bpe.fit(training_text)

In [None]:
# Test tokenization
test_words = ['cat', 'cats', 'running', 'unknown', 'tokenization']

print("BPE Tokenization Results:")
print("-" * 40)
for word in test_words:
    tokens = bpe.tokenize(word)
    print(f"{word:15} -> {tokens}")

## Part 3: Using HuggingFace Tokenizers

Let's compare our implementation with production tokenizers.

In [None]:
from transformers import AutoTokenizer

# Load different tokenizers
tokenizers = {
    'BERT': AutoTokenizer.from_pretrained('bert-base-uncased'),
    'GPT-2': AutoTokenizer.from_pretrained('gpt2'),
    'T5': AutoTokenizer.from_pretrained('t5-small'),
}

test_sentence = "Tokenization is fundamental to natural language processing."

print(f"Test sentence: '{test_sentence}'")
print("=" * 60)

for name, tokenizer in tokenizers.items():
    tokens = tokenizer.tokenize(test_sentence)
    ids = tokenizer.encode(test_sentence)
    print(f"\n{name} ({tokenizer.__class__.__name__}):")
    print(f"  Vocab size: {tokenizer.vocab_size:,}")
    print(f"  Tokens ({len(tokens)}): {tokens}")
    print(f"  Token IDs: {ids}")

In [None]:
# Visualize tokenization differences
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

sentences = [
    "Hello world",
    "Tokenization is important",
    "Supercalifragilisticexpialidocious"
]

for ax, sentence in zip(axes, sentences):
    token_counts = [len(tok.tokenize(sentence)) for tok in tokenizers.values()]
    bars = ax.bar(tokenizers.keys(), token_counts, color=['#3333B2', '#FF7F0E', '#2CA02C'])
    ax.set_title(f'"{sentence}"', fontsize=10)
    ax.set_ylabel('Number of tokens')
    
    # Add value labels
    for bar, count in zip(bars, token_counts):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                str(count), ha='center', fontsize=11, fontweight='bold')

plt.suptitle('Token Count Comparison Across Tokenizers', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 4: Vocabulary Analysis

Let's analyze how different tokenizers handle various text types.

In [None]:
# Analyze tokenization of different text types
test_cases = {
    'Regular English': 'The quick brown fox jumps over the lazy dog.',
    'Technical': 'The API endpoint returns JSON with UTF-8 encoding.',
    'Numbers': 'In 2024, the model achieved 99.5% accuracy on 10,000 samples.',
    'Code-like': 'def calculate_loss(y_pred, y_true): return mse(y_pred, y_true)',
    'Rare words': 'Pneumonoultramicroscopicsilicovolcanoconiosis is a lung disease.',
}

results = []
for case_name, text in test_cases.items():
    row = {'Case': case_name, 'Text': text[:30] + '...'}
    for tok_name, tokenizer in tokenizers.items():
        tokens = tokenizer.tokenize(text)
        row[f'{tok_name}_tokens'] = len(tokens)
        row[f'{tok_name}_ratio'] = len(tokens) / len(text.split())
    results.append(row)

# Display results
print("Tokenization Analysis")
print("=" * 80)
for r in results:
    print(f"\n{r['Case']}:")
    print(f"  Text: {r['Text']}")
    for tok_name in tokenizers.keys():
        print(f"  {tok_name}: {r[f'{tok_name}_tokens']} tokens (ratio: {r[f'{tok_name}_ratio']:.2f})")

## Part 5: Special Tokens and Vocabulary

Understanding special tokens used by different models.

In [None]:
print("Special Tokens Comparison")
print("=" * 60)

for name, tokenizer in tokenizers.items():
    print(f"\n{name}:")
    special = {
        'PAD': getattr(tokenizer, 'pad_token', None),
        'UNK': getattr(tokenizer, 'unk_token', None),
        'CLS': getattr(tokenizer, 'cls_token', None),
        'SEP': getattr(tokenizer, 'sep_token', None),
        'MASK': getattr(tokenizer, 'mask_token', None),
        'BOS': getattr(tokenizer, 'bos_token', None),
        'EOS': getattr(tokenizer, 'eos_token', None),
    }
    for token_type, token in special.items():
        if token:
            token_id = tokenizer.convert_tokens_to_ids(token)
            print(f"  {token_type}: '{token}' (ID: {token_id})")

## Exercises

1. **Extend BPE**: Modify the SimpleBPE class to handle punctuation properly
2. **Vocabulary Analysis**: Calculate the percentage of unknown tokens for each tokenizer on a corpus
3. **Compression Ratio**: Compare compression ratios across different languages
4. **Custom Tokenizer**: Train a SentencePiece tokenizer on a custom corpus

In [None]:
# Exercise 1: Calculate unknown token percentage
def calculate_unk_percentage(tokenizer, text):
    """Calculate percentage of unknown tokens"""
    tokens = tokenizer.tokenize(text)
    unk_token = tokenizer.unk_token
    if unk_token:
        unk_count = tokens.count(unk_token)
        return 100 * unk_count / len(tokens) if tokens else 0
    return 0

# Test with rare words
rare_text = "The xyzzy plugh was discovered in the colossal cave."
for name, tokenizer in tokenizers.items():
    unk_pct = calculate_unk_percentage(tokenizer, rare_text)
    print(f"{name}: {unk_pct:.1f}% unknown tokens")

## Summary

In this lab, we explored:

1. **Basic tokenization strategies**: Word, character, and subword approaches
2. **BPE algorithm**: Implemented from scratch to understand merge operations
3. **Production tokenizers**: Compared BERT (WordPiece), GPT-2 (BPE), and T5 (SentencePiece)
4. **Vocabulary analysis**: Understood how tokenizers handle different text types
5. **Special tokens**: Learned about model-specific tokens (PAD, UNK, CLS, etc.)

**Key Takeaways**:
- Subword tokenization balances vocabulary size with sequence length
- Different models use different tokenization algorithms
- Tokenization choices significantly impact model performance