# Lab 2.3.4: Tokenization Lab

**Module:** 2.3 - Natural Language Processing & Transformers  
**Time:** 3 hours  
**Difficulty:** ⭐⭐

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand why tokenization is crucial for NLP
- [ ] Implement Byte Pair Encoding (BPE) from scratch
- [ ] Train your own tokenizer on custom text
- [ ] Compare tokenizers from GPT-2, BERT, and LLaMA
- [ ] Understand the trade-offs of different vocabulary sizes

---

## Prerequisites

- Completed: Labs 2.3.1-2.3.3
- Basic Python string manipulation
- Understanding of dictionaries and collections

---

## Real-World Context

**Tokenization affects everything in NLP:**
- **Context length**: More efficient tokenization = more text in context window
- **Cost**: APIs charge per token (ChatGPT: ~$0.002 per 1K tokens)
- **Multilingual**: Some tokenizers handle languages better than others
- **Code**: Programming syntax needs special handling

A word like "unbelievable" might be:
- 1 token (word-level)
- 3 tokens: "un" + "believ" + "able" (subword)
- 12 tokens (character-level)

---

## ELI5: What is Tokenization?

> **Imagine you're creating a recipe for a robot chef.**
>
> You can't just say "make a sandwich" - the robot needs specific ingredients:
> - Option 1: Individual letters ("m", "a", "k", "e", " ", "a", " ", "s", ...)
>   - Very flexible, but 100s of steps for simple tasks!
> - Option 2: Full recipes ("make a sandwich", "bake a cake")
>   - Fast, but can't combine or create new dishes
> - Option 3: **Ingredients** ("bread", "cheese", "spread", "layer")
>   - Best of both worlds! Can make many dishes efficiently
>
> **Subword tokenization** is like using ingredients instead of letters or full recipes.
> - Common words stay whole: "the", "is", "and"
> - Rare words break into pieces: "un" + "believ" + "able"
> - New words can be built from known pieces!

---

## Part 1: Setup and Imports

In [None]:
import re
from collections import Counter, defaultdict
import matplotlib.pyplot as plt
import numpy as np
import platform

# We'll also use the transformers library for pre-trained tokenizers
HAS_TRANSFORMERS = False
try:
    from transformers import AutoTokenizer
    HAS_TRANSFORMERS = True
    print("✅ transformers already installed")
except ImportError:
    # Check if we're on ARM64 (DGX Spark)
    if platform.machine() == 'aarch64':
        print("⚠️ ARM64 detected (DGX Spark). transformers should be pre-installed in NGC container.")
        print("If missing, restart with: nvcr.io/nvidia/pytorch:25.11-py3 which includes transformers")
        print("Continuing without transformers library...")
    else:
        print("Installing transformers...")
        !pip install transformers -q
        try:
            from transformers import AutoTokenizer
            HAS_TRANSFORMERS = True
            print("✅ transformers installed successfully")
        except ImportError:
            print("❌ Could not install transformers")

print(f"\nSetup complete! transformers available: {HAS_TRANSFORMERS}")

---

## Part 2: Tokenization Approaches

Let's explore the three main approaches before diving deep into BPE.

In [None]:
# Sample text
text = "The quick brown fox jumps over the lazy dog. AI is transforming the world!"

# 1. Character-level tokenization
char_tokens = list(text)
print(f"Character-level: {len(char_tokens)} tokens")
print(f"  Tokens: {char_tokens[:20]}...")
print()

# 2. Word-level tokenization
word_tokens = text.split()
print(f"Word-level: {len(word_tokens)} tokens")
print(f"  Tokens: {word_tokens}")
print()

# 3. Subword tokenization (using GPT-2 tokenizer)
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
subword_tokens = gpt2_tokenizer.tokenize(text)
print(f"Subword (GPT-2): {len(subword_tokens)} tokens")
print(f"  Tokens: {subword_tokens}")

In [None]:
# The problem with word-level: OOV (Out of Vocabulary)

rare_words = "Transformers use self-attention mechanisms for contextualization."

# Simple word vocabulary from common words
simple_vocab = {"the", "a", "is", "for", "use", "and", "to", "of"}

words = rare_words.lower().split()
print("Word-level problem:")
for word in words:
    # Remove punctuation for checking
    clean_word = word.strip(".,!?")
    if clean_word in simple_vocab:
        print(f"  '{word}' -> ✓ In vocabulary")
    else:
        print(f"  '{word}' -> ✗ OOV (unknown word!)")

print("\nSubword solution:")
tokens = gpt2_tokenizer.tokenize(rare_words)
print(f"  {tokens}")
print("  Every piece is known! No OOV problem.")

---

## Part 3: Byte Pair Encoding (BPE) from Scratch

### ELI5: How BPE Works

> **Start with characters, and repeatedly merge the most common pairs.**
>
> Imagine you're creating shortcuts for common letter combinations:
> 1. Start: "l o w" "l o w e r" "n e w e s t" "w i d e s t"
> 2. Find most common pair: ("e", "s") appears 2 times
> 3. Create new token: "es" and replace all occurrences
> 4. Now: "l o w" "l o w e r" "n e w es t" "w i d es t"
> 5. Repeat until you have enough tokens!

In [None]:
class SimpleBPE:
    """
    Simple Byte Pair Encoding implementation.
    
    This is a teaching implementation - not optimized for production!
    """
    
    def __init__(self):
        self.merges = {}  # (a, b) -> new_token
        self.vocab = {}   # token -> id
        
    def _get_word_freqs(self, text):
        """Split text into words and count frequencies."""
        # Simple word splitting
        words = re.findall(r"\w+|[^\w\s]", text.lower())
        word_freqs = Counter(words)
        return word_freqs
    
    def _word_to_chars(self, word):
        """Convert word to list of characters with end-of-word marker."""
        return list(word) + ["</w>"]
    
    def _get_pair_freqs(self, word_freqs, word_tokens):
        """Count frequency of adjacent pairs."""
        pair_freqs = Counter()
        
        for word, freq in word_freqs.items():
            tokens = word_tokens[word]
            for i in range(len(tokens) - 1):
                pair = (tokens[i], tokens[i + 1])
                pair_freqs[pair] += freq
                
        return pair_freqs
    
    def _merge_pair(self, word_tokens, pair):
        """Merge a pair in all words."""
        new_word_tokens = {}
        
        for word, tokens in word_tokens.items():
            new_tokens = []
            i = 0
            while i < len(tokens):
                if i < len(tokens) - 1 and (tokens[i], tokens[i + 1]) == pair:
                    # Merge this pair
                    new_tokens.append(tokens[i] + tokens[i + 1])
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            new_word_tokens[word] = new_tokens
            
        return new_word_tokens
    
    def train(self, text, num_merges=100, verbose=True):
        """
        Train BPE on text.
        
        Args:
            text: Training text
            num_merges: Number of merge operations
            verbose: Print progress
        """
        # Get word frequencies
        word_freqs = self._get_word_freqs(text)
        
        # Initialize with characters
        word_tokens = {word: self._word_to_chars(word) for word in word_freqs}
        
        # Get initial vocabulary
        vocab = set()
        for tokens in word_tokens.values():
            vocab.update(tokens)
        
        if verbose:
            print(f"Initial vocabulary size: {len(vocab)}")
            print(f"Unique words: {len(word_freqs)}")
            print()
        
        # Perform merges
        for i in range(num_merges):
            # Find most frequent pair
            pair_freqs = self._get_pair_freqs(word_freqs, word_tokens)
            
            if not pair_freqs:
                break
                
            best_pair = pair_freqs.most_common(1)[0][0]
            best_freq = pair_freqs[best_pair]
            
            # Merge the pair
            word_tokens = self._merge_pair(word_tokens, best_pair)
            
            # Add to merges
            new_token = best_pair[0] + best_pair[1]
            self.merges[best_pair] = new_token
            vocab.add(new_token)
            
            if verbose and (i + 1) % 10 == 0:
                print(f"Merge {i+1}: {best_pair} -> '{new_token}' (freq: {best_freq})")
        
        # Build final vocabulary
        self.vocab = {token: i for i, token in enumerate(sorted(vocab))}
        
        if verbose:
            print(f"\nFinal vocabulary size: {len(self.vocab)}")
            print(f"Number of merges: {len(self.merges)}")
    
    def tokenize(self, word):
        """Tokenize a single word using learned merges."""
        tokens = self._word_to_chars(word.lower())
        
        # Apply merges in order
        changed = True
        while changed:
            changed = False
            new_tokens = []
            i = 0
            while i < len(tokens):
                if i < len(tokens) - 1:
                    pair = (tokens[i], tokens[i + 1])
                    if pair in self.merges:
                        new_tokens.append(self.merges[pair])
                        i += 2
                        changed = True
                        continue
                new_tokens.append(tokens[i])
                i += 1
            tokens = new_tokens
            
        return tokens

# Training text (simple example)
training_text = """
The quick brown fox jumps over the lazy dog.
The dog was very lazy and the fox was quick.
Jumping foxes are quicker than sleeping dogs.
The lazy dog dreamed of quick brown foxes.
""" * 10  # Repeat for more data

# Train BPE
bpe = SimpleBPE()
bpe.train(training_text, num_merges=30, verbose=True)

In [None]:
# Test our BPE tokenizer

test_words = ["the", "quick", "jumping", "foxes", "laziest", "unquickly"]

print("Testing our BPE tokenizer:")
print("=" * 50)

for word in test_words:
    tokens = bpe.tokenize(word)
    print(f"  '{word}' -> {tokens}")

### What Just Happened?

1. We started with individual characters as our vocabulary
2. We found the most frequent pair of adjacent tokens
3. We merged them into a new token
4. We repeated until we reached our desired vocabulary size

Now common subwords like "the", "ing", "er" are single tokens!

---

## Part 4: Comparing Real Tokenizers

In [None]:
# Load different tokenizers
tokenizers = {}

if HAS_TRANSFORMERS:
    try:
        tokenizers["GPT-2"] = AutoTokenizer.from_pretrained("gpt2")
        print("✅ Loaded GPT-2 tokenizer")
    except Exception as e:
        print(f"⚠️ Could not load GPT-2 tokenizer: {e}")

    try:
        tokenizers["BERT"] = AutoTokenizer.from_pretrained("bert-base-uncased")
        print("✅ Loaded BERT tokenizer")
    except Exception as e:
        print(f"⚠️ Could not load BERT tokenizer: {e}")

    # Try to load LLaMA tokenizer with graceful fallback
    # Note: LLaMA tokenizers require authentication
    try:
        tokenizers["LLaMA"] = AutoTokenizer.from_pretrained(
            "meta-llama/Llama-2-7b-hf",
            legacy=False,
            token=True  # Use HF_TOKEN environment variable if set
        )
        print("✅ Loaded LLaMA tokenizer")
    except Exception as e:
        print(f"⚠️ Could not load LLaMA tokenizer: {e}")
        print("   To use LLaMA tokenizer:")
        print("   1. Create account at huggingface.co")
        print("   2. Accept LLaMA license at: https://huggingface.co/meta-llama/Llama-2-7b-hf")
        print("   3. Run: huggingface-cli login")
        print("   Continuing without LLaMA tokenizer...")

if tokenizers:
    print(f"\nLoaded {len(tokenizers)} tokenizers:")
    for name, tok in tokenizers.items():
        print(f"  {name}: vocab size = {tok.vocab_size:,}")
else:
    print("\n⚠️ No tokenizers loaded. Install transformers to use this section.")

In [None]:
# Compare tokenization of the same text

test_texts = [
    "Hello, world!",
    "The transformer architecture revolutionized NLP.",
    "def fibonacci(n): return n if n < 2 else fibonacci(n-1) + fibonacci(n-2)",
    "Tokenization is crucial for language models.",
    "supercalifragilisticexpialidocious",  # Long rare word
]

for text in test_texts:
    print(f"\nText: '{text}'")
    print("-" * 60)
    
    for name, tokenizer in tokenizers.items():
        tokens = tokenizer.tokenize(text)
        ids = tokenizer.encode(text, add_special_tokens=False)
        print(f"  {name}:")
        print(f"    Tokens ({len(tokens)}): {tokens}")
        print(f"    IDs: {ids}")

In [None]:
# Efficiency comparison

long_text = """
Artificial intelligence has made remarkable progress in recent years, particularly 
in the field of natural language processing. Large language models like GPT-4, 
Claude, and LLaMA have demonstrated impressive capabilities in understanding and 
generating human-like text. These models use the transformer architecture, which 
relies on self-attention mechanisms to process sequences of tokens.

The tokenization process is crucial for these models. It converts raw text into 
a sequence of tokens that the model can process. Different tokenization strategies 
have different trade-offs in terms of vocabulary size, token efficiency, and 
handling of rare or unknown words.
""" * 3

print(f"Text length: {len(long_text)} characters")
print("\nToken counts:")
print("=" * 40)

token_counts = {}
for name, tokenizer in tokenizers.items():
    tokens = tokenizer.encode(long_text, add_special_tokens=False)
    token_counts[name] = len(tokens)
    chars_per_token = len(long_text) / len(tokens)
    print(f"  {name}: {len(tokens)} tokens ({chars_per_token:.1f} chars/token)")

# Visualize
plt.figure(figsize=(10, 5))
plt.bar(token_counts.keys(), token_counts.values(), color=['blue', 'green', 'red'][:len(token_counts)])
plt.ylabel('Number of Tokens')
plt.title('Token Efficiency Comparison\n(Fewer tokens = more efficient)')
plt.xticks(rotation=45)
for i, (name, count) in enumerate(token_counts.items()):
    plt.text(i, count + 5, str(count), ha='center')
plt.tight_layout()
plt.show()

---

## Part 5: Special Tokens

In [None]:
# Explore special tokens

for name, tokenizer in tokenizers.items():
    print(f"\n{name} Special Tokens:")
    print("=" * 40)
    
    special_tokens = {
        'BOS (Beginning of Sequence)': tokenizer.bos_token,
        'EOS (End of Sequence)': tokenizer.eos_token,
        'PAD (Padding)': tokenizer.pad_token,
        'UNK (Unknown)': tokenizer.unk_token,
        'SEP (Separator)': getattr(tokenizer, 'sep_token', None),
        'CLS (Classification)': getattr(tokenizer, 'cls_token', None),
        'MASK': getattr(tokenizer, 'mask_token', None),
    }
    
    for token_name, token in special_tokens.items():
        if token:
            token_id = tokenizer.convert_tokens_to_ids(token)
            print(f"  {token_name}: '{token}' (id: {token_id})")

In [None]:
# How BERT uses special tokens

bert_tokenizer = tokenizers.get("BERT")

if bert_tokenizer:
    text = "Hello, how are you?"
    
    # Without special tokens
    tokens_no_special = bert_tokenizer.tokenize(text)
    ids_no_special = bert_tokenizer.encode(text, add_special_tokens=False)
    
    # With special tokens
    ids_with_special = bert_tokenizer.encode(text, add_special_tokens=True)
    tokens_with_special = bert_tokenizer.convert_ids_to_tokens(ids_with_special)
    
    print("BERT tokenization:")
    print(f"  Text: '{text}'")
    print(f"  Without special tokens: {tokens_no_special}")
    print(f"  With special tokens:    {tokens_with_special}")
    print(f"\n  [CLS] = start of sequence")
    print(f"  [SEP] = end of sequence / separator")

---

## Part 6: Training a Custom Tokenizer

Let's train a proper tokenizer using the `tokenizers` library.

In [None]:
# Try to import tokenizers library
HAS_TOKENIZERS = False
try:
    from tokenizers import Tokenizer
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    from tokenizers.pre_tokenizers import Whitespace
    HAS_TOKENIZERS = True
    print("✅ tokenizers library available")
except ImportError:
    # Check if we're on ARM64 (DGX Spark)
    import platform
    if platform.machine() == 'aarch64':
        print("⚠️ ARM64 detected (DGX Spark).")
        print("   For NGC container pytorch:25.11-py3, tokenizers should be included.")
        print("   If missing, ensure you're using the correct container version.")
        print("   Skipping custom tokenizer training sections...")
    else:
        print("Installing tokenizers...")
        import subprocess
        result = subprocess.run(['pip', 'install', 'tokenizers', '-q'], capture_output=True)
        try:
            from tokenizers import Tokenizer
            from tokenizers.models import BPE
            from tokenizers.trainers import BpeTrainer
            from tokenizers.pre_tokenizers import Whitespace
            HAS_TOKENIZERS = True
            print("✅ tokenizers installed successfully")
        except ImportError:
            print("❌ Could not install tokenizers")

if HAS_TOKENIZERS:
    # Create a larger training corpus
    training_corpus = [
        "Machine learning is a subset of artificial intelligence.",
        "Deep learning uses neural networks with many layers.",
        "Transformers revolutionized natural language processing.",
        "Attention mechanisms allow models to focus on relevant parts.",
        "Pre-training on large datasets improves model performance.",
        "Fine-tuning adapts models to specific tasks.",
        "Tokenization converts text to numerical representations.",
        "Embeddings capture semantic meaning of words.",
        "Self-attention computes relationships between all tokens.",
        "Language models predict the next word in a sequence.",
    ] * 100  # Repeat for more data

    # Initialize tokenizer
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = Whitespace()

    # Create trainer
    trainer = BpeTrainer(
        vocab_size=500,
        special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
    )

    # Train on our corpus
    tokenizer.train_from_iterator(training_corpus, trainer=trainer)

    print(f"Trained tokenizer with vocabulary size: {tokenizer.get_vocab_size()}")
else:
    print("⚠️ Skipping custom tokenizer training (tokenizers library not available)")

In [None]:
# Test our custom tokenizer

if HAS_TOKENIZERS:
    test_texts = [
        "Machine learning is fascinating.",
        "Transformers use self-attention mechanisms.",
        "This is a completely new sentence about quantum computing.",
    ]

    print("Testing custom tokenizer:")
    print("=" * 60)

    for text in test_texts:
        output = tokenizer.encode(text)
        print(f"\nText: '{text}'")
        print(f"  Tokens: {output.tokens}")
        print(f"  IDs:    {output.ids}")
else:
    print("⚠️ Custom tokenizer testing skipped (tokenizers library not available)")

---

## Part 7: Vocabulary Size Trade-offs

In [None]:
# Experiment with different vocabulary sizes

if HAS_TOKENIZERS:
    vocab_sizes = [100, 500, 1000, 5000]
    test_text = """
    Natural language processing has evolved significantly with the introduction of 
    transformer-based architectures. These models leverage self-attention mechanisms 
    to capture long-range dependencies in text, enabling impressive performance on 
    various downstream tasks.
    """

    results = []

    for vocab_size in vocab_sizes:
        # Train new tokenizer
        tok = Tokenizer(BPE(unk_token="[UNK]"))
        tok.pre_tokenizer = Whitespace()
        trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=["[UNK]"])
        tok.train_from_iterator(training_corpus, trainer=trainer)
        
        # Tokenize test text
        encoded = tok.encode(test_text)
        num_tokens = len(encoded.tokens)
        num_unk = encoded.tokens.count("[UNK]")
        
        results.append({
            'vocab_size': vocab_size,
            'num_tokens': num_tokens,
            'num_unk': num_unk,
            'chars_per_token': len(test_text) / num_tokens
        })
        
        print(f"Vocab size {vocab_size}: {num_tokens} tokens, {num_unk} [UNK], "
              f"{len(test_text)/num_tokens:.1f} chars/token")

    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    ax = axes[0]
    ax.plot([r['vocab_size'] for r in results], [r['num_tokens'] for r in results], 'bo-')
    ax.set_xlabel('Vocabulary Size')
    ax.set_ylabel('Number of Tokens')
    ax.set_title('Token Count vs Vocabulary Size\n(Larger vocab = fewer tokens)')
    ax.grid(True, alpha=0.3)

    ax = axes[1]
    ax.plot([r['vocab_size'] for r in results], [r['chars_per_token'] for r in results], 'go-')
    ax.set_xlabel('Vocabulary Size')
    ax.set_ylabel('Characters per Token')
    ax.set_title('Token Efficiency vs Vocabulary Size\n(Higher = more efficient)')
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()
else:
    print("⚠️ Vocabulary size experiment skipped (tokenizers library not available)")
    print("This section requires the tokenizers library to train BPE tokenizers.")

### Trade-offs Summary:

| Vocab Size | Tokens per Text | Embedding Memory | OOV Risk |
|------------|-----------------|------------------|----------|
| Small (1K) | Many | Low | High |
| Medium (30K) | Balanced | Moderate | Low |
| Large (100K+) | Fewer | High | Very Low |

Most modern LLMs use 30K-100K vocabulary:
- GPT-2: 50,257
- BERT: 30,522
- LLaMA: 32,000

---

## Try It Yourself: Exercises

### Exercise 1: Multilingual Tokenization

Compare how different tokenizers handle non-English text.

In [None]:
# Test multilingual tokenization
multilingual_texts = [
    "Hello, world!",                    # English
    "Bonjour le monde!",                # French
    "Hola, mundo!",                     # Spanish
    "Hallo, Welt!",                     # German
    "Привет мир!",                      # Russian
    "你好世界！",                         # Chinese
    "こんにちは世界！",                    # Japanese
]

# YOUR CODE HERE
# Compare token counts for each language across different tokenizers
# Which tokenizer is most efficient for which languages?

### Exercise 2: Code Tokenization

Analyze how different tokenizers handle programming code.

In [None]:
code_samples = [
    'print("Hello, World!")',
    'def factorial(n): return 1 if n <= 1 else n * factorial(n-1)',
    'class MyClass:\n    def __init__(self, value):\n        self.value = value',
]

# YOUR CODE HERE
# How do different tokenizers handle code?
# Look for patterns in how they split identifiers, keywords, and operators

---

## Common Mistakes

### Mistake 1: Not handling special characters

In [None]:
# Problem: Special characters can cause issues
problematic_text = "This costs $100 (25% off!) @store #sale"

for name, tok in tokenizers.items():
    tokens = tok.tokenize(problematic_text)
    print(f"{name}: {tokens}")

print("\nNote: Different tokenizers handle symbols differently!")

### Mistake 2: Ignoring token limits

In [None]:
# Each model has a maximum token limit!

model_limits = {
    "GPT-2": 1024,
    "BERT": 512,
    "GPT-4": 8192,  # or 32K/128K for extended versions
    "Claude": 100000,
}

long_text = "word " * 1000  # 1000 words

gpt2_tok = tokenizers.get("GPT-2")
if gpt2_tok:
    tokens = gpt2_tok.encode(long_text)
    print(f"Text with 1000 words produces {len(tokens)} tokens")
    
    for model, limit in model_limits.items():
        status = "✓ OK" if len(tokens) <= limit else "✗ TRUNCATED"
        print(f"  {model} (limit {limit}): {status}")

### Mistake 3: Forgetting to add special tokens

In [None]:
# BERT requires [CLS] and [SEP] tokens!

bert_tok = tokenizers.get("BERT")
if bert_tok:
    text = "This is a test."
    
    # Wrong: no special tokens
    wrong_ids = bert_tok.encode(text, add_special_tokens=False)
    
    # Right: with special tokens
    right_ids = bert_tok.encode(text, add_special_tokens=True)
    
    print(f"Without special tokens: {bert_tok.convert_ids_to_tokens(wrong_ids)}")
    print(f"With special tokens:    {bert_tok.convert_ids_to_tokens(right_ids)}")
    print("\n⚠️ BERT expects [CLS] at start and [SEP] at end!")

---

## Checkpoint

You've learned:
- ✅ Why tokenization is necessary for NLP
- ✅ Character, word, and subword tokenization
- ✅ How Byte Pair Encoding (BPE) works
- ✅ Differences between GPT-2, BERT, and LLaMA tokenizers
- ✅ Special tokens and their purposes
- ✅ Vocabulary size trade-offs

---

## Challenge (Optional)

Implement **WordPiece** tokenization (used by BERT), which differs from BPE:
- BPE merges based on frequency
- WordPiece merges based on likelihood improvement

In [None]:
# Challenge: Implement WordPiece tokenization
# Hint: Instead of counting pair frequencies, compute:
# score(a, b) = freq(ab) / (freq(a) * freq(b))

# YOUR CODE HERE

---

## Further Reading

- [Hugging Face Tokenizers](https://huggingface.co/docs/tokenizers/) - Fast tokenizers library
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909) - BPE paper
- [SentencePiece](https://arxiv.org/abs/1808.06226) - Unsupervised text tokenizer
- [Let's Build a Tokenizer](https://www.youtube.com/watch?v=zduSFxRajkE) - Andrej Karpathy's video

---

## Cleanup

In [None]:
# Clean up
import gc

# Safely delete variables that may or may not exist
for var_name in ['tokenizers', 'tokenizer', 'bpe']:
    if var_name in globals():
        del globals()[var_name]

gc.collect()

print("Memory cleared! Ready for the next notebook.")

---

## Next Up

In **Notebook 05: BERT Fine-tuning**, we'll:
- Load a pre-trained BERT model
- Fine-tune it for sentiment classification
- Evaluate performance and understand transfer learning

---

*Great job! Tokenization is the essential first step for any NLP model.*