# Chapter 5: Tokenization - The Gateway to Transformer Understanding

This notebook contains all the tokenization examples from article5.md, plus additional examples from Chapter 3 that were mentioned to be explained in Chapter 5.

## Table of Contents
1. [Environment Setup](#setup)
2. [Basic Tokenization](#basic)
3. [Tokenization Algorithms (BPE, WordPiece, Unigram)](#algorithms)
4. [Custom Tokenization](#custom)
5. [Debugging and Visualization](#debugging)
6. [Multimodal Tokenization](#multimodal)
7. [Chapter 3 Advanced Examples](#chapter3)
8. [Exercises](#exercises)

## 1. Environment Setup <a id='setup'></a>

First, let's set up our environment and import necessary libraries.

In [None]:
# Import required libraries
from transformers import (
    AutoTokenizer, 
    AutoModel,
    AutoImageProcessor,
    CLIPProcessor,
    AutoProcessor
)
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
from tokenizers.processors import TemplateProcessing
import torch
import numpy as np
from PIL import Image
import requests
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Check device
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

## 2. Basic Tokenization <a id='basic'></a>

Let's start with the fundamental concepts of tokenization.

### 2.1 Basic Tokenization Example

Tokenization converts raw text into tokens and numerical IDs that models can process.

In [None]:
# Load a pre-trained fast tokenizer (BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "Transformers are revolutionizing AI!"

# Tokenize and prepare model inputs in one step
encoded = tokenizer(text)
print('Input IDs:', encoded['input_ids'])
print('Tokens:', tokenizer.convert_ids_to_tokens(encoded['input_ids']))

# For direct tensor output (e.g., for PyTorch models):
tensor_inputs = tokenizer(text, return_tensors="pt")
print('\nTensor Input IDs:', tensor_inputs['input_ids'])
print('Tensor shape:', tensor_inputs['input_ids'].shape)

### 2.2 Multilingual Tokenization with Emojis

Modern tokenizers need to handle multiple languages and special characters like emojis.

In [None]:
# Load multilingual tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

text = "Transformers están revolucionando la IA! 🚀"

# Tokenize and map to IDs in one step (recommended)
encoded = tokenizer(text, return_tensors='pt')
print('Input IDs:', encoded['input_ids'])
print('Tokens:', tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))

# Inspect special tokens
print('\nSpecial tokens:', tokenizer.special_tokens_map)

### 2.3 Batch Tokenization with Padding and Alignment

For efficient processing, we often tokenize multiple texts at once.

In [None]:
# Load the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentences = [
    "Tokenization is fun!",
    "Let's build smarter models."
]

# Tokenize the batch, including alignment info
encoded = tokenizer(
    sentences,
    padding=True,                # Pad to the longest sentence
    truncation=True,             # Truncate if too long
    return_tensors='pt',         # PyTorch tensors
    return_offsets_mapping=True  # Get character-to-token alignment
)

print('Input IDs shape:', encoded['input_ids'].shape)
print('\nInput IDs:')
print(encoded['input_ids'])
print('\nAttention Mask:')
print(encoded['attention_mask'])

# Show tokens for each sentence
for i, sentence in enumerate(sentences):
    tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][i])
    print(f'\nSentence {i+1} tokens: {tokens}')

### 2.4 Special Token Handling

Special tokens like [CLS], [SEP], and custom tokens are crucial for many transformer tasks.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Inspect current special tokens
print('Special tokens:', tokenizer.special_tokens_map)
print('\nSpecial token IDs:')
for token_name, token in tokenizer.special_tokens_map.items():
    token_id = tokenizer.convert_tokens_to_ids(token)
    print(f"  {token_name}: '{token}' -> ID: {token_id}")

# Add custom special tokens if needed
special_tokens_dict = {'additional_special_tokens': ['<CUSTOM>', '<MEDICAL>']}
num_added = tokenizer.add_special_tokens(special_tokens_dict)
print(f'\nAdded {num_added} special tokens.')

# Visualize tokenization with special tokens
text = "Classify this sentence."
encoded = tokenizer(text)
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'])
print(f'\nTokens with Special Tokens: {tokens}')

## 3. Tokenization Algorithms <a id='algorithms'></a>

Let's explore the three main tokenization algorithms: BPE, WordPiece, and Unigram.

### 3.1 Byte Pair Encoding (BPE) - Used by GPT, RoBERTa

In [None]:
# Load RoBERTa's BPE tokenizer
bpe_tokenizer = AutoTokenizer.from_pretrained('roberta-base')

text = 'unhappiness'
tokens = bpe_tokenizer.tokenize(text)
print('BPE Tokens:', tokens)

# Show how BPE handles various words
test_words = ['tokenization', 'transformer', 'preprocessing', 'pneumothorax']
print('\nBPE tokenization examples:')
for word in test_words:
    tokens = bpe_tokenizer.tokenize(word)
    print(f"  '{word}' -> {tokens}")

### 3.2 WordPiece - Used by BERT

In [None]:
# Load BERT's WordPiece tokenizer
wordpiece_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = 'unhappiness'
tokens = wordpiece_tokenizer.tokenize(text)
print('WordPiece Tokens:', tokens)
print('Notice the ## prefix for subword continuations!')

# Show how WordPiece handles various words
print('\nWordPiece tokenization examples:')
for word in test_words:
    tokens = wordpiece_tokenizer.tokenize(word)
    print(f"  '{word}' -> {tokens}")

### 3.3 Unigram - Used by XLNet, ALBERT

In [None]:
# Load XLM-RoBERTa's Unigram tokenizer
unigram_tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')

text = 'unhappiness'
tokens = unigram_tokenizer.tokenize(text)
print('Unigram Tokens:', tokens)

# Show how Unigram handles various words
print('\nUnigram tokenization examples:')
for word in test_words:
    tokens = unigram_tokenizer.tokenize(word)
    print(f"  '{word}' -> {tokens}")

### 3.4 Algorithm Comparison

Let's compare all three algorithms side by side.

In [None]:
# Compare algorithms on various text types
test_texts = [
    "unhappiness",
    "I love pizza! 🍕🔥",
    "COVID-19 pandemic",
    "user@example.com",
    "myocardial infarction"
]

tokenizers = {
    'BPE (RoBERTa)': AutoTokenizer.from_pretrained('roberta-base'),
    'WordPiece (BERT)': AutoTokenizer.from_pretrained('bert-base-uncased'),
    'Unigram (XLM-R)': AutoTokenizer.from_pretrained('xlm-roberta-base')
}

for text in test_texts:
    print(f"\nTokenizing: '{text}'")
    for name, tokenizer in tokenizers.items():
        tokens = tokenizer.tokenize(text)
        print(f"  {name}: {tokens} (length: {len(tokens)})")

## 4. Custom Tokenization <a id='custom'></a>

For specialized domains, you might need to train your own tokenizer.

### 4.1 Training a Custom Tokenizer (Simple Method)

In [None]:
# Domain-specific medical texts
medical_texts = [
    "Patient exhibits signs of pneumothorax.",
    "CT scan reveals bilateral infiltrates.",
    "Myocardial infarction confirmed via ECG.",
    "Administered 5mg of morphine for pain management.",
    "Post-operative recovery progressing normally.",
    "CBC shows elevated white blood cell count.",
    "MRI indicates herniated disc at L4-L5.",
    "Patient history includes hypertension and diabetes.",
    "Prescribed antibiotics for bacterial infection.",
    "Radiology report shows no acute findings.",
    "Chronic obstructive pulmonary disease exacerbation.",
    "Electrocardiogram shows atrial fibrillation."
]

# Start with a base tokenizer as template
base_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Train a new tokenizer on domain data
custom_tokenizer = base_tokenizer.train_new_from_iterator(
    medical_texts,
    vocab_size=1000,
)

# Test the custom tokenizer
test_text = "Patient exhibits signs of pneumothorax."
print("Original BERT tokenization:")
print(base_tokenizer.tokenize(test_text))
print("\nCustom medical tokenization:")
print(custom_tokenizer.tokenize(test_text))

### 4.2 Training a Custom BPE Tokenizer (Advanced)

In [None]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
from tokenizers.processors import TemplateProcessing

# Initialize a tokenizer with BPE model
tokenizer = Tokenizer(models.BPE())

# Pre-tokenization (splitting on whitespace and punctuation)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Train the tokenizer
trainer = trainers.BpeTrainer(
    vocab_size=1000,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

# Train from our medical corpus
tokenizer.train_from_iterator(medical_texts, trainer=trainer)

# Add post-processing for BERT-style tokens
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 2),
        ("[SEP]", 3),
    ],
)

# Test the custom BPE tokenizer
test_text = "Patient with myocardial infarction"
encoding = tokenizer.encode(test_text)
print(f"BPE tokens: {encoding.tokens}")
print(f"BPE IDs: {encoding.ids}")

### 4.3 Comparing General vs Domain-Specific Tokenization

In [None]:
# Medical terms that might be split differently
medical_terms = [
    "pneumothorax",
    "myocardial",
    "electrocardiogram",
    "thrombocytopenia",
    "cholecystectomy"
]

# Load general tokenizer
general_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

print("Comparing tokenization of medical terms:")
print("=" * 50)
for term in medical_terms:
    general_tokens = general_tokenizer.tokenize(term)
    custom_tokens = custom_tokenizer.tokenize(term)
    
    print(f"\n'{term}':")
    print(f"  General BERT: {general_tokens} (length: {len(general_tokens)})")
    print(f"  Custom Medical: {custom_tokens} (length: {len(custom_tokens)})")

## 5. Debugging and Visualization <a id='debugging'></a>

Understanding how tokenization works is crucial for debugging NLP pipelines.

### 5.1 Visualizing Tokenization with Offsets

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

text = "Let's test: 🤖 transformers!"
output = tokenizer(
    text,
    return_offsets_mapping=True,
    return_tensors=None
)
tokens = tokenizer.convert_ids_to_tokens(output['input_ids'])
offsets = output['offset_mapping']

print(f"Original text: '{text}'")
print("\nToken breakdown:")
print("-" * 50)
for token, (start, end) in zip(tokens, offsets):
    if start == end:  # Special tokens
        print(f"  {token:15} [SPECIAL TOKEN]")
    else:
        print(f"  {token:15} [{start:2}, {end:2}] -> '{text[start:end]}'")

### 5.2 Detecting Tokenizer-Model Mismatch

In [None]:
# Example: Using a mismatched tokenizer and model
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
roberta_model = AutoModel.from_pretrained('roberta-base')

text = "Tokenization mismatch!"
inputs = bert_tokenizer(text, return_tensors='pt')

print("Using BERT tokenizer with RoBERTa model:")
print(f"BERT tokens: {bert_tokenizer.tokenize(text)}")
print(f"BERT special tokens: {bert_tokenizer.special_tokens_map}")

# Show the correct pairing
roberta_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
print(f"\nRoBERTa tokens: {roberta_tokenizer.tokenize(text)}")
print(f"RoBERTa special tokens: {roberta_tokenizer.special_tokens_map}")
print("\n⚠️  Notice the different special tokens and tokenization!")

### 5.3 Analyzing Unknown Tokens

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Text with potentially unknown tokens
test_texts = [
    "Normal English text",
    "Emojis: 😀 🚀 🤖",
    "Special chars: ™ © ® µ",
    "Mixed: Hello世界Bonjour",
    "Medical: pneumonoultramicroscopicsilicovolcanoconiosis",
    "Code: def foo(x): return x**2",
    "Email: user@example.com",
    "URL: https://example.com/path"
]

print("Analyzing unknown token generation:")
print("=" * 60)

for text in test_texts:
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # Check for unknown tokens
    unk_token_id = tokenizer.unk_token_id
    unk_count = token_ids.count(unk_token_id)
    
    print(f"\nText: '{text}'")
    print(f"  Tokens: {tokens}")
    if unk_count > 0:
        print(f"  ⚠️  Contains {unk_count} unknown tokens!")
        unk_positions = [i for i, tid in enumerate(token_ids) if tid == unk_token_id]
        print(f"  Unknown at positions: {unk_positions}")

### 5.4 Debugging Padding and Truncation

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Sentences of different lengths
sentences = [
    "Short",
    "This is a medium length sentence.",
    "This is a much longer sentence that will definitely exceed the maximum length limit we set for truncation testing purposes."
]

max_length = 10

print(f"Testing padding and truncation (max_length={max_length})")
print("=" * 60)

# Test different padding strategies
encoded = tokenizer(
    sentences,
    padding='max_length',
    truncation=True,
    max_length=max_length,
    return_tensors='pt'
)

for i, sent in enumerate(sentences):
    tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][i])
    print(f"\nOriginal: '{sent[:30]}...'" if len(sent) > 30 else f"\nOriginal: '{sent}'")
    print(f"Tokens: {tokens}")
    print(f"Attention mask: {encoded['attention_mask'][i].tolist()}")

## 6. Multimodal Tokenization <a id='multimodal'></a>

Modern transformers can process not just text, but also images and other modalities.

### 6.1 Image Tokenization with Vision Transformer (ViT)

In [None]:
# Create a sample image for testing
def create_sample_image():
    image = Image.new('RGB', (224, 224), color='red')
    # Add some variation
    pixels = image.load()
    for i in range(0, 224, 20):
        for j in range(0, 224, 20):
            pixels[i, j] = (0, 255, 0)  # Green dots
    return image

# Load a vision processor (tokenizer for images)
processor = AutoImageProcessor.from_pretrained('google/vit-base-patch16-224')

# Create sample image
image = create_sample_image()
print(f"Image size: {image.size}")

# Process image into model-ready inputs
inputs = processor(images=image, return_tensors='pt')
print(f'\nPixel values shape: {inputs["pixel_values"].shape}')

# Vision Transformer details
patch_size = 16
image_size = 224
num_patches = (image_size // patch_size) ** 2
print(f"\nViT tokenization details:")
print(f"  Number of image patches: {num_patches}")
print(f"  Each patch: {patch_size}x{patch_size} pixels")
print(f"  Sequence length: {num_patches + 1} (patches + [CLS] token)")

### 6.2 CLIP Multimodal Tokenization

In [None]:
# Load CLIP processor (handles both text and images)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Create sample image
image = create_sample_image()

# Sample texts for similarity comparison
texts = [
    "a red square with green dots",
    "a photo of a cat", 
    "a photo of a dog",
    "a colorful pattern"
]

# Process both text and images together
inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    padding=True
)

print("CLIP multimodal inputs:")
print(f"  Text input shape: {inputs['input_ids'].shape}")
print(f"  Image input shape: {inputs['pixel_values'].shape}")

# Show text tokenization
print("\nText tokenization:")
for i, text in enumerate(texts):
    tokens = processor.tokenizer.convert_ids_to_tokens(inputs['input_ids'][i])
    # Filter out padding tokens for display
    tokens = [t for t in tokens if t != processor.tokenizer.pad_token]
    print(f"  '{text}': {tokens}")

### 6.3 Comparing Different Image Processors

In [None]:
# Compare different image processors/tokenizers
image = create_sample_image()

processors = {
    "ViT": "google/vit-base-patch16-224",
    "CLIP": "openai/clip-vit-base-patch32",
    "DeiT": "facebook/deit-base-patch16-224"
}

print("Comparing image processors:")
print("=" * 60)

for name, model_name in processors.items():
    try:
        if name == "CLIP":
            processor = CLIPProcessor.from_pretrained(model_name)
            inputs = processor(images=image, return_tensors="pt")
        else:
            processor = AutoImageProcessor.from_pretrained(model_name)
            inputs = processor(images=image, return_tensors="pt")
        
        print(f"\n{name} processor ({model_name}):")
        print(f"  Input shape: {inputs['pixel_values'].shape}")
        
        # Get processor configuration
        if hasattr(processor, 'size'):
            print(f"  Expected size: {processor.size}")
        if hasattr(processor, 'do_normalize'):
            print(f"  Normalization: {processor.do_normalize}")
            
    except Exception as e:
        print(f"\n{name}: Error - {e}")

In [ ]:
## 8. Exercises <a id='exercises'></a>

Now let's implement the exercises from article5.md

### 7.8 Performance Comparison of Tokenizers

Let's compare the performance of different tokenizers to understand speed vs. functionality tradeoffs.

In [ ]:
# Out-of-Vocabulary (OOV) Handling
print("=== Out-of-Vocabulary (OOV) Word Handling ===")

# Test with made-up and rare words
test_texts = [
    "The flibbertigibbet jumped over the moon.",
    "Pneumonoultramicroscopicsilicovolcanoconiosis is a lung disease.",
    "The 🦄 and 🌈 are beautiful.",
    "Contact us at support@企业.com",
]

tokenizers_to_test = {
    "BERT": bert_tokenizer,
    "GPT-2": gpt2_tokenizer,
    "RoBERTa": AutoTokenizer.from_pretrained("roberta-base"),
}

for text in test_texts:
    print(f"\nText: '{text}'")
    print("-" * 70)
    
    for name, tokenizer in tokenizers_to_test.items():
        # Get UNK token for this tokenizer
        unk_token = getattr(tokenizer, 'unk_token', None)
        
        # Tokenize
        if name == "BERT":
            tokens = tokenizer.tokenize(text.lower())
        else:
            tokens = tokenizer.tokenize(text)
        
        # Check for UNK tokens
        unk_count = tokens.count(unk_token) if unk_token else 0
        
        print(f"{name:10} ({len(tokens):2} tokens): ", end="")
        if unk_count > 0:
            print(f"⚠️  {unk_count} UNK token(s)! ", end="")
        
        # Show first few tokens
        display_tokens = tokens[:8] + ["..."] if len(tokens) > 8 else tokens
        print(display_tokens)

# Demonstrate subword handling of OOV
print("\n=== Subword Decomposition of OOV Words ===")
made_up_word = "supersupercalifragilisticexpialidocious"

print(f"\nMade-up word: '{made_up_word}'")
print("\nHow different tokenizers handle it:")
print(f"BERT:     {bert_tokenizer.tokenize(made_up_word)}")
print(f"GPT-2:    {gpt2_tokenizer.tokenize(made_up_word)}")
print(f"RoBERTa:  {tokenizers_to_test['RoBERTa'].tokenize(made_up_word)}")

# Key insight
print("\n💡 Key Insight:")
print("- BERT uses [UNK] tokens for unknown words/characters")
print("- GPT-2 and RoBERTa use BPE to break down any word into known subwords")
print("- This is why BPE-based models handle OOV words better!")

### 7.7 Out-of-Vocabulary (OOV) Handling Strategies

In [ ]:
# Install tiktoken if not already installed
try:
    import tiktoken
except ImportError:
    print("Installing tiktoken...")
    import subprocess
    subprocess.check_call(["pip", "install", "tiktoken"])
    import tiktoken

# Comparing with TikToken (used by GPT-3.5/4)
print("=== Comparing HuggingFace Tokenizers with TikToken ===")

# Initialize tiktoken
encoding = tiktoken.get_encoding("cl100k_base")  # GPT-3.5/4 encoding

# Test texts
test_texts = [
    "Hello world!",
    "The transformer architecture revolutionized NLP in 2017.",
    "def tokenize(text): return text.split()",
    "Email: user@example.com, URL: https://example.com",
]

# Compare tokenization
for text in test_texts:
    print(f"\nText: '{text}'")
    print("-" * 60)
    
    # TikToken
    tiktoken_ids = encoding.encode(text)
    tiktoken_tokens = [encoding.decode([tid]) for tid in tiktoken_ids]
    print(f"TikToken (GPT-3.5/4): {tiktoken_tokens} ({len(tiktoken_tokens)} tokens)")
    
    # GPT-2
    gpt2_tokens = gpt2_tokenizer.tokenize(text)
    print(f"GPT-2 (BPE): {gpt2_tokens} ({len(gpt2_tokens)} tokens)")
    
    # BERT
    bert_tokens = bert_tokenizer.tokenize(text.lower())
    print(f"BERT (WordPiece): {bert_tokens} ({len(bert_tokens)} tokens)")

# Vocabulary size comparison
print("\n=== Vocabulary Size Comparison ===")
print(f"TikToken (cl100k_base): {encoding.n_vocab:,} tokens")
print(f"GPT-2: {gpt2_tokenizer.vocab_size:,} tokens")
print(f"BERT: {bert_tokenizer.vocab_size:,} tokens")
print(f"T5: {t5_tokenizer.vocab_size:,} tokens")

### 7.6 Comparing Tokenizers with TikToken (GPT-3.5/4)

In [ ]:
# Subword Tokenization Deep Dive
print("=== Subword Tokenization Methods Comparison ===")

# Example text with various challenges
text = (
    "Tokenization is fundamental to NLP. Let's explore BPE, WordPiece, and "
    "SentencePiece algorithms!"
)

# 1. BPE (Byte Pair Encoding) - GPT-2
print("\n1. BPE Tokenization (GPT-2):")
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt2_tokens = gpt2_tokenizer.tokenize(text)
gpt2_ids = gpt2_tokenizer.encode(text)
print(f"   Tokens: {gpt2_tokens}")
print(f"   Token count: {len(gpt2_tokens)}")
print(f"   Vocabulary size: {gpt2_tokenizer.vocab_size}")

# 2. WordPiece - BERT
print("\n2. WordPiece Tokenization (BERT):")
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = bert_tokenizer.tokenize(text)
bert_ids = bert_tokenizer.encode(text)
print(f"   Tokens: {bert_tokens}")
print(f"   Token count: {len(bert_tokens)}")
print(f"   Notice '##' prefix for subword continuations")

# 3. SentencePiece - T5
print("\n3. SentencePiece Tokenization (T5):")
t5_tokenizer = AutoTokenizer.from_pretrained("t5-small")
t5_tokens = t5_tokenizer.tokenize(text)
t5_ids = t5_tokenizer.encode(text)
print(f"   Tokens: {t5_tokens}")
print(f"   Token count: {len(t5_tokens)}")
print(f"   Notice '▁' for word boundaries")

# Handling unknown/rare words
print("\n=== Handling Unknown/Rare Words ===")
rare_word = "Supercalifragilisticexpialidocious"

print(f"\nRare word: '{rare_word}'")
print(f"GPT-2 (BPE): {gpt2_tokenizer.tokenize(rare_word)}")
print(f"BERT (WordPiece): {bert_tokenizer.tokenize(rare_word.lower())}")
print(f"T5 (SentencePiece): {t5_tokenizer.tokenize(rare_word)}")

# Demonstrate vocabulary lookup
print("\n=== Vocabulary Lookup Example ===")
word = "tokenization"
print(f"Looking up '{word}':")

# Check if whole word is in vocabulary
if word in gpt2_tokenizer.vocab:
    print(f"  GPT-2: '{word}' is in vocabulary with ID {gpt2_tokenizer.vocab[word]}")
else:
    print(f"  GPT-2: '{word}' not in vocabulary, will be split into subwords")

# For BERT (lowercase)
word_lower = word.lower()
if word_lower in bert_tokenizer.vocab:
    print(f"  BERT: '{word_lower}' is in vocabulary with ID {bert_tokenizer.vocab[word_lower]}")
else:
    print(f"  BERT: '{word_lower}' not in vocabulary, will be split into subwords")

### 7.5 Subword Tokenization Deep Dive

Let's explore how different subword tokenization methods handle complex words and why it matters.

In [ ]:
# Token-to-Character Offset Mapping
print("=== Token-to-Character Offset Mapping ===")

text = "Hugging Face's tokenizers are extremely powerful!"
encoding = tokenizer(
    text, 
    return_offsets_mapping=True, 
    add_special_tokens=True
)

tokens = tokenizer.convert_ids_to_tokens(encoding['input_ids'])
offsets = encoding['offset_mapping']

print(f"Original text: '{text}'")
print(f"\nToken to character mapping:")
print("-" * 60)
print(f"{'Token':15} {'Text':20} {'Start':>6} {'End':>6}")
print("-" * 60)

for token, (start, end) in zip(tokens, offsets):
    if start == end:  # Special tokens have (0, 0) offsets
        print(f"{token:15} {'[SPECIAL TOKEN]':20} {start:6} {end:6}")
    else:
        original_text = text[start:end]
        print(f"{token:15} {original_text:20} {start:6} {end:6}")

# Practical use case: Highlighting entities
print("\n=== Practical Example: Entity Highlighting ===")

# Simulate NER predictions (token indices for "Hugging Face")
entity_token_indices = [1, 2]  # Tokens at positions 1 and 2

print("Detected entity tokens:")
entity_chars = []
for idx in entity_token_indices:
    token = tokens[idx]
    start, end = offsets[idx]
    entity_chars.extend(range(start, end))
    print(f"  Token '{token}' -> '{text[start:end]}'")

# Reconstruct the entity from character positions
min_char = min(entity_chars)
max_char = max(entity_chars) + 1
entity_text = text[min_char:max_char]
print(f"\nExtracted entity: '{entity_text}'")

### 7.4 Token-to-Character Offset Mapping

Offset mapping is crucial for tasks like Named Entity Recognition where you need to map model predictions back to the original text.

In [ ]:
# Multiple Sequences for Question-Answering
print("=== Multiple Sequences (Question-Answering) ===")

question = "What is tokenization?"
context = (
    "Tokenization is the process of breaking down text into smaller units "
    "called tokens. These tokens can be words, subwords, or even characters. "
    "In NLP, tokenization is a crucial preprocessing step that converts "
    "raw text into a format that machine learning models can understand."
)

# Encode question and context together
qa_encoding = tokenizer(
    question, context, 
    padding=True, 
    truncation=True, 
    return_tensors="pt"
)

# Convert to tokens to visualize
tokens = tokenizer.convert_ids_to_tokens(qa_encoding["input_ids"][0])
token_type_ids = qa_encoding['token_type_ids'][0].tolist()

print(f"Question: {question}")
print(f"Context: {context[:100]}...")
print(f"\nCombined tokens (first 20): {tokens[:20]}...")
print(f"\nToken type IDs visualization:")
print("  0 = Question/First sequence")
print("  1 = Context/Second sequence")

# Visualize token types
for i in range(min(20, len(tokens))):
    print(f"  Token {i:2d}: '{tokens[i]:15}' -> Type {token_type_ids[i]}")

# Find where question ends and context begins
sep_positions = [i for i, token in enumerate(tokens) if token == '[SEP]']
print(f"\n[SEP] token positions: {sep_positions}")
print(f"Question ends at position: {sep_positions[0]}")
print(f"Context starts at position: {sep_positions[0] + 1}")

### 7.3 Handling Multiple Sequences (Question-Answering Example)

Many NLP tasks require processing multiple sequences together, like question-answering or text entailment.

In [ ]:
# Advanced Truncation Examples
print("=== Advanced Truncation Examples ===")

# Create a very long text
long_text = " ".join(["This is a very long sentence."] * 50)

# Without truncation (will be very long)
tokens_no_trunc = tokenizer.tokenize(long_text)
print(f"Without truncation: {len(tokens_no_trunc)} tokens")

# With truncation to max_length
tokens_with_trunc = tokenizer(
    long_text, truncation=True, max_length=20, return_tensors="pt"
)
print(f"With truncation (max_length=20): {tokens_with_trunc['input_ids'].shape[1]} tokens")

# Show truncated tokens
truncated_tokens = tokenizer.convert_ids_to_tokens(tokens_with_trunc['input_ids'][0].tolist())
print(f"Truncated tokens: {truncated_tokens}")

# Truncation strategies for sentence pairs
print("\n=== Truncation Strategies for Sentence Pairs ===")
question = "What is the capital of France?"
context = " ".join(["Paris is the capital and most populous city of France."] * 10)

# Strategy: 'only_second' - truncate only the context
encoding_only_second = tokenizer(
    question, context,
    truncation='only_second',
    max_length=50,
    return_tensors='pt'
)
print(f"'only_second' strategy: {encoding_only_second['input_ids'].shape}")

# Strategy: 'longest_first' - truncate the longest sequence first
encoding_longest_first = tokenizer(
    question, context,
    truncation='longest_first',
    max_length=50,
    return_tensors='pt'
)
print(f"'longest_first' strategy: {encoding_longest_first['input_ids'].shape}")

# Show which parts were kept
tokens_only_second = tokenizer.convert_ids_to_tokens(encoding_only_second['input_ids'][0])
print(f"\nTokens with 'only_second' (first 10): {tokens_only_second[:10]}...")
print(f"Question preserved: {'what' in ' '.join(tokens_only_second).lower()}")

### 7.2 Advanced Truncation Strategies

Truncation is essential when dealing with texts longer than the model's maximum sequence length.

In [ ]:
# Advanced Padding Examples from Chapter 3
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 1. Padding examples with different strategies
print("=== Advanced Padding Examples ===")
texts = [
    "Short text.",
    "This is a medium length sentence that demonstrates padding.",
    "This is a much longer sentence that will show how padding works with "
    "multiple sentences of different lengths in a batch.",
]

# Padding to max length in batch
batch_encoding = tokenizer(texts, padding=True, return_tensors="pt")
print(f"Original texts lengths: {[len(text.split()) for text in texts]}")
print(f"Padded sequence lengths: {batch_encoding['input_ids'].shape}")
print(f"Attention mask shape: {batch_encoding['attention_mask'].shape}")
print(f"\nAttention masks (1=real token, 0=padding):")
for i, mask in enumerate(batch_encoding['attention_mask']):
    print(f"  Text {i+1}: {mask.tolist()}")

# Different padding strategies
print("\n=== Padding Strategies ===")

# Dynamic padding (to longest in batch)
dynamic_padding = tokenizer(texts, padding="longest", return_tensors="pt")
print(f"Dynamic padding shape: {dynamic_padding['input_ids'].shape}")

# Fixed padding to specific length
fixed_padding = tokenizer(texts, padding="max_length", max_length=30, return_tensors="pt")
print(f"Fixed padding shape (max_length=30): {fixed_padding['input_ids'].shape}")

# No padding
no_padding = tokenizer(texts, padding=False)
print(f"No padding: {[len(ids) for ids in no_padding['input_ids']]}")

### 7.1 Advanced Padding and Truncation Strategies

Understanding how padding and truncation work is crucial for batch processing.

## 7. Chapter 3 Advanced Tokenization Examples <a id='chapter3'></a>

These are the advanced tokenization examples from Chapter 3 that were deferred to Chapter 5 for deeper explanation.

## 7. Exercises <a id='exercises'></a>

Now let's implement the exercises from article5.md

### Exercise 1: Tokenize Multilingual Sentences with Emojis

In [None]:
# Exercise 1: Tokenize multilingual sentences including emojis and domain-specific terms

multilingual_sentences = [
    "Hello world! 👋",
    "Bonjour le monde! 🇫🇷",
    "Hola mundo! 🇪🇸",
    "你好世界！🇨🇳",
    "Привет мир! 🇷🇺",
    "The patient has COVID-19 🦠",
    "Machine learning is amazing! 🤖💡"
]

# Use multilingual BERT
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

print("Exercise 1: Multilingual Tokenization Analysis")
print("=" * 60)

for sentence in multilingual_sentences:
    tokens = tokenizer.tokenize(sentence)
    ids = tokenizer.encode(sentence, add_special_tokens=False)
    
    print(f"\nText: '{sentence}'")
    print(f"Tokens: {tokens}")
    print(f"Token count: {len(tokens)}")
    
    # Check for unknown tokens
    if tokenizer.unk_token in tokens:
        print("⚠️  Contains unknown tokens!")

# Analysis of unusual tokenization results
print("\n" + "=" * 60)
print("Analysis:")
print("- Emojis are often tokenized as [UNK] or split into multiple tokens")
print("- Different scripts (Chinese, Cyrillic) are handled differently")
print("- Domain terms like 'COVID-19' may be split unexpectedly")

### Exercise 2: Train a Custom BPE Tokenizer

In [None]:
# Exercise 2: Train a custom BPE tokenizer on domain-specific corpus

# Create a small domain-specific corpus (scientific abstracts)
scientific_corpus = [
    "The quantum entanglement phenomenon demonstrates non-local correlations.",
    "CRISPR-Cas9 enables precise genome editing in mammalian cells.",
    "Machine learning algorithms optimize hyperparameters automatically.",
    "Photosynthesis converts light energy into chemical energy.",
    "Neurotransmitters facilitate synaptic transmission in neurons.",
    "The Higgs boson was discovered at the Large Hadron Collider.",
    "DNA polymerase synthesizes new DNA strands during replication.",
    "Quantum computing leverages superposition and entanglement.",
    "The mitochondria produces ATP through oxidative phosphorylation.",
    "Climate change affects global temperature and precipitation patterns."
]

# Train custom tokenizer
print("Training custom BPE tokenizer on scientific corpus...")
base_tokenizer = AutoTokenizer.from_pretrained('gpt2')
custom_tokenizer = base_tokenizer.train_new_from_iterator(
    scientific_corpus,
    vocab_size=1000
)

# Compare tokenization
test_terms = [
    "quantum entanglement",
    "CRISPR-Cas9",
    "photosynthesis",
    "neurotransmitters"
]

print("\nComparing standard GPT-2 vs custom scientific tokenizer:")
print("=" * 60)

for term in test_terms:
    standard_tokens = base_tokenizer.tokenize(term)
    custom_tokens = custom_tokenizer.tokenize(term)
    
    print(f"\nTerm: '{term}'")
    print(f"  Standard GPT-2: {standard_tokens} (length: {len(standard_tokens)})")
    print(f"  Custom Scientific: {custom_tokens} (length: {len(custom_tokens)})")

print("\nObservation: Custom tokenizer better preserves domain-specific terms!")

### Exercise 3: Identify and Fix Tokenization Mismatch Bug

In [None]:
# Exercise 3: Intentionally create and fix a tokenizer-model mismatch

print("Exercise 3: Tokenizer-Model Mismatch Detection")
print("=" * 60)

# Step 1: Create the mismatch
print("\nStep 1: Creating intentional mismatch...")
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Note: We're not loading the full model to save memory in the notebook
# In practice, you would use: roberta_model = AutoModel.from_pretrained('roberta-base')

text = "Tokenization is crucial for NLP!"

# Show the problem
bert_tokens = bert_tokenizer(text, return_tensors='pt')
print(f"BERT tokenizer output:")
print(f"  Tokens: {bert_tokenizer.tokenize(text)}")
print(f"  Special tokens: {list(bert_tokenizer.special_tokens_map.keys())}")
print(f"  Input IDs shape: {bert_tokens['input_ids'].shape}")

# Step 2: Fix the mismatch
print("\nStep 2: Fixing the mismatch...")
roberta_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
roberta_tokens = roberta_tokenizer(text, return_tensors='pt')

print(f"RoBERTa tokenizer output (correct):")
print(f"  Tokens: {roberta_tokenizer.tokenize(text)}")
print(f"  Special tokens: {list(roberta_tokenizer.special_tokens_map.keys())}")
print(f"  Input IDs shape: {roberta_tokens['input_ids'].shape}")

# Step 3: Key differences
print("\nStep 3: Key differences:")
print("- BERT uses [CLS] and [SEP] tokens")
print("- RoBERTa uses <s> and </s> tokens")
print("- Different vocabulary mappings")
print("- Different tokenization rules")
print("\n✅ Always match tokenizer with model architecture!")

### Exercise 4: Visualize Special Tokens

In [None]:
# Exercise 4: Visualize special tokens for classification task

print("Exercise 4: Special Tokens Visualization")
print("=" * 60)

# Text classification example
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Single sentence
single_text = "This movie is fantastic!"
single_encoded = tokenizer(single_text)

print("Single sentence classification:")
print(f"Text: '{single_text}'")
tokens = tokenizer.convert_ids_to_tokens(single_encoded['input_ids'])
print(f"Tokens: {tokens}")
print(f"\nExplanation:")
print(f"  - [CLS] at position 0: Classification token")
print(f"  - Content tokens: {tokens[1:-1]}")
print(f"  - [SEP] at position {len(tokens)-1}: Separator token")

# Sentence pair (for tasks like entailment)
text_a = "The weather is sunny."
text_b = "It's a beautiful day."
pair_encoded = tokenizer(text_a, text_b)

print("\n" + "-" * 50)
print("\nSentence pair classification:")
print(f"Text A: '{text_a}'")
print(f"Text B: '{text_b}'")
pair_tokens = tokenizer.convert_ids_to_tokens(pair_encoded['input_ids'])
token_type_ids = pair_encoded['token_type_ids']

print(f"\nTokens with segment IDs:")
for i, (token, segment) in enumerate(zip(pair_tokens, token_type_ids)):
    print(f"  {i:2d}: '{token:15}' (segment {segment})")

print(f"\nExplanation:")
print(f"  - [CLS]: Start of sequence, used for classification")
print(f"  - First [SEP]: Separates sentence A from sentence B")
print(f"  - Second [SEP]: End of sequence")
print(f"  - Segment 0: First sentence + special tokens")
print(f"  - Segment 1: Second sentence + final [SEP]")

### Exercise 5: Experiment with Noisy Text

In [None]:
# Exercise 5: Tokenizing noisy text with different algorithms

print("Exercise 5: Noisy Text Tokenization Comparison")
print("=" * 60)

# Noisy text examples
noisy_texts = [
    # Typos
    "I lvoe naturl langauge procesing",
    "Thsi is vrey interessting",
    
    # Slang and informal
    "gonna b late 2nite lol",
    "ur awesome btw :)",
    
    # Code snippets
    "def calculate_loss(y_true, y_pred): return mse(y_true, y_pred)",
    "import torch.nn as nn",
    
    # Mixed case and special chars
    "CamelCaseExample_with_underscores",
    "email@domain.com | phone: +1-234-567-8900"
]

# Load different tokenizers
tokenizers = {
    'BPE (GPT-2)': AutoTokenizer.from_pretrained('gpt2'),
    'WordPiece (BERT)': AutoTokenizer.from_pretrained('bert-base-uncased'),
    'Unigram (XLM-R)': AutoTokenizer.from_pretrained('xlm-roberta-base')
}

for text in noisy_texts:
    print(f"\nText: '{text}'")
    print("-" * 50)
    
    results = {}
    for name, tokenizer in tokenizers.items():
        tokens = tokenizer.tokenize(text.lower() if 'uncased' in name else text)
        results[name] = {
            'tokens': tokens,
            'length': len(tokens)
        }
        print(f"{name}: {tokens} (length: {len(tokens)})")
    
    # Find which handles it best (fewer tokens usually = better)
    best = min(results.items(), key=lambda x: x[1]['length'])
    print(f"\n✓ Most efficient: {best[0]} with {best[1]['length']} tokens")

print("\n" + "=" * 60)
print("Analysis:")
print("- BPE (GPT-2) handles contractions and slang well")
print("- WordPiece (BERT) struggles with typos and creates more [UNK] tokens")
print("- Unigram (XLM-R) provides a balance for multilingual text")
print("- Code snippets are challenging for all tokenizers")

## Summary and Key Takeaways

In this notebook, we've explored:

1. **Basic Tokenization**: Converting text to tokens and IDs
2. **Tokenization Algorithms**: BPE, WordPiece, and Unigram differences
3. **Custom Tokenization**: Training tokenizers for specialized domains
4. **Debugging Tools**: Visualizing and understanding tokenization
5. **Multimodal Tokenization**: Processing images alongside text

### Key Points to Remember:

- Always match your tokenizer with your model
- Custom tokenizers can significantly improve domain-specific performance
- Different algorithms have different strengths (BPE for flexibility, WordPiece for consistency)
- Special tokens are crucial for task-specific fine-tuning
- Modern transformers can handle multiple modalities through specialized tokenization

### Next Steps:

- Try training a custom tokenizer on your own domain data
- Experiment with different tokenization strategies for your use case
- Explore multimodal models like CLIP for vision-language tasks
- Practice debugging tokenization issues in your NLP pipelines