## BPE TOKENIZER - COMPLETE IMPLEMENTATION

A custom implementation of Byte Pair Encoding (BPE) tokenization algorithm.
This is the same algorithm used in modern language models like GPT.

Author: √Ångel Morales Romero
Date: November 2024
License: MIT

Requirements:
    - datasets (HuggingFace)
    - tqdm

Installation:
    pip install datasets tqdm


### SECTION 1: IMPORTS

We only need minimal dependencies:
- datasets: To load text corpora from HuggingFace
- collections.Counter: To count character pair frequencies
- tqdm: For progress bars during training
- os: For file system operations
- re: For regex-based pretokenization (separating punctuation)

In [60]:
from datasets import load_dataset
import time
from tqdm import tqdm
import os
import re

### SECTION 2: DATASET CONFIGURATION

Dataset Configuration Dictionary

Define your available datasets here. Each dataset needs:
- dataset_name: Local identifier
- hf_dataset_name: HuggingFace dataset identifier
- text_column: Column name containing the text data


In [61]:
DATASETS = {
    "stories": {
        "dataset_name": "AI_Storyteller_Dataset",
        "hf_dataset_name": "jaydenccc/AI_Storyteller_Dataset",
        "text_column": "short_story",
    },
    "harry_potter": {
        "dataset_name": "HarryPotter_books_1to7",
        "hf_dataset_name": "WutYee/HarryPotter_books_1to7",
        "text_column": "text",
    }
}

### SECTION 3: LOAD DATASET FROM HUGGINGFACE

Loading the Dataset

This section downloads and loads a dataset from HuggingFace.
You can modify the dataset selection by changing the key below.


In [62]:
def load_text_dataset(dataset_key="stories"):
    """
    Load a dataset from HuggingFace.
    
    Args:
        dataset_key (str): Key from DATASETS dictionary
        
    Returns:
        tuple: (dataset, dataset_name, text_column)
    """
    dataset_config = DATASETS[dataset_key]
    
    dataset_name = dataset_config["dataset_name"]
    hf_dataset_name = dataset_config["hf_dataset_name"]
    text_column = dataset_config["text_column"]
    
    print(f"Loading dataset: {hf_dataset_name}")
    dataset = load_dataset(hf_dataset_name)
    
    print(f"\n‚úÖ Dataset loaded successfully!")
    print(f"Splits: {list(dataset.keys())}")
    print(f"Training examples: {len(dataset['train'])}")
    
    # Show a few examples
    print(f"\nFirst 3 examples:")
    for i, example in enumerate(dataset["train"][:3][text_column]):
        print(f"  [{i+1}] {example[:50]}...")
    
    return dataset, dataset_name, text_column

### SECTION 4: SAVE DATASET TO TEXT FILE

Save Dataset to File

Convert the HuggingFace dataset to a plain text file for easier processing.
This also allows you to inspect the data manually if needed.


In [63]:
def save_dataset_to_file(dataset, dataset_name, text_column, 
                         output_dir="./data", fraction=1.0):
    """
    Save dataset to a text file.
    
    Args:
        dataset: HuggingFace dataset object
        dataset_name (str): Name for output file
        text_column (str): Column containing text
        output_dir (str): Directory to save files
        fraction (float): Fraction of data to use (0.0 to 1.0)
        
    Returns:
        tuple: (full_text, output_path)
    """
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Combine all text
    print("\nCombining text from all examples...")
    text = ""
    for item in tqdm(dataset["train"], desc="Processing"):
        text += item[text_column] + "\n"
    
    # Apply fraction if specified
    if fraction < 1.0:
        text = text[:int(len(text) * fraction)]
        output_path = f"{output_dir}/{dataset_name}_small.txt"
        print(f"\nUsing {fraction*100:.0f}% of the data")
    else:
        output_path = f"{output_dir}/{dataset_name}.txt"
    
    # Save to file
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(text)
    
    # Statistics
    word_count = len(text.split())
    char_count = len(text)
    
    print(f"\n‚úÖ Text saved to: {output_path}")
    print(f"   Characters: {char_count:,} ({char_count/1_000_000:.2f}M)")
    print(f"   Words: {word_count:,}")
    print(f"   Lines: {text.count(chr(10)):,}")
    
    return text, output_path

### SECTION 5: PREPARE TEXT FOR BPE TRAINING

Text Preprocessing for BPE

Key concepts:
1. Normalize text (convert to lowercase)
2. Split into words
3. Add end-of-word marker '</w>' to preserve word boundaries
4. Convert each word to a list of characters

Example:
    "Hello World" ‚Üí ['h', 'e', 'l', 'l', 'o', '</w>'] ['w', 'o', 'r', 'l', 'd', '</w>']

### SECTION 4.5: PRETOKENIZATION

Pretokenization Function

**What is Pretokenization?**

Pretokenization is the FIRST step before BPE. It splits text into basic units (pretokens) that should never be merged across. This is crucial for:

1. **Separating punctuation**: "Hello, world!" ‚Üí ["Hello", ",", "world", "!"]
2. **Handling contractions**: "don't" ‚Üí ["don", "'", "t"]
3. **Preserving numbers**: "2024" stays as one pretoken
4. **Splitting on whitespace**: Multiple spaces are normalized

**Why is it important?**

Without pretokenization:
- "hello,world" would learn ",w" as a common token (wrong!)
- Punctuation would be glued to words
- Poor generalization to new text

With pretokenization:
- Each punctuation mark is its own token
- Words and punctuation are independent
- Better compression and generalization

**Example:**
```
Input:  "Hello, world! It's 2024."
Output: ["hello", ",", "world", "!", "it", "'", "s", "2024", "."]
```

In [64]:
def pretokenize(text):
    """
    Pretokenize text by separating punctuation and splitting on whitespace.
    
    This is the FIRST step before BPE. It ensures that:
    - Punctuation is separated from words
    - Contractions are split (e.g., "don't" ‚Üí "don", "'", "t")
    - Numbers are preserved as single tokens
    - Multiple spaces are handled correctly
    
    Args:
        text (str): Raw text
        
    Returns:
        list: List of pretokens (words and punctuation)
        
    Example:
        >>> pretokenize("Hello, world! It's 2024.")
        ['Hello', ',', 'world', '!', 'It', "'", 's', '2024', '.']
    """
    # Pattern explanation:
    # \w+        : Match word characters (letters, digits, underscore)
    # '[a-z]*    : Match contractions like 's, 't, 're
    # \d+        : Match numbers
    # [^\s\w]+   : Match punctuation (anything that's not space or word char)
    # \S         : Match any remaining non-whitespace
    
    pattern = r"\w+(?:'[a-z]*)?|\d+|[^\s\w]+|\S"
    
    # Find all matches and convert to lowercase
    tokens = re.findall(pattern, text.lower())
    
    return tokens

### TEST PRETOKENIZATION

Quick test to verify pretokenization is working correctly:

In [65]:
# Test pretokenization with various examples
test_sentences = [
    "Hello, world! How are you?",
    "It's a beautiful day, isn't it?",
    "The price is $19.99 (on sale).",
    "She said: 'Don't worry!' and left.",
    "Year 2024... what's next?"
]

print("PRETOKENIZATION TEST")
print("="*80)

for sentence in test_sentences:
    tokens = pretokenize(sentence)
    print(f"\nInput: {sentence}")
    print(f"Tokens: {tokens}")
    print(f"Count: {len(tokens)} tokens")

PRETOKENIZATION TEST

Input: Hello, world! How are you?
Tokens: ['hello', ',', 'world', '!', 'how', 'are', 'you', '?']
Count: 8 tokens

Input: It's a beautiful day, isn't it?
Tokens: ["it's", 'a', 'beautiful', 'day', ',', "isn't", 'it', '?']
Count: 8 tokens

Input: The price is $19.99 (on sale).
Tokens: ['the', 'price', 'is', '$', '19', '.', '99', '(', 'on', 'sale', ').']
Count: 11 tokens

Input: She said: 'Don't worry!' and left.
Tokens: ['she', 'said', ':', "'", "don't", 'worry', "!'", 'and', 'left', '.']
Count: 10 tokens

Input: Year 2024... what's next?
Tokens: ['year', '2024', '...', "what's", 'next', '?']
Count: 6 tokens


In [66]:
def prepare_text_for_bpe(text):
    """
    Prepare text for BPE training with proper pretokenization.
    
    Args:
        text (str): Raw text
        
    Returns:
        tuple: (word_tokens, vocab, words)
            - word_tokens: List of words as character lists
            - vocab: Initial character vocabulary
            - words: List of words with </w> markers
    """
    print("\n" + "="*80)
    print("PREPARING TEXT FOR BPE TRAINING")
    print("="*80)
    
    # Step 1: PRETOKENIZE - Split into words and punctuation
    print("\n1Ô∏è‚É£  Pretokenizing text...")
    words_raw = pretokenize(text)
    print(f"   Total pretokens: {len(words_raw):,}")
    print(f"   Example: {words_raw[:10]}")
    
    # Step 2: Add end-of-word markers
    words = [word + '</w>' for word in words_raw]
    print(f"\n2Ô∏è‚É£  Added </w> markers")
    print(f"   Example: {words[:10]}")
    
    # Step 3: Extract character vocabulary
    vocab_chars = set()
    for word in words:
        vocab_chars.update(list(word))
    
    vocab = sorted(list(vocab_chars))
    print(f"\n3Ô∏è‚É£  Initial character vocabulary: {len(vocab)} unique characters")
    print(f"   Characters: {vocab[:30]}{'...' if len(vocab) > 30 else ''}")
    
    # Step 4: Represent words as character sequences
    word_tokens = [list(word) for word in words]
    print(f"\n4Ô∏è‚É£  Converted to character sequences")
    print(f"   Example: '{words[0]}' ‚Üí {word_tokens[0]}")
    
    return word_tokens, vocab, words

### SECTION 6: BPE HELPER FUNCTIONS

Core BPE Algorithm Functions

These functions implement the two key operations:
1. Count pair frequencies across all words
2. Merge a specific pair in all words


In [67]:
def get_pair_stats(word_tokens):
    """
    Count frequencies of all adjacent character pairs.
    
    Args:
        word_tokens (list): Words as lists of characters
        
    Returns:
        Counter: Pair frequencies {(char1, char2): count}
        
    Example:
        [['t','h','e','</w>'], ['t','h','e','</w>']]
        ‚Üí {('t','h'): 2, ('h','e'): 2, ('e','</w>'): 2}
    """
    pairs = Counter()
    for word in word_tokens:
        for i in range(len(word) - 1):
            pair = (word[i], word[i+1])
            pairs[pair] += 1
    return pairs

In [68]:
def merge_pair(word_tokens, pair_to_merge):
    """
    Merge a specific pair in all words.
    
    Args:
        word_tokens (list): Words as character sequences
        pair_to_merge (tuple): (char1, char2) to merge
        
    Returns:
        list: Updated word_tokens with pair merged
        
    Example:
        word = ['t','h','e','</w>'], pair = ('t','h')
        ‚Üí ['th','e','</w>']
    """
    a, b = pair_to_merge
    new_token = a + b
    
    new_word_tokens = []
    for word in word_tokens:
        new_word = []
        i = 0
        while i < len(word):
            # Check if current position matches the pair
            if i < len(word) - 1 and word[i] == a and word[i+1] == b:
                new_word.append(new_token)
                i += 2  # Skip both characters
            else:
                new_word.append(word[i])
                i += 1
        new_word_tokens.append(new_word)
    
    return new_word_tokens

### SECTION 7: TRAIN BPE VOCABULARY

BPE Training Algorithm

This is the heart of the tokenizer. The algorithm:
1. Starts with a character-level vocabulary
2. Iteratively finds the most frequent character pair
3. Merges that pair into a new token
4. Repeats until reaching the target vocabulary size

The order of merges is CRITICAL - we save it for later use during tokenization.


In [69]:
def train_bpe(word_tokens, vocab, vocab_size=4000, verbose=True):
    """
    Train a BPE vocabulary.
    
    Args:
        word_tokens (list): Words as character sequences
        vocab (list): Initial vocabulary (characters)
        vocab_size (int): Target vocabulary size
        verbose (bool): Print progress
        
    Returns:
        tuple: (final_vocab, merges, word_tokens)
            - final_vocab: Complete vocabulary including merged tokens
            - merges: Ordered list of merge operations
            - word_tokens: Final word representations
    """
    print("\n" + "="*80)
    print("TRAINING BPE VOCABULARY")
    print("="*80)
    
    num_merges = vocab_size - len(vocab)
    print(f"\nTarget vocab size: {vocab_size}")
    print(f"Initial vocab size: {len(vocab)}")
    print(f"Merges needed: {num_merges}\n")
    
    # Store merge history (CRUCIAL for tokenization)
    merges = []
    
    start_time = time.time()
    
    # Main BPE loop
    for i in tqdm(range(num_merges), desc="Training BPE", disable=not verbose):
        # Count pair frequencies
        pairs = get_pair_stats(word_tokens)
        
        if not pairs:
            print(f"\n‚ö†Ô∏è  No more pairs to merge at iteration {i}")
            break
        
        # Find most frequent pair
        best_pair = max(pairs, key=pairs.get)
        best_count = pairs[best_pair]
        
        # Save merge operation
        merges.append(best_pair)
        
        # Merge in all words
        word_tokens = merge_pair(word_tokens, best_pair)
        
        # Add new token to vocabulary
        new_token = best_pair[0] + best_pair[1]
        vocab.append(new_token)
    
    elapsed = time.time() - start_time
    
    print(f"\n{'='*80}")
    print(f"‚úÖ BPE TRAINING COMPLETE")
    print(f"{'='*80}")
    print(f"Final vocabulary size: {len(vocab)}")
    print(f"Total merges performed: {len(merges)}")
    print(f"Training time: {int(elapsed//60)}m {elapsed%60:.1f}s")
    
    # Show examples of learned tokens
    print(f"\nüìù Sample learned tokens (last 20):")
    for i, token in enumerate(vocab[-20:], 1):
        print(f"   {len(vocab)-20+i:4d}. '{token}'")
    
    return vocab, merges, word_tokens

### SECTION 8: SAVE AND LOAD VOCABULARY

Persistence Functions

Save the trained vocabulary and merge operations to disk.
This allows you to:
1. Reuse the tokenizer without retraining
2. Share the tokenizer with others
3. Version control your tokenizers


In [70]:
def save_vocab_and_merges(vocab, merges, dataset_name, output_dir="./data"):
    """
    Save vocabulary and merges to files.
    
    Args:
        vocab (list): Complete vocabulary
        merges (list): List of merge operations
        dataset_name (str): Name for output files
        output_dir (str): Directory to save files
    """
    os.makedirs(output_dir, exist_ok=True)
    
    # Save vocabulary
    vocab_path = f"{output_dir}/{dataset_name}_vocab.txt"
    with open(vocab_path, 'w', encoding='utf-8') as f:
        for token in vocab:
            f.write(f"{token}\n")
    
    # Save merges (CRITICAL - preserves merge order)
    merges_path = f"{output_dir}/{dataset_name}_merges.txt"
    with open(merges_path, 'w', encoding='utf-8') as f:
        for a, b in merges:
            f.write(f"{a} {b}\n")
    
    print(f"\n‚úÖ Saved vocabulary to: {vocab_path}")
    print(f"‚úÖ Saved merges to: {merges_path}")
    print(f"   Vocab size: {len(vocab)}")
    print(f"   Merge operations: {len(merges)}")

In [71]:
def load_vocab_and_merges(dataset_name, data_dir="./data"):
    """
    Load vocabulary and merges from files.
    
    Args:
        dataset_name (str): Name of saved files
        data_dir (str): Directory containing files
        
    Returns:
        tuple: (vocab, merges)
    """
    vocab_path = f"{data_dir}/{dataset_name}_vocab.txt"
    merges_path = f"{data_dir}/{dataset_name}_merges.txt"
    
    # Load vocabulary
    with open(vocab_path, 'r', encoding='utf-8') as f:
        vocab = [line.rstrip('\n') for line in f.readlines()]
    
    # Load merges
    with open(merges_path, 'r', encoding='utf-8') as f:
        merges = [tuple(line.strip().split()) for line in f.readlines()]
    
    print(f"\n‚úÖ Loaded vocabulary from: {vocab_path}")
    print(f"‚úÖ Loaded merges from: {merges_path}")
    print(f"   Vocab size: {len(vocab)}")
    print(f"   Merge operations: {len(merges)}")
    
    return vocab, merges

### SECTION 9: TOKENIZER CLASS

Tokenizer Class

This class provides a clean interface for encoding and decoding text.

Key methods:
- tokenize(text): Convert text to token IDs
- decode(token_ids): Convert token IDs back to text

The tokenizer applies the same merge operations in the same order as training.

In [72]:
class Tokenizer:
    """
    BPE Tokenizer for encoding and decoding text with pretokenization.
    
    Attributes:
        vocab (list): Complete vocabulary
        token_to_id (dict): Maps tokens to IDs
        id_to_token (dict): Maps IDs to tokens
        merges (list): Ordered merge operations
        
    Example:
        >>> tokenizer = Tokenizer(vocab, merges)
        >>> ids = tokenizer.tokenize("hello world")
        >>> text = tokenizer.decode(ids)
    """
    
    def __init__(self, vocab, merges):
        """
        Initialize tokenizer with vocabulary and merges.
        
        Args:
            vocab (list): List of tokens
            merges (list): List of (char1, char2) tuples in merge order
        """
        self.vocab = vocab
        self.token_to_id = {tok: i for i, tok in enumerate(vocab)}
        self.id_to_token = {i: tok for tok, i in self.token_to_id.items()}
        self.merges = merges
        
        print(f"‚úÖ Tokenizer initialized")
        print(f"   Vocabulary size: {len(vocab)}")
        print(f"   Merge rules: {len(merges)}")
    
    def tokenize(self, text):
        """
        Encode text to token IDs using pretokenization + BPE algorithm.
        
        The algorithm:
        1. PRETOKENIZE: Split into words and punctuation
        2. Add </w> markers
        3. Start with character-level representation
        4. Apply merge operations in training order
        5. Convert tokens to IDs
        
        Args:
            text (str): Input text
            
        Returns:
            list: Token IDs
            
        Example:
            >>> tokenizer.tokenize("Hello, world!")
            [1234, 45, 5678, 90, ...]  # IDs for ['Hello</w>', ',</w>', 'world</w>', '!</w>']
        """
        # Step 1: PRETOKENIZE (same as training)
        words = pretokenize(text)
        
        token_ids = []
        
        for word in words:
            # Add end-of-word marker
            word = word + '</w>'
            
            # Start with character-level tokens
            word_tokens = list(word)
            
            # Apply merges in order
            for merge_pair in self.merges:
                a, b = merge_pair
                new_token = a + b
                
                i = 0
                new_word_tokens = []
                
                while i < len(word_tokens):
                    # Check if current and next character match the pair
                    if (i < len(word_tokens) - 1 and 
                        word_tokens[i] == a and 
                        word_tokens[i+1] == b):
                        new_word_tokens.append(new_token)
                        i += 2
                    else:
                        new_word_tokens.append(word_tokens[i])
                        i += 1
                
                word_tokens = new_word_tokens
            
            # Convert tokens to IDs
            for token in word_tokens:
                if token in self.token_to_id:
                    token_ids.append(self.token_to_id[token])
                else:
                    # Handle unknown tokens by splitting to characters
                    for char in token:
                        if char in self.token_to_id:
                            token_ids.append(self.token_to_id[char])
        
        return token_ids
    
    def decode(self, token_ids):
        """
        Decode token IDs back to text.
        
        Args:
            token_ids (list): List of token IDs
            
        Returns:
            str: Decoded text
            
        Example:
            >>> tokenizer.decode([1234, 5678, 90])
            "hello, world!"
        """
        # Convert IDs to tokens
        tokens = []
        for token_id in token_ids:
            if token_id in self.id_to_token:
                tokens.append(self.id_to_token[token_id])
        
        # Concatenate tokens
        text = ''.join(tokens)
        
        # Replace end-of-word markers with spaces
        text = text.replace('</w>', ' ')
        
        return text.strip()
    
    def __repr__(self):
        return f"Tokenizer(vocab_size={len(self.vocab)}, merges={len(self.merges)})"

### SECTION 10: EXAMPLE USAGE AND TESTING

Example Usage

This section demonstrates how to use the tokenizer once trained.


In [73]:
def test_tokenizer(tokenizer, test_phrases):
    """
    Test tokenizer with sample phrases.
    
    Args:
        tokenizer (Tokenizer): Trained tokenizer
        test_phrases (list): List of test strings
    """
    print("\n" + "="*80)
    print("TESTING TOKENIZER")
    print("="*80)
    
    for phrase in test_phrases:
        print(f"\nüìù Input: \"{phrase}\"")
        
        # Encode
        token_ids = tokenizer.tokenize(phrase)
        print(f"   Token IDs: {token_ids}")
        
        # Show actual tokens
        tokens = [tokenizer.id_to_token[id] for id in token_ids]
        print(f"   Tokens: {tokens}")
        
        # Decode
        decoded = tokenizer.decode(token_ids)
        print(f"   Decoded: \"{decoded}\"")
        
        # Verify
        if phrase.lower() == decoded:
            print("   ‚úÖ Perfect reconstruction!")
        else:
            print(f"   ‚ö†Ô∏è  Mismatch (expected: \"{phrase.lower()}\")")

### SECTION 11: MAIN EXECUTION SCRIPT

Main Script

This is the complete pipeline to train and test a BPE tokenizer.


In [74]:
def main(train_new=True, dataset_key="stories", vocab_size=2000, data_fraction=1.0):
    """
    Complete BPE tokenizer training pipeline with pretokenization.
    
    Args:
        train_new (bool): If True, train new tokenizer. If False, load existing.
        dataset_key (str): Key from DATASETS dictionary
        vocab_size (int): Target vocabulary size (only used if train_new=True)
        data_fraction (float): Fraction of data to use (only used if train_new=True)
    
    Returns:
        Tokenizer: Trained or loaded tokenizer
    """
    print("="*80)
    print("BPE TOKENIZER TRAINING PIPELINE")
    print("="*80)
    
    # Get dataset name
    dataset_name = DATASETS[dataset_key]["dataset_name"]
    
    if not train_new:
        # ==================================================================
        # LOADING EXISTING TOKENIZER
        # ==================================================================
        print("\nüîÑ Loading existing tokenizer...")
        print(f"   Dataset: {dataset_name}")
        
        try:
            vocab, merges = load_vocab_and_merges(dataset_name)
            tokenizer = Tokenizer(vocab, merges)
            
            print("\n" + "="*80)
            print("‚úÖ TOKENIZER LOADED SUCCESSFULLY!")
            print("="*80)
            
            return tokenizer
            
        except FileNotFoundError as e:
            print(f"\n‚ùå Error: Could not find saved tokenizer files!")
            print(f"   {e}")
            print(f"\nüí° Tip: Set train_new=True to train a new tokenizer")
            return None
    
    else:
        # ==================================================================
        # TRAINING NEW TOKENIZER
        # ==================================================================
        print(f"\nüöÄ Training new tokenizer...")
        print(f"   Dataset: {dataset_key}")
        print(f"   Vocab size: {vocab_size}")
        print(f"   Data fraction: {data_fraction*100:.0f}%")
        
        # Step 1: Load dataset
        print("\n" + "="*80)
        print("STEP 1: LOADING DATASET")
        print("="*80)
        dataset, dataset_name, text_column = load_text_dataset(dataset_key)
        
        # Step 2: Save to file
        print("\n" + "="*80)
        print("STEP 2: SAVING DATASET TO FILE")
        print("="*80)
        text, output_path = save_dataset_to_file(
            dataset, dataset_name, text_column, 
            fraction=data_fraction
        )
        
        # Step 3: Prepare for BPE (with pretokenization)
        print("\n" + "="*80)
        print("STEP 3: PREPARING TEXT (WITH PRETOKENIZATION)")
        print("="*80)
        word_tokens, vocab, words = prepare_text_for_bpe(text)
        
        # Step 4: Train BPE
        print("\n" + "="*80)
        print("STEP 4: TRAINING BPE")
        print("="*80)
        final_vocab, merges, final_word_tokens = train_bpe(
            word_tokens, vocab, 
            vocab_size=vocab_size
        )
        
        # Step 5: Save vocabulary
        print("\n" + "="*80)
        print("STEP 5: SAVING VOCABULARY")
        print("="*80)
        save_vocab_and_merges(final_vocab, merges, dataset_name)
        
        # Step 6: Create tokenizer
        print("\n" + "="*80)
        print("STEP 6: CREATING TOKENIZER")
        print("="*80)
        tokenizer = Tokenizer(final_vocab, merges)
        
        # Step 7: Test tokenizer (with punctuation)
        print("\n" + "="*80)
        print("STEP 7: TESTING TOKENIZER")
        print("="*80)
        test_phrases = [
            "Harry Potter and the Philosopher's Stone.",
            "Once upon a time, there was a wizard.",
            "The magical world of Hogwarts!",
            "Hermione said: 'Don't worry!'"
        ]
        test_tokenizer(tokenizer, test_phrases)
        
        print("\n" + "="*80)
        print("‚úÖ PIPELINE COMPLETE!")
        print("="*80)
        print(f"\nYour tokenizer is ready to use!")
        print(f"Vocabulary saved to: ./data/{dataset_name}_vocab.txt")
        print(f"Merges saved to: ./data/{dataset_name}_merges.txt")
        
        return tokenizer


### SECTION 12: SCRIPT ENTRY POINT


In [75]:
if __name__ == "__main__":
    # ======================================================================
    # CONFIGURATION
    # ======================================================================
    
    # Set to True to train a new tokenizer, False to load existing
    TRAIN_NEW = False
    
    # Dataset selection (if training new)
    DATASET_KEY = "stories"  # Options: "harry_potter", "stories"
    
    # Vocabulary size (if training new)
    VOCAB_SIZE = 3000
    
    # Data fraction (if training new)
    DATA_FRACTION = 1.0  # Use 100% of data
    
    # ======================================================================
    # RUN PIPELINE
    # ======================================================================
    
    if TRAIN_NEW:
        print("üöÄ MODE: Training new tokenizer")
        tokenizer = main(
            train_new=True,
            dataset_key=DATASET_KEY,
            vocab_size=VOCAB_SIZE,
            data_fraction=DATA_FRACTION
        )
    else:
        print("üîÑ MODE: Loading existing tokenizer")
        tokenizer = main(
            train_new=False,
            dataset_key=DATASET_KEY
        )
    
    # ======================================================================
    # EXAMPLE USAGE
    # ======================================================================
    
    if tokenizer is not None:
        print("\n" + "="*80)
        print("EXAMPLE: USING THE TRAINED TOKENIZER")
        print("="*80)
        
        sample_texts = [
            "once upon a time in a land far away",
            "he was a brave knight",
            "it's a beautiful day, isn't it?"
        ]
        
        for sample_text in sample_texts:
            print(f"\nüìù Input: \"{sample_text}\"")
            
            # Encode
            ids = tokenizer.tokenize(sample_text)
            print(f"   Token IDs: {ids}")
            print(f"   Num tokens: {len(ids)}")
            
            # Show tokens
            tokens = [tokenizer.id_to_token[id] for id in ids]
            print(f"   Tokens: {tokens}")
            
            # Decode
            reconstructed = tokenizer.decode(ids)
            print(f"   Decoded: \"{reconstructed}\"")
            
            # Compression ratio
            orig_chars = len(sample_text)
            token_count = len(ids)
            ratio = orig_chars / token_count if token_count > 0 else 0
            print(f"   Compression: {orig_chars} chars ‚Üí {token_count} tokens ({ratio:.2f}x)")


üîÑ MODE: Loading existing tokenizer
BPE TOKENIZER TRAINING PIPELINE

üîÑ Loading existing tokenizer...
   Dataset: AI_Storyteller_Dataset

‚úÖ Loaded vocabulary from: ./data/AI_Storyteller_Dataset_vocab.txt
‚úÖ Loaded merges from: ./data/AI_Storyteller_Dataset_merges.txt
   Vocab size: 3000
   Merge operations: 2950
‚úÖ Tokenizer initialized
   Vocabulary size: 3000
   Merge rules: 2950

‚úÖ TOKENIZER LOADED SUCCESSFULLY!

EXAMPLE: USING THE TRAINED TOKENIZER

üìù Input: "once upon a time in a land far away"
   Token IDs: [724, 970, 77, 352, 97, 77, 592, 1826, 691]
   Num tokens: 9
   Tokens: ['once</w>', 'upon</w>', 'a</w>', 'time</w>', 'in</w>', 'a</w>', 'land</w>', 'far</w>', 'away</w>']
   Decoded: "once upon a time in a land far away"
   Compression: 35 chars ‚Üí 9 tokens (3.89x)

üìù Input: "he was a brave knight"
   Token IDs: [147, 92, 77, 995, 137, 29, 528]
   Num tokens: 7
   Tokens: ['he</w>', 'was</w>', 'a</w>', 'bra', 've</w>', 'k', 'night</w>']
   Decoded: "he was a 