# My Learning Journey: Understanding Byte Pair Encoding (BPE)

**Date**: January 27, 2026  
**Learning Goal**: I'm deepening my understanding of how BPE tokenization works by exploring different implementations

## What I'm Learning Today

I'm revisiting Byte Pair Encoding (BPE), which is the tokenization method used in GPT-2 and many other LLMs. I want to understand:

1. **Why BPE exists** - What problem does it solve compared to simple word splitting?
2. **How different libraries implement BPE** - tiktoken, transformers, and the original OpenAI code
3. **Performance differences** - Which implementation is fastest and why?
4. **Practical usage** - How do I actually use these tokenizers in my code?

## My Understanding So Far

From Chapter 2, I learned that BPE:
- Starts with individual characters as base tokens
- Iteratively merges the most frequent pair of adjacent tokens
- Builds up a vocabulary of subword units
- Balances between character-level (flexible but long sequences) and word-level (fixed vocab, OOV issues)

**I still need to clarify**: How exactly does the merging algorithm decide which pairs to merge? And how does the pre-tokenization step (splitting on whitespace/punctuation) interact with BPE?

## Source Attribution

This learning notebook synthesizes concepts from:
- **Book**: *Build a Large Language Model From Scratch* by Sebastian Raschka  
- **Reference**: Author's BPE comparison notebook in `source-material/ch02_bonus_bpe_from_author/`
- **Original BPE**: OpenAI's GPT-2 implementation (Modified MIT License)

I'm writing this in my own words to build deeper understanding, not just copying code.

---

## Setup: Installing Required Packages

I need some extra packages for this exploration. The main ones are:
- **tiktoken** - OpenAI's fast BPE tokenizer (already in my environment)
- **transformers** - Hugging Face library with GPT-2 tokenizer
- **requests & tqdm** - For downloading vocabulary files

**Note to self**: I'll keep these as optional dependencies since they're for comparison experiments, not core learning.

In [None]:
# Check what I already have installed
from importlib.metadata import version
import sys

packages = ['tiktoken', 'torch', 'numpy']
for package in packages:
    try:
        print(f"{package}: {version(package)}")
    except Exception:
        print(f"{package}: NOT INSTALLED")

print(f"\nPython: {sys.version}")

**My observation**: tiktoken should already be available since it's in my pyproject.toml dependencies. If I want to run the transformers comparisons, I'll need to install that separately.

Installing transformers (optional - only if I want to compare):
```python
# Uncomment to install
# !pip install transformers requests tqdm
```

---

## Part 1: Understanding tiktoken (OpenAI's Modern BPE)

I'm starting with tiktoken because:
1. It's the current recommended tokenizer from OpenAI
2. It's significantly faster than the original GPT-2 encoder
3. It's easier to use (cleaner API)

**My mental model**: tiktoken takes text ‚Üí splits on regex patterns ‚Üí applies BPE merges ‚Üí returns token IDs

In [None]:
import tiktoken

# I'm loading the GPT-2 BPE encoder
# This uses the same vocabulary and merge rules as GPT-2
tokenizer = tiktoken.get_encoding("gpt2")

# Let me see what this tokenizer does with a simple example
text = "Hello, world. Is this-- a test?"
print(f"Original text: {text}")

In [None]:
# Encoding: text ‚Üí token IDs
token_ids = tokenizer.encode(text)
print(f"\nToken IDs: {token_ids}")
print(f"Number of tokens: {len(token_ids)}")

**I'm noticing**: The text got split into 9 tokens. Let me decode each one to see how BPE grouped the characters.

In [None]:
# Let me decode each token individually to understand the subwords
print("\nBreakdown of each token:")
for i, token_id in enumerate(token_ids):
    decoded = tokenizer.decode([token_id])
    print(f"  Token {i}: ID={token_id:5d} ‚Üí '{decoded}'")

**My observations**:
- "Hello" stayed as one token (common word)
- "," and "." are separate tokens (punctuation gets its own tokens)
- " world" includes the leading space (BPE encodes space as part of tokens)
- "--" gets tokenized as one unit (punctuation sequences)
- " Is", " this", " a", " test", "?" - notice the spaces are part of the tokens

**Key insight I'm building**: BPE doesn't just split on spaces. It learned during training which character sequences appear frequently together, so "Hello" is one token because it's common in English text.

In [None]:
# Can I go back from tokens to text?
decoded_text = tokenizer.decode(token_ids)
print(f"\nDecoded back: {decoded_text}")
print(f"Same as original? {decoded_text == text}")

**Great!** The encoding is reversible (lossless). This is important for LLMs because they need to decode their predictions back into human-readable text.

In [None]:
# How big is the vocabulary?
vocab_size = tokenizer.n_vocab
print(f"\nVocabulary size: {vocab_size:,} tokens")

**My understanding**: GPT-2 uses a vocabulary of 50,257 tokens. This is much smaller than the ~100k words in English, but bigger than the 26 letters + punctuation. That's the sweet spot of BPE - balancing vocab size with sequence length.

**I'm wondering**: What happens if I give it text with characters not in the vocabulary? Let me test with some emojis...

In [None]:
# Testing with special characters
emoji_text = "Hello! üëã How are you doing? üòä"
emoji_tokens = tokenizer.encode(emoji_text)
print(f"Text with emojis: {emoji_text}")
print(f"Number of tokens: {len(emoji_tokens)}")
print(f"\nToken breakdown:")
for i, tid in enumerate(emoji_tokens):
    print(f"  {i}: {tid:5d} ‚Üí '{tokenizer.decode([tid])}'")

**My observation**: The emojis got broken down into multiple tokens! This makes sense - emojis are UTF-8 characters that get encoded as multiple bytes, and BPE operates on these bytes.

**Takeaway**: BPE can handle ANY text (even emojis, Chinese characters, etc.) because it ultimately works on byte level. But rare characters will take up more tokens.

---

## Part 2: Comparing Different BPE Implementations

I've learned that there are several implementations of BPE for GPT-2. I want to understand:
1. Do they produce the same token IDs?
2. Which one is faster?
3. When should I use each one?

### The implementations I'm comparing:
1. **tiktoken** - Modern, fast, OpenAI's current recommendation
2. **Original GPT-2 encoder** - Historical reference, slower but educational
3. **Hugging Face transformers** - Widely used in the community

**Note**: Since I don't have transformers installed yet, I'll focus on understanding the concepts. I can install it later if needed for actual experiments.

### Understanding the Original OpenAI Implementation

The original GPT-2 encoder (from 2019) is in `source-material/ch02_bonus_bpe_from_author/bpe_openai_gpt2.py`. 

**What I learned from reading that code**:
1. It uses a `bytes_to_unicode()` function to map UTF-8 bytes to printable Unicode characters
2. The BPE algorithm iteratively merges character pairs based on learned merge rules
3. It caches results to avoid recomputing the same tokens
4. The pre-tokenization uses regex to split on words and punctuation

**Key difference from tiktoken**: The original implementation is pure Python with lots of dictionary lookups. tiktoken is written in Rust for speed.

**I'm not copying that code here** - instead, I'm building my understanding of the algorithm.

### My Mental Model of the BPE Algorithm

Here's how I understand BPE works (in my own words):

1. **Pre-tokenization**: Split text on whitespace and punctuation using regex
   - Example: "Hello, world!" ‚Üí ["Hello", ",", " world", "!"]

2. **Byte encoding**: Convert each substring to bytes, then to a special Unicode representation
   - This ensures we can handle any character set

3. **BPE merging**: For each substring:
   - Start with individual characters as tokens
   - Look up which pairs can be merged (from pre-learned merge rules)
   - Repeatedly merge the highest-priority pair until no more merges possible
   - This creates subword tokens like "Hello", "ing", "ed", etc.

4. **Token ID lookup**: Convert each subword token to its ID from the vocabulary

**What I still find tricky**: Understanding exactly how the merge rules were originally learned (that involves counting pair frequencies in training data, which is a separate process).

---

## Part 3: Performance Considerations

From the author's experiments (see source-material), the performance ranking is roughly:

1. **tiktoken** - Fastest (10-20x faster than original)
   - Written in Rust with optimized string operations
   - Minimal Python overhead
   
2. **Transformers (Fast)** - Fast (uses Rust tokenizers library)
   - ~2-3x faster than original
   - Slight overhead from the framework
   
3. **Original GPT-2** - Baseline (pure Python)
   - Good for understanding the algorithm
   - Too slow for production use
   
4. **Transformers (Python)** - Slower than original
   - More features but slower due to framework overhead

**My takeaway for practical use**: Use tiktoken for anything performance-critical. Use transformers if I need compatibility with other Hugging Face models.

---

## Part 4: When to Use Each Implementation

Here's my decision tree for choosing a tokenizer:

### Use **tiktoken** when:
- ‚úÖ I'm working with OpenAI models (GPT-2, GPT-3, GPT-4)
- ‚úÖ Performance matters (processing large datasets)
- ‚úÖ I want clean, simple code
- ‚úÖ I'm learning from scratch (like I am now!)

### Use **Hugging Face transformers** when:
- ‚úÖ I need compatibility with many different models
- ‚úÖ I'm using pre-trained models from HuggingFace Hub
- ‚úÖ I need additional features (padding, truncation, attention masks)
- ‚úÖ I'm building end-to-end pipelines with the transformers library

### Use **original encoder** when:
- ‚úÖ I'm studying how BPE actually works (educational purposes)
- ‚úÖ I need to understand legacy code
- ‚ùå NOT for production (too slow)

**My choice for this learning repo**: tiktoken, because it's fast, clean, and I'm focused on understanding fundamentals.

---

## My Takeaways and Next Steps

### What I learned today:
1. ‚úÖ BPE is a compression algorithm that balances vocabulary size and sequence length
2. ‚úÖ tiktoken is the modern, fast way to use GPT-2's BPE encoding
3. ‚úÖ BPE works on bytes, so it can handle any text (even emojis!)
4. ‚úÖ Different implementations exist with different speed/feature tradeoffs

### What I still need to practice:
- üîÑ Implementing a simple BPE algorithm from scratch (to really understand the merge process)
- üîÑ Understanding how the merge rules are learned from training data
- üîÑ Experimenting with creating my own vocabulary on a small dataset
- üîÑ Comparing how different vocabularies affect model performance

### Next learning steps:
1. Build a minimal BPE tokenizer from scratch (Chapter 2.6)
2. Train it on a small text corpus to see how merge rules are learned
3. Compare my implementation with tiktoken to validate correctness

### References:
- **Book**: *Build a Large Language Model From Scratch* by Sebastian Raschka, Chapter 2
- **Source material**: `source-material/ch02_bonus_bpe_from_author/` (author's comparison)
- **OpenAI**: Original GPT-2 encoder.py (Modified MIT License)
- **tiktoken docs**: https://github.com/openai/tiktoken

---

**Date completed**: January 27, 2026  
**Status**: Ready to move on to implementing BPE from scratch! üöÄ