# 1.19a: Flannel Tokenizer Training

**Goal:** Train a custom 1000-token BPE tokenizer on our mixed English-Thai corpus.

## How BPE Tokenizer Training Works

BPE (Byte Pair Encoding) is a greedy algorithm that learns a vocabulary from a corpus by iteratively merging the most frequent pairs of tokens.

### The Algorithm

1. **Start with a base vocabulary:** For byte-level BPE, this is the 256 individual bytes (0x00 through 0xFF).

2. **Count all pairs in the corpus:** Scan through the entire training corpus and count how many times each pair of adjacent tokens appears.
   - Example: If "th" appears together 1 million times, that pair gets a count of 1,000,000.

3. **Merge the most frequent pair:** Find the pair with the highest count and create a new token for it.
   - Example: Create token #257 = "th"
   - Add it to the vocabulary
   - Replace all instances of that pair in the corpus with the new token

4. **Repeat:** Re-count pairs with the updated corpus (now "th" is a single token), merge the next most frequent pair, and continue.

5. **Stop when you hit your target vocabulary size** (in our case, 1000 tokens).

### Example

```
Corpus:     "the cat"
Base vocab: ['t', 'h', 'e', ' ', 'c', 'a']

Iteration 1:
  Most frequent pair: ('t', 'h') appears 1 time
  Create new token: 'th'
  Corpus becomes: "the cat" → [th][e][ ][c][a][t]
  Vocab: ['t', 'h', 'e', ' ', 'c', 'a', 'th']

Iteration 2:
  Most frequent pair: ('th', 'e') appears 1 time
  Create new token: 'the'
  Corpus becomes: [the][ ][c][a][t]
  Vocab: ['t', 'h', 'e', ' ', 'c', 'a', 'th', 'the']

...and so on
```

### Key Properties

- **Order doesn't matter:** The algorithm only cares about pair frequencies, not the order documents appear in the corpus.
- **Greedy:** It always picks the most frequent pair at each step. This is optimal for compression but not necessarily for linguistic structure.
- **Fast:** No neural networks, no gradients—just counting and sorting. Takes seconds to minutes on laptop CPUs.

## This Notebook

We'll use HuggingFace's `tokenizers` library (written in Rust for speed) to train a 1000-token BPE tokenizer on our mixed corpus:
- 80% English (from FineWeb)
- 20% Thai (from FineWeb-2)

Expected result: The tokenizer will learn ~800 English subword tokens and ~200 Thai subword tokens (roughly proportional to their frequency in the training corpus).

## Output

- `../data/flannel_tokenizer.json` - Trained tokenizer in HuggingFace format

## Parameters

In [1]:
# Input corpus (from 1.18a)
CORPUS_PATH = "../data/flannel_tokenizer_corpus.txt"

# Output tokenizer
TOKENIZER_OUTPUT = "../data/flannel_tokenizer.json"

# Tokenizer parameters
VOCAB_SIZE = 1000
MIN_FREQUENCY = 2  # Ignore pairs that appear less than this

# Special tokens
SPECIAL_TOKENS = ["<|endoftext|>"]

# Random seed
RANDOM_SEED = 42

## Imports

In [2]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
from pathlib import Path
import time

print("✓ Imports complete")

✓ Imports complete


## Verify Corpus Exists

In [3]:
corpus_path = Path(CORPUS_PATH)

if not corpus_path.exists():
    raise FileNotFoundError(f"Corpus not found at {CORPUS_PATH}. Run 1.18a first.")

# Check corpus size
corpus_bytes = corpus_path.stat().st_size
corpus_mb = corpus_bytes / (1024 * 1024)

print(f"✓ Found corpus at {CORPUS_PATH}")
print(f"  Size: {corpus_bytes:,} bytes ({corpus_mb:.2f} MB)")

✓ Found corpus at ../data/flannel_tokenizer_corpus.txt
  Size: 104,920,936 bytes (100.06 MB)


## Create Tokenizer

We'll use byte-level BPE, which starts with 256 base tokens (one per byte) and merges up to our target vocabulary size.

In [4]:
print("Creating tokenizer...\n")

# Create a byte-level BPE tokenizer
tokenizer = Tokenizer(models.BPE())

# Use byte-level pre-tokenizer (splits on bytes, not characters)
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

# Use byte-level decoder (converts bytes back to UTF-8)
tokenizer.decoder = decoders.ByteLevel()

print(f"✓ Created byte-level BPE tokenizer")
print(f"  Base vocabulary: 256 bytes")
print(f"  Target vocabulary: {VOCAB_SIZE:,} tokens")
print(f"  Merges to perform: {VOCAB_SIZE - 256:,}")

Creating tokenizer...

✓ Created byte-level BPE tokenizer
  Base vocabulary: 256 bytes
  Target vocabulary: 1,000 tokens
  Merges to perform: 744


## Train Tokenizer

This is where the BPE algorithm runs. It will:
1. Scan the entire corpus
2. Count all byte pairs
3. Iteratively merge the most frequent pairs until we reach 1000 tokens

Expected time: 1-3 minutes on your M4 Pro for ~100 MB of text.

In [5]:
print(f"Training tokenizer on {CORPUS_PATH}...\n")
print(f"This will take 1-3 minutes. The tokenizer is:")
print(f"  1. Scanning {corpus_mb:.0f} MB of text")
print(f"  2. Counting all byte pairs")
print(f"  3. Performing {VOCAB_SIZE - 256:,} merge operations\n")

# Create trainer
trainer = trainers.BpeTrainer(
    vocab_size=VOCAB_SIZE,
    min_frequency=MIN_FREQUENCY,
    special_tokens=SPECIAL_TOKENS,
    show_progress=True
)

# Train (this is the slow part)
start_time = time.time()
tokenizer.train(files=[str(corpus_path)], trainer=trainer)
elapsed = time.time() - start_time

print(f"\n✓ Training complete in {elapsed:.1f} seconds ({elapsed/60:.2f} minutes)")

Training tokenizer on ../data/flannel_tokenizer_corpus.txt...

This will take 1-3 minutes. The tokenizer is:
  1. Scanning 100 MB of text
  2. Counting all byte pairs
  3. Performing 744 merge operations





✓ Training complete in 5.6 seconds (0.09 minutes)


## Analyze Tokenizer

Let's see what the tokenizer learned.

In [6]:
print(f"\nTokenizer statistics:\n")

vocab_size = tokenizer.get_vocab_size()
print(f"  Vocabulary size: {vocab_size:,} tokens")
print(f"  Special tokens: {len(SPECIAL_TOKENS)}")
print(f"  Learned merges: {vocab_size - 256:,}")

# Get the vocabulary
vocab = tokenizer.get_vocab()

# Show some example tokens
print(f"\nExample tokens (first 20 after base bytes):")
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])
for token, idx in sorted_vocab[256:276]:  # Skip base bytes, show first 20 merges
    # Decode the token for display
    try:
        display = repr(token)[1:-1]  # Remove outer quotes from repr
        print(f"  {idx:4d}: {display}")
    except:
        print(f"  {idx:4d}: [unprintable]")


Tokenizer statistics:

  Vocabulary size: 1,000 tokens
  Special tokens: 1
  Learned merges: 744

Example tokens (first 20 after base bytes):
   256: Ġl
   257: à¸ĩ
   258: ve
   259: st
   260: Ġe
   261: Ġn
   262: à¸±
   263: ro
   264: Ġre
   265: à¸¡
   266: Ġy
   267: Ġg
   268: ĠI
   269: ly
   270: à¹Ģà¸
   271: ct
   272: Ġbe
   273: à¸µ
   274: ĠT
   275: ut


## Test Tokenizer

Let's encode some test strings to verify the tokenizer works.

In [7]:
print(f"\nTesting tokenizer...\n")

test_strings = [
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "สวัสดีครับ",  # Thai: "Hello"
    "ภาษาไทย"      # Thai: "Thai language"
]

for test_str in test_strings:
    encoding = tokenizer.encode(test_str)
    tokens = encoding.tokens
    ids = encoding.ids
    
    print(f"Input:  {repr(test_str)}")
    print(f"Tokens: {tokens}")
    print(f"IDs:    {ids}")
    print(f"Count:  {len(ids)} tokens")
    print()


Testing tokenizer...

Input:  'Hello, world!'
Tokens: ['H', 'ell', 'o', ',', 'Ġwor', 'ld', '!']
IDs:    [40, 977, 79, 12, 450, 342, 1]
Count:  7 tokens

Input:  'The quick brown fox jumps over the lazy dog.'
Tokens: ['The', 'Ġqu', 'ick', 'Ġb', 'ro', 'wn', 'Ġf', 'o', 'x', 'Ġj', 'um', 'p', 's', 'Ġover', 'Ġthe', 'Ġl', 'a', 'z', 'y', 'Ġdo', 'g', '.']
IDs:    [465, 605, 585, 224, 263, 681, 223, 79, 88, 459, 414, 80, 83, 648, 209, 256, 65, 90, 89, 441, 71, 14]
Count:  22 tokens

Input:  'สวัสดีครับ'
Tokens: ['à¸ª', 'à¸§', 'à¸±', 'à¸ª', 'à¸Ķ', 'à¸µ', 'à¸Ħà¸£', 'à¸±', 'à¸ļ']
IDs:    [317, 282, 262, 317, 291, 273, 751, 262, 311]
Count:  9 tokens

Input:  'ภาษาไทย'
Tokens: ['à¸ł', 'à¸²à¸', '©', 'à¸²', 'à¹Ħ', 'à¸Ĺ', 'à¸¢']
IDs:    [737, 233, 103, 227, 333, 303, 290]
Count:  7 tokens



## Save Tokenizer

In [8]:
print(f"Saving tokenizer to {TOKENIZER_OUTPUT}...\n")

# Ensure directory exists
Path(TOKENIZER_OUTPUT).parent.mkdir(parents=True, exist_ok=True)

# Save in HuggingFace format (JSON)
tokenizer.save(str(TOKENIZER_OUTPUT))

# Verify file was created
output_path = Path(TOKENIZER_OUTPUT)
if output_path.exists():
    output_kb = output_path.stat().st_size / 1024
    print(f"✓ Saved tokenizer")
    print(f"  Path: {TOKENIZER_OUTPUT}")
    print(f"  Size: {output_kb:.1f} KB")
else:
    raise FileNotFoundError(f"Failed to save tokenizer to {TOKENIZER_OUTPUT}")

Saving tokenizer to ../data/flannel_tokenizer.json...

✓ Saved tokenizer
  Path: ../data/flannel_tokenizer.json
  Size: 54.2 KB


## Summary

In [9]:
print(f"\n{'='*70}")
print(f"TOKENIZER TRAINING COMPLETE")
print(f"{'='*70}\n")

print(f"Training corpus:")
print(f"  Path: {CORPUS_PATH}")
print(f"  Size: {corpus_mb:.2f} MB")
print(f"  Composition: ~80% English, ~20% Thai")
print()

print(f"Tokenizer:")
print(f"  Type: Byte-level BPE")
print(f"  Vocabulary size: {vocab_size:,} tokens")
print(f"  Base tokens: 256 bytes")
print(f"  Learned merges: {vocab_size - 256:,}")
print(f"  Training time: {elapsed:.1f} seconds")
print()

print(f"Output:")
print(f"  Path: {TOKENIZER_OUTPUT}")
print()

print(f"Next steps:")
print(f"  → Use this tokenizer to train Flannel models (notebook 1.20a+)")
print(f"  → Expect ~200 Thai tokens to be dead (never appear in English model training)")
print(f"  → Watch what happens to those dead tokens during training")
print()
print(f"{'='*70}")


TOKENIZER TRAINING COMPLETE

Training corpus:
  Path: ../data/flannel_tokenizer_corpus.txt
  Size: 100.06 MB
  Composition: ~80% English, ~20% Thai

Tokenizer:
  Type: Byte-level BPE
  Vocabulary size: 1,000 tokens
  Base tokens: 256 bytes
  Learned merges: 744
  Training time: 5.6 seconds

Output:
  Path: ../data/flannel_tokenizer.json

Next steps:
  → Use this tokenizer to train Flannel models (notebook 1.20a+)
  → Expect ~200 Thai tokens to be dead (never appear in English model training)
  → Watch what happens to those dead tokens during training

