# 1.19b: Flannel Tokenizer Training (Character-level)

**Goal:** Train a custom 1000-token BPE tokenizer on our mixed English-Thai corpus using **character-level** encoding.

## Why Character-level instead of Byte-level?

In **1.19a** we used byte-level BPE, which works on raw bytes (0x00-0xFF). This creates tokens that are often **partial UTF-8 sequences**—not valid Unicode characters, hard to interpret.

**Character-level BPE** works on Unicode characters instead:
- Start with a vocabulary of actual characters (a, b, c, ก, ข, ค, etc.)
- Merge character pairs, not byte pairs
- All tokens are **valid Unicode strings** → much more interpretable

### Example

```
Byte-level:   Token 257 = à¸ĩ  (partial UTF-8, unreadable)
Character-level: Token 257 = "วั"  (two Thai characters, readable)
```

### Interpretability Benefits

- Easy to classify tokens as English vs Thai (check Unicode script ranges)
- Can decode and display any token meaningfully
- Matches what large models like Qwen 3 4B probably use (they decode to clean Unicode)

## How Character-level BPE Works

Same algorithm as byte-level, but:

1. **Start with character vocabulary:** Extract all unique Unicode characters from the corpus (a-z, A-Z, Thai script, punctuation, etc.)

2. **Count character pairs:** Scan corpus and count how often each pair of adjacent characters appears.

3. **Merge most frequent pair:** Create new token, add to vocabulary.

4. **Repeat** until target vocabulary size.

## This Notebook

Train a 1000-token character-level BPE tokenizer on:
- 80% English (from FineWeb)
- 20% Thai (from FineWeb-2)

Expected result: ~800 English tokens, ~200 Thai tokens, all human-readable.

## Output

- `../data/flannel_tokenizer_chars.json` - Trained tokenizer in HuggingFace format

## Parameters

In [23]:
# Input corpus (from 1.18a)
CORPUS_PATH = "../data/flannel_tokenizer_corpus.txt"

# Output tokenizer
TOKENIZER_OUTPUT = "../data/flannel_tokenizer_chars.json"

# Tokenizer parameters
VOCAB_SIZE = 10000
MIN_FREQUENCY = 2  # Ignore pairs that appear less than this

# Special tokens
SPECIAL_TOKENS = ["<|endoftext|>"]

# Random seed
RANDOM_SEED = 42

## Imports

In [24]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers
from tokenizers.normalizers import NFD, StripAccents
from pathlib import Path
import time

print("✓ Imports complete")

✓ Imports complete


## Verify Corpus Exists

In [25]:
corpus_path = Path(CORPUS_PATH)

if not corpus_path.exists():
    raise FileNotFoundError(f"Corpus not found at {CORPUS_PATH}. Run 1.18a first.")

# Check corpus size
corpus_bytes = corpus_path.stat().st_size
corpus_mb = corpus_bytes / (1024 * 1024)

print(f"✓ Found corpus at {CORPUS_PATH}")
print(f"  Size: {corpus_bytes:,} bytes ({corpus_mb:.2f} MB)")

✓ Found corpus at ../data/flannel_tokenizer_corpus.txt
  Size: 104,920,936 bytes (100.06 MB)


## Analyze Corpus Character Set

Before training, let's see what characters appear in our corpus.

In [26]:
print(f"Analyzing corpus character set...\n")

# Read corpus
with open(corpus_path, 'r', encoding='utf-8') as f:
    corpus_text = f.read()

# Find unique characters
unique_chars = set(corpus_text)

print(f"✓ Found {len(unique_chars):,} unique characters in corpus")

# Classify by Unicode range
ascii_chars = [c for c in unique_chars if ord(c) < 128]
thai_chars = [c for c in unique_chars if 0x0E00 <= ord(c) <= 0x0E7F]  # Thai script range
other_chars = [c for c in unique_chars if c not in ascii_chars and c not in thai_chars]

print(f"  ASCII characters: {len(ascii_chars):,}")
print(f"  Thai characters: {len(thai_chars):,}")
print(f"  Other characters: {len(other_chars):,}")
print()

# Show some Thai characters
if thai_chars:
    print(f"Sample Thai characters (first 20):")
    for i, char in enumerate(sorted(thai_chars)[:20]):
        print(f"  {char} (U+{ord(char):04X})")
    print()

print(f"This character set will be the base vocabulary for BPE.")

Analyzing corpus character set...

✓ Found 2,800 unique characters in corpus
  ASCII characters: 96
  Thai characters: 86
  Other characters: 2,618

Sample Thai characters (first 20):
  ก (U+0E01)
  ข (U+0E02)
  ฃ (U+0E03)
  ค (U+0E04)
  ฅ (U+0E05)
  ฆ (U+0E06)
  ง (U+0E07)
  จ (U+0E08)
  ฉ (U+0E09)
  ช (U+0E0A)
  ซ (U+0E0B)
  ฌ (U+0E0C)
  ญ (U+0E0D)
  ฎ (U+0E0E)
  ฏ (U+0E0F)
  ฐ (U+0E10)
  ฑ (U+0E11)
  ฒ (U+0E12)
  ณ (U+0E13)
  ด (U+0E14)

This character set will be the base vocabulary for BPE.


## Create Tokenizer

We'll use character-level BPE with whitespace pre-tokenization.

In [27]:
print(f"\nCreating tokenizer...\n")

# Create a character-level BPE tokenizer
tokenizer = Tokenizer(models.BPE())

# Use whitespace pre-tokenizer (splits on spaces, preserves Unicode)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

print(f"✓ Created character-level BPE tokenizer")
print(f"  Base vocabulary: {len(unique_chars):,} unique characters from corpus")
print(f"  Target vocabulary: {VOCAB_SIZE:,} tokens")
print(f"  Merges to perform: {VOCAB_SIZE - len(unique_chars):,}")


Creating tokenizer...

✓ Created character-level BPE tokenizer
  Base vocabulary: 2,800 unique characters from corpus
  Target vocabulary: 10,000 tokens
  Merges to perform: 7,200


## Train Tokenizer

This is where the BPE algorithm runs. It will:
1. Scan the entire corpus
2. Count all character pairs
3. Iteratively merge the most frequent pairs until we reach 1000 tokens

Expected time: 1-3 minutes on your M4 Pro for ~100 MB of text.

In [28]:
print(f"Training tokenizer on {CORPUS_PATH}...\n")
print(f"This will take 1-3 minutes. The tokenizer is:")
print(f"  1. Scanning {corpus_mb:.0f} MB of text")
print(f"  2. Counting all character pairs")
print(f"  3. Performing {max(0, VOCAB_SIZE - len(unique_chars)):,} merge operations\n")

# Create trainer
trainer = trainers.BpeTrainer(
    vocab_size=VOCAB_SIZE,
    min_frequency=MIN_FREQUENCY,
    special_tokens=SPECIAL_TOKENS,
    show_progress=True,
    initial_alphabet=list(unique_chars)  # Start with corpus character set
)

# Train (this is the slow part)
start_time = time.time()
tokenizer.train(files=[str(corpus_path)], trainer=trainer)
elapsed = time.time() - start_time

print(f"\n✓ Training complete in {elapsed:.1f} seconds ({elapsed/60:.2f} minutes)")

Training tokenizer on ../data/flannel_tokenizer_corpus.txt...

This will take 1-3 minutes. The tokenizer is:
  1. Scanning 100 MB of text
  2. Counting all character pairs
  3. Performing 7,200 merge operations





✓ Training complete in 6.1 seconds (0.10 minutes)


## Analyze Tokenizer

Let's see what the tokenizer learned. All tokens should be valid Unicode strings now.

In [29]:
print(f"\nTokenizer statistics:\n")

vocab_size = tokenizer.get_vocab_size()
print(f"  Vocabulary size: {vocab_size:,} tokens")
print(f"  Special tokens: {len(SPECIAL_TOKENS)}")
print(f"  Base characters: {len(unique_chars):,}")
print(f"  Learned merges: {vocab_size - len(unique_chars):,}")

# Get the vocabulary
vocab = tokenizer.get_vocab()

# Show some example merged tokens (skip base characters)
print(f"\nExample merged tokens (first 30):")
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])

# Skip to merged tokens (past base character set)
merged_tokens = [(token, idx) for token, idx in sorted_vocab if len(token) > 1 and token not in SPECIAL_TOKENS]

for token, idx in merged_tokens[:30]:
    # All tokens should be valid Unicode now
    print(f"  {idx:4d}: {repr(token)}")


Tokenizer statistics:

  Vocabulary size: 10,000 tokens
  Special tokens: 1
  Base characters: 2,800
  Learned merges: 7,200

Example merged tokens (first 30):
  2801: 'th'
  2802: 'in'
  2803: 'er'
  2804: 'an'
  2805: 'on'
  2806: 'the'
  2807: 're'
  2808: 'at'
  2809: 'en'
  2810: 'or'
  2811: 'ou'
  2812: 'es'
  2813: 'al'
  2814: 'is'
  2815: 'to'
  2816: 'ing'
  2817: 'ed'
  2818: 'and'
  2819: 'ar'
  2820: 'it'
  2821: 'as'
  2822: 'of'
  2823: 'ic'
  2824: 'le'
  2825: 'st'
  2826: 'ion'
  2827: 'om'
  2828: 'il'
  2829: 'ent'
  2830: 'he'


## Classify Token Languages

Since all tokens are valid Unicode, we can easily classify them by script.

In [30]:
print(f"\nClassifying tokens by language...\n")

def classify_token(token):
    """Classify a token as English, Thai, Mixed, or Special"""
    if token in SPECIAL_TOKENS:
        return 'special'
    
    # Check character composition
    has_ascii = any(ord(c) < 128 for c in token)
    has_thai = any(0x0E00 <= ord(c) <= 0x0E7F for c in token)
    
    if has_thai and has_ascii:
        return 'mixed'
    elif has_thai:
        return 'thai'
    elif has_ascii:
        return 'english'
    else:
        return 'other'

# Classify all tokens
token_classes = {}
for token, idx in vocab.items():
    token_classes[idx] = classify_token(token)

# Count by category
from collections import Counter
class_counts = Counter(token_classes.values())

print(f"Token classification:")
print(f"  English tokens: {class_counts['english']:,} ({100*class_counts['english']/vocab_size:.1f}%)")
print(f"  Thai tokens: {class_counts['thai']:,} ({100*class_counts['thai']/vocab_size:.1f}%)")
print(f"  Mixed tokens: {class_counts['mixed']:,} ({100*class_counts['mixed']/vocab_size:.1f}%)")
print(f"  Special tokens: {class_counts['special']:,}")
print(f"  Other tokens: {class_counts['other']:,}")
print()

# Show some examples of each type
print(f"Example English tokens:")
english_tokens = [(t, i) for t, i in vocab.items() if token_classes[i] == 'english' and len(t) > 1]
for token, idx in sorted(english_tokens, key=lambda x: x[1])[:10]:
    print(f"  {idx:4d}: {repr(token)}")
print()

print(f"Example Thai tokens:")
thai_tokens = [(t, i) for t, i in vocab.items() if token_classes[i] == 'thai' and len(t) > 1]
for token, idx in sorted(thai_tokens, key=lambda x: x[1])[:10]:
    print(f"  {idx:4d}: {repr(token)}")


Classifying tokens by language...

Token classification:
  English tokens: 6,108 (61.1%)
  Thai tokens: 1,272 (12.7%)
  Mixed tokens: 0 (0.0%)
  Special tokens: 1
  Other tokens: 2,619

Example English tokens:
  2801: 'th'
  2802: 'in'
  2803: 'er'
  2804: 'an'
  2805: 'on'
  2806: 'the'
  2807: 're'
  2808: 'at'
  2809: 'en'
  2810: 'or'

Example Thai tokens:
  2901: 'าร'
  2903: 'อง'
  2912: 'ี่'
  2914: '่า'
  2934: 'ที่'
  2943: 'การ'
  2956: '้า'
  2975: 'ระ'
  2980: 'าม'
  2985: '่อ'


## Test Tokenizer

Let's encode some test strings to verify the tokenizer works and produces readable tokens.

In [31]:
print(f"\nTesting tokenizer...\n")

test_strings = [
    "Hello, world!",
    "The quick brown fox jumps over the lazy dog.",
    "สวัสดีครับ",  # Thai: "Hello" (polite, male)
    "ภาษาไทย"      # Thai: "Thai language"
]

for test_str in test_strings:
    encoding = tokenizer.encode(test_str)
    tokens = encoding.tokens
    ids = encoding.ids
    
    print(f"Input:  {repr(test_str)}")
    print(f"Tokens: {tokens}")
    print(f"IDs:    {ids}")
    print(f"Count:  {len(ids)} tokens")
    
    # Classify tokens in this encoding
    classes = [token_classes[tid] for tid in ids]
    print(f"Types:  {classes}")
    print()


Testing tokenizer...

Input:  'Hello, world!'
Tokens: ['Hel', 'lo', ',', 'world', '!']
IDs:    [5976, 2840, 14, 3501, 3]
Count:  5 tokens
Types:  ['english', 'english', 'english', 'english', 'english']

Input:  'The quick brown fox jumps over the lazy dog.'
Tokens: ['The', 'quick', 'brown', 'fo', 'x', 'jum', 'ps', 'over', 'the', 'l', 'azy', 'dog', '.']
IDs:    [2870, 4469, 8657, 3760, 90, 9440, 3432, 3093, 2806, 78, 7951, 5013, 16]
Count:  13 tokens
Types:  ['english', 'english', 'english', 'english', 'english', 'english', 'english', 'english', 'english', 'english', 'english', 'english', 'english']

Input:  'สวัสดีครับ'
Tokens: ['ส', 'วั', 'ส', 'ดี', 'ครับ']
IDs:    [714, 4985, 714, 3631, 4681]
Count:  5 tokens
Types:  ['thai', 'thai', 'thai', 'thai', 'thai']

Input:  'ภาษาไทย'
Tokens: ['ภาษา', 'ไทย']
IDs:    [7397, 4240]
Count:  2 tokens
Types:  ['thai', 'thai']



## Save Tokenizer

In [32]:
print(f"Saving tokenizer to {TOKENIZER_OUTPUT}...\n")

# Ensure directory exists
Path(TOKENIZER_OUTPUT).parent.mkdir(parents=True, exist_ok=True)

# Save in HuggingFace format (JSON)
tokenizer.save(str(TOKENIZER_OUTPUT))

# Verify file was created
output_path = Path(TOKENIZER_OUTPUT)
if output_path.exists():
    output_kb = output_path.stat().st_size / 1024
    print(f"✓ Saved tokenizer")
    print(f"  Path: {TOKENIZER_OUTPUT}")
    print(f"  Size: {output_kb:.1f} KB")
else:
    raise FileNotFoundError(f"Failed to save tokenizer to {TOKENIZER_OUTPUT}")

Saving tokenizer to ../data/flannel_tokenizer_chars.json...

✓ Saved tokenizer
  Path: ../data/flannel_tokenizer_chars.json
  Size: 526.1 KB


## Summary

In [33]:
print(f"\n{'='*70}")
print(f"TOKENIZER TRAINING COMPLETE")
print(f"{'='*70}\n")

print(f"Training corpus:")
print(f"  Path: {CORPUS_PATH}")
print(f"  Size: {corpus_mb:.2f} MB")
print(f"  Composition: ~80% English, ~20% Thai")
print(f"  Unique characters: {len(unique_chars):,}")
print()

print(f"Tokenizer:")
print(f"  Type: Character-level BPE")
print(f"  Vocabulary size: {vocab_size:,} tokens")
print(f"  Base characters: {len(unique_chars):,}")
print(f"  Learned merges: {vocab_size - len(unique_chars):,}")
print(f"  Training time: {elapsed:.1f} seconds")
print()

print(f"Token composition:")
print(f"  English: {class_counts['english']:,} tokens ({100*class_counts['english']/vocab_size:.1f}%)")
print(f"  Thai: {class_counts['thai']:,} tokens ({100*class_counts['thai']/vocab_size:.1f}%)")
print(f"  Mixed: {class_counts['mixed']:,} tokens")
print(f"  Other: {class_counts['other']:,} tokens")
print()

print(f"Output:")
print(f"  Path: {TOKENIZER_OUTPUT}")
print()

print(f"Benefits of character-level encoding:")
print(f"  ✓ All tokens are valid Unicode strings (interpretable)")
print(f"  ✓ Easy to classify tokens by language/script")
print(f"  ✓ Matches approach used by models like Qwen 3 4B")
print(f"  ✓ Better for tracking token movement during training")
print()

print(f"Next steps:")
print(f"  → Use this tokenizer to train Flannel models (notebook 1.20a+)")
print(f"  → ~{class_counts['thai']} Thai tokens will be dead (never appear in English training)")
print(f"  → Watch what happens to those dead tokens during training")
print()
print(f"{'='*70}")


TOKENIZER TRAINING COMPLETE

Training corpus:
  Path: ../data/flannel_tokenizer_corpus.txt
  Size: 100.06 MB
  Composition: ~80% English, ~20% Thai
  Unique characters: 2,800

Tokenizer:
  Type: Character-level BPE
  Vocabulary size: 10,000 tokens
  Base characters: 2,800
  Learned merges: 7,200
  Training time: 6.1 seconds

Token composition:
  English: 6,108 tokens (61.1%)
  Thai: 1,272 tokens (12.7%)
  Mixed: 0 tokens
  Other: 2,619 tokens

Output:
  Path: ../data/flannel_tokenizer_chars.json

Benefits of character-level encoding:
  ✓ All tokens are valid Unicode strings (interpretable)
  ✓ Easy to classify tokens by language/script
  ✓ Matches approach used by models like Qwen 3 4B
  ✓ Better for tracking token movement during training

Next steps:
  → Use this tokenizer to train Flannel models (notebook 1.20a+)
  → ~1272 Thai tokens will be dead (never appear in English training)
  → Watch what happens to those dead tokens during training

