# 1.19b2: Flannel Tokenizer Demographics

**Goal:** Analyze the composition of our 10,000-token Flannel tokenizer.

## Why This Matters

We trained our tokenizer on 80% English + 20% Thai, and we'll train our model on 100% English. This means **all Thai tokens should be dead** (never appear during model training).

By counting exact token demographics now, we can validate our experiment later:
- If we find **N pure Thai tokens** in the vocabulary
- And exactly **N tokens never get trained** during English-only training
- That's perfect validation of our experimental design!

## Token Categories

We'll classify every token as:
1. **Pure English:** All characters are ASCII letters/common punctuation
2. **Pure Thai:** All characters are Thai script (U+0E00–U+0E7F)
3. **Mixed:** Contains both English and Thai characters
4. **Numeric:** Numbers and math symbols
5. **Whitespace:** Spaces, tabs, newlines
6. **Punctuation:** Pure punctuation (no letters)
7. **Special:** Special tokens like `<|endoftext|>`
8. **Other:** Everything else (emoji, other scripts, etc.)

## This Notebook

1. Load the Flannel tokenizer
2. Classify every token
3. Show statistics and examples
4. Save demographics data for later validation

## Parameters

In [1]:
# Input tokenizer (from 1.19b)
TOKENIZER_PATH = "../data/flannel_tokenizer_chars.json"

# Output demographics
OUTPUT_PATH = "../data/flannel_token_demographics.json"

# Special tokens
SPECIAL_TOKENS = ["<|endoftext|>"]

print("✓ Parameters set")

✓ Parameters set


## Imports

In [2]:
from tokenizers import Tokenizer
from pathlib import Path
from collections import Counter, defaultdict
import json
import unicodedata

print("✓ Imports complete")

✓ Imports complete


## Load Tokenizer

In [3]:
tokenizer_path = Path(TOKENIZER_PATH)

if not tokenizer_path.exists():
    raise FileNotFoundError(f"Tokenizer not found at {TOKENIZER_PATH}. Run 1.19b first.")

print(f"Loading tokenizer from {TOKENIZER_PATH}...\n")
tokenizer = Tokenizer.from_file(str(tokenizer_path))

vocab = tokenizer.get_vocab()
vocab_size = len(vocab)

print(f"✓ Loaded tokenizer")
print(f"  Vocabulary size: {vocab_size:,} tokens")

Loading tokenizer from ../data/flannel_tokenizer_chars.json...

✓ Loaded tokenizer
  Vocabulary size: 10,000 tokens


## Define Classification Function

In [4]:
def classify_token(token):
    """
    Classify a token into one of several categories.
    
    Returns: (category, subcategory)
    """
    # Special tokens
    if token in SPECIAL_TOKENS:
        return ('special', 'special')
    
    # Empty token (shouldn't happen, but check)
    if not token:
        return ('other', 'empty')
    
    # Analyze character composition
    has_thai = False
    has_ascii_letter = False
    has_digit = False
    has_whitespace = False
    has_punctuation = False
    has_other = False
    
    for char in token:
        code = ord(char)
        
        # Thai script: U+0E00 to U+0E7F
        if 0x0E00 <= code <= 0x0E7F:
            has_thai = True
        
        # ASCII letters: a-z, A-Z
        elif (65 <= code <= 90) or (97 <= code <= 122):
            has_ascii_letter = True
        
        # Digits: 0-9
        elif 48 <= code <= 57:
            has_digit = True
        
        # Whitespace
        elif char.isspace():
            has_whitespace = True
        
        # Common ASCII punctuation
        elif code < 128 and not char.isalnum():
            has_punctuation = True
        
        # Everything else (emoji, other scripts, etc.)
        else:
            has_other = True
    
    # Classify based on composition
    # Pure categories first
    if has_thai and not (has_ascii_letter or has_digit or has_other):
        # Pure Thai (whitespace/punctuation allowed)
        return ('thai', 'pure')
    
    if has_ascii_letter and not (has_thai or has_other):
        # Pure English (whitespace/punctuation/digits allowed)
        return ('english', 'pure')
    
    if has_whitespace and not (has_ascii_letter or has_thai or has_digit or has_other):
        # Pure whitespace (with possible punctuation)
        return ('whitespace', 'pure')
    
    if has_punctuation and not (has_ascii_letter or has_thai or has_digit or has_other or has_whitespace):
        # Pure punctuation
        return ('punctuation', 'pure')
    
    if has_digit and not (has_ascii_letter or has_thai or has_other):
        # Numeric (with possible punctuation/whitespace)
        return ('numeric', 'pure')
    
    # Mixed categories
    if has_thai and has_ascii_letter:
        return ('mixed', 'thai_english')
    
    if has_thai and has_other:
        return ('mixed', 'thai_other')
    
    if has_ascii_letter and has_other:
        return ('mixed', 'english_other')
    
    # Catch-all
    return ('other', 'unknown')

print("✓ Classification function defined")

✓ Classification function defined


## Classify All Tokens

In [5]:
print(f"\nClassifying {vocab_size:,} tokens...\n")

# Classify every token
token_classifications = {}  # token_id -> (category, subcategory)
tokens_by_category = defaultdict(list)  # category -> [(token, id), ...]

for token_str, token_id in vocab.items():
    category, subcategory = classify_token(token_str)
    token_classifications[token_id] = (category, subcategory)
    tokens_by_category[category].append((token_str, token_id))

# Count by category
category_counts = Counter(cat for cat, _ in token_classifications.values())

print(f"✓ Classification complete")
print(f"  Total tokens classified: {len(token_classifications):,}")


Classifying 10,000 tokens...

✓ Classification complete
  Total tokens classified: 10,000


## Summary Statistics

In [6]:
print(f"\n{'='*70}")
print(f"TOKEN DEMOGRAPHICS")
print(f"{'='*70}\n")

print(f"Total vocabulary: {vocab_size:,} tokens\n")

print(f"Breakdown by category:")
for category in ['english', 'thai', 'mixed', 'numeric', 'whitespace', 'punctuation', 'special', 'other']:
    count = category_counts[category]
    pct = 100 * count / vocab_size
    print(f"  {category.capitalize():12s}: {count:5,} tokens ({pct:5.2f}%)")

print()
print(f"Key findings:")
print(f"  Pure English tokens: {category_counts['english']:,}")
print(f"  Pure Thai tokens: {category_counts['thai']:,}")
print(f"  Mixed tokens: {category_counts['mixed']:,}")
print()
print(f"Expected dead tokens (Thai + mixed with Thai):")
mixed_with_thai = sum(1 for cat, subcat in token_classifications.values() 
                      if cat == 'mixed' and 'thai' in subcat)
dead_token_estimate = category_counts['thai'] + mixed_with_thai
print(f"  Estimated: {dead_token_estimate:,} tokens")
print(f"  (These should never appear during English-only training)")


TOKEN DEMOGRAPHICS

Total vocabulary: 10,000 tokens

Breakdown by category:
  English     : 5,897 tokens (58.97%)
  Thai        : 1,272 tokens (12.72%)
  Mixed       :     0 tokens ( 0.00%)
  Numeric     :   132 tokens ( 1.32%)
  Whitespace  :     3 tokens ( 0.03%)
  Punctuation :    68 tokens ( 0.68%)
  Special     :     1 tokens ( 0.01%)
  Other       : 2,627 tokens (26.27%)

Key findings:
  Pure English tokens: 5,897
  Pure Thai tokens: 1,272
  Mixed tokens: 0

Expected dead tokens (Thai + mixed with Thai):
  Estimated: 1,272 tokens
  (These should never appear during English-only training)


## Show Examples of Each Category

In [7]:
print(f"\n{'='*70}")
print(f"EXAMPLES BY CATEGORY")
print(f"{'='*70}\n")

def show_examples(category, n=20):
    """Show first N examples of a category"""
    tokens = tokens_by_category[category]
    if not tokens:
        print(f"  (none)\n")
        return
    
    # Sort by token ID for consistency
    sorted_tokens = sorted(tokens, key=lambda x: x[1])[:n]
    
    for token_str, token_id in sorted_tokens:
        # Format for display
        display = repr(token_str)
        print(f"  {token_id:5d}: {display}")
    
    if len(tokens) > n:
        print(f"  ... and {len(tokens) - n:,} more")
    print()

print(f"Pure English tokens (first 20):")
show_examples('english', 20)

print(f"Pure Thai tokens (first 20):")
show_examples('thai', 20)

print(f"Mixed tokens (all):")
show_examples('mixed', 50)

print(f"Numeric tokens (all):")
show_examples('numeric', 50)

print(f"Whitespace tokens (all):")
show_examples('whitespace', 50)

print(f"Punctuation tokens (all):")
show_examples('punctuation', 50)

print(f"Special tokens (all):")
show_examples('special', 10)

print(f"Other tokens (first 20):")
show_examples('other', 20)


EXAMPLES BY CATEGORY

Pure English tokens (first 20):
     35: 'A'
     36: 'B'
     37: 'C'
     38: 'D'
     39: 'E'
     40: 'F'
     41: 'G'
     42: 'H'
     43: 'I'
     44: 'J'
     45: 'K'
     46: 'L'
     47: 'M'
     48: 'N'
     49: 'O'
     50: 'P'
     51: 'Q'
     52: 'R'
     53: 'S'
     54: 'T'
  ... and 5,877 more

Pure Thai tokens (first 20):
    673: 'ก'
    674: 'ข'
    675: 'ฃ'
    676: 'ค'
    677: 'ฅ'
    678: 'ฆ'
    679: 'ง'
    680: 'จ'
    681: 'ฉ'
    682: 'ช'
    683: 'ซ'
    684: 'ฌ'
    685: 'ญ'
    686: 'ฎ'
    687: 'ฏ'
    688: 'ฐ'
    689: 'ฑ'
    690: 'ฒ'
    691: 'ณ'
    692: 'ด'
  ... and 1,252 more

Mixed tokens (all):
  (none)

Numeric tokens (all):
     18: '0'
     19: '1'
     20: '2'
     21: '3'
     22: '4'
     23: '5'
     24: '6'
     25: '7'
     26: '8'
     27: '9'
   2937: '20'
   3023: '00'
   3101: '201'
   3153: '19'
   3231: '10'
   3266: '200'
   3514: '12'
   3538: '30'
   3559: '15'
   3564: '18'
   3606: '25'
   3653: '11'


## Analyze Token Length Distribution

In [8]:
print(f"\n{'='*70}")
print(f"TOKEN LENGTH ANALYSIS")
print(f"{'='*70}\n")

# Calculate length stats for each category
for category in ['english', 'thai']:
    tokens = tokens_by_category[category]
    if not tokens:
        continue
    
    lengths = [len(token_str) for token_str, _ in tokens]
    avg_length = sum(lengths) / len(lengths)
    max_length = max(lengths)
    min_length = min(lengths)
    
    print(f"{category.capitalize()} tokens:")
    print(f"  Count: {len(tokens):,}")
    print(f"  Average length: {avg_length:.2f} characters")
    print(f"  Min length: {min_length}")
    print(f"  Max length: {max_length}")
    
    # Show longest tokens
    longest = sorted(tokens, key=lambda x: len(x[0]), reverse=True)[:5]
    print(f"  Longest tokens:")
    for token_str, token_id in longest:
        print(f"    {token_id:5d}: {repr(token_str)} ({len(token_str)} chars)")
    print()


TOKEN LENGTH ANALYSIS

English tokens:
  Count: 5,897
  Average length: 4.91 characters
  Min length: 1
  Max length: 14
  Longest tokens:
     8009: 'responsibility' (14 chars)
     9347: 'infrastructure' (14 chars)
     9102: 'communications' (14 chars)
     7854: 'administration' (14 chars)
     9781: 'Unfortunately' (13 chars)

Thai tokens:
  Count: 1,272
  Average length: 3.28 characters
  Min length: 1
  Max length: 11
  Longest tokens:
     9793: 'เปลี่ยนแปลง' (11 chars)
     9422: 'คอมพิวเตอร์' (11 chars)
     8115: 'เนื่องจาก' (9 chars)
     7118: 'เกี่ยวกับ' (9 chars)
     8725: 'ประเทศไทย' (9 chars)



## Save Demographics Data

In [9]:
print(f"Saving demographics to {OUTPUT_PATH}...\n")

# Prepare data for JSON serialization
demographics_data = {
    'vocab_size': vocab_size,
    'category_counts': dict(category_counts),
    'dead_token_estimate': dead_token_estimate,
    'token_classifications': {
        str(token_id): {'category': cat, 'subcategory': subcat}
        for token_id, (cat, subcat) in token_classifications.items()
    },
    'tokens_by_category': {
        category: [(token_str, token_id) for token_str, token_id in tokens]
        for category, tokens in tokens_by_category.items()
    }
}

# Ensure directory exists
Path(OUTPUT_PATH).parent.mkdir(parents=True, exist_ok=True)

# Save to JSON
with open(OUTPUT_PATH, 'w', encoding='utf-8') as f:
    json.dump(demographics_data, f, ensure_ascii=False, indent=2)

# Verify file was created
output_path = Path(OUTPUT_PATH)
if output_path.exists():
    output_kb = output_path.stat().st_size / 1024
    print(f"✓ Saved demographics data")
    print(f"  Path: {OUTPUT_PATH}")
    print(f"  Size: {output_kb:.1f} KB")
else:
    raise FileNotFoundError(f"Failed to save demographics to {OUTPUT_PATH}")

Saving demographics to ../data/flannel_token_demographics.json...

✓ Saved demographics data
  Path: ../data/flannel_token_demographics.json
  Size: 1217.7 KB


## Summary

In [10]:
print(f"\n{'='*70}")
print(f"DEMOGRAPHICS ANALYSIS COMPLETE")
print(f"{'='*70}\n")

print(f"Tokenizer: {TOKENIZER_PATH}")
print(f"Vocabulary: {vocab_size:,} tokens\n")

print(f"Key Statistics:")
print(f"  Pure English: {category_counts['english']:,} tokens ({100*category_counts['english']/vocab_size:.1f}%)")
print(f"  Pure Thai: {category_counts['thai']:,} tokens ({100*category_counts['thai']/vocab_size:.1f}%)")
print(f"  Mixed: {category_counts['mixed']:,} tokens")
print(f"  Other: {sum(category_counts[c] for c in ['numeric', 'whitespace', 'punctuation', 'special', 'other']):,} tokens\n")

print(f"Experimental Validation:")
print(f"  Expected dead tokens: {dead_token_estimate:,}")
print(f"  (Thai + mixed Thai/English tokens)\n")

print(f"When we train on English-only corpus:")
print(f"  ✓ {category_counts['english']:,} English tokens should be trained")
print(f"  ✓ {dead_token_estimate:,} Thai tokens should NEVER appear")
print(f"  ✓ If untrained token count = {dead_token_estimate:,}, experiment validated!\n")

print(f"Output:")
print(f"  Demographics data: {OUTPUT_PATH}\n")

print(f"Next steps:")
print(f"  → Use this tokenizer to train Flannel 1 (notebook 1.20a)")
print(f"  → After training, compare untrained tokens to this demographics")
print(f"  → Perfect match = perfect experimental design!")
print()
print(f"{'='*70}")


DEMOGRAPHICS ANALYSIS COMPLETE

Tokenizer: ../data/flannel_tokenizer_chars.json
Vocabulary: 10,000 tokens

Key Statistics:
  Pure English: 5,897 tokens (59.0%)
  Pure Thai: 1,272 tokens (12.7%)
  Mixed: 0 tokens
  Other: 2,831 tokens

Experimental Validation:
  Expected dead tokens: 1,272
  (Thai + mixed Thai/English tokens)

When we train on English-only corpus:
  ✓ 5,897 English tokens should be trained
  ✓ 1,272 Thai tokens should NEVER appear
  ✓ If untrained token count = 1,272, experiment validated!

Output:
  Demographics data: ../data/flannel_token_demographics.json

Next steps:
  → Use this tokenizer to train Flannel 1 (notebook 1.20a)
  → After training, compare untrained tokens to this demographics
  → Perfect match = perfect experimental design!

