# 1.19b3: Token Demographics with Unicode Block Classification

**Goal:** Subdivide the 2,627 "other" tokens by Unicode block for better understanding.

## The Problem

In **1.19b2**, we classified tokens as English, Thai, numeric, etc., but 26% fell into "other"—too many unclassified tokens.

Looking at those "other" tokens, they include:
- Latin-1 Supplement (¡¢£¤¥§©ª...)
- Latin Extended (À Á Â Ã Ä Å...)
- Emoji and symbols
- Other scripts (Cyrillic, Arabic, Chinese, etc.)

## Unicode Blocks

Unicode organizes characters into **blocks**—contiguous ranges with semantic meaning:
- U+0000–U+007F: Basic Latin (ASCII)
- U+0080–U+00FF: Latin-1 Supplement
- U+0100–U+017F: Latin Extended-A
- U+0E00–U+0E7F: Thai
- U+1F600–U+1F64F: Emoticons
- ...and hundreds more

By classifying tokens by their Unicode blocks, we can see exactly what's in that "other" category.

## This Notebook

1. Load tokenizer and existing demographics
2. For each token, identify which Unicode blocks its characters belong to
3. Show breakdown by block
4. Identify tokens that span multiple blocks

## Parameters

In [1]:
# Input files
TOKENIZER_PATH = "../data/flannel_tokenizer_chars.json"
DEMOGRAPHICS_PATH = "../data/flannel_token_demographics.json"

# Output
OUTPUT_PATH = "../data/flannel_token_unicode_blocks.json"

print("✓ Parameters set")

✓ Parameters set


## Imports

In [2]:
from tokenizers import Tokenizer
from pathlib import Path
from collections import Counter, defaultdict
import json
import unicodedata

print("✓ Imports complete")

✓ Imports complete


## Unicode Block Detection

Python's `unicodedata` doesn't directly give us block names, but we can infer them from character ranges.

In [3]:
# Define major Unicode blocks
# Format: (start, end, name)
UNICODE_BLOCKS = [
    (0x0000, 0x007F, 'Basic Latin (ASCII)'),
    (0x0080, 0x00FF, 'Latin-1 Supplement'),
    (0x0100, 0x017F, 'Latin Extended-A'),
    (0x0180, 0x024F, 'Latin Extended-B'),
    (0x0250, 0x02AF, 'IPA Extensions'),
    (0x02B0, 0x02FF, 'Spacing Modifier Letters'),
    (0x0300, 0x036F, 'Combining Diacritical Marks'),
    (0x0370, 0x03FF, 'Greek and Coptic'),
    (0x0400, 0x04FF, 'Cyrillic'),
    (0x0500, 0x052F, 'Cyrillic Supplement'),
    (0x0530, 0x058F, 'Armenian'),
    (0x0590, 0x05FF, 'Hebrew'),
    (0x0600, 0x06FF, 'Arabic'),
    (0x0700, 0x074F, 'Syriac'),
    (0x0780, 0x07BF, 'Thaana'),
    (0x0900, 0x097F, 'Devanagari'),
    (0x0980, 0x09FF, 'Bengali'),
    (0x0E00, 0x0E7F, 'Thai'),
    (0x0E80, 0x0EFF, 'Lao'),
    (0x1000, 0x109F, 'Myanmar'),
    (0x10A0, 0x10FF, 'Georgian'),
    (0x1100, 0x11FF, 'Hangul Jamo'),
    (0x1200, 0x137F, 'Ethiopic'),
    (0x13A0, 0x13FF, 'Cherokee'),
    (0x1400, 0x167F, 'Unified Canadian Aboriginal Syllabics'),
    (0x1680, 0x169F, 'Ogham'),
    (0x16A0, 0x16FF, 'Runic'),
    (0x1700, 0x171F, 'Tagalog'),
    (0x1780, 0x17FF, 'Khmer'),
    (0x1800, 0x18AF, 'Mongolian'),
    (0x1E00, 0x1EFF, 'Latin Extended Additional'),
    (0x1F00, 0x1FFF, 'Greek Extended'),
    (0x2000, 0x206F, 'General Punctuation'),
    (0x2070, 0x209F, 'Superscripts and Subscripts'),
    (0x20A0, 0x20CF, 'Currency Symbols'),
    (0x20D0, 0x20FF, 'Combining Diacritical Marks for Symbols'),
    (0x2100, 0x214F, 'Letterlike Symbols'),
    (0x2150, 0x218F, 'Number Forms'),
    (0x2190, 0x21FF, 'Arrows'),
    (0x2200, 0x22FF, 'Mathematical Operators'),
    (0x2300, 0x23FF, 'Miscellaneous Technical'),
    (0x2400, 0x243F, 'Control Pictures'),
    (0x2460, 0x24FF, 'Enclosed Alphanumerics'),
    (0x2500, 0x257F, 'Box Drawing'),
    (0x2580, 0x259F, 'Block Elements'),
    (0x25A0, 0x25FF, 'Geometric Shapes'),
    (0x2600, 0x26FF, 'Miscellaneous Symbols'),
    (0x2700, 0x27BF, 'Dingbats'),
    (0x27C0, 0x27EF, 'Miscellaneous Mathematical Symbols-A'),
    (0x2800, 0x28FF, 'Braille Patterns'),
    (0x2E80, 0x2EFF, 'CJK Radicals Supplement'),
    (0x2F00, 0x2FDF, 'Kangxi Radicals'),
    (0x3000, 0x303F, 'CJK Symbols and Punctuation'),
    (0x3040, 0x309F, 'Hiragana'),
    (0x30A0, 0x30FF, 'Katakana'),
    (0x3100, 0x312F, 'Bopomofo'),
    (0x3130, 0x318F, 'Hangul Compatibility Jamo'),
    (0x3200, 0x32FF, 'Enclosed CJK Letters and Months'),
    (0x3300, 0x33FF, 'CJK Compatibility'),
    (0x4E00, 0x9FFF, 'CJK Unified Ideographs'),
    (0xA000, 0xA48F, 'Yi Syllables'),
    (0xA490, 0xA4CF, 'Yi Radicals'),
    (0xAC00, 0xD7AF, 'Hangul Syllables'),
    (0xE000, 0xF8FF, 'Private Use Area'),
    (0xF900, 0xFAFF, 'CJK Compatibility Ideographs'),
    (0xFB00, 0xFB4F, 'Alphabetic Presentation Forms'),
    (0xFB50, 0xFDFF, 'Arabic Presentation Forms-A'),
    (0xFE00, 0xFE0F, 'Variation Selectors'),
    (0xFE10, 0xFE1F, 'Vertical Forms'),
    (0xFE20, 0xFE2F, 'Combining Half Marks'),
    (0xFE30, 0xFE4F, 'CJK Compatibility Forms'),
    (0xFE50, 0xFE6F, 'Small Form Variants'),
    (0xFE70, 0xFEFF, 'Arabic Presentation Forms-B'),
    (0xFF00, 0xFFEF, 'Halfwidth and Fullwidth Forms'),
    (0xFFF0, 0xFFFF, 'Specials'),
    (0x10000, 0x1007F, 'Linear B Syllabary'),
    (0x10080, 0x100FF, 'Linear B Ideograms'),
    (0x10100, 0x1013F, 'Aegean Numbers'),
    (0x10300, 0x1032F, 'Old Italic'),
    (0x10330, 0x1034F, 'Gothic'),
    (0x10380, 0x1039F, 'Ugaritic'),
    (0x103A0, 0x103DF, 'Old Persian'),
    (0x10400, 0x1044F, 'Deseret'),
    (0x10450, 0x1047F, 'Shavian'),
    (0x10480, 0x104AF, 'Osmanya'),
    (0x10800, 0x1083F, 'Cypriot Syllabary'),
    (0x10900, 0x1091F, 'Phoenician'),
    (0x10A00, 0x10A5F, 'Kharoshthi'),
    (0x12000, 0x123FF, 'Cuneiform'),
    (0x12400, 0x1247F, 'Cuneiform Numbers and Punctuation'),
    (0x1D000, 0x1D0FF, 'Byzantine Musical Symbols'),
    (0x1D100, 0x1D1FF, 'Musical Symbols'),
    (0x1D200, 0x1D24F, 'Ancient Greek Musical Notation'),
    (0x1D300, 0x1D35F, 'Tai Xuan Jing Symbols'),
    (0x1D360, 0x1D37F, 'Counting Rod Numerals'),
    (0x1D400, 0x1D7FF, 'Mathematical Alphanumeric Symbols'),
    (0x1F000, 0x1F02F, 'Mahjong Tiles'),
    (0x1F030, 0x1F09F, 'Domino Tiles'),
    (0x1F0A0, 0x1F0FF, 'Playing Cards'),
    (0x1F100, 0x1F1FF, 'Enclosed Alphanumeric Supplement'),
    (0x1F200, 0x1F2FF, 'Enclosed Ideographic Supplement'),
    (0x1F300, 0x1F5FF, 'Miscellaneous Symbols and Pictographs'),
    (0x1F600, 0x1F64F, 'Emoticons'),
    (0x1F650, 0x1F67F, 'Ornamental Dingbats'),
    (0x1F680, 0x1F6FF, 'Transport and Map Symbols'),
    (0x1F700, 0x1F77F, 'Alchemical Symbols'),
    (0x1F780, 0x1F7FF, 'Geometric Shapes Extended'),
    (0x1F800, 0x1F8FF, 'Supplemental Arrows-C'),
    (0x1F900, 0x1F9FF, 'Supplemental Symbols and Pictographs'),
    (0x20000, 0x2A6DF, 'CJK Unified Ideographs Extension B'),
    (0x2A700, 0x2B73F, 'CJK Unified Ideographs Extension C'),
    (0x2B740, 0x2B81F, 'CJK Unified Ideographs Extension D'),
    (0x2B820, 0x2CEAF, 'CJK Unified Ideographs Extension E'),
]

def get_unicode_block(char):
    """Get the Unicode block name for a character"""
    code = ord(char)
    for start, end, name in UNICODE_BLOCKS:
        if start <= code <= end:
            return name
    return 'Unknown'

print(f"✓ Unicode block detection ready")
print(f"  Defined {len(UNICODE_BLOCKS)} Unicode blocks")

✓ Unicode block detection ready
  Defined 113 Unicode blocks


## Load Data

In [4]:
print("Loading tokenizer and demographics...\n")

# Load tokenizer
tokenizer = Tokenizer.from_file(str(TOKENIZER_PATH))
vocab = tokenizer.get_vocab()

# Load existing demographics
with open(DEMOGRAPHICS_PATH, 'r', encoding='utf-8') as f:
    demographics = json.load(f)

print(f"✓ Loaded tokenizer ({len(vocab):,} tokens)")
print(f"✓ Loaded demographics data")

Loading tokenizer and demographics...

✓ Loaded tokenizer (10,000 tokens)
✓ Loaded demographics data


## Analyze Unicode Blocks

In [5]:
print(f"\nAnalyzing Unicode blocks for all tokens...\n")

# For each token, find which blocks it contains
token_blocks = {}  # token_id -> set of block names
block_token_counts = Counter()  # block_name -> count of tokens containing it
pure_block_tokens = defaultdict(list)  # block_name -> [(token, id), ...] for pure tokens

for token_str, token_id in vocab.items():
    # Get blocks for all characters in this token
    blocks = set(get_unicode_block(char) for char in token_str)
    token_blocks[token_id] = blocks
    
    # Count occurrences
    for block in blocks:
        block_token_counts[block] += 1
    
    # If token is pure (single block), add to that block's list
    if len(blocks) == 1:
        block_name = list(blocks)[0]
        pure_block_tokens[block_name].append((token_str, token_id))

print(f"✓ Analysis complete")
print(f"  Found {len(block_token_counts)} different Unicode blocks represented")


Analyzing Unicode blocks for all tokens...

✓ Analysis complete
  Found 55 different Unicode blocks represented


## Show Block Statistics

In [6]:
print(f"\n{'='*70}")
print(f"UNICODE BLOCK DISTRIBUTION")
print(f"{'='*70}\n")

print(f"Tokens containing characters from each block:")
print(f"(A token may appear in multiple blocks if it contains mixed characters)\n")

# Sort by count, descending
sorted_blocks = sorted(block_token_counts.items(), key=lambda x: x[1], reverse=True)

for block_name, count in sorted_blocks:
    pct = 100 * count / len(vocab)
    pure_count = len(pure_block_tokens[block_name])
    print(f"  {block_name:45s}: {count:5,} tokens ({pct:5.2f}%) - {pure_count:5,} pure")


UNICODE BLOCK DISTRIBUTION

Tokens containing characters from each block:
(A token may appear in multiple blocks if it contains mixed characters)

  Basic Latin (ASCII)                          : 6,109 tokens (61.09%) - 6,100 pure
  Thai                                         : 1,272 tokens (12.72%) - 1,272 pure
  CJK Unified Ideographs                       : 1,035 tokens (10.35%) - 1,035 pure
  Hangul Syllables                             :   385 tokens ( 3.85%) -   385 pure
  Unknown                                      :   107 tokens ( 1.07%) -   107 pure
  Latin-1 Supplement                           :    89 tokens ( 0.89%) -    89 pure
  Cyrillic                                     :    83 tokens ( 0.83%) -    83 pure
  Miscellaneous Symbols and Pictographs        :    70 tokens ( 0.70%) -    70 pure
  Katakana                                     :    65 tokens ( 0.65%) -    65 pure
  Latin Extended-A                             :    63 tokens ( 0.63%) -    63 pure
  Arabic    

## Focus on "Other" Category

Let's see what Unicode blocks are in the 2,627 "other" tokens from 1.19b2.

In [7]:
print(f"\n{'='*70}")
print(f"BREAKDOWN OF 'OTHER' CATEGORY")
print(f"{'='*70}\n")

# Get list of "other" tokens from demographics
other_tokens = [(t, i) for t, i in demographics['tokens_by_category']['other']]
other_token_ids = set(i for _, i in other_tokens)

print(f"Total 'other' tokens: {len(other_tokens):,}\n")

# Count blocks within "other" category
other_block_counts = Counter()
for token_str, token_id in other_tokens:
    blocks = token_blocks[token_id]
    for block in blocks:
        other_block_counts[block] += 1

print(f"Unicode blocks in 'other' category:")
sorted_other_blocks = sorted(other_block_counts.items(), key=lambda x: x[1], reverse=True)

for block_name, count in sorted_other_blocks:
    pct = 100 * count / len(other_tokens)
    print(f"  {block_name:45s}: {count:5,} tokens ({pct:5.1f}%)")


BREAKDOWN OF 'OTHER' CATEGORY

Total 'other' tokens: 2,627

Unicode blocks in 'other' category:
  CJK Unified Ideographs                       : 1,035 tokens ( 39.4%)
  Hangul Syllables                             :   385 tokens ( 14.7%)
  Unknown                                      :   107 tokens (  4.1%)
  Latin-1 Supplement                           :    88 tokens (  3.3%)
  Cyrillic                                     :    83 tokens (  3.2%)
  Miscellaneous Symbols and Pictographs        :    70 tokens (  2.7%)
  Katakana                                     :    65 tokens (  2.5%)
  Latin Extended-A                             :    63 tokens (  2.4%)
  Arabic                                       :    61 tokens (  2.3%)
  Hiragana                                     :    59 tokens (  2.2%)
  Devanagari                                   :    55 tokens (  2.1%)
  Greek and Coptic                             :    47 tokens (  1.8%)
  Myanmar                                      :   

## Show Examples from Top Blocks

In [8]:
print(f"\n{'='*70}")
print(f"EXAMPLES FROM TOP UNICODE BLOCKS")
print(f"{'='*70}\n")

# Show top 5 blocks
for block_name, _ in sorted_blocks[:5]:
    pure_tokens = pure_block_tokens[block_name]
    print(f"{block_name} (pure tokens: {len(pure_tokens):,}):")
    
    # Show first 20 examples
    sorted_tokens = sorted(pure_tokens, key=lambda x: x[1])[:20]
    for token_str, token_id in sorted_tokens:
        print(f"  {token_id:5d}: {repr(token_str)}")
    
    if len(pure_tokens) > 20:
        print(f"  ... and {len(pure_tokens) - 20:,} more")
    print()


EXAMPLES FROM TOP UNICODE BLOCKS

Basic Latin (ASCII) (pure tokens: 6,100):
      0: '<|endoftext|>'
      1: '\n'
      2: ' '
      3: '!'
      4: '"'
      5: '#'
      6: '$'
      7: '%'
      8: '&'
      9: "'"
     10: '('
     11: ')'
     12: '*'
     13: '+'
     14: ','
     15: '-'
     16: '.'
     17: '/'
     18: '0'
     19: '1'
  ... and 6,080 more

Thai (pure tokens: 1,272):
    673: 'ก'
    674: 'ข'
    675: 'ฃ'
    676: 'ค'
    677: 'ฅ'
    678: 'ฆ'
    679: 'ง'
    680: 'จ'
    681: 'ฉ'
    682: 'ช'
    683: 'ซ'
    684: 'ฌ'
    685: 'ญ'
    686: 'ฎ'
    687: 'ฏ'
    688: 'ฐ'
    689: 'ฑ'
    690: 'ฒ'
    691: 'ณ'
    692: 'ด'
  ... and 1,252 more

CJK Unified Ideographs (pure tokens: 1,035):
   1202: '一'
   1203: '七'
   1204: '万'
   1205: '丈'
   1206: '三'
   1207: '上'
   1208: '下'
   1209: '不'
   1210: '与'
   1211: '世'
   1212: '両'
   1213: '並'
   1214: '个'
   1215: '中'
   1216: '为'
   1217: '主'
   1218: '乃'
   1219: '久'
   1220: '义'
   1221: '之'
  ... and 1,01

## Identify Mixed-Block Tokens

In [9]:
print(f"\n{'='*70}")
print(f"MIXED-BLOCK TOKENS")
print(f"{'='*70}\n")

# Find tokens that span multiple blocks
mixed_block_tokens = [(token_str, token_id, token_blocks[token_id]) 
                      for token_str, token_id in vocab.items() 
                      if len(token_blocks[token_id]) > 1]

print(f"Tokens spanning multiple Unicode blocks: {len(mixed_block_tokens):,}\n")

if mixed_block_tokens:
    print(f"Examples (first 20):")
    sorted_mixed = sorted(mixed_block_tokens, key=lambda x: x[1])[:20]
    for token_str, token_id, blocks in sorted_mixed:
        block_list = ', '.join(sorted(blocks))
        print(f"  {token_id:5d}: {repr(token_str):20s} - {block_list}")
    
    if len(mixed_block_tokens) > 20:
        print(f"  ... and {len(mixed_block_tokens) - 20:,} more")


MIXED-BLOCK TOKENS

Tokens spanning multiple Unicode blocks: 9

Examples (first 20):
   3378: '.”'                 - Basic Latin (ASCII), General Punctuation
   3579: ',”'                 - Basic Latin (ASCII), General Punctuation
   6005: '?”'                 - Basic Latin (ASCII), General Punctuation
   6326: '”.'                 - Basic Latin (ASCII), General Punctuation
   7299: '”,'                 - Basic Latin (ASCII), General Punctuation
   8035: '!”'                 - Basic Latin (ASCII), General Punctuation
   8761: '.’'                 - Basic Latin (ASCII), General Punctuation
   9463: '’.'                 - Basic Latin (ASCII), General Punctuation
   9712: '’,'                 - Basic Latin (ASCII), General Punctuation


## Save Unicode Block Data

In [10]:
print(f"\nSaving Unicode block data to {OUTPUT_PATH}...\n")

# Prepare data
unicode_block_data = {
    'block_token_counts': dict(block_token_counts),
    'token_blocks': {str(tid): list(blocks) for tid, blocks in token_blocks.items()},
    'pure_block_token_counts': {block: len(tokens) for block, tokens in pure_block_tokens.items()},
    'mixed_block_token_count': len(mixed_block_tokens),
    'other_category_blocks': dict(other_block_counts)
}

# Save
Path(OUTPUT_PATH).parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_PATH, 'w', encoding='utf-8') as f:
    json.dump(unicode_block_data, f, ensure_ascii=False, indent=2)

output_path = Path(OUTPUT_PATH)
if output_path.exists():
    output_kb = output_path.stat().st_size / 1024
    print(f"✓ Saved Unicode block data")
    print(f"  Path: {OUTPUT_PATH}")
    print(f"  Size: {output_kb:.1f} KB")


Saving Unicode block data to ../data/flannel_token_unicode_blocks.json...

✓ Saved Unicode block data
  Path: ../data/flannel_token_unicode_blocks.json
  Size: 461.3 KB


## Summary

In [11]:
print(f"\n{'='*70}")
print(f"UNICODE BLOCK ANALYSIS COMPLETE")
print(f"{'='*70}\n")

print(f"Key Findings:")
print(f"  Total Unicode blocks represented: {len(block_token_counts)}")
print(f"  Tokens spanning multiple blocks: {len(mixed_block_tokens):,}")
print(f"  Tokens in single block: {len(vocab) - len(mixed_block_tokens):,}\n")

print(f"Top 5 blocks by token count:")
for i, (block_name, count) in enumerate(sorted_blocks[:5], 1):
    pct = 100 * count / len(vocab)
    pure = len(pure_block_tokens[block_name])
    print(f"  {i}. {block_name}: {count:,} tokens ({pct:.1f}%), {pure:,} pure")

print(f"\nMystery of 'other' category solved:")
if sorted_other_blocks:
    top_other = sorted_other_blocks[0]
    print(f"  Largest component: {top_other[0]} ({top_other[1]:,} tokens)")
    if len(sorted_other_blocks) > 1:
        print(f"  Plus {len(sorted_other_blocks) - 1} other Unicode blocks")

print(f"\nOutput:")
print(f"  Unicode block data: {OUTPUT_PATH}")
print()
print(f"{'='*70}")


UNICODE BLOCK ANALYSIS COMPLETE

Key Findings:
  Total Unicode blocks represented: 55
  Tokens spanning multiple blocks: 9
  Tokens in single block: 9,991

Top 5 blocks by token count:
  1. Basic Latin (ASCII): 6,109 tokens (61.1%), 6,100 pure
  2. Thai: 1,272 tokens (12.7%), 1,272 pure
  3. CJK Unified Ideographs: 1,035 tokens (10.3%), 1,035 pure
  4. Hangul Syllables: 385 tokens (3.9%), 385 pure
  5. Unknown: 107 tokens (1.1%), 107 pure

Mystery of 'other' category solved:
  Largest component: CJK Unified Ideographs (1,035 tokens)
  Plus 53 other Unicode blocks

Output:
  Unicode block data: ../data/flannel_token_unicode_blocks.json

