# 1.10d: Classify All Tokens by Script

**Complete taxonomy of Qwen's 151,936-token vocabulary**

## The Question

We've been exploring:
- Cluster tokens (geometric collapse)
- Halo tokens (unreachable via Unicode, but reachable via bytes)
- Bulk tokens (normal vocabulary)

But we still don't know: **What ARE these tokens?**

## The Goal

Classify every token in Qwen's vocabulary by:
1. **Script/Alphabet** (Han/CJK, Latin, Thai, Cyrillic, Arabic, etc.)
2. **Character type** (printable, replacement character �, control characters, mixed)
3. **Geometric category** (cluster, halo, bulk)

Then save this classification as a dataset for future analysis.

## Method

Use the **`alphabet-detector`** library for automatic script detection:
```python
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()

ad.detect_alphabet("你好")    # {'HAN'}
ad.detect_alphabet("Hello")  # {'LATIN'}
ad.detect_alphabet("สวัสดี")  # {'THAI'}
```

For each token:
1. Decode it (UTF-8, which may produce �)
2. Detect script(s) using `alphabet-detector`
3. Check for special cases (replacement chars, control chars, empty)
4. Record geometric category (cluster/halo/bulk)
5. Save everything to a CSV/safetensors for later use

## Parameters

In [1]:
# Model
MODEL_NAME = "Qwen3-4B-Instruct-2507"
HF_MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Input data
CLUSTER_TOKENS_PATH = "../tensors/Qwen3-4B-Instruct-2507/1.4h_cluster_tokens.safetensors"
REACHABILITY_PATH = "../tensors/Qwen3-4B-Instruct-2507/1.8d_full_vocab_reachability.safetensors"

# Output
OUTPUT_CSV = "../tensors/Qwen3-4B-Instruct-2507/1.10d_token_classification.csv"
OUTPUT_SAFETENSORS = "../tensors/Qwen3-4B-Instruct-2507/1.10d_token_classification.safetensors"

## Imports

In [2]:
import torch
import pandas as pd
import unicodedata
from transformers import AutoTokenizer
from safetensors.torch import load_file, save_file
from tqdm import tqdm
from collections import Counter

# Install alphabet-detector if needed: uv add alphabet-detector
try:
    from alphabet_detector import AlphabetDetector
    ad = AlphabetDetector()
    print("✓ alphabet-detector imported successfully")
except ImportError:
    print("ERROR: alphabet-detector not installed")
    print("Please run: uv add alphabet-detector")
    raise

✓ alphabet-detector imported successfully


## Load Token Classifications

In [3]:
print("Loading geometric classifications...\n")

# Load cluster tokens
cluster_data = load_file(CLUSTER_TOKENS_PATH)
cluster_token_ids = set(cluster_data['cluster_token_ids'].tolist())

# Load halo tokens (unreachable outside cluster)
reachability_data = load_file(REACHABILITY_PATH)
halo_token_ids = set(reachability_data['unreachable_outside_cluster'].tolist())

print(f"✓ Loaded geometric classifications")
print(f"  Cluster tokens: {len(cluster_token_ids):,}")
print(f"  Halo tokens: {len(halo_token_ids):,}")
print(f"  Bulk tokens: {151669 - len(cluster_token_ids) - len(halo_token_ids):,}")

Loading geometric classifications...

✓ Loaded geometric classifications
  Cluster tokens: 2,212
  Halo tokens: 1,423
  Bulk tokens: 148,034


## Load Tokenizer

In [4]:
print(f"\nLoading tokenizer: {HF_MODEL_NAME}\n")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
vocab_size = len(tokenizer)

print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {vocab_size:,} tokens")


Loading tokenizer: Qwen/Qwen3-4B-Instruct-2507

✓ Tokenizer loaded
  Vocabulary size: 151,669 tokens


## Helper Functions

In [5]:
def classify_token(token_id, tokenizer, ad, cluster_ids, halo_ids):
    """
    Classify a single token by script, character type, and geometric category.
    
    Returns dict with:
    - token_id
    - decoded_text
    - scripts (comma-separated set of detected scripts)
    - is_replacement (contains U+FFFD)
    - is_printable (all printable characters)
    - is_empty (empty or whitespace only)
    - char_count (number of characters)
    - geometric_category (cluster, halo, bulk)
    - primary_script (most common script, or special category)
    """
    # Decode token
    decoded = tokenizer.decode([token_id])
    
    # Geometric category
    if token_id in cluster_ids:
        geometric_category = 'cluster'
    elif token_id in halo_ids:
        geometric_category = 'halo'
    else:
        geometric_category = 'bulk'
    
    # Character type checks
    is_replacement = '\ufffd' in decoded
    is_empty = len(decoded.strip()) == 0
    char_count = len(decoded)
    
    # Printability check
    is_printable = True
    if decoded:
        for char in decoded:
            cat = unicodedata.category(char)
            if cat.startswith('C') and not char.isspace():
                is_printable = False
                break
    
    # Script detection
    scripts = set()
    if decoded and not is_replacement:
        try:
            detected = ad.detect_alphabet(decoded)
            if detected:
                scripts = detected
        except:
            pass
    
    # Primary script classification
    if is_replacement:
        primary_script = 'REPLACEMENT'
    elif is_empty:
        primary_script = 'EMPTY'
    elif not is_printable:
        primary_script = 'CONTROL'
    elif len(scripts) == 0:
        primary_script = 'UNKNOWN'
    elif len(scripts) == 1:
        primary_script = list(scripts)[0]
    else:
        # Mixed scripts - pick most common or use 'MIXED'
        primary_script = 'MIXED'
    
    return {
        'token_id': token_id,
        'decoded_text': decoded,
        'scripts': ','.join(sorted(scripts)) if scripts else '',
        'is_replacement': is_replacement,
        'is_printable': is_printable,
        'is_empty': is_empty,
        'char_count': char_count,
        'geometric_category': geometric_category,
        'primary_script': primary_script,
    }

print("✓ Helper functions defined")

✓ Helper functions defined


## Classify All Tokens

In [6]:
print(f"\n{'='*70}")
print("CLASSIFYING ALL TOKENS")
print(f"{'='*70}\n")

print(f"Processing {vocab_size:,} tokens...\n")

# Classify all tokens
classifications = []

for token_id in tqdm(range(vocab_size), desc="Classifying tokens"):
    classification = classify_token(token_id, tokenizer, ad, cluster_token_ids, halo_token_ids)
    classifications.append(classification)

# Create dataframe
df = pd.DataFrame(classifications)

print(f"\n✓ Classification complete")
print(f"  Tokens classified: {len(df):,}")
print(f"  Dataframe shape: {df.shape}")


CLASSIFYING ALL TOKENS

Processing 151,669 tokens...



Classifying tokens: 100%|██████████| 151669/151669 [00:00<00:00, 250522.25it/s]


✓ Classification complete
  Tokens classified: 151,669
  Dataframe shape: (151669, 9)





## Summary Statistics

In [7]:
print(f"\n{'='*70}")
print("SUMMARY STATISTICS")
print(f"{'='*70}\n")

print("Primary script distribution:")
script_counts = df['primary_script'].value_counts()
for script, count in script_counts.items():
    pct = 100 * count / len(df)
    print(f"  {script:20s}: {count:6,} ({pct:5.2f}%)")

print(f"\nGeometric category distribution:")
geom_counts = df['geometric_category'].value_counts()
for category, count in geom_counts.items():
    pct = 100 * count / len(df)
    print(f"  {category:20s}: {count:6,} ({pct:5.2f}%)")

print(f"\nSpecial categories:")
print(f"  Replacement chars: {df['is_replacement'].sum():,} ({100*df['is_replacement'].sum()/len(df):.2f}%)")
print(f"  Empty/whitespace: {df['is_empty'].sum():,} ({100*df['is_empty'].sum()/len(df):.2f}%)")
print(f"  Non-printable: {(~df['is_printable']).sum():,} ({100*(~df['is_printable']).sum()/len(df):.2f}%)")


SUMMARY STATISTICS

Primary script distribution:
  LATIN               : 94,610 (62.38%)
  CJK                 : 25,464 (16.79%)
  UNKNOWN             :  8,239 ( 5.43%)
  CYRILLIC            :  4,142 ( 2.73%)
  ARABIC              :  3,978 ( 2.62%)
  HANGUL              :  3,567 ( 2.35%)
  HEBREW              :  3,179 ( 2.10%)
  THAI                :  2,549 ( 1.68%)
  REPLACEMENT         :  1,457 ( 0.96%)
  HIRAGANA            :    939 ( 0.62%)
  MIXED               :    681 ( 0.45%)
  KATAKANA            :    450 ( 0.30%)
  EMPTY               :    441 ( 0.29%)
  MATHEMATICAL        :    434 ( 0.29%)
  GREEK               :    232 ( 0.15%)
  ETHIOPIC            :    112 ( 0.07%)
  MODIFIER            :     80 ( 0.05%)
  ARMENIAN            :     73 ( 0.05%)
  CANADIAN            :     71 ( 0.05%)
  HALFWIDTH           :     63 ( 0.04%)
  DEVANAGARI          :     56 ( 0.04%)
  CONTROL             :     53 ( 0.03%)
  FULLWIDTH           :     52 ( 0.03%)
  TAI                 :     43

## Cross-Tabulation: Script vs Geometric Category

In [8]:
print(f"\n{'='*70}")
print("SCRIPT × GEOMETRIC CATEGORY")
print(f"{'='*70}\n")

# Cross-tab
crosstab = pd.crosstab(df['primary_script'], df['geometric_category'])

# Sort by total count
crosstab['total'] = crosstab.sum(axis=1)
crosstab = crosstab.sort_values('total', ascending=False)

print(crosstab)

print(f"\nKey insights:")
print(f"  - Are cluster tokens dominated by specific scripts?")
print(f"  - Are halo tokens mostly REPLACEMENT characters?")
print(f"  - What scripts dominate the bulk vocabulary?")


SCRIPT × GEOMETRIC CATEGORY

geometric_category   bulk  cluster  halo  total
primary_script                                 
LATIN               94585       25     0  94610
CJK                 25269      195     0  25464
UNKNOWN              8233        6     0   8239
CYRILLIC             4142        0     0   4142
ARABIC               3903       75     0   3978
...                   ...      ...   ...    ...
FEMININE                1        0     0      1
EULER                   1        0     0      1
CARON                   1        0     0      1
BATAK                   1        0     0      1
KELVIN                  0        1     0      1

[91 rows x 4 columns]

Key insights:
  - Are cluster tokens dominated by specific scripts?
  - Are halo tokens mostly REPLACEMENT characters?
  - What scripts dominate the bulk vocabulary?


## Sample Tokens by Category

In [9]:
print(f"\n{'='*70}")
print("SAMPLE TOKENS BY CATEGORY")
print(f"{'='*70}\n")

# Show examples from each major category
categories = [
    ('REPLACEMENT', 'halo'),
    ('HAN', 'bulk'),
    ('LATIN', 'bulk'),
    ('THAI', 'cluster'),
]

for primary_script, geom_cat in categories:
    samples = df[(df['primary_script'] == primary_script) & (df['geometric_category'] == geom_cat)].head(5)
    
    if len(samples) > 0:
        print(f"\n{primary_script} ({geom_cat}):")
        for _, row in samples.iterrows():
            decoded_display = repr(row['decoded_text'])[:40]
            print(f"  Token {row['token_id']:6d}: {decoded_display}")


SAMPLE TOKENS BY CATEGORY


REPLACEMENT (halo):
  Token     94: '�'
  Token     95: '�'
  Token     96: '�'
  Token     97: '�'
  Token     98: '�'

LATIN (bulk):
  Token     32: 'A'
  Token     33: 'B'
  Token     34: 'C'
  Token     35: 'D'
  Token     36: 'E'

THAI (cluster):
  Token 123828: 'ที่'
  Token 123939: 'ร์'
  Token 123948: 'ติ'
  Token 123952: 'ด้'
  Token 124027: 'เป็น'


## Save Results

In [10]:
print(f"\n{'='*70}")
print("SAVING RESULTS")
print(f"{'='*70}\n")

# Save CSV (human-readable)
df.to_csv(OUTPUT_CSV, index=False)
print(f"✓ Saved CSV to: {OUTPUT_CSV}")
print(f"  Rows: {len(df):,}")
print(f"  Columns: {len(df.columns)}")

# Save safetensors (for fast loading in future notebooks)
save_dict = {
    'token_ids': torch.tensor(df['token_id'].values, dtype=torch.int64),
    'is_replacement': torch.tensor(df['is_replacement'].values, dtype=torch.bool),
    'is_printable': torch.tensor(df['is_printable'].values, dtype=torch.bool),
    'is_empty': torch.tensor(df['is_empty'].values, dtype=torch.bool),
    'char_count': torch.tensor(df['char_count'].values, dtype=torch.int32),
}

# Encode categorical variables as integers
geom_cat_map = {'bulk': 0, 'halo': 1, 'cluster': 2}
df['geometric_category_code'] = df['geometric_category'].map(geom_cat_map)
save_dict['geometric_category'] = torch.tensor(df['geometric_category_code'].values, dtype=torch.int8)

# Script codes (store unique scripts and their codes separately)
unique_scripts = sorted(df['primary_script'].unique())
script_to_code = {script: i for i, script in enumerate(unique_scripts)}
df['primary_script_code'] = df['primary_script'].map(script_to_code)
save_dict['primary_script_code'] = torch.tensor(df['primary_script_code'].values, dtype=torch.int16)

save_file(save_dict, OUTPUT_SAFETENSORS)
print(f"\n✓ Saved safetensors to: {OUTPUT_SAFETENSORS}")

# Save script mapping for reference
print(f"\nScript code mapping:")
for script, code in sorted(script_to_code.items(), key=lambda x: x[1])[:10]:
    print(f"  {code:3d}: {script}")
if len(unique_scripts) > 10:
    print(f"  ... ({len(unique_scripts) - 10} more)")


SAVING RESULTS

✓ Saved CSV to: ../tensors/Qwen3-4B-Instruct-2507/1.10d_token_classification.csv
  Rows: 151,669
  Columns: 9

✓ Saved safetensors to: ../tensors/Qwen3-4B-Instruct-2507/1.10d_token_classification.safetensors

Script code mapping:
    0: ALEF
    1: ANGSTROM
    2: ARABIC
    3: ARMENIAN
    4: BALINESE
    5: BAMUM
    6: BATAK
    7: BENGALI
    8: BLACK-LETTER
    9: BOPOMOFO
  ... (81 more)


## Summary

This notebook classified all 151,936 tokens in Qwen's vocabulary by:
- **Script/Alphabet** (using `alphabet-detector`)
- **Character type** (replacement, printable, empty, control)
- **Geometric category** (cluster, halo, bulk)

**Key findings to look for:**

1. **Cluster tokens:** Are they dominated by Thai script? (Expected from 1.7a)
2. **Halo tokens:** Are they mostly REPLACEMENT characters? (Expected from 1.10a)
3. **Bulk tokens:** What scripts dominate? (HAN/CJK for Chinese? LATIN for English?)
4. **Script distribution:** Overall breakdown of vocabulary by writing system

**Next steps:**
- Load this classification in future notebooks to filter/analyze by script
- Investigate specific script categories in detail
- Cross-reference with training data properties (if available)