# 1.7b: Thai Token Census

This notebook counts all Thai tokens in the full vocabulary to contextualize the cluster.

## The Question

We found that 71.4% of the cluster (1,579 tokens) are Thai script. But how does this compare to the full vocabulary?

**Is the cluster unusual, or is Thai just heavily represented in the tokenizer?**

Possible scenarios:
1. **Cluster is anomalous:** Thai is ~1-2% of full vocabulary → 71% in cluster is extreme concentration
2. **Thai is overrepresented:** Thai is ~20-30% of vocabulary → cluster reflects overall distribution
3. **Cluster captures most Thai:** Thai is ~1% of vocabulary → cluster contains >50% of all Thai tokens

## Method

We'll:
1. Decode the entire vocabulary (all 151,936 tokens)
2. Classify each token by script (reuse 1.7a's classifier)
3. Count Thai tokens across full vocabulary
4. Compare cluster Thai tokens to full vocabulary Thai tokens
5. Compute percentages and ratios

## Parameters

In [1]:
# Model to analyze
MODEL_NAME = "Qwen3-4B-Instruct-2507"
HF_MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

## Imports

In [2]:
import torch
import pandas as pd
import unicodedata
from safetensors.torch import load_file
from pathlib import Path
from transformers import AutoTokenizer
from collections import Counter
from tqdm import tqdm

## Load Tokenizer

In [3]:
print(f"Loading tokenizer from {HF_MODEL_NAME}...\n")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
vocab_size = len(tokenizer)
print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {vocab_size:,} tokens")

Loading tokenizer from Qwen/Qwen3-4B-Instruct-2507...

✓ Tokenizer loaded
  Vocabulary size: 151,669 tokens


## Load Cluster Data

In [4]:
# Load cluster tokens from 1.4h
tensor_path = Path(f"../tensors/{MODEL_NAME}/1.4h_cluster_tokens.safetensors")
data = load_file(tensor_path)
cluster_token_ids = set(data['cluster_token_ids'].tolist())

print(f"\nLoaded {len(cluster_token_ids):,} cluster token IDs")


Loaded 2,212 cluster token IDs


## Helper Functions

In [5]:
def classify_script(text):
    """
    Classify text by dominant script using Unicode properties.
    Returns primary script name (e.g., 'Thai', 'Han', 'Latin').
    """
    if not text or not text.strip():
        return 'Empty/Whitespace'
    
    # Count characters by script
    script_counts = Counter()
    
    for char in text:
        # Skip whitespace and control characters for classification
        if char.isspace() or unicodedata.category(char).startswith('C'):
            continue
            
        # Get Unicode script name
        try:
            script = unicodedata.name(char).split()[0]
            script_counts[script] += 1
        except (ValueError, IndexError):
            # Handle characters without names or unnamed characters
            category = unicodedata.category(char)
            if category.startswith('P'):
                script_counts['Punctuation'] += 1
            elif category.startswith('S'):
                script_counts['Symbol'] += 1
            else:
                script_counts['Other'] += 1
    
    if not script_counts:
        return 'Empty/Whitespace'
    
    # Return most common script
    return script_counts.most_common(1)[0][0]


print("Helper function defined: classify_script(text)")

Helper function defined: classify_script(text)


## Decode and Classify All Tokens

In [6]:
print(f"\nDecoding and classifying all {vocab_size:,} tokens...\n")

# Build classification for every token
records = []

for token_id in tqdm(range(vocab_size), desc="Processing tokens"):
    # Decode
    decoded = tokenizer.decode([token_id])
    
    # Classify script
    script = classify_script(decoded)
    
    # Check if in cluster
    in_cluster = token_id in cluster_token_ids
    
    records.append({
        'token_id': token_id,
        'decoded': decoded,
        'script': script,
        'in_cluster': in_cluster,
    })

# Create dataframe
df = pd.DataFrame(records)

print(f"\n✓ Processed {len(df):,} tokens")


Decoding and classifying all 151,669 tokens...



Processing tokens: 100%|██████████| 151669/151669 [00:00<00:00, 207919.28it/s]


✓ Processed 151,669 tokens





## Thai Token Statistics

In [7]:
print("\n" + "="*70)
print("THAI TOKEN CENSUS")
print("="*70)

# Overall counts
total_tokens = len(df)
total_thai = (df['script'] == 'THAI').sum()
total_cluster = df['in_cluster'].sum()

print(f"\nFull vocabulary:")
print(f"  Total tokens: {total_tokens:,}")
print(f"  Thai tokens: {total_thai:,} ({100*total_thai/total_tokens:.2f}%)")
print(f"  Non-Thai tokens: {total_tokens - total_thai:,} ({100*(total_tokens-total_thai)/total_tokens:.2f}%)")

# Cluster counts
cluster_thai = ((df['script'] == 'THAI') & df['in_cluster']).sum()
cluster_non_thai = (df['in_cluster'] & (df['script'] != 'THAI')).sum()

print(f"\nCluster (2,212 tokens):")
print(f"  Thai tokens: {cluster_thai:,} ({100*cluster_thai/total_cluster:.2f}%)")
print(f"  Non-Thai tokens: {cluster_non_thai:,} ({100*cluster_non_thai/total_cluster:.2f}%)")

# Non-cluster counts
non_cluster_thai = ((df['script'] == 'THAI') & ~df['in_cluster']).sum()
non_cluster_total = (~df['in_cluster']).sum()

print(f"\nNon-cluster tokens ({non_cluster_total:,} tokens):")
print(f"  Thai tokens: {non_cluster_thai:,} ({100*non_cluster_thai/non_cluster_total:.2f}%)")
print(f"  Non-Thai tokens: {non_cluster_total - non_cluster_thai:,} ({100*(non_cluster_total-non_cluster_thai)/non_cluster_total:.2f}%)")

print("\n" + "="*70)


THAI TOKEN CENSUS

Full vocabulary:
  Total tokens: 151,669
  Thai tokens: 2,571 (1.70%)
  Non-Thai tokens: 149,098 (98.30%)

Cluster (2,212 tokens):
  Thai tokens: 1,579 (81.18%)
  Non-Thai tokens: 366 (18.82%)

Non-cluster tokens (149,724 tokens):
  Thai tokens: 992 (0.66%)
  Non-Thai tokens: 148,732 (99.34%)



## Key Ratios and Comparisons

In [8]:
print("\n" + "="*70)
print("KEY RATIOS")
print("="*70)

# What fraction of all Thai tokens are in the cluster?
thai_capture_rate = 100 * cluster_thai / total_thai
print(f"\nThai token capture rate:")
print(f"  {cluster_thai:,} of {total_thai:,} Thai tokens are in cluster")
print(f"  = {thai_capture_rate:.1f}% of all Thai tokens")

# Enrichment factor
thai_pct_cluster = 100 * cluster_thai / total_cluster
thai_pct_vocab = 100 * total_thai / total_tokens
enrichment = thai_pct_cluster / thai_pct_vocab

print(f"\nThai enrichment in cluster:")
print(f"  Cluster: {thai_pct_cluster:.1f}% Thai")
print(f"  Full vocabulary: {thai_pct_vocab:.2f}% Thai")
print(f"  Enrichment factor: {enrichment:.1f}×")
print(f"  (Cluster is {enrichment:.1f}× more Thai-dense than expected)")

# Comparison to non-cluster
non_cluster_thai_pct = 100 * non_cluster_thai / non_cluster_total
cluster_vs_non_cluster = thai_pct_cluster / non_cluster_thai_pct

print(f"\nCluster vs non-cluster:")
print(f"  Cluster Thai: {thai_pct_cluster:.1f}%")
print(f"  Non-cluster Thai: {non_cluster_thai_pct:.2f}%")
print(f"  Ratio: {cluster_vs_non_cluster:.1f}×")

print("\n" + "="*70)


KEY RATIOS

Thai token capture rate:
  1,579 of 2,571 Thai tokens are in cluster
  = 61.4% of all Thai tokens

Thai enrichment in cluster:
  Cluster: 81.2% Thai
  Full vocabulary: 1.70% Thai
  Enrichment factor: 47.9×
  (Cluster is 47.9× more Thai-dense than expected)

Cluster vs non-cluster:
  Cluster Thai: 81.2%
  Non-cluster Thai: 0.66%
  Ratio: 122.5×



## Script Distribution Comparison

In [9]:
print("\n" + "="*70)
print("SCRIPT DISTRIBUTION: CLUSTER vs FULL VOCABULARY")
print("="*70)

# Get top 15 scripts from full vocabulary
vocab_script_counts = df['script'].value_counts().head(15)
cluster_script_counts = df[df['in_cluster']]['script'].value_counts()

print(f"\n{'Script':<20} {'Full Vocab':>12} {'Cluster':>12} {'Enrichment':>12}")
print("-" * 70)

for script in vocab_script_counts.index:
    vocab_count = vocab_script_counts[script]
    vocab_pct = 100 * vocab_count / total_tokens
    
    cluster_count = cluster_script_counts.get(script, 0)
    cluster_pct = 100 * cluster_count / total_cluster
    
    if vocab_pct > 0:
        enrichment = cluster_pct / vocab_pct
    else:
        enrichment = 0
    
    print(f"{script:<20} {vocab_pct:>11.2f}% {cluster_pct:>11.1f}% {enrichment:>11.1f}×")

print("\n" + "="*70)


SCRIPT DISTRIBUTION: CLUSTER vs FULL VOCABULARY

Script                 Full Vocab      Cluster   Enrichment
----------------------------------------------------------------------
LATIN                      61.83%         1.3%         0.0×
CJK                        17.01%        10.2%         0.6×
CYRILLIC                    2.74%         0.0%         0.0×
ARABIC                      2.64%         3.9%         1.5×
HANGUL                      2.36%         0.0%         0.0×
HEBREW                      2.10%         1.3%         0.6×
THAI                        1.70%        81.2%        47.9×
REPLACEMENT                 0.91%         0.9%         1.0×
HIRAGANA                    0.78%         0.0%         0.0×
RIGHT                       0.76%         0.0%         0.0×
LEFT                        0.54%         0.0%         0.0×
KATAKANA                    0.39%         0.0%         0.0×
QUOTATION                   0.34%         0.0%         0.0×
Empty/Whitespace            0.32%      

## Observations

This census tells us whether the cluster is:

**Scenario 1: Extreme concentration**
- Thai is <5% of full vocabulary
- Cluster captures >50% of all Thai tokens
- Enrichment factor >10×
- **Interpretation:** Cluster is a Thai token graveyard—most Thai tokens are dead

**Scenario 2: Representative sample**
- Thai is 20-30% of full vocabulary
- Cluster captures ~5% of Thai tokens
- Enrichment factor ~2-3×
- **Interpretation:** Cluster is slightly enriched for Thai, but not dramatically

**Scenario 3: Modest overrepresentation**
- Thai is 2-5% of full vocabulary
- Cluster captures 20-40% of Thai tokens
- Enrichment factor 5-8×
- **Interpretation:** Thai tokens disproportionately end up in dead cluster

The enrichment factor and capture rate will tell us whether this is a Thai-specific phenomenon or a more general vocabulary issue.