# 1.7a: Decode Cluster Tokens

This notebook decodes all 2,212 cluster tokens to see what text they represent.

## The Question

We've found a geometrically anomalous cluster: 2,212 tokens packed into a tiny volume (diameter ~0.0016), isolated from the rest of the vocabulary by a conspicuous void.

**What *are* these tokens?**

This is where geometric cartography meets linguistics. We have token IDs—now let's see what they decode to.

## Method

We'll:
1. Load the cluster token IDs from 1.4h
2. Sort by token ID (BPE natural order)
3. Decode each token with the tokenizer
4. Build a dataframe with metadata (byte length, printability, script)
5. Display the full table and summary statistics

## Parameters

In [1]:
# Model to analyze
MODEL_NAME = "Qwen3-4B-Instruct-2507"
HF_MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

## Imports

In [2]:
import torch
import pandas as pd
import unicodedata
from safetensors.torch import load_file
from pathlib import Path
from transformers import AutoTokenizer
from collections import Counter

## Load Cluster Token IDs

In [3]:
# Load cluster tokens from 1.4h
tensor_path = Path(f"../tensors/{MODEL_NAME}/1.4h_cluster_tokens.safetensors")
data = load_file(tensor_path)
cluster_token_ids = data['cluster_token_ids']

print(f"Loaded {len(cluster_token_ids):,} cluster token IDs from {tensor_path.name}")
print(f"  Token ID range: [{cluster_token_ids.min().item()}, {cluster_token_ids.max().item()}]")

Loaded 2,212 cluster token IDs from 1.4h_cluster_tokens.safetensors
  Token ID range: [124, 151935]


In [4]:
# Sort by token ID (BPE natural order)
cluster_token_ids_sorted = torch.sort(cluster_token_ids)[0]

print(f"\nSorted token IDs:")
print(f"  First 10: {cluster_token_ids_sorted[:10].tolist()}")
print(f"  Last 10: {cluster_token_ids_sorted[-10:].tolist()}")


Sorted token IDs:
  First 10: [124, 125, 177, 178, 179, 180, 181, 182, 183, 184]
  Last 10: [151926, 151927, 151928, 151929, 151930, 151931, 151932, 151933, 151934, 151935]


## Load Tokenizer

In [5]:
print(f"\nLoading tokenizer from {HF_MODEL_NAME}...\n")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {len(tokenizer):,} tokens")


Loading tokenizer from Qwen/Qwen3-4B-Instruct-2507...

✓ Tokenizer loaded
  Vocabulary size: 151,669 tokens


## Helper Functions

In [6]:
def classify_script(text):
    """
    Classify text by dominant script using Unicode properties.
    Returns primary script name (e.g., 'Thai', 'Han', 'Latin').
    """
    if not text or not text.strip():
        return 'Empty/Whitespace'
    
    # Count characters by script
    script_counts = Counter()
    
    for char in text:
        # Skip whitespace and control characters for classification
        if char.isspace() or unicodedata.category(char).startswith('C'):
            continue
            
        # Get Unicode script name
        try:
            script = unicodedata.name(char).split()[0]
            script_counts[script] += 1
        except (ValueError, IndexError):
            # Handle characters without names or unnamed characters
            category = unicodedata.category(char)
            if category.startswith('P'):
                script_counts['Punctuation'] += 1
            elif category.startswith('S'):
                script_counts['Symbol'] += 1
            else:
                script_counts['Other'] += 1
    
    if not script_counts:
        return 'Empty/Whitespace'
    
    # Return most common script
    return script_counts.most_common(1)[0][0]


def is_printable(text):
    """
    Check if text contains only printable characters.
    """
    if not text:
        return False
    
    for char in text:
        cat = unicodedata.category(char)
        # Allow letters, numbers, punctuation, symbols, and whitespace
        # Exclude control characters (Cc, Cf, Co, Cn)
        if cat.startswith('C') and not char.isspace():
            return False
    
    return True


print("Helper functions defined:")
print("  classify_script(text) - Identify dominant Unicode script")
print("  is_printable(text) - Check if all characters are printable")

Helper functions defined:
  classify_script(text) - Identify dominant Unicode script
  is_printable(text) - Check if all characters are printable


## Decode All Tokens

In [7]:
print(f"\nDecoding {len(cluster_token_ids_sorted):,} tokens...\n")

# Build list of records
records = []

for token_id in cluster_token_ids_sorted:
    tid = token_id.item()
    
    # Decode
    decoded = tokenizer.decode([tid])
    
    # Compute metadata
    byte_len = len(decoded.encode('utf-8'))
    printable = is_printable(decoded)
    script = classify_script(decoded)
    
    records.append({
        'token_id': tid,
        'decoded': decoded,
        'byte_length': byte_len,
        'printable': printable,
        'script': script,
    })

# Create dataframe
df = pd.DataFrame(records)

print(f"✓ Decoded {len(df):,} tokens")
print(f"\nDataframe shape: {df.shape}")
print(f"Columns: {list(df.columns)}")


Decoding 2,212 tokens...

✓ Decoded 2,212 tokens

Dataframe shape: (2212, 5)
Columns: ['token_id', 'decoded', 'byte_length', 'printable', 'script']


## Summary Statistics

In [8]:
print("\n" + "="*60)
print("CLUSTER TOKEN SUMMARY")
print("="*60)

print(f"\nTotal tokens: {len(df):,}")
print(f"\nByte length statistics:")
print(f"  Min: {df['byte_length'].min()}")
print(f"  Max: {df['byte_length'].max()}")
print(f"  Mean: {df['byte_length'].mean():.2f}")
print(f"  Median: {df['byte_length'].median():.0f}")

print(f"\nPrintability:")
print(f"  Printable: {df['printable'].sum():,} ({100*df['printable'].sum()/len(df):.1f}%)")
print(f"  Non-printable: {(~df['printable']).sum():,} ({100*(~df['printable']).sum()/len(df):.1f}%)")

print(f"\nScript distribution:")
script_counts = df['script'].value_counts()
for script, count in script_counts.items():
    pct = 100 * count / len(df)
    print(f"  {script:20s}: {count:4,} ({pct:5.1f}%)")

print("\n" + "="*60)


CLUSTER TOKEN SUMMARY

Total tokens: 2,212

Byte length statistics:
  Min: 0
  Max: 42
  Mean: 11.29
  Median: 12

Printability:
  Printable: 1,945 (87.9%)
  Non-printable: 267 (12.1%)

Script distribution:
  THAI                : 1,579 ( 71.4%)
  Empty/Whitespace    :  267 ( 12.1%)
  CJK                 :  198 (  9.0%)
  ARABIC              :   75 (  3.4%)
  HEBREW              :   25 (  1.1%)
  LATIN               :   25 (  1.1%)
  REPLACEMENT         :   18 (  0.8%)
  GREEK               :   12 (  0.5%)
  LESS-THAN           :    3 (  0.1%)
  FULLWIDTH           :    2 (  0.1%)
  OHM                 :    1 (  0.0%)
  LEFT-POINTING       :    1 (  0.0%)
  TIBETAN             :    1 (  0.0%)
  ANGSTROM            :    1 (  0.0%)
  KELVIN              :    1 (  0.0%)
  DEVANAGARI          :    1 (  0.0%)
  BENGALI             :    1 (  0.0%)
  RIGHT-POINTING      :    1 (  0.0%)



## Full Token Table

In [9]:
# Display settings for full table
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 50)

print(f"\nFull table of {len(df):,} cluster tokens:\n")
display(df)


Full table of 2,212 cluster tokens:



Unnamed: 0,token_id,decoded,byte_length,printable,script
0,124,�,3,True,REPLACEMENT
1,125,�,3,True,REPLACEMENT
2,177,�,3,True,REPLACEMENT
3,178,�,3,True,REPLACEMENT
4,179,�,3,True,REPLACEMENT
5,180,�,3,True,REPLACEMENT
6,181,�,3,True,REPLACEMENT
7,182,�,3,True,REPLACEMENT
8,183,�,3,True,REPLACEMENT
9,184,�,3,True,REPLACEMENT


## Observations

What patterns do we see in the decoded tokens?

**Key questions:**
1. What's the dominant script? (Thai? CJK? Mixed?)
2. Are these common words or rare sequences?
3. Are there any obvious patterns in token ID ranges?
4. How many are high-byte-count sequences (complex Unicode)?
5. Do the non-printable tokens cluster in specific ID ranges?

The script distribution will tell us whether this is a linguistic phenomenon (e.g., "dead Thai tokens") or something more general.