# 1.9c: Decode Halo Tokens

This notebook decodes the 1,423 "unreachable outside cluster" tokens to validate our reachability test.

## The Question

In 1.8d, we found that 1,423 tokens are "unreachable" (fail round-trip test) but are NOT in the geometric cluster.

**Are these truly unreachable, or is our test wrong?**

## The Concern

The round-trip test (`decode(token_id) → encode(string) → compare`) might have false positives:
- A token might be reachable via longer sequences but not when its decoded string is re-encoded in isolation
- Greedy tokenization might prefer different tokens
- Context dependency might matter

## Method

1. Load the 1,423 "unreachable_outside_cluster" token IDs from 1.8d
2. Decode each token
3. Classify by script, printability, content
4. Look for patterns that suggest genuine unreachability vs. test artifacts

**If these are genuinely broken tokens (mojibake, malformed Unicode), the test is probably correct.**  
**If these are plausible linguistic content, the test might be wrong.**

## Parameters

In [1]:
# Model to analyze
MODEL_NAME = "Qwen3-4B-Instruct-2507"
HF_MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Input from 1.8d
REACHABILITY_PATH = "../tensors/Qwen3-4B-Instruct-2507/1.8d_full_vocab_reachability.safetensors"

## Imports

In [2]:
import torch
import pandas as pd
import unicodedata
from safetensors.torch import load_file
from pathlib import Path
from transformers import AutoTokenizer
from collections import Counter

## Load Halo Token IDs

In [3]:
# Load reachability data from 1.8d
data = load_file(REACHABILITY_PATH)
halo_token_ids = data['unreachable_outside_cluster']

print(f"Loaded {len(halo_token_ids):,} halo token IDs from {Path(REACHABILITY_PATH).name}")
print(f"  Token ID range: [{halo_token_ids.min().item()}, {halo_token_ids.max().item()}]")

Loaded 1,423 halo token IDs from 1.8d_full_vocab_reachability.safetensors
  Token ID range: [94, 151560]


In [4]:
# Sort by token ID (BPE natural order)
halo_token_ids_sorted = torch.sort(halo_token_ids)[0]

print(f"\nSorted token IDs:")
print(f"  First 10: {halo_token_ids_sorted[:10].tolist()}")
print(f"  Last 10: {halo_token_ids_sorted[-10:].tolist()}")


Sorted token IDs:
  First 10: [94, 95, 96, 97, 98, 99, 100, 101, 102, 103]
  Last 10: [151266, 151268, 151270, 151272, 151274, 151276, 151278, 151282, 151366, 151560]


## Load Tokenizer

In [5]:
print(f"\nLoading tokenizer from {HF_MODEL_NAME}...\n")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {len(tokenizer):,} tokens")


Loading tokenizer from Qwen/Qwen3-4B-Instruct-2507...

✓ Tokenizer loaded
  Vocabulary size: 151,669 tokens


## Helper Functions

In [6]:
def classify_script(text):
    """
    Classify text by dominant script using Unicode properties.
    Returns primary script name (e.g., 'Thai', 'Han', 'Latin').
    """
    if not text or not text.strip():
        return 'Empty/Whitespace'
    
    # Count characters by script
    script_counts = Counter()
    
    for char in text:
        # Skip whitespace and control characters for classification
        if char.isspace() or unicodedata.category(char).startswith('C'):
            continue
            
        # Get Unicode script name
        try:
            script = unicodedata.name(char).split()[0]
            script_counts[script] += 1
        except (ValueError, IndexError):
            # Handle characters without names or unnamed characters
            category = unicodedata.category(char)
            if category.startswith('P'):
                script_counts['Punctuation'] += 1
            elif category.startswith('S'):
                script_counts['Symbol'] += 1
            else:
                script_counts['Other'] += 1
    
    if not script_counts:
        return 'Empty/Whitespace'
    
    # Return most common script
    return script_counts.most_common(1)[0][0]


def is_printable(text):
    """
    Check if text contains only printable characters.
    """
    if not text:
        return False
    
    for char in text:
        cat = unicodedata.category(char)
        # Allow letters, numbers, punctuation, symbols, and whitespace
        # Exclude control characters (Cc, Cf, Co, Cn)
        if cat.startswith('C') and not char.isspace():
            return False
    
    return True


def is_replacement_character(text):
    """
    Check if text consists only of Unicode replacement characters (U+FFFD).
    This indicates malformed/invalid UTF-8.
    """
    return all(c == '\ufffd' for c in text) and len(text) > 0


print("Helper functions defined:")
print("  classify_script(text) - Identify dominant Unicode script")
print("  is_printable(text) - Check if all characters are printable")
print("  is_replacement_character(text) - Check if text is U+FFFD (broken encoding)")

Helper functions defined:
  classify_script(text) - Identify dominant Unicode script
  is_printable(text) - Check if all characters are printable
  is_replacement_character(text) - Check if text is U+FFFD (broken encoding)


## Decode All Halo Tokens

In [7]:
print(f"\nDecoding {len(halo_token_ids_sorted):,} halo tokens...\n")

# Build list of records
records = []

for token_id in halo_token_ids_sorted:
    tid = token_id.item()
    
    # Decode
    decoded = tokenizer.decode([tid])
    
    # Compute metadata
    byte_len = len(decoded.encode('utf-8'))
    printable = is_printable(decoded)
    script = classify_script(decoded)
    is_mojibake = is_replacement_character(decoded)
    
    records.append({
        'token_id': tid,
        'decoded': decoded,
        'byte_length': byte_len,
        'printable': printable,
        'script': script,
        'is_mojibake': is_mojibake,
    })

# Create dataframe
df = pd.DataFrame(records)

print(f"✓ Decoded {len(df):,} tokens")
print(f"\nDataframe shape: {df.shape}")
print(f"Columns: {list(df.columns)}")


Decoding 1,423 halo tokens...

✓ Decoded 1,423 tokens

Dataframe shape: (1423, 6)
Columns: ['token_id', 'decoded', 'byte_length', 'printable', 'script', 'is_mojibake']


## Summary Statistics

In [8]:
print("\n" + "="*60)
print("HALO TOKEN SUMMARY")
print("="*60)

print(f"\nTotal tokens: {len(df):,}")
print(f"\nByte length statistics:")
print(f"  Min: {df['byte_length'].min()}")
print(f"  Max: {df['byte_length'].max()}")
print(f"  Mean: {df['byte_length'].mean():.2f}")
print(f"  Median: {df['byte_length'].median():.0f}")

print(f"\nPrintability:")
print(f"  Printable: {df['printable'].sum():,} ({100*df['printable'].sum()/len(df):.1f}%)")
print(f"  Non-printable: {(~df['printable']).sum():,} ({100*(~df['printable']).sum()/len(df):.1f}%)")

print(f"\nMojibake (U+FFFD replacement characters):")
print(f"  Mojibake: {df['is_mojibake'].sum():,} ({100*df['is_mojibake'].sum()/len(df):.1f}%)")
print(f"  Valid: {(~df['is_mojibake']).sum():,} ({100*(~df['is_mojibake']).sum()/len(df):.1f}%)")

print(f"\nScript distribution:")
script_counts = df['script'].value_counts()
for script, count in script_counts.items():
    pct = 100 * count / len(df)
    print(f"  {script:20s}: {count:4,} ({pct:5.1f}%)")

print("\n" + "="*60)


HALO TOKEN SUMMARY

Total tokens: 1,423

Byte length statistics:
  Min: 3
  Max: 15
  Mean: 3.59
  Median: 3

Printability:
  Printable: 1,422 (99.9%)
  Non-printable: 1 (0.1%)

Mojibake (U+FFFD replacement characters):
  Mojibake: 1,119 (78.6%)
  Valid: 304 (21.4%)

Script distribution:
  REPLACEMENT         : 1,351 ( 94.9%)
  CJK                 :   18 (  1.3%)
  KATAKANA            :   11 (  0.8%)
  CYRILLIC            :    7 (  0.5%)
  HIRAGANA            :    7 (  0.5%)
  HANGUL              :    7 (  0.5%)
  DEVANAGARI          :    4 (  0.3%)
  HEBREW              :    4 (  0.3%)
  BENGALI             :    3 (  0.2%)
  KATAKANA-HIRAGANA   :    2 (  0.1%)
  LATIN               :    2 (  0.1%)
  TAMIL               :    2 (  0.1%)
  KHMER               :    2 (  0.1%)
  ARABIC              :    1 (  0.1%)
  MALAYALAM           :    1 (  0.1%)
  LEFT                :    1 (  0.1%)



## Full Token Table

In [9]:
# Display settings for full table
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 50)

print(f"\nFull table of {len(df):,} halo tokens:\n")
display(df)


Full table of 1,423 halo tokens:



Unnamed: 0,token_id,decoded,byte_length,printable,script,is_mojibake
0,94,�,3,True,REPLACEMENT,True
1,95,�,3,True,REPLACEMENT,True
2,96,�,3,True,REPLACEMENT,True
3,97,�,3,True,REPLACEMENT,True
4,98,�,3,True,REPLACEMENT,True
5,99,�,3,True,REPLACEMENT,True
6,100,�,3,True,REPLACEMENT,True
7,101,�,3,True,REPLACEMENT,True
8,102,�,3,True,REPLACEMENT,True
9,103,�,3,True,REPLACEMENT,True


## Sample by Category

Let's look at some examples from each category to understand what these tokens actually are.

In [10]:
print("\n" + "="*60)
print("SAMPLE TOKENS BY CATEGORY")
print("="*60)

# Mojibake samples
if df['is_mojibake'].any():
    print(f"\nMojibake tokens (U+FFFD replacement characters):")
    mojibake_samples = df[df['is_mojibake']].head(10)
    for _, row in mojibake_samples.iterrows():
        print(f"  Token {row['token_id']:6d}: {repr(row['decoded'])}")

# Non-mojibake but non-printable
non_printable = df[~df['is_mojibake'] & ~df['printable']]
if len(non_printable) > 0:
    print(f"\nNon-printable but not mojibake:")
    for _, row in non_printable.head(10).iterrows():
        print(f"  Token {row['token_id']:6d}: {repr(row['decoded'])} (script: {row['script']})")

# Printable samples from each script
printable = df[df['printable']]
if len(printable) > 0:
    print(f"\nPrintable tokens by script:")
    for script in printable['script'].unique()[:5]:  # First 5 scripts
        print(f"\n  {script}:")
        script_samples = printable[printable['script'] == script].head(5)
        for _, row in script_samples.iterrows():
            print(f"    Token {row['token_id']:6d}: {repr(row['decoded'])}")


SAMPLE TOKENS BY CATEGORY

Mojibake tokens (U+FFFD replacement characters):
  Token     94: '�'
  Token     95: '�'
  Token     96: '�'
  Token     97: '�'
  Token     98: '�'
  Token     99: '�'
  Token    100: '�'
  Token    101: '�'
  Token    102: '�'
  Token    103: '�'

Non-printable but not mojibake:
  Token  42476: '\x80�' (script: REPLACEMENT)

Printable tokens by script:

  REPLACEMENT:
    Token     94: '�'
    Token     95: '�'
    Token     96: '�'
    Token     97: '�'
    Token     98: '�'

  CYRILLIC:
    Token   2226: 'о�'
    Token   3648: 'а�'
    Token   4175: 'е�'
    Token  10768: 'ра�'
    Token  24212: ' ра�'

  HIRAGANA:
    Token  20472: 'し�'
    Token 125388: 'っ�'
    Token 127124: 'に�'
    Token 135078: 'ご�'
    Token 137973: 'に�'

  KATAKANA-HIRAGANA:
    Token  24041: 'ー�'
    Token  37148: 'ー�'

  LATIN:
    Token  25772: 'ư�'
    Token  46500: ' t�'


## Diagnosis

**If most tokens are mojibake (U+FFFD):**
- These are genuinely broken tokens (malformed UTF-8)
- The round-trip test is correct
- These tokens are truly unreachable

**If most tokens are valid linguistic content:**
- The round-trip test has false positives
- These tokens might be reachable in other contexts
- We need a better test for reachability

**If mixed:**
- Some tokens are genuinely broken
- Others might be test artifacts
- Need to investigate specific cases