# 1.8c: Tokenizer Reachability Test

This notebook tests whether cluster tokens are **reachable** by the tokenizer.

## The Question

In 1.8b, we found that **zero** cluster tokens appeared when tokenizing 100 Thai Wikipedia articles.

Two possible explanations:

1. **Rare but reachable**: Cluster tokens represent legitimate Thai words that just didn't happen to appear in our sample
2. **Unreachable**: Cluster tokens exist in the vocabulary but the tokenizer's BPE algorithm will never produce them

## The Round-Trip Test

For each cluster token, we:

1. **Decode**: token_id → string
2. **Re-encode**: string → token_ids
3. **Compare**: Does re-encoding produce the original token_id?

**If round-trip succeeds:**
- Token is **reachable** (tokenizer can produce it)
- Absence in Wikipedia means it's genuinely rare

**If round-trip fails:**
- Token is **unreachable** (tokenizer will never produce it)
- It's an orphaned vocabulary entry
- Tokenizer can decode it (ID→string) but won't encode to it (string→ID)

## Method

Test all 2,212 cluster tokens exhaustively:
- Decode each token
- Re-encode the decoded string
- Check if we get back the original token ID
- Categorize: reachable vs unreachable
- Analyze patterns

## Parameters

In [1]:
# Model and tokenizer
MODEL_NAME = "Qwen3-4B-Instruct-2507"
HF_MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Cluster tokens (from 1.4h)
CLUSTER_TOKENS_PATH = '../tensors/Qwen3-4B-Instruct-2507/1.4h_cluster_tokens.safetensors'

# Output
OUTPUT_PATH = '../tensors/Qwen3-4B-Instruct-2507/1.8c_reachability_analysis.safetensors'

## Imports

In [2]:
import torch
import numpy as np
from transformers import AutoTokenizer
from safetensors.torch import load_file, save_file
from collections import Counter, defaultdict
from tqdm import tqdm

## Load Tokenizer

In [3]:
print(f"Loading tokenizer: {HF_MODEL_NAME}\n")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
vocab_size = len(tokenizer)

print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {vocab_size:,} tokens")

Loading tokenizer: Qwen/Qwen3-4B-Instruct-2507

✓ Tokenizer loaded
  Vocabulary size: 151,669 tokens


## Load Cluster Token IDs

In [4]:
print(f"\nLoading cluster token IDs...\n")
cluster_data = load_file(CLUSTER_TOKENS_PATH)
cluster_token_ids = cluster_data['cluster_token_ids'].tolist()

print(f"✓ Loaded {len(cluster_token_ids):,} cluster token IDs")
print(f"  Token ID range: [{min(cluster_token_ids)}, {max(cluster_token_ids)}]")


Loading cluster token IDs...

✓ Loaded 2,212 cluster token IDs
  Token ID range: [124, 151935]


## Round-Trip Test

In [5]:
print(f"\n{'='*70}")
print("ROUND-TRIP REACHABILITY TEST")
print(f"{'='*70}\n")

print(f"Testing {len(cluster_token_ids):,} cluster tokens...\n")

# Track results
reachable = []  # Tokens where round-trip succeeds
unreachable = []  # Tokens where round-trip fails
unreachable_details = []  # (token_id, decoded_string, reencoded_ids)

for token_id in tqdm(cluster_token_ids, desc="Testing reachability"):
    # Decode: token ID → string
    decoded_string = tokenizer.decode([token_id])
    
    # Re-encode: string → token IDs (no special tokens)
    reencoded_ids = tokenizer.encode(decoded_string, add_special_tokens=False)
    
    # Check round-trip
    if reencoded_ids == [token_id]:
        # SUCCESS: tokenizer produces this token
        reachable.append(token_id)
    else:
        # FAILURE: tokenizer produces different tokens
        unreachable.append(token_id)
        unreachable_details.append((token_id, decoded_string, reencoded_ids))

print(f"\n✓ Round-trip test complete\n")

print(f"Results:")
print(f"  Reachable tokens: {len(reachable):,} ({100*len(reachable)/len(cluster_token_ids):.1f}%)")
print(f"  Unreachable tokens: {len(unreachable):,} ({100*len(unreachable)/len(cluster_token_ids):.1f}%)")


ROUND-TRIP REACHABILITY TEST

Testing 2,212 cluster tokens...



Testing reachability: 100%|██████████| 2212/2212 [00:00<00:00, 60023.68it/s]


✓ Round-trip test complete

Results:
  Reachable tokens: 11 (0.5%)
  Unreachable tokens: 2,201 (99.5%)





## Analysis: Unreachable Tokens

In [6]:
if unreachable:
    print(f"\n{'='*70}")
    print("UNREACHABLE TOKEN ANALYSIS")
    print(f"{'='*70}\n")
    
    print(f"Found {len(unreachable):,} unreachable tokens\n")
    
    # Analyze re-encoding patterns
    reencoding_lengths = Counter()
    for token_id, decoded, reencoded in unreachable_details:
        reencoding_lengths[len(reencoded)] += 1
    
    print(f"Re-encoding patterns:")
    print(f"  (Original token is length 1, re-encoded as N tokens)\n")
    for length in sorted(reencoding_lengths.keys()):
        count = reencoding_lengths[length]
        pct = 100 * count / len(unreachable)
        print(f"    Re-encoded as {length} tokens: {count:,} ({pct:.1f}%)")
    
    # Show examples
    print(f"\nExample unreachable tokens (first 20):\n")
    print(f"  {'Token ID':>8} | {'Decoded String':<30} | Re-encoded IDs")
    print(f"  {'-'*8}-+-{'-'*30}-+{'-'*30}")
    
    for token_id, decoded, reencoded in unreachable_details[:20]:
        # Truncate decoded string for display
        decoded_display = decoded[:28] + '...' if len(decoded) > 30 else decoded
        reencoded_display = str(reencoded[:5]) + '...' if len(reencoded) > 5 else str(reencoded)
        print(f"  {token_id:8d} | {decoded_display:<30} | {reencoded_display}")
    
    if len(unreachable_details) > 20:
        print(f"  ... ({len(unreachable_details) - 20} more)")

else:
    print(f"\n{'='*70}")
    print("ALL TOKENS ARE REACHABLE")
    print(f"{'='*70}\n")
    
    print(f"✓ All {len(cluster_token_ids):,} cluster tokens passed round-trip test")
    print(f"\n  Tokenizer CAN produce these tokens")
    print(f"  Their absence in Wikipedia means they're genuinely rare/unused")


UNREACHABLE TOKEN ANALYSIS

Found 2,201 unreachable tokens

Re-encoding patterns:
  (Original token is length 1, re-encoded as N tokens)

    Re-encoded as 0 tokens: 267 (12.1%)
    Re-encoded as 1 tokens: 208 (9.5%)
    Re-encoded as 2 tokens: 713 (32.4%)
    Re-encoded as 3 tokens: 495 (22.5%)
    Re-encoded as 4 tokens: 331 (15.0%)
    Re-encoded as 5 tokens: 118 (5.4%)
    Re-encoded as 6 tokens: 38 (1.7%)
    Re-encoded as 7 tokens: 22 (1.0%)
    Re-encoded as 8 tokens: 8 (0.4%)
    Re-encoded as 9 tokens: 1 (0.0%)

Example unreachable tokens (first 20):

  Token ID | Decoded String                 | Re-encoded IDs
  ---------+--------------------------------+------------------------------
       124 | �                              | [5691]
       125 | �                              | [5691]
       177 | �                              | [5691]
       178 | �                              | [5691]
       179 | �                              | [5691]
       180 | �                

## Analysis: Reachable Tokens

In [7]:
if reachable:
    print(f"\n{'='*70}")
    print("REACHABLE TOKEN ANALYSIS")
    print(f"{'='*70}\n")
    
    print(f"Found {len(reachable):,} reachable tokens\n")
    
    # Show examples
    print(f"Example reachable tokens (first 20):\n")
    print(f"  {'Token ID':>8} | Decoded String")
    print(f"  {'-'*8}-+{'-'*40}")
    
    for token_id in reachable[:20]:
        decoded = tokenizer.decode([token_id])
        decoded_display = decoded[:38] + '...' if len(decoded) > 40 else decoded
        print(f"  {token_id:8d} | {decoded_display}")
    
    if len(reachable) > 20:
        print(f"  ... ({len(reachable) - 20} more)")
    
    print(f"\nInterpretation:")
    print(f"  These tokens CAN be produced by the tokenizer")
    print(f"  They didn't appear in 100 Wikipedia articles (1.8b)")
    print(f"  → Either extremely rare OR not in Wikipedia's Thai vocabulary")


REACHABLE TOKEN ANALYSIS

Found 11 reachable tokens

Example reachable tokens (first 20):

  Token ID | Decoded String
  ---------+----------------------------------------
     83971 | $PostalCodesNL
    151646 | <|object_ref_start|>
    151647 | <|object_ref_end|>
    151648 | <|box_start|>
    151649 | <|box_end|>
    151650 | <|quad_start|>
    151651 | <|quad_end|>
    151654 | <|vision_pad|>
    151655 | <|image_pad|>
    151656 | <|video_pad|>
    151662 | <|fim_pad|>

Interpretation:
  These tokens CAN be produced by the tokenizer
  They didn't appear in 100 Wikipedia articles (1.8b)
  → Either extremely rare OR not in Wikipedia's Thai vocabulary


## Detailed Breakdown by Category

In [8]:
print(f"\n{'='*70}")
print("DETAILED BREAKDOWN")
print(f"{'='*70}\n")

# Categorize tokens by decoded content
categories = defaultdict(list)

for token_id in cluster_token_ids:
    decoded = tokenizer.decode([token_id])
    is_reachable = token_id in reachable
    
    # Categorize by content
    if not decoded or decoded.isspace():
        category = 'Empty/Whitespace'
    elif all(ord(c) < 128 for c in decoded):
        category = 'ASCII'
    elif any('\u0e00' <= c <= '\u0e7f' for c in decoded):
        category = 'Thai'
    elif any('\u4e00' <= c <= '\u9fff' for c in decoded):
        category = 'CJK'
    else:
        category = 'Other'
    
    categories[category].append((token_id, is_reachable))

print(f"Reachability by content category:\n")
print(f"  {'Category':<20} | {'Total':>6} | {'Reachable':>10} | {'Unreachable':>12} | % Unreachable")
print(f"  {'-'*20}-+-{'-'*6}-+-{'-'*10}-+-{'-'*12}-+{'-'*14}")

for category in sorted(categories.keys()):
    tokens = categories[category]
    total = len(tokens)
    n_reachable = sum(1 for _, is_reach in tokens if is_reach)
    n_unreachable = total - n_reachable
    pct_unreachable = 100 * n_unreachable / total if total > 0 else 0
    
    print(f"  {category:<20} | {total:6d} | {n_reachable:10d} | {n_unreachable:12d} | {pct_unreachable:6.1f}%")

print(f"\n  {'TOTAL':<20} | {len(cluster_token_ids):6d} | {len(reachable):10d} | {len(unreachable):12d} | {100*len(unreachable)/len(cluster_token_ids):6.1f}%")


DETAILED BREAKDOWN

Reachability by content category:

  Category             |  Total |  Reachable |  Unreachable | % Unreachable
  ---------------------+--------+------------+--------------+--------------
  ASCII                |     15 |         11 |            4 |   26.7%
  CJK                  |      4 |          0 |            4 |  100.0%
  Empty/Whitespace     |    267 |          0 |          267 |  100.0%
  Other                |    347 |          0 |          347 |  100.0%
  Thai                 |   1579 |          0 |         1579 |  100.0%

  TOTAL                |   2212 |         11 |         2201 |   99.5%


## Save Results

In [9]:
print(f"\n{'='*70}")
print("SAVING RESULTS")
print(f"{'='*70}\n")

# Prepare data for saving
reachable_tensor = torch.tensor(sorted(reachable), dtype=torch.int64)
unreachable_tensor = torch.tensor(sorted(unreachable), dtype=torch.int64)

save_file({
    'reachable_token_ids': reachable_tensor,
    'unreachable_token_ids': unreachable_tensor,
}, OUTPUT_PATH)

print(f"✓ Saved reachability analysis to {OUTPUT_PATH}")
print(f"\nSaved data:")
print(f"  reachable_token_ids: {len(reachable):,} tokens")
print(f"  unreachable_token_ids: {len(unreachable):,} tokens")


SAVING RESULTS

✓ Saved reachability analysis to ../tensors/Qwen3-4B-Instruct-2507/1.8c_reachability_analysis.safetensors

Saved data:
  reachable_token_ids: 11 tokens
  unreachable_token_ids: 2,201 tokens


## Summary

This notebook tested whether cluster tokens are reachable by the tokenizer's BPE algorithm.

**Method:**
- For each cluster token: decode(token_id) → re-encode(string)
- Check if re-encoding produces original token_id
- Classify as reachable (round-trip succeeds) or unreachable (fails)

**Results:**
- Reachable: (see above)
- Unreachable: (see above)

**Interpretation:**

**If most tokens are UNREACHABLE:**
- These are **orphaned vocabulary entries**
- Tokenizer can decode them (ID→string) but will never produce them (string→ID)
- They exist in the model's embedding matrix but are inaccessible via tokenization
- Explains why they don't appear in Wikipedia (tokenizer can't produce them)
- **Root cause:** Vocabulary and BPE merge rules are out of sync

**If most tokens are REACHABLE:**
- Tokenizer CAN produce these tokens
- They didn't appear in 100 Wikipedia articles (1.8b test)
- **Two possibilities:**
  1. Extremely rare words (1-in-a-million)
  2. Valid Thai that's not in Wikipedia's vocabulary (archaic, technical, dialectal)

**If MIXED:**
- Some tokens are orphaned (unreachable)
- Others are rare but legitimate (reachable)
- Need to investigate each category separately

**Next steps:**
- If unreachable tokens exist: investigate why vocabulary/BPE diverged
- If all reachable: test larger Thai corpus to see if they ever appear
- Decode reachable Thai tokens and assess legitimacy (real words vs garbage)