# 1.8d: Full Vocabulary Reachability Scan

This notebook tests **all 151,936 tokens** in the Qwen3 vocabulary for reachability.

## The Question

In 1.8c, we found that **2,201 out of 2,212 cluster tokens are unreachable** by the tokenizer.

Now we ask: **Are there unreachable tokens OUTSIDE the cluster?**

## Two Competing Hypotheses

**Hypothesis 1 (Jeffery's prediction):**
- Unreachable tokens ⊂ Cluster tokens
- All unreachable tokens are in the cluster
- The cluster is DEFINED by unreachability (+ 11 rare-but-reachable tokens)
- Unreachable → no training signal → geometric collapse

**Hypothesis 2 (Alpha's prediction):**
- Unreachable tokens ⊃ Cluster tokens
- There are unreachable tokens scattered throughout the vocabulary
- Cluster = subset of unreachable tokens that ALSO collapsed geometrically
- Unreachability is necessary but not sufficient for collapse

## Method

Test all 151,936 tokens exhaustively:
1. For each token: decode → re-encode → compare
2. Track reachable vs unreachable
3. Compare unreachable tokens to cluster tokens
4. Check: unreachable tokens = cluster tokens?

## Expected Runtime

~30-60 seconds (tokenizer is fast)

## Parameters

In [1]:
# Model and tokenizer
MODEL_NAME = "Qwen3-4B-Instruct-2507"
HF_MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Cluster tokens (from 1.4h)
CLUSTER_TOKENS_PATH = '../tensors/Qwen3-4B-Instruct-2507/1.4h_cluster_tokens.safetensors'

# Output
OUTPUT_PATH = '../tensors/Qwen3-4B-Instruct-2507/1.8d_full_vocab_reachability.safetensors'

## Imports

In [2]:
import torch
import numpy as np
from transformers import AutoTokenizer
from safetensors.torch import load_file, save_file
from tqdm import tqdm

## Load Tokenizer

In [3]:
print(f"Loading tokenizer: {HF_MODEL_NAME}\n")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
vocab_size = len(tokenizer)

print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {vocab_size:,} tokens")

Loading tokenizer: Qwen/Qwen3-4B-Instruct-2507

✓ Tokenizer loaded
  Vocabulary size: 151,669 tokens


## Load Cluster Token IDs

In [4]:
print(f"\nLoading cluster token IDs...\n")
cluster_data = load_file(CLUSTER_TOKENS_PATH)
cluster_token_ids = set(cluster_data['cluster_token_ids'].tolist())

print(f"✓ Loaded {len(cluster_token_ids):,} cluster token IDs")
print(f"  Token ID range: [{min(cluster_token_ids)}, {max(cluster_token_ids)}]")


Loading cluster token IDs...

✓ Loaded 2,212 cluster token IDs
  Token ID range: [124, 151935]


## Exhaustive Round-Trip Test

In [5]:
print(f"\n{'='*70}")
print("EXHAUSTIVE VOCABULARY REACHABILITY SCAN")
print(f"{'='*70}\n")

print(f"Testing all {vocab_size:,} tokens in vocabulary...\n")

# Track results
reachable = []
unreachable = []

for token_id in tqdm(range(vocab_size), desc="Testing reachability"):
    # Decode: token ID → string
    decoded_string = tokenizer.decode([token_id])
    
    # Re-encode: string → token IDs (no special tokens)
    reencoded_ids = tokenizer.encode(decoded_string, add_special_tokens=False)
    
    # Check round-trip
    if reencoded_ids == [token_id]:
        reachable.append(token_id)
    else:
        unreachable.append(token_id)

print(f"\n✓ Scan complete\n")

print(f"Full vocabulary results:")
print(f"  Total tokens: {vocab_size:,}")
print(f"  Reachable: {len(reachable):,} ({100*len(reachable)/vocab_size:.2f}%)")
print(f"  Unreachable: {len(unreachable):,} ({100*len(unreachable)/vocab_size:.2f}%)")


EXHAUSTIVE VOCABULARY REACHABILITY SCAN

Testing all 151,669 tokens in vocabulary...



Testing reachability: 100%|██████████| 151669/151669 [00:02<00:00, 74041.91it/s]


✓ Scan complete

Full vocabulary results:
  Total tokens: 151,669
  Reachable: 148,312 (97.79%)
  Unreachable: 3,357 (2.21%)





## Compare Unreachable Tokens to Cluster

In [11]:
print(f"\n{'='*70}")
print("COMPARING UNREACHABLE TOKENS TO CLUSTER")
print(f"{'='*70}\n")

unreachable_set = set(unreachable)

# Compute overlaps
unreachable_in_cluster = unreachable_set & cluster_token_ids
unreachable_outside_cluster = unreachable_set - cluster_token_ids
cluster_but_reachable = cluster_token_ids - unreachable_set

print(f"Set analysis:")
print(f"  Cluster tokens: {len(cluster_token_ids):,}")
print(f"  Unreachable tokens (full vocab): {len(unreachable_set):,}")
print(f"\nOverlaps:")
print(f"  Unreachable ∩ Cluster: {len(unreachable_in_cluster):,}")
print(f"  Unreachable Cluster: {len(unreachable_outside_cluster):,}")
print(f"  Cluster Unreachable: {len(cluster_but_reachable):,}")

# Percentages
pct_cluster_unreachable = 100 * len(unreachable_in_cluster) / len(cluster_token_ids)
pct_unreachable_in_cluster = 100 * len(unreachable_in_cluster) / len(unreachable_set)

print(f"\nPercentages:")
print(f"  {pct_cluster_unreachable:.1f}% of cluster tokens are unreachable")
print(f"  {pct_unreachable_in_cluster:.1f}% of unreachable tokens are in cluster")


COMPARING UNREACHABLE TOKENS TO CLUSTER

Set analysis:
  Cluster tokens: 2,212
  Unreachable tokens (full vocab): 3,357

Overlaps:
  Unreachable ∩ Cluster: 1,934
  Unreachable Cluster: 1,423
  Cluster Unreachable: 278

Percentages:
  87.4% of cluster tokens are unreachable
  57.6% of unreachable tokens are in cluster


## Test Hypotheses

In [12]:
print(f"\n{'='*70}")
print("HYPOTHESIS TESTING")
print(f"{'='*70}\n")

print(f"Hypothesis 1 (Jeffery): Unreachable tokens ⊂ Cluster tokens")
print(f"  Prediction: All unreachable tokens are in cluster")
print(f"  Test: len(Unreachable \ Cluster) == 0")
print(f"  Result: {len(unreachable_outside_cluster)} unreachable tokens outside cluster")

if len(unreachable_outside_cluster) == 0:
    print(f"  ✓ SUPPORTED: All unreachable tokens are in cluster")
else:
    print(f"  ✗ REJECTED: Found {len(unreachable_outside_cluster):,} unreachable tokens outside cluster")

print(f"\nHypothesis 2 (Alpha): Unreachable tokens ⊃ Cluster tokens")
print(f"  Prediction: Some unreachable tokens outside cluster")
print(f"  Test: len(Unreachable \ Cluster) > 0")
print(f"  Result: {len(unreachable_outside_cluster)} unreachable tokens outside cluster")

if len(unreachable_outside_cluster) > 0:
    print(f"  ✓ SUPPORTED: Unreachability extends beyond cluster")
else:
    print(f"  ✗ REJECTED: No unreachable tokens outside cluster")

# Perfect match test
print(f"\nPerfect match test:")
print(f"  Unreachable == Cluster? {unreachable_set == cluster_token_ids}")

if unreachable_set == cluster_token_ids:
    print(f"  ✓ PERFECT MATCH: Cluster is exactly the set of unreachable tokens")
elif len(unreachable_outside_cluster) == 0 and len(cluster_but_reachable) > 0:
    print(f"  SUBSET: All unreachable tokens in cluster, plus {len(cluster_but_reachable)} reachable cluster tokens")
elif len(unreachable_outside_cluster) > 0 and len(cluster_but_reachable) == 0:
    print(f"  SUPERSET: All cluster tokens unreachable, plus {len(unreachable_outside_cluster)} unreachable non-cluster tokens")
else:
    print(f"  PARTIAL OVERLAP: Complex relationship")


HYPOTHESIS TESTING

Hypothesis 1 (Jeffery): Unreachable tokens ⊂ Cluster tokens
  Prediction: All unreachable tokens are in cluster
  Test: len(Unreachable \ Cluster) == 0
  Result: 1423 unreachable tokens outside cluster
  ✗ REJECTED: Found 1,423 unreachable tokens outside cluster

Hypothesis 2 (Alpha): Unreachable tokens ⊃ Cluster tokens
  Prediction: Some unreachable tokens outside cluster
  Test: len(Unreachable \ Cluster) > 0
  Result: 1423 unreachable tokens outside cluster
  ✓ SUPPORTED: Unreachability extends beyond cluster

Perfect match test:
  Unreachable == Cluster? False
  PARTIAL OVERLAP: Complex relationship


  print(f"  Test: len(Unreachable \ Cluster) == 0")
  print(f"  Test: len(Unreachable \ Cluster) > 0")


## Unreachable Tokens Outside Cluster (if any)

In [13]:
if unreachable_outside_cluster:
    print(f"\n{'='*70}")
    print("UNREACHABLE TOKENS OUTSIDE CLUSTER")
    print(f"{'='*70}\n")
    
    print(f"Found {len(unreachable_outside_cluster):,} unreachable tokens NOT in cluster\n")
    
    # Show examples
    print(f"Examples (first 20):\n")
    print(f"  {'Token ID':>8} | Decoded String")
    print(f"  {'-'*8}-+{'-'*50}")
    
    for token_id in sorted(unreachable_outside_cluster)[:20]:
        decoded = tokenizer.decode([token_id])
        decoded_display = decoded[:48] + '...' if len(decoded) > 50 else decoded
        print(f"  {token_id:8d} | {decoded_display}")
    
    if len(unreachable_outside_cluster) > 20:
        print(f"  ... ({len(unreachable_outside_cluster) - 20} more)")
    
    print(f"\nImplication:")
    print(f"  Unreachability is a VOCABULARY-WIDE problem")
    print(f"  Cluster = subset of unreachable tokens that collapsed geometrically")
    print(f"  Some unreachable tokens have normal embeddings (didn't collapse)")

else:
    print(f"\n{'='*70}")
    print("ALL UNREACHABLE TOKENS ARE IN CLUSTER")
    print(f"{'='*70}\n")
    
    print(f"✓ Zero unreachable tokens found outside cluster")
    print(f"\nImplication:")
    print(f"  Unreachable tokens = Cluster tokens (± reachable cluster tokens)")
    print(f"  Unreachability → geometric collapse (strong correlation)")
    print(f"  The cluster is DEFINED by unreachability")


UNREACHABLE TOKENS OUTSIDE CLUSTER

Found 1,423 unreachable tokens NOT in cluster

Examples (first 20):

  Token ID | Decoded String
  ---------+--------------------------------------------------
        94 | �
        95 | �
        96 | �
        97 | �
        98 | �
        99 | �
       100 | �
       101 | �
       102 | �
       103 | �
       104 | �
       105 | �
       106 | �
       107 | �
       108 | �
       109 | �
       110 | �
       111 | �
       112 | �
       113 | �
  ... (1403 more)

Implication:
  Unreachability is a VOCABULARY-WIDE problem
  Cluster = subset of unreachable tokens that collapsed geometrically
  Some unreachable tokens have normal embeddings (didn't collapse)


## Reachable Cluster Tokens (if any)

In [14]:
if cluster_but_reachable:
    print(f"\n{'='*70}")
    print("REACHABLE CLUSTER TOKENS")
    print(f"{'='*70}\n")
    
    print(f"Found {len(cluster_but_reachable):,} cluster tokens that ARE reachable\n")
    
    # Show all (should be small)
    print(f"All reachable cluster tokens:\n")
    print(f"  {'Token ID':>8} | Decoded String")
    print(f"  {'-'*8}-+{'-'*50}")
    
    for token_id in sorted(cluster_but_reachable):
        decoded = tokenizer.decode([token_id])
        decoded_display = decoded[:48] + '...' if len(decoded) > 50 else decoded
        print(f"  {token_id:8d} | {decoded_display}")
    
    print(f"\nImplication:")
    print(f"  These tokens CAN be produced by tokenizer")
    print(f"  But still ended up in geometric cluster")
    print(f"  → Reachable but never appeared in training data")
    print(f"  → Received no differential gradients → collapsed with unreachable tokens")


REACHABLE CLUSTER TOKENS

Found 278 cluster tokens that ARE reachable

All reachable cluster tokens:

  Token ID | Decoded String
  ---------+--------------------------------------------------
     83971 | $PostalCodesNL
    151646 | <|object_ref_start|>
    151647 | <|object_ref_end|>
    151648 | <|box_start|>
    151649 | <|box_end|>
    151650 | <|quad_start|>
    151651 | <|quad_end|>
    151654 | <|vision_pad|>
    151655 | <|image_pad|>
    151656 | <|video_pad|>
    151662 | <|fim_pad|>
    151669 | 
    151670 | 
    151671 | 
    151672 | 
    151673 | 
    151674 | 
    151675 | 
    151676 | 
    151677 | 
    151678 | 
    151679 | 
    151680 | 
    151681 | 
    151682 | 
    151683 | 
    151684 | 
    151685 | 
    151686 | 
    151687 | 
    151688 | 
    151689 | 
    151690 | 
    151691 | 
    151692 | 
    151693 | 
    151694 | 
    151695 | 
    151696 | 
    151697 | 
    151698 | 
    151699 | 
    151700 | 
    151701 | 
    151702 | 
    151703 | 
    15170

## Save Results

In [15]:
print(f"\n{'='*70}")
print("SAVING RESULTS")
print(f"{'='*70}\n")

# Prepare data for saving
save_file({
    'all_reachable': torch.tensor(sorted(reachable), dtype=torch.int64),
    'all_unreachable': torch.tensor(sorted(unreachable), dtype=torch.int64),
    'unreachable_in_cluster': torch.tensor(sorted(unreachable_in_cluster), dtype=torch.int64),
    'unreachable_outside_cluster': torch.tensor(sorted(unreachable_outside_cluster), dtype=torch.int64),
    'cluster_but_reachable': torch.tensor(sorted(cluster_but_reachable), dtype=torch.int64),
}, OUTPUT_PATH)

print(f"✓ Saved full vocabulary reachability analysis to {OUTPUT_PATH}")
print(f"\nSaved data:")
print(f"  all_reachable: {len(reachable):,} tokens")
print(f"  all_unreachable: {len(unreachable):,} tokens")
print(f"  unreachable_in_cluster: {len(unreachable_in_cluster):,} tokens")
print(f"  unreachable_outside_cluster: {len(unreachable_outside_cluster):,} tokens")
print(f"  cluster_but_reachable: {len(cluster_but_reachable):,} tokens")


SAVING RESULTS

✓ Saved full vocabulary reachability analysis to ../tensors/Qwen3-4B-Instruct-2507/1.8d_full_vocab_reachability.safetensors

Saved data:
  all_reachable: 148,312 tokens
  all_unreachable: 3,357 tokens
  unreachable_in_cluster: 1,934 tokens
  unreachable_outside_cluster: 1,423 tokens
  cluster_but_reachable: 278 tokens


## Summary

This notebook tested all 151,936 tokens in Qwen3's vocabulary for reachability.

**Method:**
- Exhaustive round-trip test: decode(token) → re-encode(string)
- Compare unreachable tokens to geometric cluster
- Test hypothesis: unreachable = cluster?

**Results:**
- Total vocabulary: 151,936 tokens
- Reachable: (see above)
- Unreachable: (see above)
- Unreachable in cluster: (see above)
- Unreachable outside cluster: (see above)

**Key Findings:**

**If unreachable ⊂ cluster (Jeffery's prediction):**
- All unreachable tokens are in the cluster
- Cluster is DEFINED by unreachability (+ a few reachable but unused tokens)
- Unreachability → no training signal → geometric collapse
- Strong causal link: can't be tokenized → can't be trained → collapse together

**If unreachable ⊃ cluster (Alpha's prediction):**
- Unreachable tokens exist throughout vocabulary
- Cluster = subset that ALSO collapsed geometrically
- Unreachability is necessary but not sufficient for collapse
- Some unreachable tokens have normal embeddings (why?)

**Implications:**

1. **Vocabulary quality issue**: Thousands of unreachable tokens = broken vocabulary
2. **Training corpus gaps**: Reachable cluster tokens never appeared in training
3. **Geometric signatures**: Can identify unused tokens via embedding geometry
4. **Model archaeology**: Dead tokens preserve evidence of data engineering artifacts