# 1.10b: Random Bytes Tokenization Test

**Testing the binary garbage hypothesis**

## The Hypothesis

Jeffery's insight: **The tokenizer operates on bytes, not Unicode.**

What if halo tokens (unreachable via Unicode round-trip) are actually *reachable via non-UTF-8 byte sequences*?

**Proposed mechanism:**
1. Qwen's training data includes binary garbage (corrupted files, non-UTF-8 encodings, PDF fragments, etc.)
2. Tokenizer encodes these byte sequences → produces "halo" tokens
3. Model receives gradients for these tokens → they develop normal embeddings
4. But we can't decode them to valid Unicode → round-trip test fails → labeled "unreachable"

**Meanwhile:**
- Cluster tokens = valid Unicode that never appeared in training (Thai script, etc.)
- No gradients → geometric collapse
- Different category entirely

## The Test

Feed the tokenizer **random bytes** (not valid UTF-8) and see which token categories appear:

**If hypothesis is correct:**
- Halo tokens should appear **disproportionately often**
- Random bytes → invalid UTF-8 → tokens that can't round-trip → halo tokens

**Null hypothesis:**
- Random bytes produce mostly bulk tokens (same distribution as valid UTF-8 text)
- Halo tokens appear at baseline rate (~0.9% of vocabulary)

## Method

1. Generate random byte sequences (various lengths)
2. Tokenize them (allow errors, don't enforce UTF-8 validity)
3. Collect all produced token IDs
4. Classify each token: cluster / halo / bulk
5. Compare distributions

**Statistical validity:**
- Generate millions of bytes
- Tokenize thousands of sequences
- Count token category frequencies
- Compute enrichment: P(halo | random bytes) vs P(halo | vocabulary)

## Parameters

In [7]:
# Model
MODEL_NAME = "Qwen3-4B-Instruct-2507"
HF_MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Input data
CLUSTER_TOKENS_PATH = "../tensors/Qwen3-4B-Instruct-2507/1.4h_cluster_tokens.safetensors"
REACHABILITY_PATH = "../tensors/Qwen3-4B-Instruct-2507/1.8d_full_vocab_reachability.safetensors"

# Random byte generation
NUM_SEQUENCES = 10000  # Number of random byte sequences to generate
MIN_SEQ_LENGTH = 10
MAX_SEQ_LENGTH = 1000

RANDOM_SEED = 42

## Imports

In [8]:
import torch
import numpy as np
from transformers import AutoTokenizer
from safetensors.torch import load_file
from collections import Counter
from tqdm import tqdm

np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

<torch._C.Generator at 0x1168adb30>

## Load Token Classifications

In [9]:
print("Loading token classifications...\n")

# Load cluster tokens
cluster_data = load_file(CLUSTER_TOKENS_PATH)
cluster_token_ids = set(cluster_data['cluster_token_ids'].tolist())

# Load halo tokens (unreachable outside cluster)
reachability_data = load_file(REACHABILITY_PATH)
halo_token_ids = set(reachability_data['unreachable_outside_cluster'].tolist())

print(f"✓ Loaded token classifications")
print(f"  Cluster tokens: {len(cluster_token_ids):,}")
print(f"  Halo tokens: {len(halo_token_ids):,}")
print(f"  Bulk tokens: {151669 - len(cluster_token_ids) - len(halo_token_ids):,}")

Loading token classifications...

✓ Loaded token classifications
  Cluster tokens: 2,212
  Halo tokens: 1,423
  Bulk tokens: 148,034


## Load Tokenizer

In [10]:
print(f"\nLoading tokenizer: {HF_MODEL_NAME}\n")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
vocab_size = len(tokenizer)

print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {vocab_size:,} tokens")


Loading tokenizer: Qwen/Qwen3-4B-Instruct-2507

✓ Tokenizer loaded
  Vocabulary size: 151,669 tokens


## Generate and Tokenize Random Bytes

In [11]:
print(f"\n{'='*70}")
print("GENERATING RANDOM BYTE SEQUENCES")
print(f"{'='*70}\n")

print(f"Configuration:")
print(f"  Number of sequences: {NUM_SEQUENCES:,}")
print(f"  Sequence length: {MIN_SEQ_LENGTH}-{MAX_SEQ_LENGTH} bytes")
print(f"\nTokenizing...\n")

# Collect all token IDs produced
all_tokens = []
failed_sequences = 0

for i in tqdm(range(NUM_SEQUENCES), desc="Tokenizing random bytes"):
    # Generate random byte sequence
    seq_length = np.random.randint(MIN_SEQ_LENGTH, MAX_SEQ_LENGTH + 1)
    random_bytes = bytes(np.random.randint(0, 256, size=seq_length, dtype=np.uint8))
    
    try:
        # Convert bytes to string using Latin-1 encoding
        # Latin-1 maps bytes 0-255 to Unicode U+0000-U+00FF (lossless)
        # This allows any byte sequence to be represented as a "string"
        byte_string = random_bytes.decode('latin-1')
        
        # Tokenize the byte string
        token_ids = tokenizer.encode(byte_string, add_special_tokens=False)
        all_tokens.extend(token_ids)
    except Exception as e:
        # Track failures but continue
        failed_sequences += 1
        if failed_sequences <= 5:  # Show first 5 errors
            print(f"  Warning: Failed to tokenize sequence {i}: {e}")

print(f"\n✓ Tokenization complete")
print(f"  Sequences processed: {NUM_SEQUENCES - failed_sequences:,} / {NUM_SEQUENCES:,}")
print(f"  Failed sequences: {failed_sequences:,}")
print(f"  Total tokens produced: {len(all_tokens):,}")
print(f"  Unique tokens: {len(set(all_tokens)):,}")


GENERATING RANDOM BYTE SEQUENCES

Configuration:
  Number of sequences: 10,000
  Sequence length: 10-1000 bytes

Tokenizing...



Tokenizing random bytes: 100%|██████████| 10000/10000 [00:03<00:00, 2719.78it/s]


✓ Tokenization complete
  Sequences processed: 10,000 / 10,000
  Failed sequences: 0
  Total tokens produced: 5,421,516
  Unique tokens: 5,848





## Classify Tokens

In [12]:
print(f"\n{'='*70}")
print("CLASSIFYING TOKENS")
print(f"{'='*70}\n")

# Count tokens by category
cluster_count = 0
halo_count = 0
bulk_count = 0

for token_id in all_tokens:
    if token_id in cluster_token_ids:
        cluster_count += 1
    elif token_id in halo_token_ids:
        halo_count += 1
    else:
        bulk_count += 1

total_tokens = len(all_tokens)

if total_tokens == 0:
    print("ERROR: No tokens were produced!")
    print("Cannot compute statistics with zero tokens.")
else:
    print(f"Token distribution from random bytes:")
    print(f"  Cluster tokens: {cluster_count:,} ({100*cluster_count/total_tokens:.2f}%)")
    print(f"  Halo tokens: {halo_count:,} ({100*halo_count/total_tokens:.2f}%)")
    print(f"  Bulk tokens: {bulk_count:,} ({100*bulk_count/total_tokens:.2f}%)")
    print(f"  Total: {total_tokens:,}")


CLASSIFYING TOKENS

Token distribution from random bytes:
  Cluster tokens: 0 (0.00%)
  Halo tokens: 1,187,940 (21.91%)
  Bulk tokens: 4,233,576 (78.09%)
  Total: 5,421,516


## Compare to Vocabulary Baseline

In [13]:
print(f"\n{'='*70}")
print("COMPARISON TO VOCABULARY BASELINE")
print(f"{'='*70}\n")

# Vocabulary proportions (if tokens were sampled uniformly)
vocab_cluster_pct = 100 * len(cluster_token_ids) / vocab_size
vocab_halo_pct = 100 * len(halo_token_ids) / vocab_size
vocab_bulk_pct = 100 * (vocab_size - len(cluster_token_ids) - len(halo_token_ids)) / vocab_size

# Observed proportions from random bytes
obs_cluster_pct = 100 * cluster_count / total_tokens
obs_halo_pct = 100 * halo_count / total_tokens
obs_bulk_pct = 100 * bulk_count / total_tokens

# Enrichment (observed / expected)
cluster_enrichment = obs_cluster_pct / vocab_cluster_pct if vocab_cluster_pct > 0 else 0
halo_enrichment = obs_halo_pct / vocab_halo_pct if vocab_halo_pct > 0 else 0
bulk_enrichment = obs_bulk_pct / vocab_bulk_pct if vocab_bulk_pct > 0 else 0

print("Baseline (vocabulary proportions):")
print(f"  Cluster: {vocab_cluster_pct:.2f}%")
print(f"  Halo: {vocab_halo_pct:.2f}%")
print(f"  Bulk: {vocab_bulk_pct:.2f}%")

print(f"\nObserved (from random bytes):")
print(f"  Cluster: {obs_cluster_pct:.2f}%")
print(f"  Halo: {obs_halo_pct:.2f}%")
print(f"  Bulk: {obs_bulk_pct:.2f}%")

print(f"\nEnrichment (observed / baseline):")
print(f"  Cluster: {cluster_enrichment:.2f}x")
print(f"  Halo: {halo_enrichment:.2f}x {'← ENRICHED' if halo_enrichment > 1.5 else ''}")
print(f"  Bulk: {bulk_enrichment:.2f}x")


COMPARISON TO VOCABULARY BASELINE

Baseline (vocabulary proportions):
  Cluster: 1.46%
  Halo: 0.94%
  Bulk: 97.60%

Observed (from random bytes):
  Cluster: 0.00%
  Halo: 21.91%
  Bulk: 78.09%

Enrichment (observed / baseline):
  Cluster: 0.00x
  Halo: 23.35x ← ENRICHED
  Bulk: 0.80x


## Statistical Significance

In [14]:
print(f"\n{'='*70}")
print("STATISTICAL SIGNIFICANCE")
print(f"{'='*70}\n")

# Chi-square test: are observed frequencies different from expected?
from scipy.stats import chisquare

# Expected frequencies (if tokens were sampled uniformly from vocab)
expected = [
    total_tokens * len(cluster_token_ids) / vocab_size,
    total_tokens * len(halo_token_ids) / vocab_size,
    total_tokens * (vocab_size - len(cluster_token_ids) - len(halo_token_ids)) / vocab_size,
]

# Observed frequencies
observed = [cluster_count, halo_count, bulk_count]

# Chi-square test
chi2, p_value = chisquare(observed, expected)

print(f"Chi-square goodness-of-fit test:")
print(f"  Null hypothesis: Random bytes produce tokens uniformly from vocabulary")
print(f"  Chi-square statistic: {chi2:.2f}")
print(f"  p-value: {p_value:.2e}")

if p_value < 0.001:
    print(f"\n  ✓ SIGNIFICANT: Random bytes do NOT sample uniformly (p < 0.001)")
    print(f"    The distribution is significantly different from baseline.")
elif p_value < 0.05:
    print(f"\n  ✓ SIGNIFICANT: Random bytes do NOT sample uniformly (p < 0.05)")
else:
    print(f"\n  ✗ NOT SIGNIFICANT: Cannot reject null hypothesis (p ≥ 0.05)")
    print(f"    Random bytes might sample uniformly from vocabulary.")


STATISTICAL SIGNIFICANCE

Chi-square goodness-of-fit test:
  Null hypothesis: Random bytes produce tokens uniformly from vocabulary
  Chi-square statistic: 25709026.65
  p-value: 0.00e+00

  ✓ SIGNIFICANT: Random bytes do NOT sample uniformly (p < 0.001)
    The distribution is significantly different from baseline.


## Most Common Tokens from Random Bytes

In [15]:
print(f"\n{'='*70}")
print("MOST COMMON TOKENS FROM RANDOM BYTES")
print(f"{'='*70}\n")

token_counter = Counter(all_tokens)
most_common = token_counter.most_common(20)

print(f"Top 20 tokens produced by random bytes:\n")
print(f"  {'Token ID':>8} | {'Count':>6} | {'Category':>8} | Decoded")
print(f"  {'-'*8}-+-{'-'*6}-+-{'-'*8}-+{'-'*30}")

for token_id, count in most_common:
    # Classify
    if token_id in cluster_token_ids:
        category = "Cluster"
    elif token_id in halo_token_ids:
        category = "Halo"
    else:
        category = "Bulk"
    
    # Decode
    decoded = tokenizer.decode([token_id])
    decoded_display = repr(decoded)[:28]
    
    print(f"  {token_id:8d} | {count:6,} | {category:>8} | {decoded_display}")


MOST COMMON TOKENS FROM RANDOM BYTES

Top 20 tokens produced by random bytes:

  Token ID |  Count | Category | Decoded
  ---------+--------+----------+------------------------------
       126 | 585,833 |     Halo | '�'
       236 | 20,148 |     Halo | '�'
     41873 | 20,081 |     Bulk | '¼'
       206 | 19,954 |     Bulk | '\x12'
     59497 | 19,950 |     Bulk | '¹'
       250 | 19,927 |     Halo | '�'
       214 | 19,925 |     Bulk | '\x1a'
       253 | 19,912 |     Halo | '�'
       216 | 19,907 |     Bulk | '\x1c'
       212 | 19,895 |     Bulk | '\x18'
     45913 | 19,874 |     Bulk | 'Ç'
        22 | 19,865 |     Bulk | '7'
       221 | 19,861 |     Bulk | '\x7f'
    131371 | 19,858 |     Bulk | 'Ô'
       219 | 19,857 |     Bulk | '\x1f'
       205 | 19,849 |     Bulk | '\x11'
       190 | 19,847 |     Bulk | '\x02'
       249 | 19,841 |     Halo | '�'
       232 | 19,841 |     Halo | '�'
       217 | 19,832 |     Bulk | '\x1d'


## Interpretation

**If halo enrichment > 1.5x:**
- ✓ **Hypothesis supported**: Random bytes disproportionately produce halo tokens
- Halo tokens are reachable via non-UTF-8 byte sequences
- They likely appeared in Qwen's training data as binary garbage
- Round-trip test fails because they can't be decoded to valid Unicode
- But they're not "dead"—they received gradients and have normal embeddings

**If halo enrichment ≈ 1.0x:**
- ✗ **Hypothesis not supported**: Random bytes sample uniformly from vocabulary
- Halo tokens are not preferentially produced by binary garbage
- Need alternative explanation for why they're outside the cluster

**If cluster enrichment > 1.0x:**
- Unexpected! Cluster tokens appearing from random bytes?
- Would suggest cluster tokens are also reachable via byte sequences
- But they collapsed geometrically—why?

**Key insight:**

This test distinguishes between:
- **Halo tokens**: Reachable via bytes, not via Unicode (binary garbage)
- **Cluster tokens**: Not reachable via bytes OR Unicode (truly unused)
- **Bulk tokens**: Reachable via Unicode (normal training data)