# 1.10c: GB2312 Chinese Character Tokenization Test

**Testing the legacy encoding hypothesis**

## The Hypothesis

From 1.10b, we found that **random bytes produce halo tokens at 23.35× enrichment**.

This suggests halo tokens appear in Qwen's training data as **non-UTF-8 byte sequences**.

**New question:** Did Qwen train on pre-Unicode Chinese text?

## GB2312 Encoding

GB2312 (1980) is a legacy Chinese character encoding used before Unicode:
- 2 bytes per character
- ~6,700 Simplified Chinese characters
- High byte: 0xA1–0xF7 (area code)
- Low byte: 0xA1–0xFE (position code)

**Character areas:**
- Areas 1-9 (0xA1A1–0xA9FE): Symbols, punctuation, kana, Greek, Cyrillic
- Areas 16-55 (0xB0A1–0xD7FE): Level 1 Chinese (3,755 common characters)
- Areas 56-87 (0xD8A1–0xF7FE): Level 2 Chinese (3,008 less common characters)

## The Test

Feed the tokenizer **all valid GB2312 Chinese character byte pairs** and measure halo token enrichment.

**If Qwen trained on GB2312 text:**
- Halo enrichment should be **higher** than random bytes (> 23.35x)
- GB2312 byte pairs specifically target the byte sequences that would appear in legacy Chinese text
- This is a **targeted test** vs. random byte noise

**If Qwen never saw GB2312:**
- Enrichment similar to random bytes (~23x) or lower
- No special affinity for these specific byte patterns

## Method

1. Generate all valid GB2312 Chinese character byte pairs (0xB0A1–0xF7FE)
2. Convert to strings via Latin-1 (lossless byte→string)
3. Tokenize each byte pair individually
4. Classify tokens: cluster / halo / bulk
5. Compare to vocabulary baseline and random bytes baseline

## Parameters

In [12]:
# Model
MODEL_NAME = "Qwen3-4B-Instruct-2507"
HF_MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Input data
CLUSTER_TOKENS_PATH = "../tensors/Qwen3-4B-Instruct-2507/1.4h_cluster_tokens.safetensors"
REACHABILITY_PATH = "../tensors/Qwen3-4B-Instruct-2507/1.8d_full_vocab_reachability.safetensors"

# GB2312 Chinese character ranges
# Level 1: Common characters (areas 16-55)
LEVEL1_HIGH_START = 0xB0
LEVEL1_HIGH_END = 0xD7

# Level 2: Less common characters (areas 56-87)
LEVEL2_HIGH_START = 0xD8
LEVEL2_HIGH_END = 0xF7

# Low byte range (same for both levels)
LOW_BYTE_START = 0xA1
LOW_BYTE_END = 0xFE

# Baseline from 1.10b (random bytes)
RANDOM_BYTES_HALO_ENRICHMENT = 23.35

## Imports

In [13]:
import torch
import numpy as np
from transformers import AutoTokenizer
from safetensors.torch import load_file
from collections import Counter
from tqdm import tqdm

## Load Token Classifications

In [14]:
print("Loading token classifications...\n")

# Load cluster tokens
cluster_data = load_file(CLUSTER_TOKENS_PATH)
cluster_token_ids = set(cluster_data['cluster_token_ids'].tolist())

# Load halo tokens (unreachable outside cluster)
reachability_data = load_file(REACHABILITY_PATH)
halo_token_ids = set(reachability_data['unreachable_outside_cluster'].tolist())

print(f"✓ Loaded token classifications")
print(f"  Cluster tokens: {len(cluster_token_ids):,}")
print(f"  Halo tokens: {len(halo_token_ids):,}")
print(f"  Bulk tokens: {151669 - len(cluster_token_ids) - len(halo_token_ids):,}")

Loading token classifications...

✓ Loaded token classifications
  Cluster tokens: 2,212
  Halo tokens: 1,423
  Bulk tokens: 148,034


## Load Tokenizer

In [15]:
print(f"\nLoading tokenizer: {HF_MODEL_NAME}\n")
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
vocab_size = len(tokenizer)

print(f"✓ Tokenizer loaded")
print(f"  Vocabulary size: {vocab_size:,} tokens")


Loading tokenizer: Qwen/Qwen3-4B-Instruct-2507

✓ Tokenizer loaded
  Vocabulary size: 151,669 tokens


## Generate GB2312 Byte Pairs

In [16]:
print(f"\n{'='*70}")
print("GENERATING GB2312 CHINESE CHARACTER BYTE PAIRS")
print(f"{'='*70}\n")

# Generate all valid GB2312 Chinese character byte pairs
gb2312_byte_pairs = []

# Level 1: Common characters (0xB0A1 - 0xD7FE)
for high_byte in range(LEVEL1_HIGH_START, LEVEL1_HIGH_END + 1):
    for low_byte in range(LOW_BYTE_START, LOW_BYTE_END + 1):
        gb2312_byte_pairs.append(bytes([high_byte, low_byte]))

# Level 2: Less common characters (0xD8A1 - 0xF7FE)
for high_byte in range(LEVEL2_HIGH_START, LEVEL2_HIGH_END + 1):
    for low_byte in range(LOW_BYTE_START, LOW_BYTE_END + 1):
        gb2312_byte_pairs.append(bytes([high_byte, low_byte]))

print(f"✓ Generated {len(gb2312_byte_pairs):,} GB2312 byte pairs")
print(f"\nByte range coverage:")
level1_count = (LEVEL1_HIGH_END - LEVEL1_HIGH_START + 1) * (LOW_BYTE_END - LOW_BYTE_START + 1)
level2_count = (LEVEL2_HIGH_END - LEVEL2_HIGH_START + 1) * (LOW_BYTE_END - LOW_BYTE_START + 1)
print(f"  Level 1 (0xB0A1-0xD7FE): {level1_count:,} pairs")
print(f"  Level 2 (0xD8A1-0xF7FE): {level2_count:,} pairs")
print(f"  Total: {len(gb2312_byte_pairs):,} pairs")

# Show first few examples
print(f"\nFirst 5 byte pairs (hex):")
for i in range(min(5, len(gb2312_byte_pairs))):
    pair = gb2312_byte_pairs[i]
    print(f"  {i+1}. 0x{pair[0]:02X}{pair[1]:02X}")


GENERATING GB2312 CHINESE CHARACTER BYTE PAIRS

✓ Generated 6,768 GB2312 byte pairs

Byte range coverage:
  Level 1 (0xB0A1-0xD7FE): 3,760 pairs
  Level 2 (0xD8A1-0xF7FE): 3,008 pairs
  Total: 6,768 pairs

First 5 byte pairs (hex):
  1. 0xB0A1
  2. 0xB0A2
  3. 0xB0A3
  4. 0xB0A4
  5. 0xB0A5


## Tokenize GB2312 Byte Pairs

In [17]:
print(f"\n{'='*70}")
print("TOKENIZING GB2312 BYTE PAIRS")
print(f"{'='*70}\n")

# Collect all token IDs produced
all_tokens = []
failed_pairs = 0

for byte_pair in tqdm(gb2312_byte_pairs, desc="Tokenizing GB2312 pairs"):
    try:
        # Convert bytes to string using GB2312 encoding
        byte_string = byte_pair.decode('gb2312')
        
        # Tokenize
        token_ids = tokenizer.encode(byte_string, add_special_tokens=False)
        all_tokens.extend(token_ids)
    except Exception as e:
        failed_pairs += 1
        if failed_pairs <= 5:
            print(f"  Warning: Failed to tokenize 0x{byte_pair[0]:02X}{byte_pair[1]:02X}: {e}")

print(f"\n✓ Tokenization complete")
print(f"  Byte pairs processed: {len(gb2312_byte_pairs) - failed_pairs:,} / {len(gb2312_byte_pairs):,}")
print(f"  Failed pairs: {failed_pairs:,}")
print(f"  Total tokens produced: {len(all_tokens):,}")
print(f"  Unique tokens: {len(set(all_tokens)):,}")
print(f"  Tokens per byte pair (avg): {len(all_tokens) / len(gb2312_byte_pairs):.2f}")


TOKENIZING GB2312 BYTE PAIRS



Tokenizing GB2312 pairs: 100%|██████████| 6768/6768 [00:00<00:00, 20989.76it/s]


✓ Tokenization complete
  Byte pairs processed: 6,763 / 6,768
  Failed pairs: 5
  Total tokens produced: 6,885
  Unique tokens: 6,789
  Tokens per byte pair (avg): 1.02





## Classify Tokens

In [18]:
print(f"\n{'='*70}")
print("CLASSIFYING TOKENS")
print(f"{'='*70}\n")

# Count tokens by category
cluster_count = 0
halo_count = 0
bulk_count = 0

for token_id in all_tokens:
    if token_id in cluster_token_ids:
        cluster_count += 1
    elif token_id in halo_token_ids:
        halo_count += 1
    else:
        bulk_count += 1

total_tokens = len(all_tokens)

if total_tokens == 0:
    print("ERROR: No tokens were produced!")
    print("Cannot compute statistics with zero tokens.")
else:
    print(f"Token distribution from GB2312 byte pairs:")
    print(f"  Cluster tokens: {cluster_count:,} ({100*cluster_count/total_tokens:.2f}%)")
    print(f"  Halo tokens: {halo_count:,} ({100*halo_count/total_tokens:.2f}%)")
    print(f"  Bulk tokens: {bulk_count:,} ({100*bulk_count/total_tokens:.2f}%)")
    print(f"  Total: {total_tokens:,}")


CLASSIFYING TOKENS

Token distribution from GB2312 byte pairs:
  Cluster tokens: 0 (0.00%)
  Halo tokens: 244 (3.54%)
  Bulk tokens: 6,641 (96.46%)
  Total: 6,885


## Compare to Baselines

In [19]:
print(f"\n{'='*70}")
print("COMPARISON TO BASELINES")
print(f"{'='*70}\n")

# Vocabulary proportions (uniform sampling baseline)
vocab_cluster_pct = 100 * len(cluster_token_ids) / vocab_size
vocab_halo_pct = 100 * len(halo_token_ids) / vocab_size
vocab_bulk_pct = 100 * (vocab_size - len(cluster_token_ids) - len(halo_token_ids)) / vocab_size

# Observed proportions from GB2312
obs_cluster_pct = 100 * cluster_count / total_tokens
obs_halo_pct = 100 * halo_count / total_tokens
obs_bulk_pct = 100 * bulk_count / total_tokens

# Enrichment (observed / expected)
cluster_enrichment = obs_cluster_pct / vocab_cluster_pct if vocab_cluster_pct > 0 else 0
halo_enrichment = obs_halo_pct / vocab_halo_pct if vocab_halo_pct > 0 else 0
bulk_enrichment = obs_bulk_pct / vocab_bulk_pct if vocab_bulk_pct > 0 else 0

print("Baseline 1: Vocabulary (uniform sampling):")
print(f"  Cluster: {vocab_cluster_pct:.2f}%")
print(f"  Halo: {vocab_halo_pct:.2f}%")
print(f"  Bulk: {vocab_bulk_pct:.2f}%")

print(f"\nBaseline 2: Random bytes (from 1.10b):")
print(f"  Halo enrichment: {RANDOM_BYTES_HALO_ENRICHMENT:.2f}x")

print(f"\nObserved (GB2312 byte pairs):")
print(f"  Cluster: {obs_cluster_pct:.2f}%")
print(f"  Halo: {obs_halo_pct:.2f}%")
print(f"  Bulk: {obs_bulk_pct:.2f}%")

print(f"\nEnrichment vs. vocabulary:")
print(f"  Cluster: {cluster_enrichment:.2f}x")
print(f"  Halo: {halo_enrichment:.2f}x", end="")

# Compare to random bytes baseline
if halo_enrichment > RANDOM_BYTES_HALO_ENRICHMENT * 1.2:
    print(f" ← HIGHER THAN RANDOM BYTES ({RANDOM_BYTES_HALO_ENRICHMENT:.2f}x)")
    print(f"\n  ✓ GB2312 byte pairs show STRONGER halo affinity than random noise!")
    print(f"    This suggests Qwen was trained on GB2312-encoded Chinese text.")
elif halo_enrichment > RANDOM_BYTES_HALO_ENRICHMENT * 0.8:
    print(f" ← SIMILAR TO RANDOM BYTES ({RANDOM_BYTES_HALO_ENRICHMENT:.2f}x)")
    print(f"\n  ≈ GB2312 byte pairs behave like random noise")
    print(f"    No special affinity for these byte patterns.")
else:
    print(f" ← LOWER THAN RANDOM BYTES ({RANDOM_BYTES_HALO_ENRICHMENT:.2f}x)")
    print(f"\n  ✗ GB2312 byte pairs produce FEWER halo tokens than random noise")
    print(f"    Qwen likely did not train on GB2312 text.")

print(f"  Bulk: {bulk_enrichment:.2f}x")


COMPARISON TO BASELINES

Baseline 1: Vocabulary (uniform sampling):
  Cluster: 1.46%
  Halo: 0.94%
  Bulk: 97.60%

Baseline 2: Random bytes (from 1.10b):
  Halo enrichment: 23.35x

Observed (GB2312 byte pairs):
  Cluster: 0.00%
  Halo: 3.54%
  Bulk: 96.46%

Enrichment vs. vocabulary:
  Cluster: 0.00x
  Halo: 3.78x ← LOWER THAN RANDOM BYTES (23.35x)

  ✗ GB2312 byte pairs produce FEWER halo tokens than random noise
    Qwen likely did not train on GB2312 text.
  Bulk: 0.99x


## Statistical Significance

In [20]:
print(f"\n{'='*70}")
print("STATISTICAL SIGNIFICANCE")
print(f"{'='*70}\n")

from scipy.stats import chisquare

# Expected frequencies (if tokens were sampled uniformly from vocab)
expected = [
    total_tokens * len(cluster_token_ids) / vocab_size,
    total_tokens * len(halo_token_ids) / vocab_size,
    total_tokens * (vocab_size - len(cluster_token_ids) - len(halo_token_ids)) / vocab_size,
]

# Observed frequencies
observed = [cluster_count, halo_count, bulk_count]

# Chi-square test
chi2, p_value = chisquare(observed, expected)

print(f"Chi-square goodness-of-fit test:")
print(f"  Null hypothesis: GB2312 byte pairs produce tokens uniformly from vocabulary")
print(f"  Chi-square statistic: {chi2:.2f}")
print(f"  p-value: {p_value:.2e}")

if p_value < 0.001:
    print(f"\n  ✓ SIGNIFICANT: Distribution is non-uniform (p < 0.001)")
elif p_value < 0.05:
    print(f"\n  ✓ SIGNIFICANT: Distribution is non-uniform (p < 0.05)")
else:
    print(f"\n  ✗ NOT SIGNIFICANT: Cannot reject null hypothesis (p ≥ 0.05)")


STATISTICAL SIGNIFICANCE

Chi-square goodness-of-fit test:
  Null hypothesis: GB2312 byte pairs produce tokens uniformly from vocabulary
  Chi-square statistic: 599.59
  p-value: 6.31e-131

  ✓ SIGNIFICANT: Distribution is non-uniform (p < 0.001)


## Most Common Tokens from GB2312

In [21]:
print(f"\n{'='*70}")
print("MOST COMMON TOKENS FROM GB2312 BYTE PAIRS")
print(f"{'='*70}\n")

token_counter = Counter(all_tokens)
most_common = token_counter.most_common(20)

print(f"Top 20 tokens produced by GB2312 byte pairs:\n")
print(f"  {'Token ID':>8} | {'Count':>6} | {'Category':>8} | Decoded")
print(f"  {'-'*8}-+-{'-'*6}-+-{'-'*8}-+{'-'*30}")

for token_id, count in most_common:
    # Classify
    if token_id in cluster_token_ids:
        category = "Cluster"
    elif token_id in halo_token_ids:
        category = "Halo"
    else:
        category = "Bulk"
    
    # Decode
    decoded = tokenizer.decode([token_id])
    decoded_display = repr(decoded)[:28]
    
    print(f"  {token_id:8d} | {count:6,} | {category:>8} | {decoded_display}")


MOST COMMON TOKENS FROM GB2312 BYTE PAIRS

Top 20 tokens produced by GB2312 byte pairs:

  Token ID |  Count | Category | Decoded
  ---------+--------+----------+------------------------------
       114 |      8 |     Halo | '�'
       222 |      6 |     Halo | '�'
      3490 |      4 |     Halo | '�'
       251 |      4 |     Halo | '�'
       113 |      4 |     Halo | '�'
       116 |      4 |     Halo | '�'
       248 |      3 |     Halo | '�'
       224 |      3 |     Halo | '�'
     13519 |      3 |     Halo | '�'
       244 |      3 |     Halo | '�'
       112 |      3 |     Halo | '�'
     20778 |      3 |     Halo | '�'
       110 |      3 |     Halo | '�'
       241 |      3 |     Halo | '�'
        96 |      3 |     Halo | '�'
       253 |      3 |     Halo | '�'
     24864 |      3 |     Halo | '�'
       233 |      3 |     Halo | '�'
        99 |      3 |     Halo | '�'
     37472 |      3 |     Halo | '�'


## Unique Halo Tokens Hit

In [22]:
print(f"\n{'='*70}")
print("HALO TOKEN COVERAGE")
print(f"{'='*70}\n")

# How many unique halo tokens did we hit?
unique_halo_tokens = set(token_id for token_id in all_tokens if token_id in halo_token_ids)

print(f"Halo token coverage:")
print(f"  Total halo tokens in vocabulary: {len(halo_token_ids):,}")
print(f"  Unique halo tokens produced: {len(unique_halo_tokens):,}")
print(f"  Coverage: {100 * len(unique_halo_tokens) / len(halo_token_ids):.1f}%")

if len(unique_halo_tokens) / len(halo_token_ids) > 0.5:
    print(f"\n  ✓ HIGH COVERAGE: GB2312 byte pairs hit majority of halo tokens!")
    print(f"    This strongly suggests halo tokens come from GB2312 text.")
elif len(unique_halo_tokens) / len(halo_token_ids) > 0.1:
    print(f"\n  ≈ MODERATE COVERAGE: GB2312 hits some halo tokens")
else:
    print(f"\n  ✗ LOW COVERAGE: GB2312 rarely produces halo tokens")


HALO TOKEN COVERAGE

Halo token coverage:
  Total halo tokens in vocabulary: 1,423
  Unique halo tokens produced: 148
  Coverage: 10.4%

  ≈ MODERATE COVERAGE: GB2312 hits some halo tokens


## Interpretation

**If halo enrichment > 28x (20% higher than random bytes):**
- ✓ **Smoking gun**: GB2312-encoded Chinese text is specifically overrepresented
- Qwen's training data likely included pre-Unicode Chinese corpora
- Halo tokens are artifacts of legacy encoding in training data

**If halo enrichment ≈ 23x (similar to random bytes):**
- ≈ **Inconclusive**: GB2312 behaves like general binary noise
- Halo tokens come from various non-UTF-8 sources, not specifically GB2312

**If halo enrichment < 18x (lower than random bytes):**
- ✗ **Evidence against**: GB2312 is less halo-enriched than random noise
- Qwen likely did not train on significant GB2312 text

**High halo token coverage (>50%):**
- Would indicate that most halo tokens are specifically reachable via GB2312 byte pairs
- Strong evidence that legacy Chinese encoding is the primary source of halo tokens