# 06.1p: Cuckoo Zone Survey

**Goal:** Comprehensive analysis of the 27 tokens in the "cuckoo zone" (1e-4 to 1e-3 from black hole centroid).

These are the tokens that sit just outside the ultra-dense black hole cluster, in the region with wild density variation. For each cuckoo token we'll report:
- Token ID and decoded string
- Distance from black hole centroid
- Pairwise Euclidean distances to all other cuckoos
- Angular separations from other cuckoos
- Character classification
- Nearest black hole

This is Volume 6: Pathologies and Singularities

## Parameters

In [1]:
TENSOR_DIR = "../data/tensors"
GAMMA_FILE = "gamma_qwen3_4b_instruct_2507.safetensors"
MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# Cuckoo zone boundaries
CUCKOO_R_INNER = 1e-4
CUCKOO_R_OUTER = 1e-3

# Random seed
RANDOM_SEED = 42

## Imports

In [2]:
import torch
import numpy as np
from safetensors.torch import load_file
from pathlib import Path
from collections import defaultdict
from transformers import AutoTokenizer
import unicodedata

print("Imports loaded successfully.")

Imports loaded successfully.


## Step 1: Load Data and Identify Black Hole Centroid

In [3]:
print("Loading gamma matrix...")
gamma_path = Path(TENSOR_DIR) / GAMMA_FILE
gamma = load_file(gamma_path)['gamma']
N, d = gamma.shape
print(f"Loaded: {gamma.shape}\n")

print("Finding degenerate clusters...\n")

gamma_np = gamma.cpu().numpy()
unique_vecs, inverse_indices, counts = np.unique(
    gamma_np, 
    axis=0, 
    return_inverse=True, 
    return_counts=True
)

# Build clusters
clusters = defaultdict(list)
for token_id, unique_idx in enumerate(inverse_indices):
    clusters[unique_idx].append(token_id)

# Filter to degenerate clusters
degenerate_clusters = {idx: tokens for idx, tokens in clusters.items() if len(tokens) > 1}

# Sort by size, descending
sorted_clusters = sorted(degenerate_clusters.items(), key=lambda x: len(x[1]), reverse=True)

# Extract all black hole token IDs (top 13 clusters)
black_hole_token_ids = []
for unique_idx, token_ids in sorted_clusters[:13]:
    black_hole_token_ids.extend(token_ids)

black_hole_token_ids = sorted(black_hole_token_ids)
black_hole_set = set(black_hole_token_ids)

print(f"Identified {len(black_hole_token_ids)} black hole tokens")

# Compute centroid
black_hole_embeddings = gamma[black_hole_token_ids]
centroid = black_hole_embeddings.mean(dim=0)
centroid_norm = torch.norm(centroid)

print(f"Black hole centroid norm: {centroid_norm.item():.12f}")

Loading gamma matrix...
Loaded: torch.Size([151936, 2560])

Finding degenerate clusters...

Identified 2100 black hole tokens
Black hole centroid norm: 0.370916873217


## Step 2: Extract Cuckoo Zone Tokens

In [4]:
print("\n" + "="*70)
print(f"EXTRACTING CUCKOO ZONE TOKENS ({CUCKOO_R_INNER:.0e} < r < {CUCKOO_R_OUTER:.0e})")
print("="*70)
print()

# Compute distances from centroid
distances = torch.norm(gamma - centroid, dim=1)

# Find tokens in cuckoo zone
in_cuckoo_zone = (distances > CUCKOO_R_INNER) & (distances <= CUCKOO_R_OUTER)
cuckoo_token_ids = torch.where(in_cuckoo_zone)[0].tolist()
cuckoo_distances = distances[in_cuckoo_zone]
cuckoo_embeddings = gamma[in_cuckoo_zone]

print(f"Found {len(cuckoo_token_ids)} tokens in the cuckoo zone")
print(f"\nDistance range:")
print(f"  Min: {cuckoo_distances.min().item():.12e}")
print(f"  Max: {cuckoo_distances.max().item():.12e}")
print(f"  Mean: {cuckoo_distances.mean().item():.12e}")
print(f"  Median: {cuckoo_distances.median().item():.12e}")


EXTRACTING CUCKOO ZONE TOKENS (1e-04 < r < 1e-03)

Found 27 tokens in the cuckoo zone

Distance range:
  Min: 1.011977947201e-04
  Max: 9.954774286598e-04
  Mean: 3.191469004378e-04
  Median: 1.878242037492e-04


## Step 3: Load Tokenizer

In [5]:
print("\n" + "="*70)
print("LOADING TOKENIZER")
print("="*70)
print()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Tokenizer loaded: {len(tokenizer)} tokens\n")


LOADING TOKENIZER

Tokenizer loaded: 151669 tokens



## Step 4: Character Classification Function

In [6]:
def classify_character(char):
    """Classify a character by Unicode category and name."""
    if not char:
        return 'Empty'
    
    try:
        cat = unicodedata.category(char[0])
        name = unicodedata.name(char[0], '<UNNAMED>')
        
        # Check for specific scripts in the name
        if 'THAI' in name:
            return 'Thai'
        elif 'CJK' in name or 'HIRAGANA' in name or 'KATAKANA' in name or 'HANGUL' in name:
            return 'CJK'
        elif 'ARABIC' in name:
            return 'Arabic'
        elif 'HEBREW' in name:
            return 'Hebrew'
        elif 'CYRILLIC' in name:
            return 'Cyrillic'
        elif 'GREEK' in name:
            return 'Greek'
        elif 'LATIN' in name:
            # Further subdivide Latin
            if cat.startswith('L'):  # Letter
                return 'Latin Letter'
            elif cat.startswith('N'):  # Number
                return 'Latin Number'
            elif cat.startswith('P'):  # Punctuation
                return 'Latin Punctuation'
            else:
                return 'Latin Other'
        elif cat.startswith('Z'):  # Separator (space, etc.)
            return 'Whitespace'
        elif cat.startswith('P'):  # Punctuation
            return 'Punctuation'
        elif cat.startswith('N'):  # Number
            return 'Number'
        elif cat.startswith('S'):  # Symbol
            return 'Symbol'
        else:
            return 'Other'
    except Exception as e:
        return f'Error: {e}'

print("Character classification function defined.")

Character classification function defined.


## Step 5: Compute Pairwise Distances

In [7]:
print("\n" + "="*70)
print("COMPUTING PAIRWISE DISTANCES BETWEEN CUCKOOS")
print("="*70)
print()

# Euclidean pairwise distances
pairwise_dists = torch.cdist(cuckoo_embeddings, cuckoo_embeddings, p=2)

# Angular distances (using normalized vectors)
cuckoo_normed = cuckoo_embeddings / torch.norm(cuckoo_embeddings, dim=1, keepdim=True)
cosine_sims = cuckoo_normed @ cuckoo_normed.T
angular_dists = torch.acos(torch.clamp(cosine_sims, -1.0, 1.0))

print(f"Computed {len(cuckoo_token_ids)}×{len(cuckoo_token_ids)} distance matrices")
print(f"\nEuclidean distance statistics:")
# Extract upper triangle (excluding diagonal)
mask = torch.triu(torch.ones_like(pairwise_dists, dtype=bool), diagonal=1)
unique_dists = pairwise_dists[mask]
print(f"  Min: {unique_dists.min().item():.12e}")
print(f"  Max: {unique_dists.max().item():.12e}")
print(f"  Mean: {unique_dists.mean().item():.12e}")
print(f"  Median: {unique_dists.median().item():.12e}")

print(f"\nAngular distance statistics:")
unique_angles = angular_dists[mask]
print(f"  Min: {unique_angles.min().item():.6f} rad ({torch.rad2deg(unique_angles.min()).item():.4f}°)")
print(f"  Max: {unique_angles.max().item():.6f} rad ({torch.rad2deg(unique_angles.max()).item():.4f}°)")
print(f"  Mean: {unique_angles.mean().item():.6f} rad ({torch.rad2deg(unique_angles.mean()).item():.4f}°)")
print(f"  Median: {unique_angles.median().item():.6f} rad ({torch.rad2deg(unique_angles.median()).item():.4f}°)")


COMPUTING PAIRWISE DISTANCES BETWEEN CUCKOOS

Computed 27×27 distance matrices

Euclidean distance statistics:
  Min: 4.401307669468e-04
  Max: 1.722013694234e-03
  Mean: 9.701765957288e-04
  Median: 9.533996344544e-04

Angular distance statistics:
  Min: 0.000000 rad (0.0000°)
  Max: 0.003347 rad (0.1918°)
  Mean: 0.000355 rad (0.0204°)
  Median: 0.000000 rad (0.0000°)


## Step 6: Find Nearest Black Hole for Each Cuckoo

In [8]:
print("\n" + "="*70)
print("FINDING NEAREST BLACK HOLE FOR EACH CUCKOO")
print("="*70)
print()

# Compute distances from each cuckoo to all black holes
cuckoo_to_bh_dists = torch.cdist(cuckoo_embeddings, black_hole_embeddings, p=2)

# For each cuckoo, find nearest black hole
nearest_bh_dists, nearest_bh_indices = cuckoo_to_bh_dists.min(dim=1)
nearest_bh_token_ids = [black_hole_token_ids[idx] for idx in nearest_bh_indices]

print(f"Computed nearest black hole for each of {len(cuckoo_token_ids)} cuckoos")
print(f"\nDistance to nearest black hole:")
print(f"  Min: {nearest_bh_dists.min().item():.12e}")
print(f"  Max: {nearest_bh_dists.max().item():.12e}")
print(f"  Mean: {nearest_bh_dists.mean().item():.12e}")


FINDING NEAREST BLACK HOLE FOR EACH CUCKOO

Computed nearest black hole for each of 27 cuckoos

Distance to nearest black hole:
  Min: 5.459150415845e-04
  Max: 1.170857343823e-03
  Mean: 8.674336131662e-04


## Step 7: Complete Dossier for Each Cuckoo

In [9]:
print("\n" + "="*70)
print("CUCKOO ZONE COMPLETE SURVEY")
print("="*70)
print()

for i, token_id in enumerate(cuckoo_token_ids):
    # Basic info
    token_str = tokenizer.decode([token_id])
    dist_from_centroid = cuckoo_distances[i].item()
    embedding = cuckoo_embeddings[i]
    embedding_norm = torch.norm(embedding).item()
    
    # Character classification
    char_type = classify_character(token_str)
    
    # Nearest black hole
    nearest_bh_id = nearest_bh_token_ids[i]
    nearest_bh_dist = nearest_bh_dists[i].item()
    nearest_bh_str = tokenizer.decode([nearest_bh_id])
    
    # Distances to other cuckoos
    dists_to_others = pairwise_dists[i].cpu().numpy()
    angles_to_others = angular_dists[i].cpu().numpy()
    
    # Exclude self (distance = 0)
    other_indices = [j for j in range(len(cuckoo_token_ids)) if j != i]
    dists_to_others_filtered = dists_to_others[other_indices]
    angles_to_others_filtered = angles_to_others[other_indices]
    
    closest_other_idx = other_indices[dists_to_others_filtered.argmin()]
    closest_other_id = cuckoo_token_ids[closest_other_idx]
    closest_other_dist = dists_to_others[closest_other_idx]
    
    farthest_other_idx = other_indices[dists_to_others_filtered.argmax()]
    farthest_other_id = cuckoo_token_ids[farthest_other_idx]
    farthest_other_dist = dists_to_others[farthest_other_idx]
    
    print(f"\n{'='*70}")
    print(f"CUCKOO #{i+1} of {len(cuckoo_token_ids)}")
    print(f"{'='*70}")
    print(f"Token ID: {token_id}")
    print(f"Token string: '{token_str}'")
    print(f"Character type: {char_type}")
    print()
    print(f"Spatial properties:")
    print(f"  Distance from black hole centroid: {dist_from_centroid:.12e}")
    print(f"  Embedding norm (distance from origin): {embedding_norm:.12f}")
    print()
    print(f"Nearest black hole:")
    print(f"  Token ID: {nearest_bh_id}")
    print(f"  Token string: '{nearest_bh_str}'")
    print(f"  Distance: {nearest_bh_dist:.12e}")
    print()
    print(f"Distances to other cuckoos:")
    print(f"  Closest: token {closest_other_id} at {closest_other_dist:.12e}")
    print(f"  Farthest: token {farthest_other_id} at {farthest_other_dist:.12e}")
    print(f"  Mean: {dists_to_others_filtered.mean():.12e}")
    print(f"  Median: {np.median(dists_to_others_filtered):.12e}")
    print()
    print(f"Angular separations from other cuckoos:")
    print(f"  Min: {angles_to_others_filtered.min():.6f} rad ({np.rad2deg(angles_to_others_filtered.min()):.4f}°)")
    print(f"  Max: {angles_to_others_filtered.max():.6f} rad ({np.rad2deg(angles_to_others_filtered.max()):.4f}°)")
    print(f"  Mean: {angles_to_others_filtered.mean():.6f} rad ({np.rad2deg(angles_to_others_filtered.mean()):.4f}°)")


CUCKOO ZONE COMPLETE SURVEY


CUCKOO #1 of 27
Token ID: 83971
Token string: '$PostalCodesNL'
Character type: Symbol

Spatial properties:
  Distance from black hole centroid: 7.319091237150e-04
  Embedding norm (distance from origin): 0.370950996876

Nearest black hole:
  Token ID: 136831
  Token string: 'สติ'
  Distance: 1.118792919442e-03

Distances to other cuckoos:
  Closest: token 133377 at 9.917039424181e-04
  Farthest: token 125869 at 1.485049491748e-03
  Mean: 1.138059189543e-03
  Median: 1.108753611334e-03

Angular separations from other cuckoos:
  Min: 0.000000 rad (0.0000°)
  Max: 0.002740 rad (0.1570°)
  Mean: 0.000953 rad (0.0546°)

CUCKOO #2 of 27
Token ID: 123952
Token string: 'ด้'
Character type: Thai

Spatial properties:
  Distance from black hole centroid: 1.268317428185e-04
  Embedding norm (distance from origin): 0.370916336775

Nearest black hole:
  Token ID: 136831
  Token string: 'สติ'
  Distance: 7.324218750000e-04

Distances to other cuckoos:
  Closest: token 1

## Step 8: Character Type Summary

In [10]:
print("\n" + "="*70)
print("CHARACTER TYPE DISTRIBUTION")
print("="*70)
print()

# Classify all cuckoos
char_types = []
for token_id in cuckoo_token_ids:
    token_str = tokenizer.decode([token_id])
    char_type = classify_character(token_str)
    char_types.append(char_type)

# Count
from collections import Counter
type_counts = Counter(char_types)

print(f"{'Type':<20} {'Count':>8} {'Percentage':>12}")
print("-" * 45)
for char_type, count in sorted(type_counts.items(), key=lambda x: x[1], reverse=True):
    pct = 100 * count / len(cuckoo_token_ids)
    print(f"{char_type:<20} {count:>8} {pct:>11.1f}%")

print(f"\nTotal: {len(cuckoo_token_ids)} tokens")


CHARACTER TYPE DISTRIBUTION

Type                    Count   Percentage
---------------------------------------------
Thai                       18        66.7%
CJK                         5        18.5%
Symbol                      1         3.7%
Whitespace                  1         3.7%
Greek                       1         3.7%
Empty                       1         3.7%

Total: 27 tokens


## Summary

We've completed a comprehensive survey of the 27 tokens in the cuckoo zone.

**Key questions answered:**
1. What are these tokens? (decoded strings and character types)
2. How are they spatially arranged? (distances from centroid, pairwise separations)
3. How close are they to the black holes? (nearest black hole for each)
4. Are they linguistically similar to black holes? (character distribution)

The cuckoo zone is the chaotic region where tokens are close enough to feel the gravitational influence of the black hole cluster but far enough away that they didn't collapse into degeneracy during training.