# 06.1j: Black Hole Survey

**Goal:** Comprehensive analysis of each of the 13 primordial black holes.

For each black hole cluster, we examine:
1. **Identity:** Rank, size, representative token, complete token list with decoding
2. **Spatial properties:** Norm, distance from centroid, deviation from mean radius
3. **Pairwise relationships:** Distances and angular separations to all other black holes
4. **Token linguistics:** Script types, character patterns
5. **Statistical summary:** Nearest/farthest neighbors, outlier detection

This is Volume 6: Pathologies and Singularities

## Parameters

In [1]:
TENSOR_DIR = "../data/tensors"
GAMMA_FILE = "gamma_qwen3_4b_instruct_2507.safetensors"
MODEL_NAME = "Qwen/Qwen3-4B-Instruct-2507"

# The 13 black hole clusters (from 06.1h)
# Format: (representative_token_id, cluster_size, all_token_ids)
BLACK_HOLE_CLUSTERS = [
    (80091, 814, list(range(80091, 80091+814))),  # Placeholder - will load from 06.1h results
    (125, 704, None),
    (124, 306, None),
    (124350, 228, None),
    (123939, 11, [123939, 131955, 134350, 134430, 138099, 139299, 139794, 140074]),  # From 06.1h output
    (119349, 10, [119349, 125087, 126630, 137856, 138110, 139345, 142061, 143029, 143036, 143048]),
    (126268, 6, [126268, 132713, 138041, 146501, 148028, 151889]),
    (132383, 5, [132383, 132398, 139050, 142718, 142719]),
    (135619, 4, [135619, 138490, 140815, 143457]),
    (136831, 4, [136831, 138068, 138072, 139278]),
    (180, 3, [180, 138979, 141503]),
    (126775, 3, [126775, 140303, 147056]),
    (126816, 2, [126816, 147836]),
]

# NOTE: For large clusters, we'll need to re-run 06.1h or load token IDs from saved data
# For now, using placeholders for the big 4

## Imports

In [2]:
import torch
import numpy as np
from safetensors.torch import load_file
from pathlib import Path
from transformers import AutoTokenizer
from collections import Counter
import unicodedata

print("Imports loaded successfully.")

Imports loaded successfully.


## Step 1: Load Data

In [3]:
print("Loading gamma matrix...")
gamma_path = Path(TENSOR_DIR) / GAMMA_FILE
gamma = load_file(gamma_path)['gamma']
print(f"Loaded: {gamma.shape}\n")

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Tokenizer loaded. Vocab size: {len(tokenizer):,}\n")

Loading gamma matrix...
Loaded: torch.Size([151936, 2560])

Loading tokenizer...
Tokenizer loaded. Vocab size: 151,669



## Step 2: Re-identify All Clusters

We need the complete token lists for all 13 clusters. Let's re-run the deduplication to get them.

In [4]:
print("Finding all degenerate clusters...\n")

from collections import defaultdict

gamma_np = gamma.cpu().numpy()
unique_vecs, inverse_indices, counts = np.unique(
    gamma_np, 
    axis=0, 
    return_inverse=True, 
    return_counts=True
)

# Build clusters
clusters = defaultdict(list)
for token_id, unique_idx in enumerate(inverse_indices):
    clusters[unique_idx].append(token_id)

# Filter to degenerate clusters
degenerate_clusters = {idx: tokens for idx, tokens in clusters.items() if len(tokens) > 1}

# Sort by size, descending
sorted_clusters = sorted(degenerate_clusters.items(), key=lambda x: len(x[1]), reverse=True)

print(f"Found {len(sorted_clusters)} degenerate clusters\n")

# Extract black hole data
black_holes_data = []
for rank, (unique_idx, token_ids) in enumerate(sorted_clusters[:13], 1):
    rep_token = token_ids[0]
    size = len(token_ids)
    black_holes_data.append({
        'rank': rank,
        'size': size,
        'representative': rep_token,
        'token_ids': token_ids,
        'embedding': gamma[rep_token],
    })

print(f"Extracted data for top 13 black holes")

Finding all degenerate clusters...

Found 13 degenerate clusters

Extracted data for top 13 black holes


## Step 3: Compute Global Statistics

In [5]:
print("\nComputing global statistics...\n")

# Extract embeddings for the 13 representative tokens
black_hole_embeddings = torch.stack([bh['embedding'] for bh in black_holes_data])
n_holes = len(black_holes_data)

# Compute norms
norms = torch.norm(black_hole_embeddings, dim=1)
mean_norm = norms.mean()

# Compute centroid of 13 black holes
centroid = black_hole_embeddings.mean(dim=0)
centroid_norm = torch.norm(centroid)

# Distances to centroid
distances_to_centroid = torch.norm(black_hole_embeddings - centroid, dim=1)

# Pairwise distances (Euclidean)
pairwise_distances = torch.cdist(black_hole_embeddings, black_hole_embeddings, p=2)

# Pairwise angular separations
black_holes_normed = black_hole_embeddings / torch.norm(black_hole_embeddings, dim=1, keepdim=True)
cosine_sims = black_holes_normed @ black_holes_normed.T
cosine_sims_clamped = torch.clamp(cosine_sims, -1.0, 1.0)
angular_distances = torch.acos(cosine_sims_clamped)
angular_distances_deg = angular_distances * 180 / np.pi

# Store in black_holes_data
for i, bh in enumerate(black_holes_data):
    bh['norm'] = norms[i].item()
    bh['distance_to_centroid'] = distances_to_centroid[i].item()
    bh['norm_deviation'] = (norms[i] - mean_norm).item()

print(f"Global statistics:")
print(f"  Mean norm: {mean_norm.item():.12f}")
print(f"  Centroid norm: {centroid_norm.item():.12f}")
print(f"  RMS spread: {torch.sqrt((distances_to_centroid**2).mean()).item():.12f}")


Computing global statistics...

Global statistics:
  Mean norm: 0.370916843414
  Centroid norm: 0.370916843414
  RMS spread: 0.000026409623


## Step 4: Survey Each Black Hole

Now we generate a detailed report for each of the 13 black holes.

In [6]:
def classify_character(char):
    """Classify a character by its Unicode category."""
    if not char:
        return 'Empty'
    try:
        cat = unicodedata.category(char[0])
        name = unicodedata.name(char[0], '<UNNAMED>')
        
        # Broad classification
        if cat.startswith('L'):  # Letter
            # Try to identify script
            if 'THAI' in name:
                return 'Thai'
            elif 'CJK' in name or 'HANZI' in name or 'KANJI' in name:
                return 'CJK'
            elif 'ARABIC' in name:
                return 'Arabic'
            elif 'HEBREW' in name:
                return 'Hebrew'
            elif 'CYRILLIC' in name:
                return 'Cyrillic'
            elif 'LATIN' in name:
                return 'Latin'
            else:
                return 'Other Letter'
        elif cat.startswith('M'):  # Mark/diacritic
            return 'Diacritic'
        elif cat.startswith('N'):  # Number
            return 'Number'
        elif cat.startswith('P'):  # Punctuation
            return 'Punctuation'
        elif cat.startswith('S'):  # Symbol
            return 'Symbol'
        elif cat.startswith('Z'):  # Separator
            return 'Whitespace'
        elif cat.startswith('C'):  # Control/other
            return 'Control'
        else:
            return 'Other'
    except:
        return 'Unknown'

def survey_black_hole(bh_idx):
    """Generate comprehensive report for a single black hole."""
    bh = black_holes_data[bh_idx]
    
    print("\n" + "="*80)
    print(f"BLACK HOLE #{bh['rank']}: {bh['size']} TOKENS")
    print("="*80)
    print()
    
    # SECTION 1: Identity
    print(f"**IDENTITY**")
    print(f"  Representative token ID: {bh['representative']}")
    print(f"  Cluster size: {bh['size']} tokens")
    print()
    
    # SECTION 2: Spatial Properties
    print(f"**SPATIAL PROPERTIES**")
    print(f"  Norm (distance from origin): {bh['norm']:.12f}")
    print(f"  Distance from constellation centroid: {bh['distance_to_centroid']:.12f}")
    print(f"  Deviation from mean norm: {bh['norm_deviation']:.12e}")
    print()
    
    # SECTION 3: Pairwise Relationships
    print(f"**RELATIONSHIPS TO OTHER BLACK HOLES**")
    print(f"{'To BH':>7} {'Euclidean Dist':>18} {'Angular Sep (°)':>18}")
    print("-" * 50)
    
    euclidean_dists = []
    angular_seps = []
    
    for j in range(n_holes):
        if j == bh_idx:
            continue
        euclidean_dist = pairwise_distances[bh_idx, j].item()
        angular_sep = angular_distances_deg[bh_idx, j].item()
        
        euclidean_dists.append(euclidean_dist)
        angular_seps.append(angular_sep)
        
        print(f"#{j+1:>6} {euclidean_dist:>18.12f} {angular_sep:>18.6f}")
    
    # Summary statistics
    print()
    print(f"  Mean distance to others: {np.mean(euclidean_dists):.12f}")
    print(f"  Min distance (nearest): {np.min(euclidean_dists):.12f} → BH #{np.argmin(euclidean_dists)+1 + (1 if np.argmin(euclidean_dists) >= bh_idx else 0)}")
    print(f"  Max distance (farthest): {np.max(euclidean_dists):.12f} → BH #{np.argmax(euclidean_dists)+1 + (1 if np.argmax(euclidean_dists) >= bh_idx else 0)}")
    print()
    print(f"  Mean angular separation: {np.mean(angular_seps):.6f}°")
    print(f"  Min angular separation: {np.min(angular_seps):.6f}°")
    print(f"  Max angular separation: {np.max(angular_seps):.6f}°")
    print()
    
    # SECTION 4: Token Linguistics
    print(f"**TOKEN ANALYSIS**")
    
    # Decode all tokens
    token_data = []
    char_types = []
    
    for token_id in bh['token_ids']:
        try:
            decoded = tokenizer.decode([token_id])
            decoded_repr = repr(decoded)
            char_type = classify_character(decoded)
            char_types.append(char_type)
            token_data.append((token_id, decoded_repr, decoded))
        except:
            token_data.append((token_id, '<ERROR>', '<ERROR>'))
            char_types.append('Error')
    
    # Character type distribution
    char_type_counts = Counter(char_types)
    print(f"\n  Character type distribution:")
    for char_type, count in char_type_counts.most_common():
        pct = count / len(char_types) * 100
        print(f"    {char_type:>15}: {count:>4} ({pct:>5.1f}%)")
    
    # Token table (show all for small clusters, sample for large)
    print(f"\n  Tokens in this cluster:")
    print(f"  {'Token ID':>10} {'Repr':>40}  Token String")
    print("  " + "-" * 80)
    
    if bh['size'] <= 50:
        # Show all
        for token_id, decoded_repr, decoded in token_data:
            # Truncate repr if too long
            if len(decoded_repr) > 40:
                decoded_repr_display = decoded_repr[:37] + "..."
            else:
                decoded_repr_display = decoded_repr
            print(f"  {token_id:>10} {decoded_repr_display:>40}  '{decoded}'")
    else:
        # Show first 20, ellipsis, last 10
        for token_id, decoded_repr, decoded in token_data[:20]:
            if len(decoded_repr) > 40:
                decoded_repr_display = decoded_repr[:37] + "..."
            else:
                decoded_repr_display = decoded_repr
            print(f"  {token_id:>10} {decoded_repr_display:>40}  '{decoded}'")
        
        print(f"\n  ... ({bh['size'] - 30} tokens omitted) ...\n")
        
        for token_id, decoded_repr, decoded in token_data[-10:]:
            if len(decoded_repr) > 40:
                decoded_repr_display = decoded_repr[:37] + "..."
            else:
                decoded_repr_display = decoded_repr
            print(f"  {token_id:>10} {decoded_repr_display:>40}  '{decoded}'")
    
    print()

print("Survey function defined. Ready to iterate over all 13 black holes.")

Survey function defined. Ready to iterate over all 13 black holes.


## Step 5: Generate Reports for All 13 Black Holes

In [7]:
print("\n" + "#"*80)
print("# COMPREHENSIVE BLACK HOLE SURVEY")
print("#"*80)

for i in range(n_holes):
    survey_black_hole(i)

print("\n" + "#"*80)
print("# END OF SURVEY")
print("#"*80)


################################################################################
# COMPREHENSIVE BLACK HOLE SURVEY
################################################################################

BLACK HOLE #1: 814 TOKENS

**IDENTITY**
  Representative token ID: 80091
  Cluster size: 814 tokens

**SPATIAL PROPERTIES**
  Norm (distance from origin): 0.370916754007
  Distance from constellation centroid: 0.000019422885
  Deviation from mean norm: -8.940696716309e-08

**RELATIONSHIPS TO OTHER BLACK HOLES**
  To BH     Euclidean Dist    Angular Sep (°)
--------------------------------------------------
#     2     0.000034137182           0.000000
#     3     0.000034975299           0.000000
#     4     0.000015258789           0.000000
#     5     0.000017059845           0.000000
#     6     0.000017192635           0.000000
#     7     0.000034975306           0.000000
#     8     0.000046574147           0.000000
#     9     0.000000007451           0.000000
#    10     0.0000705852

## Summary

We've surveyed all 13 primordial black holes in detail.

**Key observations to look for:**
1. **Black hole #9** - Is it significantly more distant from the others?
2. **Token types** - Are certain clusters dominated by specific scripts (Thai, CJK, Hebrew)?
3. **Angular clustering** - Within the constellation, are black holes angularly close or spread out?
4. **Outliers** - Do any black holes have unusual properties?

The fact that all 13 lie at essentially the same radius (r ≈ 0.3709) suggests a common formation mechanism—likely tokens that were initialized near this radius and received minimal or zero training signal, leaving them frozen in place.