# 1.9a: Global Uniqueness Check

**Critical verification:** We've been analyzing 4 black holes in the core cluster. But have we actually verified that there are ONLY 4 unique vectors in the entire vocabulary?

**Method:**
1. Load W in native bfloat16 (no conversions)
2. Find all unique rows via exact equality
3. Group tokens by their vector (equivalence classes)
4. Report global black hole statistics

**Expected results (if our analysis is correct):**
- Total unique vectors < 151,936
- In core cluster: exactly 4 unique vectors
- Populations: 866, 734, 329, 249

**Nightmare scenario:**
- Fewer than 4 unique vectors (our analysis overcounted)
- Populations don't match (wrong grouping)
- More black holes outside the core (incomplete analysis)

## Parameters

In [1]:
# Model to analyze
MODEL_NAME = "Qwen3-4B-Instruct-2507"

## Imports

In [2]:
import torch
import ml_dtypes
import numpy as np
from safetensors.torch import load_file
from pathlib import Path
from collections import defaultdict
from tqdm import tqdm

## Helper Functions

In [3]:
def torch_bf16_to_numpy_bf16(tensor):
    """Convert PyTorch bfloat16 tensor to numpy array with ml_dtypes.bfloat16 dtype."""
    return tensor.cpu().view(torch.uint16).numpy().view(ml_dtypes.bfloat16)

## Load Data

In [4]:
# Load W in bfloat16 (NATIVE format, no conversions)
W_path = Path(f"../tensors/{MODEL_NAME}/W.safetensors")
W_bf16 = load_file(W_path)["W"]

print(f"Loaded W: {W_bf16.shape}")
print(f"Data type: {W_bf16.dtype}")
print(f"Total tokens: {W_bf16.shape[0]:,}")

Loaded W: torch.Size([151936, 2560])
Data type: torch.bfloat16
Total tokens: 151,936


In [5]:
# Load core mask for comparison
core_path = Path(f"../tensors/{MODEL_NAME}/1.8a_core.safetensors")
core_data = load_file(core_path)

core_mask = core_data["core_mask"].to(torch.bool)
n_core = core_data["n_core"].item()

print(f"\nCore cluster: {n_core:,} tokens")


Core cluster: 2,179 tokens


## Find All Unique Vectors (Global)

In [6]:
print("\nFinding all unique vectors (global search)...\n")
print("This may take a minute...\n")

# Convert entire W to numpy bfloat16 for hashing
W_np_bf16 = torch_bf16_to_numpy_bf16(W_bf16)

# Group tokens by their vector
# Use tuple of vector as key (hashable)
vector_groups = defaultdict(list)

for token_id in tqdm(range(W_bf16.shape[0]), desc="Grouping tokens"):
    vector = W_np_bf16[token_id]
    vector_key = tuple(vector)  # Convert to hashable tuple
    vector_groups[vector_key].append(token_id)

n_unique = len(vector_groups)
n_total = W_bf16.shape[0]

print(f"\n✓ Grouping complete")
print(f"\nUnique vectors: {n_unique:,}")
print(f"Total tokens: {n_total:,}")
print(f"Degenerate tokens: {n_total - n_unique:,} ({(n_total - n_unique) / n_total * 100:.2f}%)")


Finding all unique vectors (global search)...

This may take a minute...



Grouping tokens: 100%|██████████| 151936/151936 [00:09<00:00, 15357.79it/s]


✓ Grouping complete

Unique vectors: 149,849
Total tokens: 151,936
Degenerate tokens: 2,087 (1.37%)





## Find Black Holes (Vectors with Multiple Tokens)

In [7]:
print("\nFinding black holes (degenerate vectors)...\n")

# Find all vectors with more than 1 token
black_holes = [(vector_key, token_ids) for vector_key, token_ids in vector_groups.items() 
               if len(token_ids) > 1]

# Sort by population (largest first)
black_holes.sort(key=lambda x: len(x[1]), reverse=True)

n_black_holes = len(black_holes)
total_degenerate = sum(len(token_ids) for _, token_ids in black_holes)

print(f"Black holes found: {n_black_holes}")
print(f"Total degenerate tokens: {total_degenerate:,}")
print()

# Show top 20 black holes
print("Top 20 black holes by population:")
print("Rank  Population  Sample Token IDs")
print("-" * 60)
for i, (vector_key, token_ids) in enumerate(black_holes[:20], 1):
    sample_ids = token_ids[:5]  # Show first 5 token IDs
    sample_str = ", ".join(str(tid) for tid in sample_ids)
    if len(token_ids) > 5:
        sample_str += ", ..."
    print(f"{i:4d}  {len(token_ids):10,}  {sample_str}")

print(f"\n✓ Black hole analysis complete")


Finding black holes (degenerate vectors)...

Black holes found: 13
Total degenerate tokens: 2,100

Top 20 black holes by population:
Rank  Population  Sample Token IDs
------------------------------------------------------------
   1         814  80091, 119346, 119348, 123806, 123828, ...
   2         704  125, 177, 178, 179, 181, ...
   3         306  124, 123876, 123948, 124076, 124129, ...
   4         228  124350, 124658, 125147, 125460, 126425, ...
   5          11  123939, 131955, 131957, 134988, 134991, ...
   6          10  119349, 125087, 126630, 137856, 138110, ...
   7           6  126268, 132713, 138041, 146501, 148028, ...
   8           5  132383, 132398, 139050, 142718, 142719
   9           4  135619, 138490, 140815, 143457
  10           4  136831, 138068, 138072, 139278
  11           3  180, 138979, 141503
  12           3  126775, 140303, 147056
  13           2  126816, 147836

✓ Black hole analysis complete


## Focus on Core Cluster

In [8]:
print("\nAnalyzing core cluster...\n")

# Get core token IDs
core_token_ids = torch.where(core_mask)[0].tolist()

# Find which black holes are in the core
core_black_holes = []
for vector_key, token_ids in black_holes:
    # Check if ANY token from this black hole is in the core
    overlap = set(token_ids) & set(core_token_ids)
    if overlap:
        # Count how many tokens from this BH are in the core
        n_in_core = len(overlap)
        core_black_holes.append((vector_key, token_ids, n_in_core))

print(f"Black holes with tokens in core: {len(core_black_holes)}")
print()

# Sort by number in core
core_black_holes.sort(key=lambda x: x[2], reverse=True)

print("Black holes in core cluster:")
print("Rank  Total Pop  In Core  Sample Token IDs")
print("-" * 70)
for i, (vector_key, token_ids, n_in_core) in enumerate(core_black_holes[:10], 1):
    sample_ids = token_ids[:5]
    sample_str = ", ".join(str(tid) for tid in sample_ids)
    if len(token_ids) > 5:
        sample_str += ", ..."
    print(f"{i:4d}  {len(token_ids):9,}  {n_in_core:7,}  {sample_str}")

print(f"\n✓ Core analysis complete")


Analyzing core cluster...

Black holes with tokens in core: 13

Black holes in core cluster:
Rank  Total Pop  In Core  Sample Token IDs
----------------------------------------------------------------------
   1        814      814  80091, 119346, 119348, 123806, 123828, ...
   2        704      704  125, 177, 178, 179, 181, ...
   3        306      306  124, 123876, 123948, 124076, 124129, ...
   4        228      228  124350, 124658, 125147, 125460, 126425, ...
   5         11       11  123939, 131955, 131957, 134988, 134991, ...
   6         10       10  119349, 125087, 126630, 137856, 138110, ...
   7          6        6  126268, 132713, 138041, 146501, 148028, ...
   8          5        5  132383, 132398, 139050, 142718, 142719
   9          4        4  135619, 138490, 140815, 143457
  10          4        4  136831, 138068, 138072, 139278

✓ Core analysis complete


## Verification Against Expected Results

In [9]:
print("\n" + "=" * 80)
print("VERIFICATION: COMPARING TO EXPECTED RESULTS")
print("=" * 80)
print()

# Expected results from 1.8e
expected_populations = [866, 734, 329, 249]
expected_total = sum(expected_populations)
expected_n_bh = 4

print("Expected (from 1.8e):")
print(f"  Number of black holes in core: {expected_n_bh}")
print(f"  Populations: {expected_populations}")
print(f"  Total degenerate in core: {expected_total:,}")
print()

# Actual results
actual_populations = [n_in_core for _, _, n_in_core in core_black_holes[:4]]
actual_total = sum(actual_populations)
actual_n_bh = len([bh for bh in core_black_holes if bh[2] > 1])

print("Actual (from global uniqueness check):")
print(f"  Number of black holes in core: {actual_n_bh}")
print(f"  Top 4 populations: {actual_populations}")
print(f"  Total in top 4: {actual_total:,}")
print()

# Check if they match
if actual_n_bh == expected_n_bh:
    print(f"✓ Number of black holes MATCHES ({actual_n_bh})")
else:
    print(f"✗ Number of black holes MISMATCH (expected {expected_n_bh}, found {actual_n_bh})")

if sorted(actual_populations) == sorted(expected_populations):
    print(f"✓ Populations MATCH: {sorted(actual_populations)}")
else:
    print(f"✗ Populations MISMATCH")
    print(f"  Expected: {sorted(expected_populations)}")
    print(f"  Found: {sorted(actual_populations)}")

if actual_total == expected_total:
    print(f"✓ Total degenerate tokens MATCHES ({actual_total:,})")
else:
    print(f"✗ Total degenerate tokens MISMATCH (expected {expected_total:,}, found {actual_total:,})")

print()

# Overall verdict
if (actual_n_bh == expected_n_bh and 
    sorted(actual_populations) == sorted(expected_populations) and 
    actual_total == expected_total):
    print("=" * 80)
    print("VERDICT: ✓✓✓ ANALYSIS VERIFIED ✓✓✓")
    print("=" * 80)
    print()
    print("The 4 black holes identified in 1.8e are EXACTLY the degenerate")
    print("vectors found by global uniqueness check in native bfloat16.")
    print()
    print("Our analysis is CORRECT. The earlier float32 analysis showing")
    print("13 black holes was an artifact of precision mismatch.")
else:
    print("=" * 80)
    print("VERDICT: ⚠ DISCREPANCY DETECTED ⚠")
    print("=" * 80)
    print()
    print("The global uniqueness check does NOT match our expected results.")
    print("Further investigation required.")

print()
print("=" * 80)


VERIFICATION: COMPARING TO EXPECTED RESULTS

Expected (from 1.8e):
  Number of black holes in core: 4
  Populations: [866, 734, 329, 249]
  Total degenerate in core: 2,178

Actual (from global uniqueness check):
  Number of black holes in core: 13
  Top 4 populations: [814, 704, 306, 228]
  Total in top 4: 2,052

✗ Number of black holes MISMATCH (expected 4, found 13)
✗ Populations MISMATCH
  Expected: [249, 329, 734, 866]
  Found: [228, 306, 704, 814]
✗ Total degenerate tokens MISMATCH (expected 2,178, found 2,052)

VERDICT: ⚠ DISCREPANCY DETECTED ⚠

The global uniqueness check does NOT match our expected results.
Further investigation required.



## Save Global Uniqueness Data

In [10]:
from safetensors.torch import save_file

print("\nSaving global uniqueness data...\n")

# Create a mapping: token_id -> black_hole_id (or -1 if unique)
token_to_bh = torch.full((W_bf16.shape[0],), -1, dtype=torch.int64)

for bh_id, (vector_key, token_ids) in enumerate(black_holes):
    for token_id in token_ids:
        token_to_bh[token_id] = bh_id

# Count unique vectors
n_unique_tensor = torch.tensor(n_unique, dtype=torch.int64)
n_black_holes_tensor = torch.tensor(n_black_holes, dtype=torch.int64)

# Save
output_path = Path(f"../tensors/{MODEL_NAME}/1.9a_global_uniqueness.safetensors")
save_file({
    'token_to_black_hole': token_to_bh,  # -1 = unique, >=0 = black hole ID
    'n_unique_vectors': n_unique_tensor,
    'n_black_holes': n_black_holes_tensor,
}, str(output_path))

print(f"✓ Saved to {output_path.name}")
print(f"  token_to_black_hole: {token_to_bh.shape}")
print(f"  n_unique_vectors: {n_unique}")
print(f"  n_black_holes: {n_black_holes}")


Saving global uniqueness data...

✓ Saved to 1.9a_global_uniqueness.safetensors
  token_to_black_hole: torch.Size([151936])
  n_unique_vectors: 149849
  n_black_holes: 13
