# 07.3n: Qwen2.5 Black Hole Geometry

**Geometric analysis of the 60 black hole vectors in Qwen2.5-3B-Instruct**

We discovered that Qwen2.5-3B-Instruct has 2,152 duplicate tokens that collapse to 60 unique vectors (black holes). This is similar to Qwen3-4B-Instruct's 2,100 tokens → 13 unique vectors.

Key question: Are Qwen2.5's black holes co-located at the same tight scale as Qwen3's? If so, that's a fingerprint proving the same initialization procedure was used in both models, despite different hidden dimensions (2048 vs 2560).

## Analysis Plan

1. Load Qwen2.5-3B-Instruct embedding matrix
2. Identify the 60 unique black hole vectors
3. Compute centroid of full vocabulary
4. Compute centroid of black holes
5. Measure distance between centroids
6. Compute all pairwise distances between black holes
7. Compare scale to bfloat16 quantization (2× ULP threshold)

## Parameters

In [1]:
MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
RANDOM_SEED = 42

## Imports

In [2]:
import torch
import numpy as np
from transformers import AutoModelForCausalLM
from tqdm.auto import tqdm

torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

## Load Model and Extract Embedding Matrix

In [3]:
print(f"Loading model: {MODEL_NAME}\n")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="cpu"
)

# Extract unembedding matrix
gamma = model.lm_head.weight.data.clone().to(torch.float32)
vocab_size, hidden_dim = gamma.shape

print(f"✓ Model loaded")
print(f"Embedding matrix shape: {gamma.shape}")
print(f"Vocabulary size: {vocab_size:,}")
print(f"Hidden dimension: {hidden_dim:,}")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading model: Qwen/Qwen2.5-3B-Instruct



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✓ Model loaded
Embedding matrix shape: torch.Size([151936, 2048])
Vocabulary size: 151,936
Hidden dimension: 2,048


## Find Black Holes

Use `torch.unique()` to deduplicate and identify the 60 unique black hole vectors.

In [4]:
print("Finding black holes...\n")

unique_vectors, inverse_indices, counts = torch.unique(
    gamma,
    dim=0,
    return_inverse=True,
    return_counts=True
)

# Identify black hole vectors (shared by 2+ tokens)
black_hole_mask = counts > 1
black_hole_vectors = unique_vectors[black_hole_mask]
black_hole_counts = counts[black_hole_mask]

n_black_holes = len(black_hole_vectors)
n_duplicate_tokens = black_hole_counts.sum().item()

print(f"Black hole vectors: {n_black_holes:,}")
print(f"Duplicate tokens: {n_duplicate_tokens:,}")
print(f"Tokens per black hole (mean): {n_duplicate_tokens / n_black_holes:.1f}")
print(f"Tokens per black hole (median): {black_hole_counts.median().item():.0f}")
print(f"Largest black hole: {black_hole_counts.max().item()} tokens")

Finding black holes...

Black hole vectors: 60
Duplicate tokens: 2,212
Tokens per black hole (mean): 36.9
Tokens per black hole (median): 6
Largest black hole: 600 tokens


## Compute Centroids

In [5]:
# Full vocabulary centroid
centroid_full = gamma.mean(dim=0)
print(f"Full vocabulary centroid:")
print(f"  L2 norm: {centroid_full.norm().item():.6f}")
print(f"  Mean component: {centroid_full.mean().item():.6e}")
print(f"  Std component: {centroid_full.std().item():.6e}")

# Black hole centroid
centroid_bh = black_hole_vectors.mean(dim=0)
print(f"\nBlack hole centroid:")
print(f"  L2 norm: {centroid_bh.norm().item():.6f}")
print(f"  Mean component: {centroid_bh.mean().item():.6e}")
print(f"  Std component: {centroid_bh.std().item():.6e}")

# Distance between centroids
centroid_distance = (centroid_full - centroid_bh).norm().item()
print(f"\nDistance between centroids:")
print(f"  L2 distance: {centroid_distance:.6f}")
print(f"  As fraction of full centroid norm: {centroid_distance / centroid_full.norm().item():.4f}")

Full vocabulary centroid:
  L2 norm: 0.298720
  Mean component: -9.587228e-05
  Std component: 6.601759e-03

Black hole centroid:
  L2 norm: 0.404852
  Mean component: -4.744114e-05
  Std component: 8.948124e-03

Distance between centroids:
  L2 distance: 0.202457
  As fraction of full centroid norm: 0.6777


## Black Hole Pairwise Distances

Compute L2, L∞, and L1 distances between all pairs of black holes.

In [6]:
print(f"Computing pairwise distances for {n_black_holes} black holes...\n")

# Pairwise differences
v1 = black_hole_vectors.unsqueeze(1)  # (n, 1, d)
v2 = black_hole_vectors.unsqueeze(0)  # (1, n, d)
diffs = v1 - v2  # (n, n, d)

# L2 distances
l2_distances = torch.norm(diffs, p=2, dim=2)

# L∞ (Chebyshev) distances
l_inf_distances = torch.abs(diffs).max(dim=2)[0]

# L1 distances
l1_distances = torch.abs(diffs).sum(dim=2)

# Mask out diagonal (self-distances)
mask = ~torch.eye(n_black_holes, dtype=torch.bool)
l2_nonzero = l2_distances[mask]
l_inf_nonzero = l_inf_distances[mask]
l1_nonzero = l1_distances[mask]

print("L2 distances (Euclidean):")
print(f"  Min: {l2_nonzero.min().item():.6e}")
print(f"  Max: {l2_nonzero.max().item():.6e}")
print(f"  Mean: {l2_nonzero.mean().item():.6e}")
print(f"  Median: {l2_nonzero.median().item():.6e}")

print("\nL∞ distances (Chebyshev):")
print(f"  Min: {l_inf_nonzero.min().item():.6e}")
print(f"  Max: {l_inf_nonzero.max().item():.6e}")
print(f"  Mean: {l_inf_nonzero.mean().item():.6e}")
print(f"  Median: {l_inf_nonzero.median().item():.6e}")

print("\nL1 distances (Manhattan):")
print(f"  Min: {l1_nonzero.min().item():.6e}")
print(f"  Max: {l1_nonzero.max().item():.6e}")
print(f"  Mean: {l1_nonzero.mean().item():.6e}")
print(f"  Median: {l1_nonzero.median().item():.6e}")

Computing pairwise distances for 60 black holes...

L2 distances (Euclidean):
  Min: 1.490116e-08
  Max: 1.299978e-04
  Mean: 6.577558e-05
  Median: 7.954951e-05

L∞ distances (Chebyshev):
  Min: 1.490116e-08
  Max: 6.103516e-05
  Mean: 4.459896e-05
  Median: 6.103516e-05

L1 distances (Manhattan):
  Min: 1.490116e-08
  Max: 4.203320e-04
  Mean: 1.575693e-04
  Median: 1.775920e-04


## bfloat16 Quantization Analysis

Compare black hole separations to bfloat16 ULP (Unit in Last Place) to determine if they're at quantization scale.

For bfloat16:
- 1 sign bit
- 8 exponent bits
- 7 mantissa bits

ULP for a value x: `2^(floor(log2(|x|)) - 6)`

In [7]:
# Estimate typical ULP scale from black hole centroid components
typical_magnitude = torch.abs(centroid_bh).mean().item()
exponent = np.floor(np.log2(typical_magnitude))
ulp = 2 ** (exponent - 6)  # bfloat16 has 7 mantissa bits

print(f"Typical black hole component magnitude: {typical_magnitude:.6e}")
print(f"Estimated exponent: 2^{exponent:.0f}")
print(f"bfloat16 ULP at this scale: {ulp:.6e}")
print(f"2× ULP threshold: {2 * ulp:.6e}")

# Compare to observed distances
print(f"\nComparison to observed L∞ distances:")
print(f"  Min L∞ / ULP: {l_inf_nonzero.min().item() / ulp:.2f}")
print(f"  Max L∞ / ULP: {l_inf_nonzero.max().item() / ulp:.2f}")
print(f"  Mean L∞ / ULP: {l_inf_nonzero.mean().item() / ulp:.2f}")

# How many pairs are within 2× ULP?
within_2ulp = (l_inf_nonzero <= 2 * ulp).sum().item()
total_pairs = len(l_inf_nonzero)
pct_within_2ulp = 100 * within_2ulp / total_pairs

print(f"\nPairs within 2× ULP threshold:")
print(f"  Count: {within_2ulp:,} / {total_pairs:,}")
print(f"  Percentage: {pct_within_2ulp:.1f}%")

Typical black hole component magnitude: 5.927444e-03
Estimated exponent: 2^-8
bfloat16 ULP at this scale: 6.103516e-05
2× ULP threshold: 1.220703e-04

Comparison to observed L∞ distances:
  Min L∞ / ULP: 0.00
  Max L∞ / ULP: 1.00
  Mean L∞ / ULP: 0.73

Pairs within 2× ULP threshold:
  Count: 3,540 / 3,540
  Percentage: 100.0%


## Black Hole Norms

Analyze the L2 norms of black hole vectors.

In [8]:
black_hole_norms = torch.norm(black_hole_vectors, p=2, dim=1)

print("Black hole L2 norms:")
print(f"  Min: {black_hole_norms.min().item():.6f}")
print(f"  Max: {black_hole_norms.max().item():.6f}")
print(f"  Mean: {black_hole_norms.mean().item():.6f}")
print(f"  Median: {black_hole_norms.median().item():.6f}")
print(f"  Std: {black_hole_norms.std().item():.6f}")

# Compare to full vocabulary norms
full_norms = torch.norm(gamma, p=2, dim=1)
print(f"\nFull vocabulary L2 norms (for comparison):")
print(f"  Mean: {full_norms.mean().item():.6f}")
print(f"  Median: {full_norms.median().item():.6f}")
print(f"  Std: {full_norms.std().item():.6f}")

Black hole L2 norms:
  Min: 0.404852
  Max: 0.404856
  Mean: 0.404852
  Median: 0.404852
  Std: 0.000001

Full vocabulary L2 norms (for comparison):
  Mean: 1.088976
  Median: 1.113063
  Std: 0.159866


## Summary

In [9]:
print(f"{'='*80}")
print("QWEN2.5-3B-INSTRUCT BLACK HOLE GEOMETRY")
print(f"{'='*80}")
print(f"Model: {MODEL_NAME}")
print(f"Embedding dimensions: {vocab_size:,} × {hidden_dim:,}")
print()
print(f"Black holes: {n_black_holes} unique vectors")
print(f"Duplicate tokens: {n_duplicate_tokens:,}")
print(f"Deduplication ratio: {n_duplicate_tokens / n_black_holes:.1f}× average")
print()
print(f"Centroid separation: {centroid_distance:.6f}")
print(f"Black hole pairwise distances (L∞):")
print(f"  Range: [{l_inf_nonzero.min().item():.3e}, {l_inf_nonzero.max().item():.3e}]")
print(f"  Median: {l_inf_nonzero.median().item():.3e}")
print()
print(f"Quantization scale (bfloat16):")
print(f"  Typical ULP: {ulp:.3e}")
print(f"  Max L∞ / ULP: {l_inf_nonzero.max().item() / ulp:.1f}×")
print(f"  Pairs within 2× ULP: {pct_within_2ulp:.1f}%")
print(f"{'='*80}")

QWEN2.5-3B-INSTRUCT BLACK HOLE GEOMETRY
Model: Qwen/Qwen2.5-3B-Instruct
Embedding dimensions: 151,936 × 2,048

Black holes: 60 unique vectors
Duplicate tokens: 2,212
Deduplication ratio: 36.9× average

Centroid separation: 0.202457
Black hole pairwise distances (L∞):
  Range: [1.490e-08, 6.104e-05]
  Median: 6.104e-05

Quantization scale (bfloat16):
  Typical ULP: 6.104e-05
  Max L∞ / ULP: 1.0×
  Pairs within 2× ULP: 100.0%
