# 1.13a: Black Hole Detection

**Goal:** Find tokens with identical embeddings (black holes).

## Method

Use `torch.unique()` to find duplicate vectors efficiently.

- If two or more tokens have identical embeddings across all dimensions → black hole
- Report how many tokens participate in black holes

## Parameters

In [None]:
# Tensor to analyze
TENSOR_FILE = "../tensors/Qwen3-4B-Instruct-2507/W.safetensors"
TENSOR_KEY = "W"
TENSOR_INDEX = None

## Imports

In [2]:
import torch
from safetensors.torch import load_file
from pathlib import Path
from tqdm import tqdm

## Load Data

In [3]:
# Load tensor
data = load_file(TENSOR_FILE)
W = data[TENSOR_KEY]

# Apply indexing if specified
if TENSOR_INDEX is not None:
    W = W[TENSOR_INDEX]

n_vectors, n_dims = W.shape

print(f"✓ Loaded W from {Path(TENSOR_FILE).name}")
print(f"  Shape: {W.shape}")
print(f"  Dtype: {W.dtype}")
print(f"  Memory: ~{W.element_size() * W.numel() / 1024**3:.2f} GB")
print()
print(f"Analyzing {n_vectors:,} vectors in {n_dims:,} dimensions")

✓ Loaded W from W.safetensors
  Shape: torch.Size([151936, 2560])
  Dtype: torch.bfloat16
  Memory: ~0.72 GB

Analyzing 151,936 vectors in 2,560 dimensions


## Find Black Holes

In [4]:
print("\n" + "=" * 80)
print("BLACK HOLE DETECTION")
print("=" * 80)
print()

# torch.unique not implemented on MPS in Torch 2.8, use CPU
print("Finding unique vectors...")
W_cpu = W.cpu()
W_unique, inverse_indices, counts = torch.unique(W_cpu, dim=0, return_inverse=True, return_counts=True)

n_unique = len(W_unique)
n_duplicates = n_vectors - n_unique

print(f"  ✓ Found {n_unique:,} unique vectors")
print(f"  ✓ {n_duplicates:,} vectors are duplicates")
print()

# Count tokens participating in black holes
duplicate_mask = counts > 1
n_black_hole_centroids = duplicate_mask.sum().item()

black_hole_tokens = []
if n_black_hole_centroids > 0:
    print(f"Found {n_black_hole_centroids} black hole centroids (unique vectors with duplicates)")
    print("Counting tokens...")
    
    black_hole_unique_ids = duplicate_mask.nonzero(as_tuple=True)[0]
    
    for unique_id in tqdm(black_hole_unique_ids, desc="Processing"):
        # Find all tokens that map to this unique vector
        tokens = (inverse_indices == unique_id).nonzero(as_tuple=True)[0].tolist()
        black_hole_tokens.extend(tokens)
    
    print()

n_black_hole_tokens = len(black_hole_tokens)

print("=" * 80)
print("RESULTS")
print("=" * 80)
print()
print(f"Black hole tokens: {n_black_hole_tokens:,} ({100 * n_black_hole_tokens / n_vectors:.2f}%)")

if n_black_hole_tokens == 0:
    print("\n✓ No black holes found. All vectors are unique.")
else:
    print(f"\n⚠️  {n_black_hole_tokens:,} tokens share embeddings with other tokens!")
    print(f"   Organized into {n_black_hole_centroids} black hole centroids")

print("\n" + "=" * 80)


BLACK HOLE DETECTION

Finding unique vectors...
  ✓ Found 149,849 unique vectors
  ✓ 2,087 vectors are duplicates

Found 13 black hole centroids (unique vectors with duplicates)
Counting tokens...


Processing: 100%|██████████| 13/13 [00:00<00:00, 6255.85it/s]


RESULTS

Black hole tokens: 2,100 (1.38%)

⚠️  2,100 tokens share embeddings with other tokens!
   Organized into 13 black hole centroids




