# 07.3m: Singularity Survey

**Search an unembedding matrix for singularities (bit-for-bit identical vectors)**

We discovered that Qwen3-4B-Instruct-2507 has 2,100 "black hole" tokens that deduplicate to just 13 unique vectors. These represent dead vocabulary—tokens the tokenizer never emits, so their embeddings never got trained.

Question: Is this unique to Qwen, or do other models have similar structures?

This notebook loads an arbitrary model's unembedding matrix and searches for non-unique vectors using hash-based deduplication (`torch.unique()`) instead of pairwise comparison.

## Method

1. Load model's lm_head weight matrix (unembedding matrix)
2. Use `torch.unique()` to find all unique vectors
3. Identify singularities: groups of 2+ tokens sharing identical vectors
4. Report statistics and optionally decode tokens in each group

## Parameters

In [49]:
# Model to analyze
MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"  # Change this to test different models

# Display options
DECODE_TOKENS = True  # Whether to decode and display tokens in singularity groups
MAX_TOKENS_PER_GROUP = 20  # Limit display for very large singularity groups
RANDOM_SEED = 42

## Imports

In [50]:
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm.auto import tqdm
from collections import defaultdict

torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

## Load Model

We only need the lm_head (unembedding matrix), not the full model.

In [51]:
print(f"Loading model: {MODEL_NAME}")
print("(This may take a while for large models...)\n")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="cpu"
)

print("✓ Model loaded successfully")

Loading model: Qwen/Qwen2.5-3B-Instruct
(This may take a while for large models...)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✓ Model loaded successfully


## Extract Unembedding Matrix

In [52]:
# Get unembedding matrix (typically model.lm_head.weight)
gamma = model.lm_head.weight.data.clone().to(torch.float32)
vocab_size, hidden_dim = gamma.shape

print(f"Unembedding matrix shape: {gamma.shape}")
print(f"Vocabulary size: {vocab_size:,}")
print(f"Hidden dimension: {hidden_dim:,}")
print(f"Total parameters: {vocab_size * hidden_dim:,}")
print(f"Memory footprint: {gamma.element_size() * gamma.numel() / 1e9:.2f} GB (float32)")

Unembedding matrix shape: torch.Size([151936, 2048])
Vocabulary size: 151,936
Hidden dimension: 2,048
Total parameters: 311,164,928
Memory footprint: 1.24 GB (float32)


## Find Singularities

Use `torch.unique()` to find all unique vectors and identify groups of tokens that share identical vectors.

In [53]:
print("Searching for singularities...\n")

# Find unique vectors
unique_vectors, inverse_indices, counts = torch.unique(
    gamma,
    dim=0,
    return_inverse=True,
    return_counts=True
)

n_unique = len(unique_vectors)
n_total = vocab_size
n_duplicate = n_total - n_unique

print(f"Total tokens: {n_total:,}")
print(f"Unique vectors: {n_unique:,}")
print(f"Duplicate tokens: {n_duplicate:,}")
print(f"Uniqueness: {100 * n_unique / n_total:.2f}%\n")

if n_duplicate == 0:
    print("✓ No singularities found. Every token has a unique vector.")
else:
    print(f"⚠ Found {n_duplicate:,} duplicate tokens")

Searching for singularities...

Total tokens: 151,936
Unique vectors: 149,784
Duplicate tokens: 2,152
Uniqueness: 98.58%

⚠ Found 2,152 duplicate tokens


## Analyze Singularity Groups

Group tokens by their shared vector and report statistics.

In [54]:
if n_duplicate > 0:
    # Build map from unique vector index to list of token IDs
    singularity_groups = defaultdict(list)
    
    for token_id, unique_idx in enumerate(inverse_indices.tolist()):
        if counts[unique_idx] > 1:  # Only include vectors shared by 2+ tokens
            singularity_groups[unique_idx].append(token_id)
    
    n_groups = len(singularity_groups)
    group_sizes = [len(tokens) for tokens in singularity_groups.values()]
    
    print(f"\nSingularity groups: {n_groups:,}")
    print(f"Largest group: {max(group_sizes):,} tokens")
    print(f"Smallest group: {min(group_sizes):,} tokens")
    print(f"Mean group size: {np.mean(group_sizes):.1f} tokens")
    print(f"Median group size: {np.median(group_sizes):.1f} tokens")
    
    # Histogram of group sizes
    size_counts = defaultdict(int)
    for size in group_sizes:
        size_counts[size] += 1
    
    print("\nGroup size distribution:")
    for size in sorted(size_counts.keys()):
        count = size_counts[size]
        print(f"  {size:4d} tokens: {count:4d} groups")


Singularity groups: 60
Largest group: 600 tokens
Smallest group: 2 tokens
Mean group size: 36.9 tokens
Median group size: 6.5 tokens

Group size distribution:
     2 tokens:   11 groups
     3 tokens:    3 groups
     4 tokens:    7 groups
     5 tokens:    5 groups
     6 tokens:    4 groups
     7 tokens:    3 groups
     8 tokens:    1 groups
     9 tokens:    2 groups
    10 tokens:    1 groups
    12 tokens:    1 groups
    14 tokens:    1 groups
    15 tokens:    1 groups
    16 tokens:    1 groups
    17 tokens:    1 groups
    19 tokens:    1 groups
    21 tokens:    2 groups
    25 tokens:    1 groups
    27 tokens:    1 groups
    32 tokens:    1 groups
    38 tokens:    1 groups
    46 tokens:    1 groups
    63 tokens:    1 groups
    92 tokens:    1 groups
    94 tokens:    1 groups
   112 tokens:    2 groups
   135 tokens:    1 groups
   144 tokens:    1 groups
   188 tokens:    1 groups
   204 tokens:    1 groups
   600 tokens:    1 groups


## Decode Singularity Tokens

Load tokenizer and decode tokens in each singularity group.

In [55]:
if n_duplicate > 0 and DECODE_TOKENS:
    print(f"\nLoading tokenizer for {MODEL_NAME}...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    print("✓ Tokenizer loaded\n")
    
    # Sort groups by size (largest first)
    sorted_groups = sorted(
        singularity_groups.items(),
        key=lambda x: len(x[1]),
        reverse=True
    )
    
    print(f"Displaying up to {MAX_TOKENS_PER_GROUP} tokens per group\n")
    print("=" * 80)
    
    for group_idx, (unique_idx, token_ids) in enumerate(sorted_groups, 1):
        n_tokens = len(token_ids)
        print(f"\nGroup {group_idx}/{n_groups}: {n_tokens} tokens sharing vector #{unique_idx}")
        print("-" * 80)
        
        # Show first N tokens
        display_tokens = token_ids[:MAX_TOKENS_PER_GROUP]
        
        for token_id in display_tokens:
            try:
                token_str = tokenizer.decode([token_id])
                # Show repr for visibility of whitespace/control chars
                print(f"  Token {token_id:6d}: {repr(token_str)}")
            except Exception as e:
                print(f"  Token {token_id:6d}: <decode error: {e}>")
        
        if n_tokens > MAX_TOKENS_PER_GROUP:
            print(f"  ... and {n_tokens - MAX_TOKENS_PER_GROUP} more tokens")
        
        print("=" * 80)


Loading tokenizer for Qwen/Qwen2.5-3B-Instruct...
✓ Tokenizer loaded

Displaying up to 20 tokens per group


Group 1/60: 600 tokens sharing vector #77384
--------------------------------------------------------------------------------
  Token    177: '�'
  Token    187: '�'
  Token  77150: '１０'
  Token  80091: '２０'
  Token  83969: 'PostalCodesNL'
  Token 119346: '珊�'
  Token 123806: '�'
  Token 123828: 'ที่'
  Token 123870: '�'
  Token 123939: 'ร์'
  Token 123948: 'ติ'
  Token 124033: 'ไม่'
  Token 124055: 'ได้'
  Token 124139: 'นี้'
  Token 124175: 'คุ'
  Token 124254: 'กัน'
  Token 124258: 'ผู้'
  Token 124294: 'วั'
  Token 124311: 'ดี'
  Token 124361: 'ณ์'
  ... and 580 more tokens

Group 2/60: 204 tokens sharing vector #77422
--------------------------------------------------------------------------------
  Token 151554: '臘'
  Token 151555: '怒'
  Token 151556: '辰'
  Token 151618: '嘆'
  Token 151638: '見'
  Token 151649: '<|box_end|>'
  Token 151651: '<|quad_end|>'
  Token 151652: '

## Summary Statistics

In [56]:
print(f"\n{'='*80}")
print("SUMMARY")
print(f"{'='*80}")
print(f"Model: {MODEL_NAME}")
print(f"Vocabulary size: {vocab_size:,}")
print(f"Unique vectors: {n_unique:,}")
print(f"Duplicate tokens: {n_duplicate:,} ({100 * n_duplicate / n_total:.2f}%)")

if n_duplicate > 0:
    print(f"Singularity groups: {n_groups:,}")
    print(f"Largest singularity: {max(group_sizes):,} tokens")
    print(f"Deduplication ratio: {n_total / n_unique:.2f}x")
else:
    print("No singularities detected.")

print(f"{'='*80}")


SUMMARY
Model: Qwen/Qwen2.5-3B-Instruct
Vocabulary size: 151,936
Unique vectors: 149,784
Duplicate tokens: 2,152 (1.42%)
Singularity groups: 60
Largest singularity: 600 tokens
Deduplication ratio: 1.01x
