# 1.4h: Extract Cluster Tokens

This notebook identifies and saves the token IDs belonging to the dense cluster.

## The Question

We've identified a giant connected component of 2,212 tokens (1.4g) that forms a tight cluster with diameter ~0.0016, separated by a void from 18,161 ambient singletons.

Now we need to **identify these tokens by their vocabulary IDs** so we can:
- Save them for future analysis
- Decode them to see what text they represent
- Visualize them on sky maps to confirm spatial coherence
- Analyze their internal structure

## Method

We'll:
1. Rebuild the adjacency graph (same threshold as 1.4g)
2. Find connected components
3. Extract token IDs for the giant cluster
4. Save to disk
5. Verify the saved data

## Parameters

In [9]:
# Model to analyze
MODEL_NAME = "Qwen3-4B-Instruct-2507"

# Adjacency threshold (must match 1.4g)
ADJACENCY_THRESHOLD = 0.002

## Imports

In [10]:
import torch
import numpy as np
from safetensors.torch import load_file, save_file
from pathlib import Path
from collections import Counter
import scipy.sparse as sp
from scipy.sparse.csgraph import connected_components

## Load Distance Matrix

In [11]:
# Load distances and token IDs from 1.4b
tensor_path = Path(f"../tensors/{MODEL_NAME}/1.4b_overdensity_distances.safetensors")
data = load_file(tensor_path)
dists = data['distances']
spike_token_ids = data['spike_token_ids']

print(f"Loaded distance matrix from {tensor_path}")
print(f"  Shape: {dists.shape}")
print(f"  Spike tokens: {len(spike_token_ids):,}")

Loaded distance matrix from ../tensors/Qwen3-4B-Instruct-2507/1.4b_overdensity_distances.safetensors
  Shape: torch.Size([20373, 20373])
  Spike tokens: 20,373


## Build Adjacency Graph

In [12]:
print(f"\nBuilding adjacency matrix with threshold = {ADJACENCY_THRESHOLD}...\n")

# Create binary adjacency matrix
adjacency = (dists < ADJACENCY_THRESHOLD).float()
adjacency.fill_diagonal_(0)

num_edges = adjacency.sum().item() // 2

print(f"✓ Adjacency matrix created")
print(f"  Nodes: {len(adjacency):,}")
print(f"  Edges: {int(num_edges):,}")


Building adjacency matrix with threshold = 0.002...

✓ Adjacency matrix created
  Nodes: 20,373
  Edges: 2,445,366


## Find Connected Components

In [13]:
print(f"\nFinding connected components...\n")

# Convert to scipy sparse matrix
adjacency_sparse = sp.csr_matrix(adjacency.numpy())

# Find connected components
n_components, labels = connected_components(
    csgraph=adjacency_sparse, 
    directed=False, 
    return_labels=True
)

print(f"✓ Found {n_components:,} connected components")

# Count component sizes
component_sizes = Counter(labels)
sorted_components = sorted(component_sizes.items(), key=lambda x: x[1], reverse=True)

print(f"\nLargest component: {sorted_components[0][1]:,} tokens")
print(f"Singletons: {sum(1 for _, size in sorted_components if size == 1):,}")


Finding connected components...

✓ Found 18,162 connected components

Largest component: 2,212 tokens
Singletons: 18,161


## Extract Giant Cluster Token IDs

In [14]:
print(f"\nExtracting giant cluster token IDs...\n")

# Get the giant cluster ID (largest component)
giant_cluster_id = sorted_components[0][0]
giant_cluster_size = sorted_components[0][1]

print(f"Giant cluster ID: {giant_cluster_id}")
print(f"Giant cluster size: {giant_cluster_size:,} tokens")
print()

# Create boolean mask: True for tokens in the giant cluster
cluster_mask = (labels == giant_cluster_id)

# Extract token IDs (these are indices into the full 151,936 vocabulary)
cluster_token_ids = spike_token_ids[cluster_mask]

print(f"✓ Extracted {len(cluster_token_ids):,} cluster token IDs")
print(f"  Token ID range: [{cluster_token_ids.min().item()}, {cluster_token_ids.max().item()}]")
print(f"  First 20 IDs: {cluster_token_ids[:20].tolist()}")


Extracting giant cluster token IDs...

Giant cluster ID: 0
Giant cluster size: 2,212 tokens

✓ Extracted 2,212 cluster token IDs
  Token ID range: [124, 151935]
  First 20 IDs: [124, 125, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 77150, 80091, 83971, 119346, 119347, 119348, 119349]


## Save Cluster Data

In [15]:
print(f"\nSaving cluster data...\n")

# Prepare output
output_dir = Path(f"../tensors/{MODEL_NAME}")
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / "1.4h_cluster_tokens.safetensors"

# Save both token IDs and mask
save_file({
    'cluster_token_ids': cluster_token_ids.cpu(),
    'cluster_mask': torch.tensor(cluster_mask, dtype=torch.bool),
    'adjacency_threshold': torch.tensor([ADJACENCY_THRESHOLD], dtype=torch.float32),
}, output_path)

print(f"✓ Saved to {output_path}")
print(f"  Size: {output_path.stat().st_size / 1024:.1f} KB")
print()
print(f"Saved data:")
print(f"  cluster_token_ids: {len(cluster_token_ids):,} token IDs in full vocabulary")
print(f"  cluster_mask: {cluster_mask.sum():,} True values (mask into spike_token_ids)")
print(f"  adjacency_threshold: {ADJACENCY_THRESHOLD} (for reference)")


Saving cluster data...

✓ Saved to ../tensors/Qwen3-4B-Instruct-2507/1.4h_cluster_tokens.safetensors
  Size: 37.4 KB

Saved data:
  cluster_token_ids: 2,212 token IDs in full vocabulary
  cluster_mask: 2,212 True values (mask into spike_token_ids)
  adjacency_threshold: 0.002 (for reference)


## Verify Saved Data

In [16]:
print(f"\nVerifying saved data...\n")

# Load back
verification = load_file(output_path)
loaded_ids = verification['cluster_token_ids']
loaded_mask = verification['cluster_mask']
loaded_threshold = verification['adjacency_threshold']

# Verify counts match
assert len(loaded_ids) == giant_cluster_size, f"Count mismatch: {len(loaded_ids)} != {giant_cluster_size}"
assert loaded_mask.sum().item() == giant_cluster_size, f"Mask count mismatch"
assert torch.all(loaded_ids == cluster_token_ids), "Token IDs don't match"

# Use approximate equality for float comparison
assert torch.allclose(loaded_threshold, torch.tensor([ADJACENCY_THRESHOLD], dtype=torch.float32)), \
    f"Threshold doesn't match: {loaded_threshold.item()} != {ADJACENCY_THRESHOLD}"

print(f"✓ Verification passed")
print(f"  Token IDs: {len(loaded_ids):,}")
print(f"  Mask True count: {loaded_mask.sum().item():,}")
print(f"  Threshold: {loaded_threshold.item()}")
print()
print(f"All checks passed! Data is ready for future analysis.")


Verifying saved data...

✓ Verification passed
  Token IDs: 2,212
  Mask True count: 2,212
  Threshold: 0.0020000000949949026

All checks passed! Data is ready for future analysis.


## Summary

We've successfully identified and saved the dense cluster tokens:

- **Cluster size**: 2,212 tokens
- **Diameter**: ~0.0016 (from 1.4g)
- **Void boundary**: ~0.0036 (from 1.4g)
- **Saved to**: `../tensors/{MODEL_NAME}/1.4h_cluster_tokens.safetensors`

**What's included:**
- `cluster_token_ids`: Token IDs in the full 151,936 vocabulary
- `cluster_mask`: Boolean mask (True for cluster tokens) into the 20,373 spike tokens
- `adjacency_threshold`: The 0.002 threshold used for clustering

**Next steps:**
- Decode these tokens to see what text they represent
- Visualize them on sky maps (1.3a/1.3b) with different colors
- Analyze their internal structure (are the black holes within this cluster?)
- Compare to other models