# Extract Core Tokens: The Crystallized Solid

From 1.11h we discovered that 124 unique vectors (2,211 tokens) are packed within 2 ULP of the Big One, with one outlier at 32 ULP.

This notebook:
- Identifies which tokens belong to the crystallized core (L∞ ≤ 2 ULP from Big One)
- Saves them to a new safetensors file compatible with existing analysis notebooks
- Creates `1.11i_core_cluster_tokens.safetensors` (drop-in replacement for `1.4h_cluster_tokens.safetensors`)

With this cleaned dataset, we can re-run:
- 1.11g (power-of-two quantization - should be perfect now)
- 1.11c (visualization - cleaner structure)
- Any other analysis that was polluted by the outlier

## Parameters

In [103]:
# Paths
CLUSTER_TOKENS_PATH = '../tensors/Qwen3-4B-Instruct-2507/1.4h_cluster_tokens.safetensors'
GAMMA_PATH = '../tensors/Qwen3-4B-Instruct-2507/W.safetensors'
OUTPUT_PATH = '../tensors/Qwen3-4B-Instruct-2507/1.11i_core_cluster_tokens.safetensors'

# Core boundary (from 1.11h analysis)
MAX_CORE_RADIUS_ULP = 3.0  # Include everything within 3 ULP of Big One

## Imports

In [104]:
import torch
import numpy as np
from safetensors.torch import load_file, save_file
from pathlib import Path

## Device Detection

In [105]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f'Using device: {device}')

Using device: mps


## Load Data

In [106]:
# Load cluster token IDs
cluster_data = load_file(CLUSTER_TOKENS_PATH)
cluster_token_ids = cluster_data['cluster_token_ids'].to(device)

print(f'Loaded {len(cluster_token_ids)} cluster token IDs from 1.4h')

Loaded 2212 cluster token IDs from 1.4h


In [107]:
# Load gamma matrix in bfloat16
gamma_data = load_file(GAMMA_PATH)
W = gamma_data['W'].to(torch.bfloat16).to(device)

print(f'Loaded gamma matrix: {W.shape}')
print(f'Precision: {W.dtype}')

Loaded gamma matrix: torch.Size([151936, 2560])
Precision: torch.bfloat16


In [108]:
# Extract cluster vectors
cluster_vectors = W[cluster_token_ids]

print(f'Extracted {cluster_vectors.shape[0]} vectors of dimension {cluster_vectors.shape[1]}')

Extracted 2212 vectors of dimension 2560


## Find Unique Vectors and Big One

In [109]:
# Find unique vectors and their populations
unique_vectors, inverse_indices = torch.unique(cluster_vectors.to('cpu'), dim=0, return_inverse=True)
unique_vectors = unique_vectors.to(device)
inverse_indices = inverse_indices.to(device)

populations = torch.bincount(inverse_indices)

print(f'Found {len(unique_vectors)} unique vectors')
print(f'Population range: {populations.min().item()} to {populations.max().item()}')

Found 125 unique vectors
Population range: 1 to 814


In [110]:
# Find the Big One (814-token black hole)
big_one_idx = torch.argmax(populations).item()
big_one_population = populations[big_one_idx].item()
big_one_vector = unique_vectors[big_one_idx]

print(f'\nThe Big One:')
print(f'  Index: {big_one_idx}')
print(f'  Population: {big_one_population} tokens')
print(f'  Vector norm: {torch.norm(big_one_vector.to(torch.float32)).item():.6f}')


The Big One:
  Index: 60
  Population: 814 tokens
  Vector norm: 0.370917


## Compute L∞ Distances from Big One

In [111]:
def bfloat16_ulp(x):
    """Compute ULP for bfloat16 at value x."""
    if x == 0:
        return 2**(-133)
    exponent = int(np.floor(np.log2(np.abs(x))))
    return 2**(exponent - 7)

# Compute ULP at Big One scale
big_one_f32 = big_one_vector.to(torch.float32).cpu().numpy()
typical_value = np.median(np.abs(big_one_f32[big_one_f32 != 0]))
ulp_at_origin = bfloat16_ulp(typical_value)

print(f'ULP at Big One scale: {ulp_at_origin:.6e}')

ULP at Big One scale: 1.525879e-05


In [112]:
# Compute L∞ distance from Big One to all unique vectors
differences = unique_vectors - big_one_vector
linf_distances = torch.max(torch.abs(differences), dim=1)[0]

# Convert to ULP units
linf_distances_f32 = linf_distances.to(torch.float32).cpu().numpy()
distances_in_ulp = linf_distances_f32 / ulp_at_origin

print(f'\nDistances in ULP units:')
print(f'  Range: [{distances_in_ulp.min():.2f}, {distances_in_ulp.max():.2f}]')
print(f'  Mean: {distances_in_ulp.mean():.2f}')


Distances in ULP units:
  Range: [0.00, 32.00]
  Mean: 1.61


## Identify Core Vectors

In [113]:
# Mask: True for unique vectors in the core
core_vector_mask = distances_in_ulp <= MAX_CORE_RADIUS_ULP

n_core_vectors = core_vector_mask.sum()
n_halo_vectors = (~core_vector_mask).sum()

print(f'Core boundary: L∞ ≤ {MAX_CORE_RADIUS_ULP} ULP')
print(f'\nUnique vectors:')
print(f'  Core: {n_core_vectors}')
print(f'  Halo: {n_halo_vectors}')
print(f'  Total: {len(unique_vectors)}')

Core boundary: L∞ ≤ 3.0 ULP

Unique vectors:
  Core: 122
  Halo: 3
  Total: 125


## Map to Original Token IDs

In [114]:
# For each of the 2,212 original tokens, check if its unique vector is in the core
# inverse_indices[i] tells us which unique vector token i maps to

# Convert core_vector_mask to tensor
core_vector_mask_tensor = torch.tensor(core_vector_mask, dtype=torch.bool).to(device)

# For each token, check if its unique vector is in the core
token_in_core = core_vector_mask_tensor[inverse_indices]

# Extract core token IDs
core_token_ids = cluster_token_ids[token_in_core]
halo_token_ids = cluster_token_ids[~token_in_core]

print(f'\nTokens:')
print(f'  Core: {len(core_token_ids)}')
print(f'  Halo: {len(halo_token_ids)}')
print(f'  Total: {len(cluster_token_ids)}')

print(f'\nCore token population: {len(core_token_ids)} / {len(cluster_token_ids)} '
      f'({100*len(core_token_ids)/len(cluster_token_ids):.2f}%)')


Tokens:
  Core: 2206
  Halo: 6
  Total: 2212

Core token population: 2206 / 2212 (99.73%)


## Verify Core Statistics

In [115]:
# Count tokens in core vs halo by population
core_populations = populations[core_vector_mask_tensor.cpu().numpy()]
halo_populations = populations[~core_vector_mask_tensor.cpu().numpy()]

print(f'Population verification:')
print(f'  Core vectors population sum: {core_populations.sum().item()}')
print(f'  Halo vectors population sum: {halo_populations.sum().item()}')
print(f'  Total: {populations.sum().item()}')
print()
print(f'  Core: {core_populations.sum().item()} tokens across {len(core_populations)} unique vectors')
print(f'  Halo: {halo_populations.sum().item()} tokens across {len(halo_populations)} unique vectors')

Population verification:
  Core vectors population sum: 2206
  Halo vectors population sum: 6
  Total: 2212

  Core: 2206 tokens across 122 unique vectors
  Halo: 6 tokens across 3 unique vectors


## Show Halo Outliers

In [116]:
if len(halo_token_ids) > 0:
    print(f'Halo outlier token IDs: {halo_token_ids.cpu().numpy().tolist()}')
    print(f'Halo outlier distances (ULP): {distances_in_ulp[~core_vector_mask].tolist()}')
else:
    print('No halo outliers (all tokens in core)')

Halo outlier token IDs: [83971, 136755, 136831, 138068, 138072, 139278]
Halo outlier distances (ULP): [32.0, 4.0, 4.0]


## Save Core Cluster Data

In [117]:
print(f'\nSaving core cluster data...')

# Save in same format as 1.4h for compatibility
save_file({
    'cluster_token_ids': core_token_ids.cpu(),
    'max_core_radius_ulp': torch.tensor([MAX_CORE_RADIUS_ULP], dtype=torch.float32),
    'n_core_tokens': torch.tensor([len(core_token_ids)], dtype=torch.int64),
    'n_halo_tokens': torch.tensor([len(halo_token_ids)], dtype=torch.int64),
}, OUTPUT_PATH)

print(f'✓ Saved to {OUTPUT_PATH}')
print(f'  Size: {Path(OUTPUT_PATH).stat().st_size / 1024:.1f} KB')
print()
print(f'Saved data:')
print(f'  cluster_token_ids: {len(core_token_ids):,} core tokens')
print(f'  max_core_radius_ulp: {MAX_CORE_RADIUS_ULP} (boundary)')
print(f'  n_core_tokens: {len(core_token_ids)}')
print(f'  n_halo_tokens: {len(halo_token_ids)}')


Saving core cluster data...
✓ Saved to ../tensors/Qwen3-4B-Instruct-2507/1.11i_core_cluster_tokens.safetensors
  Size: 17.6 KB

Saved data:
  cluster_token_ids: 2,206 core tokens
  max_core_radius_ulp: 3.0 (boundary)
  n_core_tokens: 2206
  n_halo_tokens: 6


## Verify Saved Data

In [118]:
print(f'\nVerifying saved data...')

# Load back
verification = load_file(OUTPUT_PATH)
loaded_ids = verification['cluster_token_ids']
loaded_radius = verification['max_core_radius_ulp']
loaded_n_core = verification['n_core_tokens']
loaded_n_halo = verification['n_halo_tokens']

# Verify
assert len(loaded_ids) == len(core_token_ids), 'Count mismatch'
assert torch.all(loaded_ids == core_token_ids.cpu()), 'Token IDs do not match'
assert loaded_n_core.item() == len(core_token_ids), 'Core count mismatch'
assert loaded_n_halo.item() == len(halo_token_ids), 'Halo count mismatch'

print(f'✓ Verification passed')
print(f'  Token IDs: {len(loaded_ids):,}')
print(f'  Core radius: {loaded_radius.item()} ULP')
print(f'  Core tokens: {loaded_n_core.item()}')
print(f'  Halo tokens: {loaded_n_halo.item()}')
print()
print(f'All checks passed! Core cluster data is ready.')


Verifying saved data...
✓ Verification passed
  Token IDs: 2,206
  Core radius: 3.0 ULP
  Core tokens: 2206
  Halo tokens: 6

All checks passed! Core cluster data is ready.


## Summary

In [119]:
print('='*70)
print('CORE CLUSTER EXTRACTION COMPLETE')
print('='*70)
print()
print(f'Original cluster (1.4h): {len(cluster_token_ids)} tokens')
print(f'Core (L∞ ≤ {MAX_CORE_RADIUS_ULP} ULP): {len(core_token_ids)} tokens ({len(core_populations)} unique vectors)')
print(f'Halo (L∞ > {MAX_CORE_RADIUS_ULP} ULP): {len(halo_token_ids)} tokens ({len(halo_populations)} unique vectors)')
print()
print(f'Saved to: {OUTPUT_PATH}')
print()
print(f'This file is a drop-in replacement for 1.4h_cluster_tokens.safetensors')
print(f'Use it to re-run analysis on the crystallized core only:')
print(f'  - 1.11g (power-of-two quantization proof)')
print(f'  - 1.11c (cluster visualization)')
print(f'  - 1.11d (dimensional diversity)')
print(f'  - Any other notebook that loads cluster_token_ids')
print()
print('='*70)

CORE CLUSTER EXTRACTION COMPLETE

Original cluster (1.4h): 2212 tokens
Core (L∞ ≤ 3.0 ULP): 2206 tokens (122 unique vectors)
Halo (L∞ > 3.0 ULP): 6 tokens (3 unique vectors)

Saved to: ../tensors/Qwen3-4B-Instruct-2507/1.11i_core_cluster_tokens.safetensors

This file is a drop-in replacement for 1.4h_cluster_tokens.safetensors
Use it to re-run analysis on the crystallized core only:
  - 1.11g (power-of-two quantization proof)
  - 1.11c (cluster visualization)
  - 1.11d (dimensional diversity)
  - Any other notebook that loads cluster_token_ids

