# 14.2b: Dead Token Topology Evolution

**Does the primordial atom crystallize or evaporate?**

## The Question

As dead tokens evolve during training, do they:
- **Evaporate**: Spread out into a diffuse gas of isolated tokens?
- **Crystallize**: Form a connected lattice structure like the 2×2×2 hypercube in Qwen 3 4B Instruct?

## The Hypothesis

Jeffery's picture: The primordial atom starts as a tight cluster (crystal), expands thermally into a gas, but as the centroid moves away from the origin, **bfloat16 precision gets coarser**. The lattice spacing (ULP) increases, causing tokens that were separated to become adjacent. The atom **re-crystallizes** not by moving closer, but by the universe losing resolution around them.

## Method

For each training step:
1. Compute **mean ULP** at the centroid (local lattice spacing)
2. Build **adjacency graph**: tokens connected if L∞ distance ≤ mean ULP
3. Compute **graph statistics**:
   - Graph density (fraction of possible edges)
   - Largest connected component size
   - Number of isolated tokens (singletons)
   - Total number of connected components

## Output

`data/instrumented_run/dead_token_topology.safetensors` (~500 KB)

## Parameters

In [1]:
# Data
INPUT_PATH = "../data/instrumented_run/dead_token_kinematics.safetensors"
OUTPUT_PATH = "../data/instrumented_run/dead_token_topology.safetensors"

# Computation
USE_GPU = True  # Set to False to force CPU

RANDOM_SEED = 42

## Imports

In [2]:
import torch
import numpy as np
from safetensors.torch import load_file, save_file
from tqdm import tqdm
import scipy.sparse.csgraph as csgraph

torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# Detect device
if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✓ Using GPU: {torch.cuda.get_device_name(0)}")
elif USE_GPU and torch.backends.mps.is_available():
    device = torch.device('mps')
    print(f"✓ Using Apple Silicon GPU (MPS)")
else:
    device = torch.device('cpu')
    print(f"✓ Using CPU")

print("✓ Imports complete")

✓ Using Apple Silicon GPU (MPS)
✓ Imports complete


## Load Data

In [3]:
print(f"Loading: {INPUT_PATH}")

data = load_file(INPUT_PATH)

recorded_steps = data['recorded_steps']
dead_token_ids = data['dead_token_ids']
positions = data['positions']  # [n_recorded, n_dead, hidden_dim]
centroid = data['centroid']  # [n_recorded, hidden_dim]

n_recorded = len(recorded_steps)
n_dead = len(dead_token_ids)
hidden_dim = positions.shape[2]

print(f"\n  Recorded steps: {n_recorded}")
print(f"  Dead tokens: {n_dead}")
print(f"  Hidden dim: {hidden_dim}")
print(f"\n✓ Data loaded")

Loading: ../data/instrumented_run/dead_token_kinematics.safetensors

  Recorded steps: 10000
  Dead tokens: 51
  Hidden dim: 64

✓ Data loaded


## Helper Functions

In [4]:
def compute_ulp_at_point(point_bf16):
    """
    Compute ULP (unit in last place) for each dimension of a bfloat16 point.
    Returns mean ULP across dimensions.
    
    point_bf16: [hidden_dim] tensor in bfloat16
    """
    # ULP = nextafter(x, x+1) - x
    next_vals = torch.nextafter(point_bf16, point_bf16 + 1)
    ulps = next_vals - point_bf16
    return ulps.mean().item()


def compute_adjacency_matrix(positions_bf16, ulp_threshold):
    """
    Compute adjacency matrix based on L∞ distance.
    Two tokens are adjacent if L∞ distance ≤ ulp_threshold.
    
    positions_bf16: [n_tokens, hidden_dim] in bfloat16
    ulp_threshold: scalar
    
    Returns: [n_tokens, n_tokens] boolean adjacency matrix (excludes self-loops)
    """
    n_tokens = positions_bf16.shape[0]
    
    # Pairwise differences: [n_tokens, n_tokens, hidden_dim]
    diff = positions_bf16.unsqueeze(0) - positions_bf16.unsqueeze(1)
    
    # L∞ distance: [n_tokens, n_tokens]
    linf_dist = diff.abs().max(dim=2)[0]
    
    # Adjacency: L∞ > 0 (not same token) and L∞ ≤ threshold
    adjacency = (linf_dist > 0) & (linf_dist <= ulp_threshold)
    
    return adjacency.cpu().numpy()  # Return as numpy for graph analysis


def compute_graph_statistics(adjacency):
    """
    Compute topological statistics from adjacency matrix.
    
    adjacency: [n_tokens, n_tokens] boolean numpy array
    
    Returns:
        density: fraction of possible edges that exist
        largest_component_size: size of largest connected component
        n_singletons: number of isolated tokens (degree 0)
        n_components: total number of connected components
    """
    n_tokens = adjacency.shape[0]
    
    # Graph density
    n_edges = adjacency.sum() / 2  # Divide by 2 because adjacency is symmetric
    max_edges = n_tokens * (n_tokens - 1) / 2
    density = n_edges / max_edges if max_edges > 0 else 0
    
    # Connected components
    n_components, labels = csgraph.connected_components(adjacency, directed=False)
    
    # Component sizes
    component_sizes = np.bincount(labels)
    largest_component_size = component_sizes.max()
    
    # Singletons (components of size 1)
    n_singletons = (component_sizes == 1).sum()
    
    return density, largest_component_size, n_singletons, n_components


print("✓ Helper functions defined")

✓ Helper functions defined


## Compute Topology Evolution

In [5]:
print("Computing topology evolution...")
print("This may take a few minutes.\n")

# Storage for results
mean_ulps = np.zeros(n_recorded)
densities = np.zeros(n_recorded)
largest_components = np.zeros(n_recorded, dtype=np.int32)
n_singletons_array = np.zeros(n_recorded, dtype=np.int32)
n_components_array = np.zeros(n_recorded, dtype=np.int32)

# Process each step
for t in tqdm(range(n_recorded)):
    # Get positions and centroid at this step
    pos_t = positions[t].to(device)  # [n_dead, hidden_dim]
    cent_t = centroid[t].to(device)  # [hidden_dim]
    
    # Quantize to bfloat16
    pos_bf16 = pos_t.to(torch.bfloat16)
    cent_bf16 = cent_t.to(torch.bfloat16)
    
    # Compute mean ULP at centroid
    mean_ulp = compute_ulp_at_point(cent_bf16)
    mean_ulps[t] = mean_ulp
    
    # Compute adjacency matrix
    adjacency = compute_adjacency_matrix(pos_bf16, mean_ulp)
    
    # Compute graph statistics
    density, largest_comp, n_sing, n_comp = compute_graph_statistics(adjacency)
    
    densities[t] = density
    largest_components[t] = largest_comp
    n_singletons_array[t] = n_sing
    n_components_array[t] = n_comp

print(f"\n✓ Topology computed for {n_recorded} steps")

Computing topology evolution...
This may take a few minutes.



100%|██████████| 10000/10000 [00:08<00:00, 1226.80it/s]


✓ Topology computed for 10000 steps





## Summary Statistics

In [6]:
print(f"\n{'='*80}")
print(f"TOPOLOGY EVOLUTION SUMMARY")
print(f"{'='*80}\n")

print(f"ULP evolution:")
print(f"  Initial mean ULP: {mean_ulps[0]:.6e}")
print(f"  Final mean ULP: {mean_ulps[-1]:.6e}")
print(f"  Ratio (final/initial): {mean_ulps[-1] / mean_ulps[0]:.2f}")
print(f"  Max mean ULP: {mean_ulps.max():.6e} (step {recorded_steps.numpy()[mean_ulps.argmax()]})\n")

print(f"Graph density evolution:")
print(f"  Initial density: {densities[0]:.4f}")
print(f"  Final density: {densities[-1]:.4f}")
print(f"  Max density: {densities.max():.4f} (step {recorded_steps.numpy()[densities.argmax()]})")
print(f"  Min density: {densities.min():.4f} (step {recorded_steps.numpy()[densities.argmin()]})\n")

print(f"Largest connected component:")
print(f"  Initial size: {largest_components[0]} / {n_dead}")
print(f"  Final size: {largest_components[-1]} / {n_dead}")
print(f"  Min size: {largest_components.min()} (step {recorded_steps.numpy()[largest_components.argmin()]})\n")

print(f"Singletons (isolated tokens):")
print(f"  Initial: {n_singletons_array[0]} / {n_dead}")
print(f"  Final: {n_singletons_array[-1]} / {n_dead}")
print(f"  Max: {n_singletons_array.max()} (step {recorded_steps.numpy()[n_singletons_array.argmax()]})\n")

print(f"Number of components:")
print(f"  Initial: {n_components_array[0]}")
print(f"  Final: {n_components_array[-1]}")
print(f"  Max: {n_components_array.max()} (step {recorded_steps.numpy()[n_components_array.argmax()]})")

print(f"\n{'='*80}")


TOPOLOGY EVOLUTION SUMMARY

ULP evolution:
  Initial mean ULP: 0.000000e+00
  Final mean ULP: 0.000000e+00
  Ratio (final/initial): nan
  Max mean ULP: 0.000000e+00 (step 0)

Graph density evolution:
  Initial density: 0.0000
  Final density: 0.0000
  Max density: 0.0000 (step 0)
  Min density: 0.0000 (step 0)

Largest connected component:
  Initial size: 1 / 51
  Final size: 1 / 51
  Min size: 1 (step 0)

Singletons (isolated tokens):
  Initial: 51 / 51
  Final: 51 / 51
  Max: 51 (step 0)

Number of components:
  Initial: 51
  Final: 51
  Max: 51 (step 0)



  print(f"  Ratio (final/initial): {mean_ulps[-1] / mean_ulps[0]:.2f}")


## Save Results

In [7]:
print(f"\nSaving to: {OUTPUT_PATH}")

save_dict = {
    'recorded_steps': recorded_steps,
    'dead_token_ids': dead_token_ids,
    'mean_ulps': torch.tensor(mean_ulps, dtype=torch.float32),
    'densities': torch.tensor(densities, dtype=torch.float32),
    'largest_components': torch.tensor(largest_components, dtype=torch.int32),
    'n_singletons': torch.tensor(n_singletons_array, dtype=torch.int32),
    'n_components': torch.tensor(n_components_array, dtype=torch.int32),
}

save_file(save_dict, OUTPUT_PATH)

# Check file size
import os
file_size_kb = os.path.getsize(OUTPUT_PATH) / 1024

print(f"\n✓ Saved successfully")
print(f"  File size: {file_size_kb:.1f} KB")


Saving to: ../data/instrumented_run/dead_token_topology.safetensors

✓ Saved successfully
  File size: 274.4 KB


## Interpretation

**Crystallization signature:**
- Mean ULP increases as centroid moves away from origin (coarser lattice)
- Graph density increases or stays high
- Largest component ≈ n_dead (most tokens stay connected)
- Few singletons

**Evaporation signature:**
- Mean ULP stays constant or decreases
- Graph density decreases to near zero
- Largest component shrinks
- Many singletons

**Jeffery's hypothesis:**
Tokens spread out in *physical* space (expansion we saw), but as ULP grows, they become adjacent again in *topological* space. The universe loses resolution and everything re-crystallizes.