# 1.6a: Cluster-Centric Reference Frame

This notebook establishes a coordinate system centered on the cluster and computes reusable geometric infrastructure.

## The Question

We've identified a tight cluster of 2,212 tokens (1.4g-1.4h) and seen that it appears as a concentrated spike in the radial density profile (1.5c). But what about the remaining ~18,161 non-cluster tokens in the overdensity?

Are they:
- **Uniform background** ("ice cube floating in diffuse gas")
- **Thermal halo** (tokens that escaped the cluster)
- **Structured** (sub-clusters, filaments, anisotropies)

To answer this, we need to view the universe **from the cluster's perspective**. This notebook sets up the infrastructure:

1. **Cluster centroid** - the origin of our new coordinate system
2. **PCA basis** - consistent axes for all future visualizations

These will be reused throughout the 1.6 series to study the structure surrounding the cluster.

## Method

We'll:
1. Load W and compute PCA (save eigenvectors + eigenvalues)
2. Load cluster token IDs and compute geometric centroid
3. Save everything for reuse
4. Verify the saved data

## Parameters

In [1]:
# Model to analyze
MODEL_NAME = "Qwen3-4B-Instruct-2507"

## Imports

In [2]:
import torch
import numpy as np
from safetensors.torch import load_file, save_file
from pathlib import Path

## Device Detection

In [3]:
# Detect available device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

Using device: mps


## Load W

In [4]:
# Load W
tensor_path = Path(f"../tensors/{MODEL_NAME}/W.safetensors")
W_bf16 = load_file(tensor_path)["W"]
W = W_bf16.to(torch.float32).to(device)
N, d = W.shape

print(f"Loaded W: {W.shape}")
print(f"  {N:,} tokens in {d:,} dimensions")

Loaded W: torch.Size([151936, 2560])
  151,936 tokens in 2,560 dimensions


## Compute PCA Basis

In [5]:
print("\nComputing PCA...\n")

# Center the data
W_centered = W - W.mean(dim=0)

# Covariance matrix
cov = (W_centered.T @ W_centered) / N

# Move to CPU for eigen decomposition (MPS doesn't support eigh)
cov_cpu = cov.cpu()

# Eigendecomposition
eigenvalues, eigenvectors = torch.linalg.eigh(cov_cpu)

# Sort descending by eigenvalue
idx = torch.argsort(eigenvalues, descending=True)
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print(f"✓ PCA computed")
print(f"\nTop 10 eigenvalues:")
for i in range(10):
    variance_explained = eigenvalues[i] / eigenvalues.sum() * 100
    print(f"  PC{i+1}: {eigenvalues[i].item():.6f} ({variance_explained:.2f}% variance)")

# Compute cumulative variance
cumsum = torch.cumsum(eigenvalues, dim=0)
total_variance = eigenvalues.sum()
variance_10 = cumsum[9] / total_variance * 100
variance_100 = cumsum[99] / total_variance * 100

print(f"\nCumulative variance:")
print(f"  First 10 PCs: {variance_10:.2f}%")
print(f"  First 100 PCs: {variance_100:.2f}%")


Computing PCA...

✓ PCA computed

Top 10 eigenvalues:
  PC1: 0.010487 (0.94% variance)
  PC2: 0.003178 (0.28% variance)
  PC3: 0.002791 (0.25% variance)
  PC4: 0.002616 (0.23% variance)
  PC5: 0.001973 (0.18% variance)
  PC6: 0.001805 (0.16% variance)
  PC7: 0.001609 (0.14% variance)
  PC8: 0.001549 (0.14% variance)
  PC9: 0.001468 (0.13% variance)
  PC10: 0.001389 (0.12% variance)

Cumulative variance:
  First 10 PCs: 2.58%
  First 100 PCs: 10.16%


## Load Cluster Tokens and Compute Centroid

In [6]:
print("\nLoading cluster tokens...\n")

# Load cluster token IDs
cluster_path = Path(f"../tensors/{MODEL_NAME}/1.4h_cluster_tokens.safetensors")
cluster_data = load_file(cluster_path)
cluster_token_ids = cluster_data['cluster_token_ids']

print(f"✓ Loaded {len(cluster_token_ids):,} cluster tokens")


Loading cluster tokens...

✓ Loaded 2,212 cluster tokens


In [7]:
print("\nComputing cluster centroid...\n")

# Get cluster vectors
cluster_vecs = W[cluster_token_ids]

# Compute geometric centroid (unweighted mean)
cluster_centroid = cluster_vecs.mean(dim=0)

print(f"✓ Cluster centroid computed")
print(f"  Shape: {cluster_centroid.shape}")
print(f"  Norm: {torch.linalg.vector_norm(cluster_centroid).item():.6f}")
print(f"\n  First 10 components: {cluster_centroid.cpu().numpy()[:10]}")


Computing cluster centroid...

✓ Cluster centroid computed
  Shape: torch.Size([2560])
  Norm: 0.370917

  First 10 components: [ 0.00607304  0.01324463  0.01177979  0.03662109  0.01586914 -0.01257324
  0.00659181  0.00830075 -0.02148438 -0.05200195]


## Save Reference Frame Data

In [8]:
print("\nSaving reference frame data...\n")

# Prepare output
output_dir = Path(f"../tensors/{MODEL_NAME}")
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / "1.6a_cluster_reference_frame.safetensors"

# Save
save_file({
    'cluster_centroid': cluster_centroid.cpu(),
    'W_eigenvalues': eigenvalues,
    'W_eigenvectors': eigenvectors,
}, output_path)

print(f"✓ Saved to {output_path}")
print(f"  Size: {output_path.stat().st_size / 1024 / 1024:.1f} MB")
print()
print(f"Saved data:")
print(f"  cluster_centroid: {cluster_centroid.shape} (geometric center of 2,212 cluster tokens)")
print(f"  W_eigenvalues: {eigenvalues.shape} (PCA eigenvalues for full vocabulary)")
print(f"  W_eigenvectors: {eigenvectors.shape} (PCA basis vectors for full vocabulary)")


Saving reference frame data...

✓ Saved to ../tensors/Qwen3-4B-Instruct-2507/1.6a_cluster_reference_frame.safetensors
  Size: 25.0 MB

Saved data:
  cluster_centroid: torch.Size([2560]) (geometric center of 2,212 cluster tokens)
  W_eigenvalues: torch.Size([2560]) (PCA eigenvalues for full vocabulary)
  W_eigenvectors: torch.Size([2560, 2560]) (PCA basis vectors for full vocabulary)


## Verify Saved Data

In [9]:
print("\nVerifying saved data...\n")

# Load back
verification = load_file(output_path)
loaded_centroid = verification['cluster_centroid']
loaded_eigenvalues = verification['W_eigenvalues']
loaded_eigenvectors = verification['W_eigenvectors']

# Verify shapes
assert loaded_centroid.shape == (d,), f"Centroid shape mismatch: {loaded_centroid.shape}"
assert loaded_eigenvalues.shape == (d,), f"Eigenvalues shape mismatch: {loaded_eigenvalues.shape}"
assert loaded_eigenvectors.shape == (d, d), f"Eigenvectors shape mismatch: {loaded_eigenvectors.shape}"

# Verify values match
assert torch.allclose(loaded_centroid, cluster_centroid.cpu()), "Centroid values don't match"
assert torch.allclose(loaded_eigenvalues, eigenvalues), "Eigenvalues don't match"
assert torch.allclose(loaded_eigenvectors, eigenvectors), "Eigenvectors don't match"

print(f"✓ Verification passed")
print(f"  Centroid: {loaded_centroid.shape}")
print(f"  Eigenvalues: {loaded_eigenvalues.shape}")
print(f"  Eigenvectors: {loaded_eigenvectors.shape}")
print()
print(f"All checks passed! Reference frame is ready for use.")


Verifying saved data...

✓ Verification passed
  Centroid: torch.Size([2560])
  Eigenvalues: torch.Size([2560])
  Eigenvectors: torch.Size([2560, 2560])

All checks passed! Reference frame is ready for use.


## Summary

We've established a cluster-centric reference frame for studying the surrounding token distribution:

**Saved data:**
- `cluster_centroid` (2560,) - geometric center of the 2,212-token cluster
- `W_eigenvalues` (2560,) - principal component variances
- `W_eigenvectors` (2560, 2560) - principal component directions

**Next steps:**
- **1.6b:** View universe from cluster perspective (Mollweide + polar projections)
- **1.6c:** Test isotropy - is the surrounding distribution uniform?
- **1.6d:** Compare cluster-centric view to global view - is the cluster special?

The cluster centroid becomes our new origin. Any anisotropies or structure in the surrounding tokens will be visible as deviations from spherical symmetry.