# 12.2a: Sigma Sweep Data Collection

**Goal:** Generate synthetic snowballs at Qwen-scale and measure their structure.

## Experiment

For σ ∈ [1×10⁻⁶, 1×10⁻⁴]:
1. Initialize 2,100 tokens: `qwen_centroid + Gaussian(0, σ)`
2. Quantize to bfloat16
3. Measure black hole structure
4. Save results to CSV

No analysis, no plots—pure data collection.

## Parameters

In [1]:
# Experiment parameters
N_TOKENS = 2100        # Match Qwen's dead token count
HIDDEN_DIM = 2560      # Qwen's embedding dimension

# Sweep range
SIGMA_MIN = 1.5e-9
SIGMA_MAX = 5e-9
N_SAMPLES = 1000

# Reference scale
EPSILON = 6e-5  # bfloat16 ULP at Qwen black hole magnitude

# Output
OUTPUT_FILE = "../data/analysis/sigma_sweep_qwen_scale.csv"

RANDOM_SEED = 42

## Imports

In [2]:
import torch
import numpy as np
import pandas as pd
from safetensors.torch import load_file
from tqdm.auto import tqdm
from pathlib import Path
import gc

torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

## Load Qwen Centroid

In [3]:
print("Loading Qwen black hole centroid...\n")

centroid_data = load_file("../data/tensors/black_hole_centroid_qwen3_4b.safetensors")
qwen_centroid = centroid_data['centroid'].to(torch.float32)

print(f"✓ Centroid loaded")
print(f"  Shape: {qwen_centroid.shape}")
print(f"  Norm: {qwen_centroid.norm().item():.6f}")

Loading Qwen black hole centroid...

✓ Centroid loaded
  Shape: torch.Size([2560])
  Norm: 0.166061


## Define Measurement Function

In [4]:
def measure_snowball(centroid, n_tokens, sigma):
    """
    Generate synthetic snowball and measure structure.
    
    Returns dict with measurements.
    """
    hidden_dim = len(centroid)
    
    # Generate embeddings
    noise = torch.randn(n_tokens, hidden_dim, dtype=torch.float32) * sigma
    embeddings = centroid.unsqueeze(0) + noise
    
    # Quantize to bfloat16
    embeddings = embeddings.to(torch.bfloat16).to(torch.float32)
    
    # Find unique vectors
    unique_vectors, inverse_indices, counts = torch.unique(
        embeddings,
        dim=0,
        return_inverse=True,
        return_counts=True
    )
    
    # Black holes (count ≥ 2)
    black_hole_mask = counts >= 2
    n_black_holes = black_hole_mask.sum().item()
    black_hole_population = counts[black_hole_mask].sum().item() if n_black_holes > 0 else 0
    largest_bh = counts.max().item()
    n_singletons = len(unique_vectors) - n_black_holes
    
    # Pairwise L∞ distances (only if we have multiple black holes)
    if n_black_holes > 1:
        black_hole_vectors = unique_vectors[black_hole_mask]
        v1 = black_hole_vectors.unsqueeze(1)
        v2 = black_hole_vectors.unsqueeze(0)
        diffs = v1 - v2
        l_inf_distances = torch.abs(diffs).max(dim=2)[0]
        
        mask = ~torch.eye(n_black_holes, dtype=torch.bool)
        l_inf_nondiag = l_inf_distances[mask]
        
        max_l_inf = l_inf_nondiag.max().item()
        mean_l_inf = l_inf_nondiag.mean().item()
        median_l_inf = l_inf_nondiag.median().item()
    else:
        max_l_inf = 0.0
        mean_l_inf = 0.0
        median_l_inf = 0.0
    
    return {
        'sigma': sigma,
        'unique_vectors': len(unique_vectors),
        'n_black_holes': n_black_holes,
        'black_hole_population': black_hole_population,
        'n_singletons': n_singletons,
        'largest_bh': largest_bh,
        'max_l_inf': max_l_inf,
        'mean_l_inf': mean_l_inf,
        'median_l_inf': median_l_inf,
    }

print("✓ Measurement function defined")

✓ Measurement function defined


## Run Sweep

In [5]:
print(f"\nRunning sigma sweep...")
print(f"  Range: σ ∈ [{SIGMA_MIN:.2e}, {SIGMA_MAX:.2e}]")
print(f"  Samples: {N_SAMPLES}")
print(f"  Tokens: {N_TOKENS:,}")
print(f"  Dimension: {HIDDEN_DIM:,}\n")

sigmas = np.linspace(SIGMA_MIN, SIGMA_MAX, N_SAMPLES)
results = []

for sigma in tqdm(sigmas, desc="Sweeping σ"):
    measurements = measure_snowball(qwen_centroid, N_TOKENS, sigma)
    results.append(measurements)
    
    # Periodic garbage collection
    if len(results) % 20 == 0:
        gc.collect()

print(f"\n✓ Sweep complete: {len(results)} measurements")


Running sigma sweep...
  Range: σ ∈ [1.50e-09, 5.00e-09]
  Samples: 1000
  Tokens: 2,100
  Dimension: 2,560



Sweeping σ:   0%|          | 0/1000 [00:00<?, ?it/s]


✓ Sweep complete: 1000 measurements


## Save Results

In [6]:
df = pd.DataFrame(results)

# Add metadata columns
df['n_tokens'] = N_TOKENS
df['hidden_dim'] = HIDDEN_DIM
df['epsilon'] = EPSILON
df['sigma_over_epsilon'] = df['sigma'] / EPSILON
df['max_l_inf_over_epsilon'] = df['max_l_inf'] / EPSILON

# Save
output_path = Path(OUTPUT_FILE)
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path, index=False)

print(f"\n✓ Results saved to {output_path}")
print(f"  Rows: {len(df):,}")
print(f"  Columns: {list(df.columns)}")
print(f"  File size: {output_path.stat().st_size / 1024:.1f} KB")


✓ Results saved to ../data/analysis/sigma_sweep_qwen_scale.csv
  Rows: 1,000
  Columns: ['sigma', 'unique_vectors', 'n_black_holes', 'black_hole_population', 'n_singletons', 'largest_bh', 'max_l_inf', 'mean_l_inf', 'median_l_inf', 'n_tokens', 'hidden_dim', 'epsilon', 'sigma_over_epsilon', 'max_l_inf_over_epsilon']
  File size: 154.0 KB


## Quick Summary

In [7]:
print(f"\n{'='*60}")
print(f"DATA COLLECTION COMPLETE")
print(f"{'='*60}")
print(f"Samples: {len(df):,}")
print(f"\nBlack hole count range: [{df['n_black_holes'].min()}, {df['n_black_holes'].max()}]")
print(f"Max L∞ / ε range: [{df['max_l_inf_over_epsilon'].min():.2f}, {df['max_l_inf_over_epsilon'].max():.2f}]")
print(f"\nData saved: {output_path}")
print(f"{'='*60}")


DATA COLLECTION COMPLETE
Samples: 1,000

Black hole count range: [12, 240]
Max L∞ / ε range: [0.01, 0.25]

Data saved: ../data/analysis/sigma_sweep_qwen_scale.csv


In [8]:

import pandas as pd

df = pd.read_csv('../data/analysis/sigma_sweep_qwen_scale.csv')

# Show rows where we got black holes
black_hole_rows = df[df['n_black_holes'] > 0]

print(f"Total samples: {len(df)}")
print(f"Samples with black holes: {len(black_hole_rows)}")
print(f"\nBlack hole detections:\n")
print(black_hole_rows[['sigma', 'n_black_holes', 'black_hole_population', 'max_l_inf', 'sigma_over_epsilon', 'max_l_inf_over_epsilon']])


Total samples: 1000
Samples with black holes: 1000

Black hole detections:

            sigma  n_black_holes  black_hole_population  max_l_inf  \
0    1.500000e-09             14                   2099   0.000015   
1    1.503504e-09             14                   2098   0.000015   
2    1.507007e-09             14                   2098   0.000015   
3    1.510511e-09             13                   2097   0.000015   
4    1.514014e-09             13                   2099   0.000015   
..            ...            ...                    ...        ...   
995  4.985986e-09            230                   1698   0.000015   
996  4.989489e-09            222                   1703   0.000015   
997  4.992993e-09            235                   1720   0.000015   
998  4.996496e-09            227                   1696   0.000015   
999  5.000000e-09            232                   1690   0.000015   

     sigma_over_epsilon  max_l_inf_over_epsilon  
0              0.000025          