# 12.2d: Comprehensive Synthetic Statistics

**Goal:** Stream through all 10,000 synthetic snowball trials and collect EVERY statistic we care about. Save to One CSV To Rule Them All.

## What We Compute (Per Trial)

### Basic Counts
- `n_tokens` - Total tokens (should always be 2,100)
- `n_unique` - Number of unique vectors
- `n_black_holes` - Vectors with count ≥ 2
- `n_singletons` - Vectors with count = 1
- `total_population` - Sum of all counts (= n_tokens)
- `black_hole_population` - Sum of counts where count ≥ 2

### Per-Black-Hole Statistics
- `largest_bh` - Max population among black holes
- `smallest_bh` - Min population among black holes
- `mean_bh_size` - Mean population per black hole
- `median_bh_size` - Median population per black hole
- `top2_population` - Sum of two largest black holes (concentration metric)
- `gini_coefficient` - Inequality measure of population distribution

### Spatial Extent (L∞ Distances)
- `max_l_inf` - Maximum pairwise Chebyshev distance (in units of ε)
- `mean_l_inf` - Mean pairwise Chebyshev distance
- `median_l_inf` - Median pairwise Chebyshev distance

### Topology (Graph Structure)
- `n_components` - Number of connected components in adjacency graph
- `n_isolated` - Number of nodes with degree = 0
- `largest_component_size` - Size of largest connected component
- `largest_component_density` - Edge density of largest component (0-1)
- `global_density` - Edge density of full graph (0-1)

## Approach

- Stream HDF5 in 100-trial batches (~2 GB RAM per batch)
- For each trial: run `torch.unique()` to get vectors + counts
- Compute all statistics using vectorized operations where possible
- Use NetworkX only for graph topology (unavoidable, but fast for ~12 nodes)
- Save results to CSV: one row per trial, 20+ columns

## Output

`../data/analysis/synthetic_comprehensive_n10000.csv`

**Runtime:** ~2-3 minutes (bottleneck: torch.unique() runs on CPU)

## Parameters

In [1]:
# Data source
DATA_H5 = "../data/tensors/synthetic_snowballs_n10000_sigma1.5e-9.h5"

# Streaming configuration
BATCH_SIZE = 100  # Trials per batch (~2 GB RAM per batch)

# Reference scale
EPSILON = 6e-5  # bfloat16 ULP at Qwen magnitude

# Adjacency threshold for topology
TOUCHING_THRESHOLD = 2 * EPSILON

# Output
OUTPUT_CSV = "../data/analysis/synthetic_comprehensive_n10000.csv"

RANDOM_SEED = 42

## Imports

In [2]:
import torch
import numpy as np
import pandas as pd
import h5py
import networkx as nx
from tqdm.auto import tqdm
from pathlib import Path
import gc

torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

print("✓ Imports complete")

✓ Imports complete


## Helper Functions

In [3]:
def compute_gini_coefficient(populations):
    """
    Compute Gini coefficient of inequality.
    
    Args:
        populations: Tensor of black hole populations
    
    Returns:
        float: Gini coefficient (0 = perfect equality, 1 = maximum inequality)
    """
    if len(populations) <= 1:
        return 0.0
    
    # Sort populations
    sorted_pops = torch.sort(populations.float())[0]
    n = len(sorted_pops)
    
    # Gini formula: G = (2 * sum(i * x_i)) / (n * sum(x_i)) - (n + 1) / n
    indices = torch.arange(1, n + 1, dtype=torch.float32)
    numerator = 2 * torch.sum(indices * sorted_pops)
    denominator = n * torch.sum(sorted_pops)
    
    if denominator == 0:
        return 0.0
    
    gini = (numerator / denominator) - (n + 1) / n
    return gini.item()


def compute_l_inf_stats(unique_vectors, epsilon):
    """
    Compute L∞ (Chebyshev) distance statistics.
    
    Args:
        unique_vectors: Tensor of shape [n_unique, hidden_dim]
        epsilon: Reference scale for normalization
    
    Returns:
        dict with max_l_inf, mean_l_inf, median_l_inf (all in units of epsilon)
    """
    n = len(unique_vectors)
    
    if n <= 1:
        return {'max_l_inf': 0.0, 'mean_l_inf': 0.0, 'median_l_inf': 0.0}
    
    # Compute pairwise L∞ distances (vectorized)
    v1 = unique_vectors.unsqueeze(1)  # [n, 1, d]
    v2 = unique_vectors.unsqueeze(0)  # [1, n, d]
    diffs = v1 - v2  # [n, n, d]
    l_inf_matrix = torch.abs(diffs).max(dim=2)[0]  # [n, n]
    
    # Exclude diagonal (self-distances)
    mask = ~torch.eye(n, dtype=torch.bool)
    l_inf_values = l_inf_matrix[mask]
    
    # Normalize by epsilon
    l_inf_values = l_inf_values / epsilon
    
    return {
        'max_l_inf': l_inf_values.max().item(),
        'mean_l_inf': l_inf_values.mean().item(),
        'median_l_inf': l_inf_values.median().item(),
    }


def compute_topology(unique_vectors, threshold):
    """
    Compute graph topology statistics.
    
    Args:
        unique_vectors: Tensor of shape [n_unique, hidden_dim]
        threshold: Adjacency threshold (L∞ distance)
    
    Returns:
        dict with n_components, n_isolated, largest_component_size, 
        largest_component_density, global_density
    """
    n = len(unique_vectors)
    
    if n == 0:
        return {
            'n_components': 0,
            'n_isolated': 0,
            'largest_component_size': 0,
            'largest_component_density': 0.0,
            'global_density': 0.0,
        }
    
    if n == 1:
        return {
            'n_components': 1,
            'n_isolated': 1,
            'largest_component_size': 1,
            'largest_component_density': 1.0,
            'global_density': 1.0,
        }
    
    # Compute pairwise L∞ distances
    v1 = unique_vectors.unsqueeze(1)
    v2 = unique_vectors.unsqueeze(0)
    diffs = v1 - v2
    l_inf_matrix = torch.abs(diffs).max(dim=2)[0]
    
    # Build adjacency matrix (exclude self-loops)
    adjacency = (l_inf_matrix <= threshold) & (~torch.eye(n, dtype=torch.bool))
    
    # Convert to NetworkX graph
    G = nx.Graph()
    G.add_nodes_from(range(n))
    edges = torch.nonzero(adjacency, as_tuple=False).tolist()
    G.add_edges_from(edges)
    
    # Connected components
    components = list(nx.connected_components(G))
    component_sizes = sorted([len(c) for c in components], reverse=True)
    
    n_components = len(components)
    largest_size = component_sizes[0] if component_sizes else 0
    
    # Isolated nodes (degree = 0)
    n_isolated = sum(1 for node in G.nodes() if G.degree(node) == 0)
    
    # Density of largest component
    if largest_size > 1:
        largest_component = max(components, key=len)
        subgraph = G.subgraph(largest_component)
        n_edges = subgraph.number_of_edges()
        max_edges = largest_size * (largest_size - 1) // 2
        largest_density = n_edges / max_edges if max_edges > 0 else 0.0
    else:
        largest_density = 1.0 if largest_size == 1 else 0.0
    
    # Global density
    n_edges = G.number_of_edges()
    max_edges = n * (n - 1) // 2
    global_density = n_edges / max_edges if max_edges > 0 else 0.0
    
    return {
        'n_components': n_components,
        'n_isolated': n_isolated,
        'largest_component_size': largest_size,
        'largest_component_density': largest_density,
        'global_density': global_density,
    }


def compute_trial_statistics(embeddings, epsilon, threshold):
    """
    Compute all statistics for a single trial.
    
    Args:
        embeddings: Tensor of shape [n_tokens, hidden_dim]
        epsilon: Reference scale
        threshold: Adjacency threshold
    
    Returns:
        dict with all 20+ statistics
    """
    # Get unique vectors and counts
    unique_vectors, _, counts = torch.unique(
        embeddings,
        dim=0,
        return_inverse=True,
        return_counts=True
    )
    
    # Basic counts
    n_tokens = len(embeddings)
    n_unique = len(unique_vectors)
    black_hole_mask = counts >= 2
    n_black_holes = black_hole_mask.sum().item()
    n_singletons = (~black_hole_mask).sum().item()
    total_population = counts.sum().item()
    black_hole_population = counts[black_hole_mask].sum().item() if n_black_holes > 0 else 0
    
    # Per-black-hole statistics
    if n_black_holes > 0:
        bh_populations = counts[black_hole_mask]
        largest_bh = bh_populations.max().item()
        smallest_bh = bh_populations.min().item()
        mean_bh_size = bh_populations.float().mean().item()
        median_bh_size = bh_populations.float().median().item()
        
        # Top-2 concentration
        top_k = min(2, len(bh_populations))
        top2_population = bh_populations.topk(top_k)[0].sum().item()
        
        # Gini coefficient
        gini = compute_gini_coefficient(bh_populations)
    else:
        largest_bh = 0
        smallest_bh = 0
        mean_bh_size = 0.0
        median_bh_size = 0.0
        top2_population = 0
        gini = 0.0
    
    # Spatial extent (L∞ distances)
    l_inf_stats = compute_l_inf_stats(unique_vectors, epsilon)
    
    # Topology
    topology_stats = compute_topology(unique_vectors, threshold)
    
    # Combine all statistics
    return {
        # Basic counts
        'n_tokens': n_tokens,
        'n_unique': n_unique,
        'n_black_holes': n_black_holes,
        'n_singletons': n_singletons,
        'total_population': total_population,
        'black_hole_population': black_hole_population,
        
        # Per-BH statistics
        'largest_bh': largest_bh,
        'smallest_bh': smallest_bh,
        'mean_bh_size': mean_bh_size,
        'median_bh_size': median_bh_size,
        'top2_population': top2_population,
        'gini_coefficient': gini,
        
        # Spatial extent
        'max_l_inf': l_inf_stats['max_l_inf'],
        'mean_l_inf': l_inf_stats['mean_l_inf'],
        'median_l_inf': l_inf_stats['median_l_inf'],
        
        # Topology
        'n_components': topology_stats['n_components'],
        'n_isolated': topology_stats['n_isolated'],
        'largest_component_size': topology_stats['largest_component_size'],
        'largest_component_density': topology_stats['largest_component_density'],
        'global_density': topology_stats['global_density'],
    }

print("✓ Helper functions defined")

✓ Helper functions defined


## Stream Processing

In [4]:
print(f"Loading dataset from {DATA_H5}...\n")

results = []

with h5py.File(DATA_H5, 'r') as f:
    n_total_trials = f['embeddings'].shape[0]
    n_batches = (n_total_trials + BATCH_SIZE - 1) // BATCH_SIZE
    
    print(f"✓ Dataset loaded")
    print(f"  Total trials: {n_total_trials:,}")
    print(f"  Batch size: {BATCH_SIZE}")
    print(f"  Number of batches: {n_batches}")
    print(f"\nProcessing...\n")
    
    for batch_idx in tqdm(range(n_batches), desc="Processing batches"):
        # Load batch
        batch_start = batch_idx * BATCH_SIZE
        batch_end = min(batch_start + BATCH_SIZE, n_total_trials)
        
        embeddings_batch = torch.from_numpy(f['embeddings'][batch_start:batch_end]).to(torch.float32)
        
        # Process each trial in batch
        for trial_idx in range(len(embeddings_batch)):
            embeddings = embeddings_batch[trial_idx]
            
            # Compute all statistics
            stats = compute_trial_statistics(embeddings, EPSILON, TOUCHING_THRESHOLD)
            stats['trial_id'] = batch_start + trial_idx
            
            results.append(stats)
        
        # Free batch memory
        del embeddings_batch
        gc.collect()

print(f"\n✓ Processing complete: {len(results):,} trials analyzed")

Loading dataset from ../data/tensors/synthetic_snowballs_n10000_sigma1.5e-9.h5...

✓ Dataset loaded
  Total trials: 10,000
  Batch size: 100
  Number of batches: 100

Processing...



Processing batches:   0%|          | 0/100 [00:00<?, ?it/s]


✓ Processing complete: 10,000 trials analyzed


## Save to CSV

In [5]:
# Convert to DataFrame
df = pd.DataFrame(results)

# Reorder columns for readability
column_order = [
    'trial_id',
    # Basic counts
    'n_tokens', 'n_unique', 'n_black_holes', 'n_singletons',
    'total_population', 'black_hole_population',
    # Per-BH stats
    'largest_bh', 'smallest_bh', 'mean_bh_size', 'median_bh_size',
    'top2_population', 'gini_coefficient',
    # Spatial
    'max_l_inf', 'mean_l_inf', 'median_l_inf',
    # Topology
    'n_components', 'n_isolated', 'largest_component_size',
    'largest_component_density', 'global_density',
]

df = df[column_order]

# Save
output_path = Path(OUTPUT_CSV)
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path, index=False)

print(f"\n✓ Saved to {output_path}")
print(f"  Rows: {len(df):,}")
print(f"  Columns: {len(df.columns)}")
print(f"  File size: {output_path.stat().st_size / (1024**2):.2f} MB")


✓ Saved to ../data/analysis/synthetic_comprehensive_n10000.csv
  Rows: 10,000
  Columns: 21
  File size: 1.43 MB


## Preview Results

In [6]:
print(f"\n{'='*70}")
print(f"PREVIEW: First 5 Trials")
print(f"{'='*70}\n")

print(df.head())

print(f"\n{'='*70}")
print(f"SUMMARY STATISTICS")
print(f"{'='*70}\n")

print(df.describe())


PREVIEW: First 5 Trials

   trial_id  n_tokens  n_unique  n_black_holes  n_singletons  \
0         0      2100        10             10             0   
1         1      2100         9              9             0   
2         2      2100        12             10             2   
3         3      2100        11              9             2   
4         4      2100        12              8             4   

   total_population  black_hole_population  largest_bh  smallest_bh  \
0              2100                   2100        1137            2   
1              2100                   2100        1122            3   
2              2100                   2098        1098            2   
3              2100                   2098        1125            2   
4              2100                   2096        1118            4   

   mean_bh_size  ...  top2_population  gini_coefficient  max_l_inf  \
0    210.000000  ...             1526          0.724381   0.254313   
1    233.333328  ...  

## Quick Sanity Checks

In [7]:
print(f"\n{'='*70}")
print(f"SANITY CHECKS")
print(f"{'='*70}\n")

# Check 1: All trials have n_tokens = 2100
assert (df['n_tokens'] == 2100).all(), "ERROR: Some trials have wrong token count!"
print("✓ All trials have n_tokens = 2,100")

# Check 2: total_population == n_tokens
assert (df['total_population'] == df['n_tokens']).all(), "ERROR: Population doesn't match token count!"
print("✓ total_population == n_tokens for all trials")

# Check 3: n_unique == n_black_holes + n_singletons
assert (df['n_unique'] == df['n_black_holes'] + df['n_singletons']).all(), "ERROR: Unique count mismatch!"
print("✓ n_unique == n_black_holes + n_singletons")

# Check 4: All trials have 1 connected component
n_single_component = (df['n_components'] == 1).sum()
print(f"\n✓ Trials with 1 component: {n_single_component:,} / {len(df):,} ({n_single_component/len(df)*100:.1f}%)")

# Check 5: All trials have density = 1.0
n_full_density = (df['largest_component_density'] == 1.0).sum()
print(f"✓ Trials with density = 1.0: {n_full_density:,} / {len(df):,} ({n_full_density/len(df)*100:.1f}%)")

print(f"\n{'='*70}")
print(f"ALL SANITY CHECKS PASSED")
print(f"{'='*70}")


SANITY CHECKS

✓ All trials have n_tokens = 2,100
✓ total_population == n_tokens for all trials
✓ n_unique == n_black_holes + n_singletons

✓ Trials with 1 component: 10,000 / 10,000 (100.0%)
✓ Trials with density = 1.0: 10,000 / 10,000 (100.0%)

ALL SANITY CHECKS PASSED
