# 12.2d: Comprehensive Synthetic Statistics

**Goal:** Stream through all 10,000 synthetic snowball trials and collect EVERY statistic we care about. Save to One CSV To Rule Them All.

## What We Compute (Per Trial)

### Basic Counts
- `n_tokens` - Total tokens (should always be 2,100)
- `n_unique` - Number of unique vectors
- `n_black_holes` - Vectors with count ≥ 2
- `n_singletons` - Vectors with count = 1
- `total_population` - Sum of all counts (= n_tokens)
- `black_hole_population` - Sum of counts where count ≥ 2

### Per-Black-Hole Statistics
- `largest_bh` - Max population among black holes
- `smallest_bh` - Min population among black holes
- `mean_bh_size` - Mean population per black hole
- `median_bh_size` - Median population per black hole
- `top2_population` - Sum of two largest black holes (concentration metric)
- `gini_coefficient` - Inequality measure of population distribution

### Spatial Extent (L∞ Distances)
- `max_l_inf` - Maximum pairwise Chebyshev distance (in units of ε)
- `mean_l_inf` - Mean pairwise Chebyshev distance
- `median_l_inf` - Median pairwise Chebyshev distance

### Topology (Graph Structure)
- `n_components` - Number of connected components in adjacency graph
- `n_isolated` - Number of nodes with degree = 0
- `largest_component_size` - Size of largest connected component
- `largest_component_density` - Edge density of largest component (0-1)
- `global_density` - Edge density of full graph (0-1)

## Approach

- Stream HDF5 in 100-trial batches (~2 GB RAM per batch)
- For each trial: run `torch.unique()` to get vectors + counts
- Compute all statistics using vectorized operations where possible
- Use NetworkX only for graph topology (unavoidable, but fast for ~12 nodes)
- Save results to CSV: one row per trial, 20+ columns

## Output

`../data/analysis/synthetic_comprehensive_n10000.csv`

**Runtime:** ~2-3 minutes (bottleneck: torch.unique() runs on CPU)