Skip to content

Simulation Health & Debugging

LiranOG edited this page May 9, 2026 · 8 revisions

🩺 Simulation Health & Debugging

GRANITE v0.6.8 | ← Scientific Context

This page replaces and corrects the outdated docs/diagnostic_handbook.md.
All thresholds and descriptions reflect actual v0.6.8 production run behavior.


Table of Contents

  1. Primary Health Indicators
  2. Healthy Output — What Normal Looks Like
  3. The Lapse Lifecycle — α_center Explained
  4. Hamiltonian Constraint ‖H‖₂ — Interpretation Guide
  5. NaN Forensics
  6. The Debugging Flowchart
  7. Common Failure Modes with Solutions
  8. The Health Check Checklist
  9. Known Diagnostic Limitations

1. Primary Health Indicators

Monitor these six quantities at every diagnostic output step:

Indicator Source Healthy Warning Critical
alpha_center Lapse at AMR L0, r≈0.65M ⚠️ Stable after collapse Oscillating > 5.0 or NaN
‖H‖₂ Hamiltonian constraint L2 norm Decaying or γ < 0.005/M 0.005 < γ < 0.02/M γ > 0.02/M or NaN
NaN_events Explicit NaN scan from C++ 0 at all times Any event
AMR_blocks Block count in hierarchy Grows toward max_levels Frozen at initial count Dropping (regrid bug)
CFL_finest Advection CFL at finest level < 0.5 0.5 – 0.9 > 0.95 (emergency dt reduction)
speed (M/s) Simulated time per wall-second Consistent Stalling Near-zero (zombie state)

2. Healthy Output — What Normal Looks Like

2.1 Single Puncture (single_puncture benchmark, 64³)

GRANITE v0.6.7.2 — single_puncture benchmark
step=2    t=0.375M  α_center=0.9921  ‖H‖₂=1.083e-02  [OK]  blocks=3  speed=0.017M/s
step=4    t=0.750M  α_center=0.9844  ‖H‖₂=9.214e-03  [OK]  blocks=4  speed=0.017M/s
...
step=16   t=3.000M  α_center=0.9634  ‖H‖₂=6.437e-03  [OK]  blocks=4  speed=0.017M/s
step=32   t=6.000M  α_center=0.9399  ‖H‖₂=4.021e-03  [OK]  blocks=4  speed=0.017M/s
...
step=320  t=60.0M   α_center=0.9399  ‖H‖₂=5.551e-04  [OK]  blocks=4  speed=0.017M/s
step=1334 t=500.0M  α_center=0.9399  ‖H‖₂=1.277e-04  [OK]  blocks=4  speed=0.017M/s

=== STABILITY SUMMARY ===
Run completed to t=500.0M  ✓
‖H‖₂ reduction: ×84.8 (1.083e-02 → 1.277e-04)  ✓
NaN events: 0  ✓
Stability: STABLE

2.2 Binary Black Hole Inspiral (B2_eq, 64³, t=500M)

GRANITE v0.6.7.2 — B2_eq benchmark (equal-mass BBH)
step=2    t=3.125M  α_center=0.81207  ‖H‖₂=4.929e-04  [OK]  blocks=3
step=4    t=6.250M  α_center=0.86547  ‖H‖₂=8.226e-04  [OK]  blocks=4
step=6    t=9.375M  α_center=0.89896  ‖H‖₂=6.350e-04  [OK]  blocks=4
step=32   t=50.00M  α_center=0.96214  ‖H‖₂=1.571e-04  [OK]  blocks=4
step=64   t=100.0M  α_center=0.96668  ‖H‖₂=7.540e-05  [OK]  blocks=4
step=128  t=200.0M  α_center=0.96656  ‖H‖₂=4.095e-05  [OK]  blocks=4
step=256  t=400.0M  α_center=0.97264  ‖H‖₂=1.960e-05  [OK]  blocks=4
step=320  t=500.0M  α_center=0.96748  ‖H‖₂=1.341e-05  [OK]  blocks=4

=== STABILITY SUMMARY ===
Run completed to t=500.0M  ✓
‖H‖₂ reduction: ×61.3 (peak 8.226e-04 → 1.341e-05)  ✓
NaN events: 0  ✓
Stability: STABLE

Key signatures of a healthy run:

  • ‖H‖₂ monotonically decaying after initial gauge adjustment peak
  • α_center stable (not oscillating wildly, not → 0 or → ∞)
  • NaN events = 0
  • AMR block count stable at 4

3. The Lapse Lifecycle — α_center Explained

3.1 What α_center Actually Measures

⚠️ Critical caveat: alpha_center reads from AMR level 0 (the coarsest level) at approximately r ≈ 0.65M from the domain center. This is not the lapse at the puncture (r → 0). The values displayed are systematically too high. This is a known diagnostic bug targeted for v0.7.

Despite this, the quantity is still useful as a stability proxy — the key is knowing what values to expect and what patterns indicate problems.

3.2 Expected Lapse Behavior by Phase

Phase Time α_center (L0) Physical α at puncture Status
Initial flat t = 0 1.000 1.000 Normal
Gauge adjustment t < 5M 0.90 – 0.98 Dropping Normal
Trumpet forming t = 5–20M 0.85 – 0.95 → 0.003 Normal
Trumpet stable t > 20M ~0.94 – 0.97 ~0.03 – 0.3 Normal ✓
PROBLEM: lapse explosion any > 2.0 diverging CRASH
PROBLEM: lapse floor any → 1e-6 → 0 Gauge collapse

3.3 Why α_center Doesn't Drop to ~0.3 in Production Runs

At 128³ resolution with dx_base = 0.75M and 4 AMR levels, dx_finest ≈ 0.094M. The trumpet profile has a characteristic scale of ~1 Schwarzschild radius = 2M. At this resolution, the trumpet geometry is not resolved — the finest cell is comparable to the feature scale. Therefore:

  • The lapse never drops below the trumpet phase threshold (α ≈ 0.02)
  • The phase classifier shows "Early Inspiral" throughout
  • This is expected behavior, not a bug

To resolve the trumpet and see α drop to its physical value:

  • Increase AMR levels to 6+ with finer innermost dx (< 0.05M)
  • Or run at 256³ base resolution

4. Hamiltonian Constraint ‖H‖₂ — Interpretation Guide

4.1 What ‖H‖₂ Measures

‖H‖₂ = sqrt(∫ H² dV / V)

where H = R + K² - K_ij K^ij - 16π ρ

In a perfect numerical solution: H = 0 everywhere.
In practice: H ≠ 0 due to truncation error, which we monitor.

4.2 Thresholds — Calibrated to v0.6.8 Production Runs

‖H‖₂ Value Growth Rate γ [M⁻¹] Classification Action
< 1e-2 and decaying < 0 ✅ EXCELLENT None
1e-2 – 1e-1, decaying < 0.005 ✅ HEALTHY Monitor
Stable plateau ≈ 0 ✅ ACCEPTABLE Monitor
Slow growth 0.005 – 0.02 ⚠️ WARNING Check domain size, ko_sigma
Rapid growth > 0.02 🔴 CRITICAL Investigate immediately
→ NaN / ∞ exponential 💥 CRASH Crash analysis required

Why relative thresholds matter more than absolute values:
Initial ‖H‖₂ varies with resolution. 128³ produces higher initial ‖H‖₂ than 64³ (steeper gradients resolved). What matters is the trend — is it decaying or growing?

4.3 Constraint Explosion Pattern — Forensic Analysis

When ‖H‖₂ explodes, it almost never happens gradually — it jumps multiple orders of magnitude in one output window. This is characteristic of:

  1. Domain-size crash: Gauge wave hits boundary, reflects, constraint explosion in 1 step
  2. CFL violation: dt too large for finest AMR level — goes NaN immediately
  3. BC incompatibility: Sommerfeld + BL data — exponential growth from step 1
  4. Over-dissipation (ko_sigma > 0.1): Trumpet destroyed — secondary gauge collapse

5. NaN Forensics

5.1 NaN Is An Infection, Not A Crash

A NaN (Not-a-Number per IEEE 754) propagates through arithmetic: NaN + anything = NaN. The first NaN event ("Patient Zero") is the diagnostic target.

5.2 The "Clean Summary Paradox"

The granite_analysis.cli.sim_tracker 8-layer stability guard can catch NaN events that bypass the C++ summary. However, the inverse is also possible: a crash so catastrophic that the engine terminates before the tracker flushes the NaN event to the summary.

Rule: Always inspect the step log, not just the summary.

5.3 Forensic Checklist

Step 1: Check the very first RK3 step
  └─ The C++ engine scans RHS at step 0 and reports:
     "Spacetime RHS: all finite ✓"
     "Hydro RHS: all finite ✓"
     If either fails → initial data is corrupted

Step 2: Identify the first NaN variable
  └─ The engine scans the first 20 steps for NaN
  └─ Reports: "[NaN] var=17 (alpha) at (i=32,j=32,k=32)"
  └─ Variable index lookup:
     0=chi, 1-6=gamma_tilde, 7=K, 8-13=A_tilde, 14-16=Gamma_hat, 17=alpha, 18-20=beta, 21=Theta

Step 3: Locate the infection source
  └─ Center (i≈nx/2, j≈ny/2, k≈nz/2): near puncture → resolution issue
  └─ Corner (i≈0 or i≈nx): boundary condition failure
  └─ Spreads at ~1 cell/step: normal infection propagation
  └─ Spreads at >5 cells/step: CFL violation (signal faster than grid)

Step 4: Identify the cause
  └─ VAR 0 (chi) fails first: conformal factor collapse at puncture
     → Cause: insufficient resolution or too-aggressive gauge
  └─ VAR 17 (alpha) fails first: lapse instability
     → Cause: CFL violation or Gamma-driver η too large
  └─ VAR 14-16 (Gamma_hat) fails first: shift instability
     → Cause: pure upwinding (d1up) in near-puncture region (known issue)
  └─ Boundary cells fail first: BC problem
     → Cause: Sommerfeld + BL data, or domain too small

6. The Debugging Flowchart

Simulation has unexpected behavior
            │
            ├─ NaN events > 0?
            │    │
            │    └─ YES → Which variable fails first?
            │             ├─ alpha (VAR 17): CFL or gauge issue
            │             │   → Check: CFL at finest AMR level
            │             │   → Check: ko_sigma ≤ 0.1
            │             ├─ chi (VAR 0): resolution at puncture
            │             │   → Increase AMR levels or reduce dx
            │             └─ Boundary cell: BC incompatibility
            │                 → Check: Sommerfeld + BL data (forbidden!)
            │                 → Check: domain.size ≥ 48M (SP) / 200M (BBH)
            │
            ├─ ‖H‖₂ growing rapidly (γ > 0.02/M)?
            │    │
            │    ├─ From step 0 or 1?
            │    │   └─ YES → BC incompatibility (Sommerfeld + BL data)
            │    │
            │    ├─ Sudden jump after t ≈ domain_size/√2?
            │    │   └─ YES → Domain too small. Gauge wave reflected.
            │    │       → Solution: increase domain.size
            │    │
            │    └─ Gradual growth throughout?
            │        └─ Check ko_sigma (> 0.1 = over-dissipation)
            │           Check kappa1 (> 0.05 = over-damping)
            │
            ├─ No merger despite d=10M BBH at t=500M?
            │    │
            │    └─ Check initial_data.bh*.momentum
            │        → momentum: [0,0,0] = head-on collision (NOT inspiral!)
            │        → Need: momentum: [0, ±0.0840, 0] for d=10M equal-mass
            │
            ├─ Lapse not forming trumpet (α stays ≈ 1.0)?
            │    │
            │    └─ Check resolution:
            │        If dx_finest > 0.2M → trumpet unresolved (expected, not a bug)
            │        If ko_sigma > 0.1 → over-dissipation destroying trumpet
            │
            ├─ AMR blocks stuck at 4?
            │    │
            │    └─ This is a known v0.6.8 limitation.
            │       Dynamic regridding not fully implemented.
            │       Not a bug — planned fix in v0.7.
            │
            └─ Phase labels show "Early Inspiral" for entire run?
                 │
                 └─ Phase labels are TIME-BASED, not separation-based.
                    "Early Inspiral" for all t < 100M by definition.
                    This is a known limitation — not a physics diagnosis.

7. Common Failure Modes with Solutions

7.1 Crash at t ≈ 6–11M (Single Puncture)

Symptom: Simulation stable to some time, then ‖H‖₂ jumps to NaN in one step.
Cause: Gauge wave emitted from puncture reaches boundary and reflects.
Formula: t_crash ≈ domain_size / √2
Solution:

domain.size t_crash Adequate for
16M ≈ 11.3M Quick tests only
32M ≈ 22.6M t < 20M runs
48M ≈ 33.9M ✅ t = 500M runs (validated)
200M ≈ 141M ✅ BBH runs to t = 500M

7.2 High ‖H‖₂ from Step 0 (Sommerfeld + BL)

Symptom: ‖H‖₂ starts at 10× higher than expected and grows immediately.
Cause: Sommerfeld BCs applied with Brill-Lindquist initial data.
Solution: Change boundary.type: sommerfeld to boundary.type: copy.

7.3 No Inspiral in BBH Run

Symptom: Two BHs placed at d=10M, run to t=500M, no merger or orbital motion.
Cause: Zero tangential momentum — head-on collision, not inspiral.
Solution: Set momentum: [0.0, 0.0840, 0.0] for bh1 and [0.0, -0.0840, 0.0] for bh2.

7.4 Trumpet Not Forming

Symptom: α_center stays near 1.0 throughout the run, no lapse collapse.
Cause 1: Resolution too coarse (dx_finest > 0.2M) — trumpet unresolved. Not a bug.
Cause 2: ko_sigma > 0.1 — over-dissipation destroying gauge profile. Fix: reduce to 0.1.
Diagnosis: Check ko_sigma first; if 0.1, resolution is the cause.

7.5 Secondary Gauge Collapse Without NaN

Symptom: Lapse hits floor (α → 1e-6) with zero NaN events, then recovers.
Classification: Gauge pathology, NOT a numerical crash.
Action: Do not terminate the simulation. Monitor and continue. If recovery occurs (α increases from floor), the simulation is self-healing.


8. The Health Check Checklist

Run before every production simulation:

python3 scripts/health_check.py

Expected output:

=== GRANITE Pre-Flight Health Check ===
Build type:       Release ✓
Compiler flags:   -O3 -march=native -ffast-math ✓
OpenMP enabled:   YES — 6 threads ✓
HDF5 parallel:    YES ✓
Available RAM:    15.7 GB ✓
CMake config:     Release build confirmed ✓

=== PARAMETER VALIDATION ===
domain.size = 48.0M ... ✓ (≥ 48M minimum)
ko_sigma = 0.1 ... ✓ (≤ 0.15 safe range)
kappa1 = 0.02 ... ✓ (standard value)
boundary = copy with initial_data = brill_lindquist ... ✓ COMPATIBLE

All checks passed. Safe to launch simulation.

Manual checklist:

Check Pass Criterion
Build type Release (-O3 flags confirmed by health_check.py)
ko_sigma ≤ 0.1
kappa1 0.01 – 0.04
domain.size ≥ 48M for SP, ≥ 200M for BBH t > 100M
BC compatibility copy for BL data; sommerfeld for Two-Punctures
BBH momentum [0, ±0.0840, 0] for d=10M equal-mass inspiral
Initial NaN scan "all finite" at step 0
‖H‖₂ at t=5M < 1.0

9. Known Diagnostic Limitations

These limitations are confirmed in v0.6.8 and planned for fix in v0.7:

Limitation Impact Fix Version
alpha_center reads from AMR level 0, not finest level near puncture Displayed values (≈0.97) are NOT the puncture lapse (~0.03) v0.7
Phase labels (Early/Mid/Late Inspiral) are time-based, not separation-based Phase labels are not physics diagnostics v0.7
AMR block count fixed at initialization AMR level count shown in telemetry reflects initial setup, not dynamic refinement v0.7
loadCheckpoint() not implemented Checkpoint files are written but cannot be used to resume v0.7
M1 radiation not wired into RK3 Radiation physics completely inactive in all production runs v0.7

See also: Parameter Reference | Known Fixed Bugs | Architecture Overview


Clone this wiki locally