-
Notifications
You must be signed in to change notification settings - Fork 0
Simulation Health & Debugging
GRANITE v0.6.8 | ← Scientific Context
This page replaces and corrects the outdated
docs/diagnostic_handbook.md.
All thresholds and descriptions reflect actual v0.6.8 production run behavior.
- Primary Health Indicators
- Healthy Output — What Normal Looks Like
- The Lapse Lifecycle — α_center Explained
- Hamiltonian Constraint ‖H‖₂ — Interpretation Guide
- NaN Forensics
- The Debugging Flowchart
- Common Failure Modes with Solutions
- The Health Check Checklist
- Known Diagnostic Limitations
Monitor these six quantities at every diagnostic output step:
| Indicator | Source | Healthy | Warning | Critical |
|---|---|---|---|---|
alpha_center |
Lapse at AMR L0, r≈0.65M |
Stable after collapse | Oscillating | > 5.0 or NaN |
‖H‖₂ |
Hamiltonian constraint L2 norm | Decaying or γ < 0.005/M | 0.005 < γ < 0.02/M | γ > 0.02/M or NaN |
NaN_events |
Explicit NaN scan from C++ | 0 at all times | — | Any event |
AMR_blocks |
Block count in hierarchy | Grows toward max_levels | Frozen at initial count | Dropping (regrid bug) |
CFL_finest |
Advection CFL at finest level | < 0.5 | 0.5 – 0.9 | > 0.95 (emergency dt reduction) |
speed (M/s) |
Simulated time per wall-second | Consistent | Stalling | Near-zero (zombie state) |
GRANITE v0.6.7.2 — single_puncture benchmark
step=2 t=0.375M α_center=0.9921 ‖H‖₂=1.083e-02 [OK] blocks=3 speed=0.017M/s
step=4 t=0.750M α_center=0.9844 ‖H‖₂=9.214e-03 [OK] blocks=4 speed=0.017M/s
...
step=16 t=3.000M α_center=0.9634 ‖H‖₂=6.437e-03 [OK] blocks=4 speed=0.017M/s
step=32 t=6.000M α_center=0.9399 ‖H‖₂=4.021e-03 [OK] blocks=4 speed=0.017M/s
...
step=320 t=60.0M α_center=0.9399 ‖H‖₂=5.551e-04 [OK] blocks=4 speed=0.017M/s
step=1334 t=500.0M α_center=0.9399 ‖H‖₂=1.277e-04 [OK] blocks=4 speed=0.017M/s
=== STABILITY SUMMARY ===
Run completed to t=500.0M ✓
‖H‖₂ reduction: ×84.8 (1.083e-02 → 1.277e-04) ✓
NaN events: 0 ✓
Stability: STABLE
GRANITE v0.6.7.2 — B2_eq benchmark (equal-mass BBH)
step=2 t=3.125M α_center=0.81207 ‖H‖₂=4.929e-04 [OK] blocks=3
step=4 t=6.250M α_center=0.86547 ‖H‖₂=8.226e-04 [OK] blocks=4
step=6 t=9.375M α_center=0.89896 ‖H‖₂=6.350e-04 [OK] blocks=4
step=32 t=50.00M α_center=0.96214 ‖H‖₂=1.571e-04 [OK] blocks=4
step=64 t=100.0M α_center=0.96668 ‖H‖₂=7.540e-05 [OK] blocks=4
step=128 t=200.0M α_center=0.96656 ‖H‖₂=4.095e-05 [OK] blocks=4
step=256 t=400.0M α_center=0.97264 ‖H‖₂=1.960e-05 [OK] blocks=4
step=320 t=500.0M α_center=0.96748 ‖H‖₂=1.341e-05 [OK] blocks=4
=== STABILITY SUMMARY ===
Run completed to t=500.0M ✓
‖H‖₂ reduction: ×61.3 (peak 8.226e-04 → 1.341e-05) ✓
NaN events: 0 ✓
Stability: STABLE
Key signatures of a healthy run:
- ‖H‖₂ monotonically decaying after initial gauge adjustment peak
- α_center stable (not oscillating wildly, not → 0 or → ∞)
- NaN events = 0
- AMR block count stable at 4
⚠️ Critical caveat:alpha_centerreads from AMR level 0 (the coarsest level) at approximately r ≈ 0.65M from the domain center. This is not the lapse at the puncture (r → 0). The values displayed are systematically too high. This is a known diagnostic bug targeted for v0.7.
Despite this, the quantity is still useful as a stability proxy — the key is knowing what values to expect and what patterns indicate problems.
| Phase | Time | α_center (L0) | Physical α at puncture | Status |
|---|---|---|---|---|
| Initial flat | t = 0 | 1.000 | 1.000 | Normal |
| Gauge adjustment | t < 5M | 0.90 – 0.98 | Dropping | Normal |
| Trumpet forming | t = 5–20M | 0.85 – 0.95 | → 0.003 | Normal |
| Trumpet stable | t > 20M | ~0.94 – 0.97 | ~0.03 – 0.3 | Normal ✓ |
| PROBLEM: lapse explosion | any | > 2.0 | diverging | CRASH |
| PROBLEM: lapse floor | any | → 1e-6 | → 0 | Gauge collapse |
At 128³ resolution with dx_base = 0.75M and 4 AMR levels, dx_finest ≈ 0.094M. The trumpet profile has a characteristic scale of ~1 Schwarzschild radius = 2M. At this resolution, the trumpet geometry is not resolved — the finest cell is comparable to the feature scale. Therefore:
- The lapse never drops below the trumpet phase threshold (α ≈ 0.02)
- The phase classifier shows "Early Inspiral" throughout
- This is expected behavior, not a bug
To resolve the trumpet and see α drop to its physical value:
- Increase AMR levels to 6+ with finer innermost dx (< 0.05M)
- Or run at 256³ base resolution
‖H‖₂ = sqrt(∫ H² dV / V)
where H = R + K² - K_ij K^ij - 16π ρ
In a perfect numerical solution: H = 0 everywhere.
In practice: H ≠ 0 due to truncation error, which we monitor.
| ‖H‖₂ Value | Growth Rate γ [M⁻¹] | Classification | Action |
|---|---|---|---|
| < 1e-2 and decaying | < 0 | ✅ EXCELLENT | None |
| 1e-2 – 1e-1, decaying | < 0.005 | ✅ HEALTHY | Monitor |
| Stable plateau | ≈ 0 | ✅ ACCEPTABLE | Monitor |
| Slow growth | 0.005 – 0.02 | Check domain size, ko_sigma | |
| Rapid growth | > 0.02 | 🔴 CRITICAL | Investigate immediately |
| → NaN / ∞ | exponential | 💥 CRASH | Crash analysis required |
Why relative thresholds matter more than absolute values:
Initial ‖H‖₂ varies with resolution. 128³ produces higher initial ‖H‖₂ than 64³ (steeper gradients resolved). What matters is the trend — is it decaying or growing?
When ‖H‖₂ explodes, it almost never happens gradually — it jumps multiple orders of magnitude in one output window. This is characteristic of:
- Domain-size crash: Gauge wave hits boundary, reflects, constraint explosion in 1 step
- CFL violation: dt too large for finest AMR level — goes NaN immediately
- BC incompatibility: Sommerfeld + BL data — exponential growth from step 1
- Over-dissipation (ko_sigma > 0.1): Trumpet destroyed — secondary gauge collapse
A NaN (Not-a-Number per IEEE 754) propagates through arithmetic: NaN + anything = NaN. The first NaN event ("Patient Zero") is the diagnostic target.
The granite_analysis.cli.sim_tracker 8-layer stability guard can catch NaN events that bypass the C++ summary. However, the inverse is also possible: a crash so catastrophic that the engine terminates before the tracker flushes the NaN event to the summary.
Rule: Always inspect the step log, not just the summary.
Step 1: Check the very first RK3 step
└─ The C++ engine scans RHS at step 0 and reports:
"Spacetime RHS: all finite ✓"
"Hydro RHS: all finite ✓"
If either fails → initial data is corrupted
Step 2: Identify the first NaN variable
└─ The engine scans the first 20 steps for NaN
└─ Reports: "[NaN] var=17 (alpha) at (i=32,j=32,k=32)"
└─ Variable index lookup:
0=chi, 1-6=gamma_tilde, 7=K, 8-13=A_tilde, 14-16=Gamma_hat, 17=alpha, 18-20=beta, 21=Theta
Step 3: Locate the infection source
└─ Center (i≈nx/2, j≈ny/2, k≈nz/2): near puncture → resolution issue
└─ Corner (i≈0 or i≈nx): boundary condition failure
└─ Spreads at ~1 cell/step: normal infection propagation
└─ Spreads at >5 cells/step: CFL violation (signal faster than grid)
Step 4: Identify the cause
└─ VAR 0 (chi) fails first: conformal factor collapse at puncture
→ Cause: insufficient resolution or too-aggressive gauge
└─ VAR 17 (alpha) fails first: lapse instability
→ Cause: CFL violation or Gamma-driver η too large
└─ VAR 14-16 (Gamma_hat) fails first: shift instability
→ Cause: pure upwinding (d1up) in near-puncture region (known issue)
└─ Boundary cells fail first: BC problem
→ Cause: Sommerfeld + BL data, or domain too small
Simulation has unexpected behavior
│
├─ NaN events > 0?
│ │
│ └─ YES → Which variable fails first?
│ ├─ alpha (VAR 17): CFL or gauge issue
│ │ → Check: CFL at finest AMR level
│ │ → Check: ko_sigma ≤ 0.1
│ ├─ chi (VAR 0): resolution at puncture
│ │ → Increase AMR levels or reduce dx
│ └─ Boundary cell: BC incompatibility
│ → Check: Sommerfeld + BL data (forbidden!)
│ → Check: domain.size ≥ 48M (SP) / 200M (BBH)
│
├─ ‖H‖₂ growing rapidly (γ > 0.02/M)?
│ │
│ ├─ From step 0 or 1?
│ │ └─ YES → BC incompatibility (Sommerfeld + BL data)
│ │
│ ├─ Sudden jump after t ≈ domain_size/√2?
│ │ └─ YES → Domain too small. Gauge wave reflected.
│ │ → Solution: increase domain.size
│ │
│ └─ Gradual growth throughout?
│ └─ Check ko_sigma (> 0.1 = over-dissipation)
│ Check kappa1 (> 0.05 = over-damping)
│
├─ No merger despite d=10M BBH at t=500M?
│ │
│ └─ Check initial_data.bh*.momentum
│ → momentum: [0,0,0] = head-on collision (NOT inspiral!)
│ → Need: momentum: [0, ±0.0840, 0] for d=10M equal-mass
│
├─ Lapse not forming trumpet (α stays ≈ 1.0)?
│ │
│ └─ Check resolution:
│ If dx_finest > 0.2M → trumpet unresolved (expected, not a bug)
│ If ko_sigma > 0.1 → over-dissipation destroying trumpet
│
├─ AMR blocks stuck at 4?
│ │
│ └─ This is a known v0.6.8 limitation.
│ Dynamic regridding not fully implemented.
│ Not a bug — planned fix in v0.7.
│
└─ Phase labels show "Early Inspiral" for entire run?
│
└─ Phase labels are TIME-BASED, not separation-based.
"Early Inspiral" for all t < 100M by definition.
This is a known limitation — not a physics diagnosis.
Symptom: Simulation stable to some time, then ‖H‖₂ jumps to NaN in one step.
Cause: Gauge wave emitted from puncture reaches boundary and reflects.
Formula: t_crash ≈ domain_size / √2
Solution:
| domain.size | t_crash | Adequate for |
|---|---|---|
| 16M | ≈ 11.3M | Quick tests only |
| 32M | ≈ 22.6M | t < 20M runs |
| 48M | ≈ 33.9M | ✅ t = 500M runs (validated) |
| 200M | ≈ 141M | ✅ BBH runs to t = 500M |
Symptom: ‖H‖₂ starts at 10× higher than expected and grows immediately.
Cause: Sommerfeld BCs applied with Brill-Lindquist initial data.
Solution: Change boundary.type: sommerfeld to boundary.type: copy.
Symptom: Two BHs placed at d=10M, run to t=500M, no merger or orbital motion.
Cause: Zero tangential momentum — head-on collision, not inspiral.
Solution: Set momentum: [0.0, 0.0840, 0.0] for bh1 and [0.0, -0.0840, 0.0] for bh2.
Symptom: α_center stays near 1.0 throughout the run, no lapse collapse.
Cause 1: Resolution too coarse (dx_finest > 0.2M) — trumpet unresolved. Not a bug.
Cause 2: ko_sigma > 0.1 — over-dissipation destroying gauge profile. Fix: reduce to 0.1.
Diagnosis: Check ko_sigma first; if 0.1, resolution is the cause.
Symptom: Lapse hits floor (α → 1e-6) with zero NaN events, then recovers.
Classification: Gauge pathology, NOT a numerical crash.
Action: Do not terminate the simulation. Monitor and continue. If recovery occurs (α increases from floor), the simulation is self-healing.
Run before every production simulation:
python3 scripts/health_check.pyExpected output:
=== GRANITE Pre-Flight Health Check ===
Build type: Release ✓
Compiler flags: -O3 -march=native -ffast-math ✓
OpenMP enabled: YES — 6 threads ✓
HDF5 parallel: YES ✓
Available RAM: 15.7 GB ✓
CMake config: Release build confirmed ✓
=== PARAMETER VALIDATION ===
domain.size = 48.0M ... ✓ (≥ 48M minimum)
ko_sigma = 0.1 ... ✓ (≤ 0.15 safe range)
kappa1 = 0.02 ... ✓ (standard value)
boundary = copy with initial_data = brill_lindquist ... ✓ COMPATIBLE
All checks passed. Safe to launch simulation.
Manual checklist:
| Check | Pass Criterion |
|---|---|
| Build type | Release (-O3 flags confirmed by health_check.py) |
| ko_sigma | ≤ 0.1 |
| kappa1 | 0.01 – 0.04 |
| domain.size | ≥ 48M for SP, ≥ 200M for BBH t > 100M |
| BC compatibility | copy for BL data; sommerfeld for Two-Punctures |
| BBH momentum | [0, ±0.0840, 0] for d=10M equal-mass inspiral |
| Initial NaN scan | "all finite" at step 0 |
| ‖H‖₂ at t=5M | < 1.0 |
These limitations are confirmed in v0.6.8 and planned for fix in v0.7:
| Limitation | Impact | Fix Version |
|---|---|---|
alpha_center reads from AMR level 0, not finest level near puncture |
Displayed values (≈0.97) are NOT the puncture lapse (~0.03) | v0.7 |
| Phase labels (Early/Mid/Late Inspiral) are time-based, not separation-based | Phase labels are not physics diagnostics | v0.7 |
| AMR block count fixed at initialization | AMR level count shown in telemetry reflects initial setup, not dynamic refinement | v0.7 |
loadCheckpoint() not implemented |
Checkpoint files are written but cannot be used to resume | v0.7 |
| M1 radiation not wired into RK3 | Radiation physics completely inactive in all production runs | v0.7 |
See also: Parameter Reference | Known Fixed Bugs | Architecture Overview
v0.6.8 · Repository · Issues
- 🩺 Simulation Health & Debugging
- 📊 Benchmarks & Validation
- 🗂️ AMR Design
- 🖥️ HPC Deployment
- 🌀 VORTEX Engine
"Simulate the unimaginable."