Skip to content

Known Fixed Bugs

LiranOG edited this page May 9, 2026 · 7 revisions

🐛 Known Fixed Bugs

GRANITE v0.6.8 | ← Initial Data | Parameter Reference →

Authoritative record of all confirmed fixed bugs. Never re-introduce any item on this list.
Cross-referenced with CHANGELOG.md for full technical details.


⚠️ Standing Order

Every PR that touches any of the files listed in this table must explicitly verify that the corresponding fix is still in place. Include in your PR description: "Verified fix [ID] is present: [how you verified it]."


Complete Bug Registry

C1 — EOS Sound Speed: Hardcoded γ=1

Field Value
ID C1
Severity CRITICAL — Physics wrong for all non-trivial EOS
File src/grmhd/grmhd.cppmaxWavespeeds()
Phase Fixed Phase 4D
Commit Phase 4D session

What was wrong:
The fast magnetosonic wavespeed computation used cs² = p / (ρh) — the ideal gas formula for Γ=1 — regardless of the actual EOS. For IdealGasEOS(Γ=5/3), the correct value is cs² = Γ·p/(ρh), giving a wavespeed underestimate of factor √(5/3) ≈ 1.29. For stiff nuclear EOS (Γ≈2.5), the underestimate reached √2.5 ≈ 1.58.

Consequence: Underestimated wavespeeds → HLLE stencil narrower than the true light cone → subcritical scheme → potential CFL violation in production simulations.

The fix:

// BEFORE (wrong):
const Real cs2 = press / (rho * h + 1.0e-30);  // Hardcoded Γ=1

// AFTER (correct):
const Real cs_raw = eos.soundSpeed(rho, eps, press);  // Exact from EOS
const Real cs = std::clamp(cs_raw, 0.0, 1.0 - 1.0e-8);
const Real cs2 = cs * cs;

Verification test: GRMHDGRTest.HLLEFluxScalesWithLapse — passes when wavespeeds are correct.


C3 — HLLE Flat-Metric Hard-Code

Field Value
ID C3
Severity CRITICAL — All previous simulations used SR not GR fluxes
File src/grmhd/grmhd.cppcomputeHLLEFlux()
Phase Fixed Phase 4D

What was wrong:
computeHLLEFlux() internally constructed a dummy flat Minkowski metric:

Metric3 g; g.alpha=1; g.betax=0; g.betay=0; g.betaz=0; g.sqrtg=1; // FLAT!

This meant the HLLE solver computed SR (special-relativistic) fluxes regardless of the actual spacetime curvature. Near a neutron star surface (α≈0.8, √γ≈1.2), the flux magnitude was wrong by α·√γ ≈ 0.96 — a ~4% systematic error.

Additionally, the wavespeed formula was wrong: λ± = α·v_pm - α (subtracted lapse instead of shift).

The fix: Promoted Metric3 to public GRMetric3 struct. Changed computeHLLEFlux() signature to accept const GRMetric3& g. Corrected wavespeed formula:

// AFTER (correct GR formula):
const Real beta_dir = (dir==0)?g.betax:(dir==1)?g.betay:g.betaz;
return {g.alpha * vL - beta_dir, g.alpha * vR - beta_dir};

Architectural consequence: GRMetric3 is now the mandatory, public-only interface between CCZ4 and GRMHD. This architectural rule must be maintained.


H1 — KO Dissipation: 22 Separate OpenMP Regions

Field Value
ID H1
Severity HIGH — ~30% of compute time wasted; potential race conditions
File src/spacetime/ccz4.cppapplyDissipation()
Phase Fixed Phase 4D

What was wrong:

for (int var = 0; var < 22; ++var) {         // Outer loop: 22 vars
    #pragma omp parallel for collapse(2)      // ← 22 separate thread spawns!
    for (int k ...) for (int j ...) for (int i ...)
        rhs.data(var, i, j, k) += ko_term;
}

This spawned and joined the OpenMP thread pool 22 times per RHS call. Each cycle: ~50 μs overhead × 22 × 10⁶ RHS calls ≈ 1,100 seconds wasted per long run (~30% of total compute time).

The fix: Single parallel region, variable loop innermost:

#pragma omp parallel for collapse(2) schedule(static)  // ONE thread spawn
for (int k ...) for (int j ...) {
    Real diss[22] = {};  // Stack-allocated, L1 cache resident
    for (int i ...) {
        for (int var = 0; var < 22; ++var)
            rhs.data(var, i, j, k) += ko_compute(var, i, j, k);
    }
}

Verification test: CCZ4FlatTest.KODissipationIsZeroForFlatSpacetime — confirms zero dissipation on constant data with the new loop structure.


H2 — GridBlock: vector-of-vectors Allocation

Field Value
ID H2
Severity HIGH — Cache thrashing, no SIMD, 22 heap allocations per block
File include/granite/core/grid.hpp, src/core/grid.cpp
Phase Fixed Phase 5

What was wrong:

std::vector<std::vector<Real>> data_;  // 22 separate heap allocations!
// Each var's data at a different memory address
// Cache miss on every var access in KO dissipation accumulator

The fix:

std::vector<Real> data_;  // SINGLE flat allocation
// Indexed as: data_[var * stride + flat(i,j,k)]
// All vars of any given cell are stride_ words apart in memory
// All cells of any given var are 8 bytes apart (SIMD-friendly)

22 → 1 heap allocations per GridBlock. All data() accessor calls produce identical results — zero behavioral change to callers.

Verification tests: GridBlockFlatLayoutTest (4 tests): FlatBufferContiguity, RawDataEquivalentToAccessor, FlatBufferSpansAllVars, MultiVarWriteReadBack.


H3 — RHS Zero-Out Loop Order

Field Value
ID H3
Severity HIGH — Cache thrashing in the most frequently executed code path
File src/grmhd/grmhd.cppGRMHDEvolution::computeRHS()
Phase Fixed Phase 5

What was wrong:

// WRONG — var outermost causes stride_ jumps between writes:
for (int var = 0; var < 9; ++var)
    for (int k ...) for (int j ...) for (int i ...)
        rhs.data(var, i, j, k) = 0.0;
// Consecutive writes: stride_ words apart → TLB pressure on large grids

The fix:

// CORRECT — spatial outermost, var innermost = sequential writes:
for (int k ...) for (int j ...) for (int i ...)
    for (int var = 0; var < 9; ++var)
        rhs.data(var, i, j, k) = 0.0;
// Consecutive writes: 8 bytes apart (one double) → L1 cache hits

Rule: Spatial loops must always be outer; variable loop must always be inner. This must not be changed.


TOV — TOV Solver RSUN_CGS Conversion

Field Value
ID TOV
Severity CRITICAL — TOV solver produced nonsensical results
File src/initial_data/initial_data.cpp
Phase Fixed Phase 4A/4C

What was wrong:

const Real r_cgs = r_km * RSUN_CGS;  // RSUN_CGS ≈ 6.957e10 cm  ← WRONG!
// Used solar radius instead of km→cm conversion factor
// Off by factor: 6.957e10 / 1e5 = ~700,000
// Step sizes were astronomically large; solver diverged on first step

The fix:

const Real r_cgs = r_km * 1.0e5;  // 1 km = 1e5 cm ← CORRECT

Verification: PolytropeTest.TOVSolver — M ≈ 1.4 M☉, R ≈ 10 km for standard polytrope.


KO-σ — Over-Dissipation at ko_sigma = 0.35

Field Value
ID KO-σ
Severity HIGH — Destroys trumpet gauge profile silently
File benchmarks/B2_eq/params.yaml
Phase Fixed v0.6.5 (reverted B2_eq config)

What happened:
The B2_eq benchmark YAML was configured with ko_sigma: 0.35 (from Phase 7 experimentation with aggressive gauge damping). This value over-dissipates the trumpet gauge profile: the high-frequency noise is suppressed but so is the physically important steep gradient that forms the trumpet geometry. Secondary gauge collapse occurs without NaN events, making the run appear stable while the physics is unphysical.

The fix: ko_sigma: 0.35ko_sigma: 0.1 in all benchmark YAML files.

Rule: ko_sigma > 0.15 is forbidden in all production configurations. Default and enforced value: 0.1.


Sommerfeld+BL — Incompatible BC + Initial Data

Field Value
ID Sommerfeld+BL
Severity HIGH — 8× higher constraint violations from step 1
File Configuration / user error
Phase Fixed Phase 6 (documented; user responsibility)

Root cause: Sommerfeld BCs assume a spherical 1/r outgoing wave profile. Brill-Lindquist initial data has a conformal factor ψ = 1 + Σ(m/2r) which does NOT match this profile. The BC formula therefore introduces artificial metric gradients from t=0 that feed into the Ricci tensor.

Evidence (Phase 6 experiment):

  • Copy BC at t=5.625M: ‖H‖₂ = 0.276
  • Sommerfeld BC at t=5.625M: ‖H‖₂ = 2.257 (8.2× worse)

Rule: Use boundary.type: copy with initial_data.type: brill_lindquist. This is non-negotiable.


Domain — Boundary Gauge Reflections on Small Domains

Field Value
ID Domain
Severity HIGH — Constraint explosion at predictable time
File benchmarks/*/params.yaml — domain.size parameter
Phase Fixed Phase 6 (documented) / Phase 7 (domain expansion to ±48M)

Physics:
Gauge waves in 1+log slicing propagate at speed c_gauge = √2 ≈ 1.41 M/M. A wave emitted at t=0 from r=0 reaches the outer boundary at t = domain_size / √2. Upon reflection, constraint violations grow exponentially to NaN in O(1) output window.

Minimum safe domain sizes:

  • Single puncture (t=500M): ±48M (t_reflection ≈ 33.9M; stable to t=500M confirmed)
  • BBH inspiral (t=500M): ±200M (t_reflection ≈ 141M; validated in B2_eq benchmark)

alpha_center — Diagnostic Bug: AMR Level 0 Sampling

Field Value
ID alpha_center
Severity MEDIUM — Misleading diagnostic output
File src/core/evolution_loop.cppalpha_center diagnostic
Status Known but NOT YET FIXED — v0.7 target

What is wrong:
The alpha_center diagnostic reads the lapse from AMR level 0 (the coarsest grid level) at the grid point nearest to the domain center. At 64³ base resolution with dx=6.25M, this cell center is at approximately r ≈ 0.65·dx ≈ 4M from the origin — not at the puncture.

Consequence: The displayed α_center ≈ 0.97 in B2_eq runs is the lapse at r≈4M from the puncture, where the spacetime is nearly flat. The actual puncture lapse (r→0) should be ~0.03 in a resolved trumpet.

Fix required: Sample α from the finest AMR level at the tracking sphere center.


See also: Simulation Health & Debugging | CHANGELOG | Parameter Reference


Clone this wiki locally