Skip to content

Fix SPIRV ID overflow for large kernels due to autodiff.#489

Merged
duburcqa merged 11 commits into
mainfrom
duburcqa/fix_spirv_id_overflow
Apr 16, 2026
Merged

Fix SPIRV ID overflow for large kernels due to autodiff.#489
duburcqa merged 11 commits into
mainfrom
duburcqa/fix_spirv_id_overflow

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 16, 2026

Summary

Fixes a SIGSEGV during loss.backward() on Metal/Vulkan when differentiating articulated-body
simulations (e.g. Genesis rigid ABD with freejoint + child joints). The reverse-mode autodiff
kernel generates SPIR-V modules with ~89K IDs, and the SPIRV-Tools SSA-construction pass
(LocalMultiStoreElim) was overflowing the default 4M ID cap, producing corrupt SPIR-V that
crashed the GPU driver.

Root cause: SPIRV-Tools caps TakeNextId() at 0x3FFFFF (~4M) by default. The SSA pass
creates phi nodes proportional to (variables x join blocks), easily exceeding 4M on large autodiff
kernels. When TakeNextId() returns 0, the pass continues with invalid id references and segfaults.

Fix: Raise max_id_bound to 64M via the public OptimizerOptions::set_max_id_bound()
API, add intermediate AggressiveDCE passes matching the upstream RegisterPerformancePasses
pipeline, and append CompactIdsPass at the end to renumber IDs back to a dense range for
Metal/Vulkan drivers. No vendored SPIRV-Tools code is modified.

Reproduction

Reproduction script for Genesis-Embodied-AI/Genesis#2537:

repro_2537.py
"""Reproduction for Genesis-Embodied-AI/Genesis#2537 on macOS (Metal backend).

Symptom: `loss.backward()` SIGSEGVs (exit 139) for an articulated robot whose
root is a freejoint with at least one child joint. The crash is preceded by a
flood of `[spirv_codegen] ID overflow. Try running compact-ids.` warnings,
suggesting the Quadrants-generated reverse-mode SPIR-V kernel for the rigid
ABD (Articulated Body Dynamics) submodule overflows the SPIR-V ID space.

On Linux + CUDA the same scene hangs indefinitely instead of crashing; on the
CPU backend the same scene completes correctly in ~66 s, which suggests the
bug is in GPU-target codegen of the rigid backward kernels rather than in the
algorithm itself.

Run:
    python repro_2537.py

Requirements:
    macOS with Metal-capable GPU. `performance_mode=True` is mandatory because
    Metal does not support autodiff in the default ndarray mode (see
    `genesis/engine/scene.py:279-284`).
"""
import os
import sys
import tempfile
import time

import torch

import genesis as gs


MJCF_ARTICULATED = """
<mujoco model="free_plus_hinge">
  <worldbody>
    <body name="chassis" pos="0 0 0">
      <freejoint name="root"/>
      <inertial mass="1.0" pos="0 0 0" diaginertia="0.1 0.1 0.1"/>
      <geom type="box" size="0.1 0.1 0.1" contype="0" conaffinity="0"/>
      <body name="wheel" pos="0.2 0 0">
        <joint name="hinge_y" type="hinge" axis="0 1 0"/>
        <inertial mass="0.5" pos="0 0 0" diaginertia="0.05 0.05 0.05"/>
        <geom type="cylinder" fromto="0 -0.05 0 0 0.05 0" size="0.1"
              contype="0" conaffinity="0"/>
      </body>
    </body>
  </worldbody>
</mujoco>
"""


def main():
    if sys.platform != "darwin":
        print("[repro] WARNING: this script is intended for macOS (Metal backend).", flush=True)

    gs.init(backend=gs.gpu, performance_mode=True, logging_level="warning")

    fd, path = tempfile.mkstemp(suffix=".xml")
    with os.fdopen(fd, "w") as f:
        f.write(MJCF_ARTICULATED)

    scene = gs.Scene(
        sim_options=gs.options.SimOptions(dt=0.01, gravity=(0, 0, 0), requires_grad=True),
        rigid_options=gs.options.RigidOptions(enable_collision=False),
        show_viewer=False,
    )
    robot = scene.add_entity(gs.morphs.MJCF(file=path))
    scene.build()

    # 6 freejoint DOFs + 1 hinge DOF
    ctrl = gs.tensor([0.0] * 7, requires_grad=True)
    target = torch.tensor([0.05, 0.0, 0.0], device=gs.device)

    scene.reset()
    for _ in range(5):
        robot.set_dofs_velocity(ctrl)
        scene.step()

    state = robot.get_state()
    loss = torch.nn.functional.mse_loss(state.pos.squeeze(), target)
    print(f"[repro] forward done, loss={loss.item():.6f} -- calling backward()", flush=True)

    t0 = time.time()
    loss.backward()  # SIGSEGV here on macOS Metal + performance_mode=True
    print(f"[repro] BACKWARD COMPLETED in {time.time() - t0:.2f}s", flush=True)
    print(f"[repro] ctrl.grad = {ctrl.grad}", flush=True)


if __name__ == "__main__":
    main()

Test plan

  • repro_2537.py (Genesis freejoint + hinge): EXIT=0, backward completes in 18s (cached),
    correct gradient
  • pytest tests/python/test_ad_ndarray_torch.py test_ndarray.py test_function.py test_struct.py
    (349 passed, 0 failed)

@duburcqa
Copy link
Copy Markdown
Contributor Author

@claude review

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
@duburcqa duburcqa force-pushed the duburcqa/fix_spirv_id_overflow branch 3 times, most recently from 27a14ce to 3ffcb5e Compare April 16, 2026 10:17
@duburcqa
Copy link
Copy Markdown
Contributor Author

@claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previously flagged bugs have been addressed in c3abc8f; the remaining inline comments are nits that don't affect production correctness.

Extended reasoning...

Overview

This PR modifies spirv_codegen.cpp to fix a SIGSEGV crash during autodiff backward passes on Metal/Vulkan. The changes include: raising the SPIRV-Tools max ID bound from ~4M to ~64M, inserting intermediate AggressiveDCE passes matching the upstream performance pipeline, appending a CompactIdsPass, adding an ID-overflow detection flag via the message consumer callback, and falling back to un-optimized SPIR-V when optimization fails. Commit c3abc8f (pushed after the initial review) addressed the null-pointer UB in the message consumer, the dedup-state flush/reset gap, and the duplicate comment block.

Security risks

None. This is GPU shader compilation infrastructure with no auth, crypto, or network surface. The only external data is SPIR-V bytecode generated internally by the Quadrants compiler.

Level of scrutiny

Moderate. The changes touch a critical codegen path (SPIR-V optimization pipeline), but the logic is well-motivated and follows the upstream RegisterPerformancePasses ordering. The fallback to un-optimized SPIR-V is simple and correct. The author provided a reproduction script and passing test results (349 tests).

Other factors

The two remaining inline bugs are both nits: one is a comment inaccuracy on a value description and spec claim (doc-only, no code change needed), the other is a debug-only dump path that could show corrupt SPIR-V when DUMP_IR is set and optimization fails (zero production impact). Neither affects the correctness of the fix. The author has been responsive, addressing all three previously raised concerns in a single follow-up commit.


Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 quadrants/codegen/spirv/spirv_codegen.cpp:2383-2395 — The debug IR dump block (enabled via DUMP_IR env var) calls spirv_tools_->Disassemble(optimized_spv) unconditionally, even when success=false due to ID overflow — meaning the _after_opt.spirv file will contain the corrupt/partial SPIR-V rather than the actual un-optimized fallback that was submitted to the GPU. The fix is to mirror the production push_back logic: use success ? optimized_spv : task_res.spirv_code in the dump path.

    Extended reasoning...

    What the bug is and how it manifests

    The PR introduces a correct fallback at the generated_spirv.push_back call (line 2425): success ? std::move(optimized_spv) : std::move(task_res.spirv_code). However, the debug IR dump block immediately above — which is gated on an env-var and writes a _after_opt.spirv file — calls spirv_tools_->Disassemble(optimized_spv, &spirv_asm) unconditionally, without consulting the success flag.

    The specific code path that triggers it

    1. Developer enables DUMP_IR to investigate an optimizer issue.
    2. A large autodiff kernel triggers the ID-overflow path: spirv_opt_id_overflow_seen becomes true, success is set to false.
    3. The dump_ir block is entered (it is not guarded by success).
    4. spirv_tools_->Disassemble(optimized_spv) is called with the corrupt, partially-optimized SPIR-V.
    5. The disassembler writes garbage or partial output to _after_opt.spirv.
    6. The developer inspects the dump expecting to see the un-optimized SPIR-V that was actually shipped to the GPU, but sees the corrupt optimized blob instead.

    Why existing code does not prevent it

    The success flag check introduced by this PR at lines 2392–2394 correctly gates the production path, but the dump block was not updated in tandem. The PR's own comment at line 2423 explicitly warns: "optimized_spv may reference id 0 and is unsafe to ship to the GPU backend" — yet this same pointer is passed to the dump disassembler. Notably, the disabled/dead code block at lines 2409–2420 even defines a local spirv variable that correctly selects success ? optimized_spv : task_res.spirv_code, but then still calls Disassemble(optimized_spv) directly rather than Disassemble(spirv) — a further sign that the dump path was overlooked during this refactor.

    Impact

    This is a debug-only code path (off by default), so there is zero production impact. The disassembler is robust enough to handle id-0 references without crashing (it will just emit %0 = ...), so there is no crash risk in practice. The real harm is diagnostic: the developer who enables DUMP_IR specifically to debug an ID-overflow failure will see a dump that does not match what was submitted to the GPU, potentially leading them to chase a red herring.

    How to fix it

    Replace spirv_tools_->Disassemble(optimized_spv, &spirv_asm) in the dump block with spirv_tools_->Disassemble(success ? optimized_spv : task_res.spirv_code, &spirv_asm), mirroring the pattern already used correctly at line 2425. Alternatively, move the dump block after the push_back and disassemble from generated_spirv.back().

    Step-by-step proof

    1. success = false (ID overflow path).
    2. Execution enters the if (dump_ir) block — there is no if (success) guard.
    3. spirv_tools_->Disassemble(optimized_spv, &spirv_asm) is called; optimized_spv may contain references to id 0.
    4. The disassembler emits textual output with %0 references into _after_opt.spirv.
    5. Lines 2422–2425 execute and correctly push task_res.spirv_code (the un-optimized version) to generated_spirv.
    6. The dump file and the actual GPU submission now show different SPIR-V — the dump is misleading.

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
@hughperkins
Copy link
Copy Markdown
Collaborator

@claude can I ask questions here?

@hughperkins
Copy link
Copy Markdown
Collaborator

@claude What are potential downsides to this PR? What are potential risks of this PR? What are potential performance impact of this PR? What effects will this PR have on memory usage? What risks are that this PR might affect other spir-v platofrms such as metal?

@hughperkins
Copy link
Copy Markdown
Collaborator

from Opus, on answer to questions I posed to Claude above:

Potential downsides

  • Silent quality-of-service change: When optimization is treated as failed (including ID overflow), users get unoptimized SPIR-V. Correctness should stay OK, but GPU
    time can get much worse
    with no loud “your kernel is now slow” signal beyond existing warnings/logs.
  • Fragile overflow detection: Tying behavior to the substring "ID overflow" in SPIRV-Tools messages is version- and wording-sensitive. A tooling update could change
    text and stop setting the flag (overflow might still be bad but no fallback), or rarely a different message could spuriously match.
  • More aggressive optimization: Extra AggressiveDCE, duplicated local elim passes, CompactIds, and a higher ID bound change failure modes and compile cost more
    than they change the “happy path” API surface.

Potential risks

  • False negatives on overflow: If overflow is not reported with that phrase, spirv_opt_id_overflow_seen might stay false and you could still ship broken optimized IR
    (the comment says optimized_spv may reference id 0). The PR reduces a known gap but does not guarantee detection in all cases.
  • Heavier pass order: More passes can expose latent SPIRV-Tools bugs or backend conversion quirks (different instruction mix than before) on some kernels.
  • Message dedup / pointer to message: spirv_msg_last = message stores a const char *. If the next call uses a different buffer with the same contents, dedup
    might not behave as intended; lifetime bugs would be UB. The null check helps one class of issues; this area remains subtle.

Performance impact

  • Kernel compile time (CPU): Extra DCE / local-elim / dead-branch / compact passes almost certainly increase SPIRV-Tools work on kernels that already optimized
    successfully. Amount is workload-dependent (small kernels: noise; huge autodiff graphs: noticeable).
  • GPU execution time:
    • Success path: Goal is fewer ID blowups and leaner IR between passes — can be neutral to slightly better once optimization completes.
    • Fallback path: Worse GPU performance because optimizations are dropped entirely for that task’s SPIR-V chunk.

Memory usage effects

  • set_max_id_bound(0x3FFFFFF): Raises the ceiling SPIRV-Tools allows; it does not allocate tens of millions of IDs by itself. Actual memory still scales with
    module size and live IDs during passes. For pathological kernels, peak optimizer memory can grow versus the old lower cap (which might have failed earlier).
  • Message dedup: One saved string / counters — negligible.
  • Fallback: Unoptimized binaries might be smaller than a half-broken oversized optimized blob; optimized success might be slightly smaller after compaction — not
    dominant
    compared to overall kernel size.

Metal and other SPIR-V platforms

  • Shared pipeline: SPIR-V from this codegen feeds any arch that uses this path; Metal ingests the same SPIR-V and translates via SPIRV-Cross (and similar stacks
    for others). This PR does not add a Metal-specific branch.
  • Semantic effect: Valid SPIR-V should behave the same across backends; different optimization lowers a different IR into SPIRV-Cross, which can change MSL shape
    (register pressure, unrolling, etc.). So perf characteristics on Metal vs Vulkan can diverge from before, even when both remain correct.
  • Realistic “Metal risk”: Less “wrong SPIR-V for Metal,” more (1) rarer SPIRV-Cross edge cases on odd IR, (2) different perf/debuggability when fallback kicks in,
    (3) if anything ever relied on specific IDs (unusual), CompactIds renumbers IDs — reflection usually uses decorations/bindings, so this is rarely a problem.

Possible follow-ups (optional)

  • Surface an explicit user-visible diagnostic when fallback happens.
  • Replace string matching with a SPIRV-Tools-supported overflow signal if available in the version you ship.

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@hughperkins
Copy link
Copy Markdown
Collaborator

Description + good points + bad points, from Opus:

Summary

This PR hardens SPIRV-Tools optimization in spirv_codegen.cpp against SPIR-V ID space overflow on large kernels (notably autodiff-heavy code paths), improves logging
during optimization, and adjusts the optimizer pass pipeline to reduce intermediate IR bloat.

What changed

  • Detect ID overflow during optimization: A thread-local flag is set when SPIRV-Tools reports a message whose text contains "ID overflow". Optimization is treated as
    failed if spirv_opt_->Run fails or this flag is set.
  • Safe fallback: On optimization failure (including overflow), the runtime uses unoptimized SPIR-V from TaskCodegen instead of pushing possibly-invalid optimized
    bytes (e.g. bogus id references) to the backend.
  • Raise the optimizer ID bound via spirv_opt_options_.set_max_id_bound(0x3FFFFFF) so very large modules are not rejected purely due to the default ~4M cap (SPIR-V has no
    fixed upper bound in the spec).
  • Richer pass sequence: Adds intermediate AggressiveDCE (and related local elim / dead-branch passes where noted), plus CreateCompactIdsPass() at the end,
    aligned with the goal of trimming dead instructions between expensive transformations.
  • Logging: Deduplicates consecutive identical SPIRV-Tools messages and flushes suppression counts after each optimizer run. Null message pointers are skipped.
  • Bugfix: The message consumer’s trace branch incorrectly compared against SPV_MSG_INFO twice; it now uses SPV_MSG_DEBUG for QD_TRACE.

Good

  • Stability: Avoids shipping broken optimized SPIR-V when the optimizer leaves the module in a bad state; explicit fallback to known-good unoptimized IR.
  • Large-kernel support: Higher max_id_bound addresses realistic autodiff/SSA-style blowups that hit the old cap.
  • Compile pipeline: Extra DCE / compaction aims to prevent overflow by keeping IR smaller between heavy passes, not only bumping limits.
  • Operability: Message dedup and overflow-aware failure make logs easier to read and failures easier to reason about.
  • Small surface area: Changes are localized to SPIR-V optimization in one file.

Tradeoffs / risks

  • Performance when fallback triggers: Users may see much slower GPU kernels when optimization is skipped; this is a silent-ish QoS change unless surfaced clearly
    elsewhere.
  • Host compile cost: Additional optimizer passes can increase CPU time spent in SPIRV-Tools on kernels that already optimized successfully.
  • Overflow detection coupling: Reliance on the literal substring "ID overflow" in tool messages is fragile across SPIRV-Tools versions or wording changes.
  • Incomplete guarantee: If overflow occurs without that message, the flag may not set; the risk of bad optimized output is reduced but not theoretically eliminated without
    a first-class API signal.
  • Cross-backend behavior: All SPIR-V consumers (Vulkan, Metal via SPIRV-Cross, etc.) see different IR when optimization succeeds vs. falls back; performance and
    codegen can diverge by platform even when correctness holds.

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
@hughperkins
Copy link
Copy Markdown
Collaborator

# what would be consequences if we only increased the limit, with no other changes?

If we only raised the ID limit (set_max_id_bound) and nothing else

Assuming literally the only change is adding something like spirv_opt_options_.set_max_id_bound(0x3FFFFFF) (plus maybe a comment), and no extra passes, no
overflow flag, no message dedup, no CompactIds, no fallback in run().

What would get better

Area Effect
Failures from the default bound SPIRV-Tools would stop rejecting work purely because the built‑in ~4M cap is too low. Many “large autodiff / SSA” cases that
only failed that check could start optimizing successfully.
Code churn Minimal diff: one option + comment; easiest to review and revert.

What would not improve (vs the full branch)

Area Effect
ID growth between passes The branch adds intermediate DCE (and related ordering) because dead IR can pile up between heavy passes. Cap-only does not reduce
that growth; you only allow a higher ceiling. Some kernels might still exhaust IDs, memory, or time—or stay on the edge of failure.
Output / peak resources Without extra cleanup + CompactIds, modules can stay larger in ID space and possibly heavier for the rest of the pipeline than
with the full change set.
“Run succeeded but output is bad” The full PR treats “ID overflow” in messages as failure because Run() may not always fail when overflow is reported.
Cap-only does not detect that; behavior stays whatever SPIRV-Tools already did in those ambiguous cases.
Logs No dedup of repeated SPIRV-Tools lines; spam unchanged.
Consumer bug The SPV_MSG_INFO duplicated branch (debug/trace path) stays wrong on main.

Behavior when optimization still fails (Run returns failure)

On origin/main, the code still push_backs optimized_spv even when optimization is marked failed (see the pre‑PR
generated_spirv.push_back(std::move(optimized_spv))).
Cap-only does not add the fallback to task_res.spirv_code, so:

  • Any remaining Run failures (pass bugs, other limits, corruption) still ship whatever is in optimized_spv, which the PR comments describe as potentially
    invalid
    (e.g. bad IDs).
    So raising the cap alone mainly shrinks how often you hit that failure path; it does not fix the “failed opt still gets used” hazard.

Summary

Question Cap-only answer
Simpler? Yes.
Fixes bound-too-low cases? Usually yes, if that was the sole bottleneck.
Matches full PR robustness? No — no structural relief for ID blowup, no compacting, no overflow-aware failure, no safe fallback, no log/diagnostic improvements.
Net: Cap-only is a reasonable minimal mitigation when failures are only “default ID bound too small.” The rest of the branch targets root-cause IR growth,
ambiguous optimizer outcomes, and safe behavior when optimization fails—those consequences remain if you only bump the limit.
# Is there some intermediate change possible beyond simply increaseing the id cap, but without turning off optimization? What would be the consequences?

Intermediate options (more than cap-only, but no “turn off optimization” fallback)

Here “turn off optimization” means the PR’s behavior of shipping unoptimized task_res.spirv_code when optimization fails or overflow is seen. Intermediate designs
try to stay on an optimized path or fail loudly instead.

Option A — Cap + pass pipeline / CompactIds (no fallback)

What: Keep (or add) intermediate DCE, CompactIds, maybe reorder passes — and raise max_id_bound, but do not replace successful/failed output with
unoptimized SPIR-V.
Consequences

Upside Downside
Addresses why IDs explode (dead IR between heavy passes), not only the numeric ceiling. More CPU time per kernel compile (extra passes).
CompactIds can shrink ID space in the final module. If Run() still fails, you’re back to whatever main did with failed optimized_spv unless you add
something else (see B).
You never silently drop to unoptimized. Doesn’t fix “Run reports success but messages say overflow” unless you add explicit handling (see B/C).

Option B — Cap + pipeline + treat overflow as hard compile error (no fallback)

What: Detect overflow (message flag or better API if available) or failed Run(), and QD_ERROR / abort the kernel compile (or return an error to the user) instead
of
emitting unoptimized SPIR-V.
Consequences

Upside Downside
No invalid optimized SPIR-V shipped by “best effort.” Some kernels stop compiling that the full PR would have run in unoptimized mode.
Clear failure mode: “optimizer/ID space failed” vs silent QoS change. Worse availability for huge graphs unless you invest in more fixes upstream.
This is “don’t turn off optimization” in the sense of not emitting unoptimized IR — but the user loses that kernel until they disable opt or change the program.

Option C — Retry with a lighter optimizer preset (still optimized)

What: On failure or overflow signal, run again with a smaller pass list, or skip the heaviest passes, still producing optimized SPIR-V (weaker optimization).
Consequences

Upside Downside
Still some optimization; avoids full unoptimized path. Two passes on failure = slower compile on those kernels.
May avoid ID blowup without max unoptimized IR. Tuning burden: which passes to drop; risk of rare miscompiles if preset is wrong.

Option D — Cap + post-Run validation (fail compile if invalid)

What: After Run(), run the SPIR-V validator (or your own checks). If invalid, error (or retry C), don’t push bad bytes.
Consequences

Upside Downside
Catches more than “cap too low.” Validator cost; may need environment/extensions aligned with your targets.
No unoptimized fallback required for safety. Still need a policy for “invalid after opt”: error vs retry.
(set_run_validator(false) in the branch suggests validation was off for speed/tooling reasons — turning it on is a product tradeoff.)

How this compares to the full PR

Approach Optimization “off”? Typical consequence
Full PR (fallback) Sometimes uses unoptimized SPIR-V Best availability, possible big GPU perf hit when fallback triggers.
A (pipeline + cap only) Never deliberately unoptimized Better than cap-only at avoiding overflow; doesn’t fix bad-output-on-failed-Run by itself.
B (fail compile) No unoptimized emit Safest semantics if you can’t trust failed output; hardest on huge kernels.
C (lighter retry) No full unoptimized Middle ground on perf vs compile time vs complexity.

Short answer: Yes — the natural intermediate step is cap + IR cleanup (DCE / CompactIds / pass order) without fallback. That does not “turn off”
optimization
; it tries to keep optimization succeeding. The consequences are mainly higher compile cost and no automatic recovery when optimization still
fails — for that you either accept main-style risk, fail the build (B), or retry a weaker opt (C).

# whats the difference between option A and B?

Difference between Option A and Option B

Option A — Cap + pipeline / CompactIds (no fallback)

  • Goal: Make optimization succeed more often by reducing ID pressure (extra DCE between heavy passes, compaction, higher cap).
  • When things go wrong: You did not define a new rule in my summary — behavior on failed Run() or ambiguous overflow is basically unchanged from whatever the
    codebase already does
    (e.g. still appending optimized_spv even on failure, unless something else in the tree fixes that).
  • Emphasis: Prevention, not a new failure policy.

Option B — Same tooling improvements + hard error on failure / overflow (no fallback)

  • Goal: Never ship suspect optimized SPIR-V and never fall back to unoptimized — instead fail the kernel compile (or abort with a clear error) when Run()
    fails or overflow is detected.
  • When things go wrong: Explicit failure for the user; no silent alternate path.
  • Emphasis: Safety / correctness policy when optimization does not look trustworthy.

One-line contrast

Option A Option B
Primary lever Try harder so optimization usually works Stop when optimization is not trustworthy
User-visible outcome on bad opt Unspecified / legacy behavior unless you add more Compile error (or explicit failure)

Analogy: A is “widen the road and clear debris.” B is “if the bridge still looks unsafe, close it — don’t send traffic over it or on a detour you didn’t approve.”
You can combine them: A’s pass schedule + B’s error-on-failure is a common pattern (prevent first, then fail closed if prevention wasn’t enough).

@hughperkins
Copy link
Copy Markdown
Collaborator

an you write down option E: Cap + pass pipeline / CompactIds + treat overflow as hard compile error (no fallback)

Option E — Cap + pass pipeline / CompactIds + hard error on overflow (no fallback)

What it is

Combine:

  • Higher max_id_bound (same intent as the PR: avoid failing only because the default ~4M cap is too small).
  • Pass pipeline changes that reduce ID growth between heavy optimizations (e.g. intermediate AggressiveDCE, extra local elim / dead-branch passes as appropriate,
    ordering aligned with upstream-style performance passes).
  • CreateCompactIdsPass() (or equivalent end-of-pipeline ID compaction) so the final module’s ID space is tighter.
  • On overflow or failed optimization: treat as a hard compile error (clear diagnostic), not a fallback to unoptimized SPIR-V.
    Explicitly excluded: emitting task_res.spirv_code when optimization fails or overflow is detected (no “turn off optimization” escape hatch).

Intended behavior

Situation Outcome
Optimization succeeds Ship optimized SPIR-V (same goal as today’s success path).
Run() fails or overflow is detected (e.g. message-based flag, or better API if available) Fail the kernel compile with a specific error (e.g. “SPIR-V
optimizer failed” / “ID space overflow”).
Ambiguous success (messages suggest overflow but Run() doesn’t fail) Policy choice: either same as failure (strict) or needs a stronger signal from
SPIRV-Tools; strict E prefers fail closed if you treat overflow messages as fatal.

Upsides

  • No silent QoS cliff from falling back to unoptimized shaders.
  • No shipping potentially invalid optimized_spv on failure — fail closed instead of “best effort” bytes.
  • Structural mitigation (DCE / compact / cap) reduces how often you hit the error path vs cap-only.

Downsides / consequences

  • Availability: Kernels that would have run with the full PR’s unoptimized fallback will fail to compile under Option E unless you add another mitigation (e.g.
    weaker retry preset, user disables opt, or smaller kernel).
  • Compile time: Extra passes increase SPIRV-Tools work on the success path (same as Option A).
  • Overflow detection: If you rely on message substring matching (or similar), version fragility remains; false negatives could still ship bad IR unless you add
    validation or a first-class overflow bit from the toolchain.
  • Support burden: Users hit hard errors more often than with fallback — need actionable diagnostics (which pass, approximate module size, hint to disable
    optimization or reduce kernel).

How Option E relates to A and B

  • Versus A: Same prevention stack; Option E adds a defined failure policy (hard error, no legacy “push failed optimized” ambiguity).
  • Versus B: Same fail closed policy; Option E includes the full pipeline + cap prevention layer, not only policy.
    Option E is essentially “maximum prevention + fail closed,” with no unoptimized fallback.

@hughperkins
Copy link
Copy Markdown
Collaborator

hughperkins commented Apr 16, 2026

@duburcqa I'm a little concerned about the fallback to unoptimized code, without warning. I'm fairly tempted by option E above, which I interpret as:

  • raise the cap
  • put in place measures to remove old ids, in between compilations and/or passes
  • fail with hard error if we hit cap count, or something is questionaable about code generation

Thoughts?

@hughperkins
Copy link
Copy Markdown
Collaborator

is there some way we can make the cap dynamic instead of fixed, and grow as needed?

Dynamic cap — what’s possible?

What SPIRV-Tools does today

In SPIRV-Tools, max_id_bound is applied once per Optimizer::Run() to the optimizer’s IRContext (see optimizer.cpp:
context->set_max_id_bound(opt_options->max_id_bound_) before the pass manager runs). It is not described as something that grows automatically between passes inside
a single Run().
So there is no built-in “grow the cap on demand mid-pipeline” mode in the usual API you call from Quadrants. Dynamic behavior has to live in your code around Run().

Practical pattern: retry with increasing bound

  1. Start from a first guess for the bound (default SPIRV-Tools value, or something derived from the input module’s header id bound × a safety factor).
  2. Call Run(..., spirv_opt_options_).
  3. If optimization fails and your consumer / logs indicate ID overflow (or Run fails for that reason), increase the bound (e.g. double, or add a fixed step),
    up to a hard maximum (memory / sanity ceiling).
  4. Re-run from the same original SPIR-V bytes (each Run rebuilds the module from scratch).
    That is “dynamic” in the sense of adaptive cap selection, not a continuously growing limit inside one optimizer invocation.

Tradeoffs

Aspect Notes
Cost Worst case: multiple full optimizer runs — can be much slower than picking one high cap up front.
Correctness Each attempt is a normal Run(); no semantic difference vs setting a fixed high cap, if you eventually use a bound that suffices.
Safety You still need a maximum bound; unbounded growth risks OOM or runaway compile time on pathological inputs.
Heuristic start Seeding from the input SPIR-V header’s bound can avoid paying a huge cap on tiny shaders and avoid one failure on medium ones — tuning
required.

What “dynamic” does not solve by itself

  • Root cause of ID explosion (dead IR between passes) — that’s still pass ordering / DCE, etc.
  • Ambiguous failures (e.g. success + overflow messages) — retry logic must tie into whatever signal you trust (same issue as the current PR’s message flag).

Short answer

Yes — in app code: implement retry (or binary search) on set_max_id_bound with a ceiling, re-calling Run() until success or cap. No — not inside SPIRV-Tools as a
single fixed call:
the bound is per Run, not an auto-growing knob during one run.

@duburcqa
Copy link
Copy Markdown
Contributor Author

@hughperkins Answering your questions:

What would be consequences if we only increased the limit, with no other changes?

The optimizer would produce valid SPIR-V (no crash), but with very sparse IDs (we measured up to 6.8M). MoltenVK/SPIRV-Cross on Metal can't handle IDs that high — we got an assertion failure in bind_pipeline. So CompactIdsPass at the end is required to renumber back to a dense range.

Is there some intermediate change possible beyond simply increasing the id cap, but without turning off optimization?

That's exactly what commit 4 does. Nothing is turned off — the full optimization pipeline runs (with additional intermediate DCE passes matching SPIRV-Tools' own RegisterPerformancePasses). The raised cap just gives the optimizer room to allocate temporary IDs during SSA construction, and CompactIds at the end brings the output back into spec range.

What's the difference between option A and B?

Commits 1-3 are defense-in-depth (detect overflow via message consumer, dedup warnings, threshold skip). Commit 4 is the real fix (raise cap + DCE + CompactIds). Commit 5 replaced the silent fallback with a hard error per @duburcqa's feedback. With the cap raised to ~67M and CompactIds renumbering the output, the defense-in-depth code paths should never trigger in practice — but they're there in case a future kernel is pathologically large.

Can we make the cap dynamic?

The SPIRV-Tools API (OptimizerOptions::set_max_id_bound) only accepts a fixed uint32_t. We could set it to UINT32_MAX and rely entirely on CompactIds to bring the output into spec range. The risk is that a truly pathological kernel could allocate billions of IDs, consuming many GB of RAM inside the optimizer before CompactIds runs. The 67M cap is a pragmatic safety net — 16x the spec limit, enough for any kernel we've seen, without unbounded memory growth.

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@duburcqa duburcqa force-pushed the duburcqa/fix_spirv_id_overflow branch from 57627d8 to dfb6c14 Compare April 16, 2026 13:34
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp Outdated
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@hughperkins
Copy link
Copy Markdown
Collaborator

@claude review

@hughperkins
Copy link
Copy Markdown
Collaborator

Opus description:

Summary

Addresses SPIR-V optimizer ID-space exhaustion on very large kernels (e.g. autodiff / heavy SSA-style transforms) by raising the SPIRV-Tools max ID bound, adding
intermediate dead-code elimination and a final compact-IDs pass
, and detecting “ID overflow” from SPIRV-Tools messages so compilation fails explicitly instead of
potentially continuing with bad output.

What changed

  • Optimizer pipeline: Extra AggressiveDCE (and related load/store elim) placements to reduce instruction accumulation before expensive passes; CompactIdsPass at
    the end; set_max_id_bound(0x3FFFFFF) with comments explaining why the default ~4M bound is insufficient for some kernels.
  • Message consumer: Thread-local flag when messages contain "ID overflow"; deduplication of consecutive identical SPIRV-Tools log lines; fatal-level messages
    logged at WARN
    (no throw from inside the callback) with QD_ERROR_IF after Run() for the actual error path.
  • Bugfix: Corrected duplicate SPV_MSG_INFO branch so debug messages use SPV_MSG_DEBUG.
  • KernelCodegen::run: Reset overflow flag before Run(), flush dedup state after; treat Run() failure or overflow flag as failure; only dump post-opt IR when
    optimization succeeded
    ; disassembly on failure uses the same buffer as the chosen SPIR-V (optimized vs pre-opt).

Strengths

  • Targets a real failure mode: ID exhaustion during optimization on large shaders.
  • CompactIds + higher bound is a standard mitigation; end-of-pipeline compaction helps keep emitted IDs dense.
  • Explicit failure on overflow detection is better than silent success with corrupt SPIR-V.
  • Log dedup keeps SPIRV-Tools spam readable during failures.
  • Avoids throwing from the SPIRV-Tools callback by deferring to QD_ERROR_IF after Run().
  • Small correctness fix in message-level branching and consistent disassembly/dump behavior on failure paths.

Limitations / risks

  • Substring match on "ID overflow" is fragile if SPIRV-Tools rewords messages; comments in code acknowledge this.
  • Pipeline churn: Many extra passes can change optimization behavior (compile time, code size, and possibly semantics at edge cases); worth validating on representative
    kernels and perf-sensitive paths.
  • Very high max_id_bound: Reduces how soon the tools fail, but work/memory can still blow up before that; this is a pragmatic cap raise, not a guarantee of
    scalability.
  • Logging FATAL as WARN in the callback is intentional but can confuse log triage unless reviewers know errors are finalized in QD_ERROR_IF.

Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you.

@duburcqa duburcqa force-pushed the duburcqa/fix_spirv_id_overflow branch from 93330ba to c1a8c83 Compare April 16, 2026 18:22
@duburcqa duburcqa enabled auto-merge (squash) April 16, 2026 18:22
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@duburcqa duburcqa merged commit 4331f12 into main Apr 16, 2026
48 checks passed
@duburcqa duburcqa deleted the duburcqa/fix_spirv_id_overflow branch April 16, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants