Fix SPIRV ID overflow for large kernels due to autodiff.#489
Conversation
|
@claude review |
27a14ce to
3ffcb5e
Compare
|
@claude review |
There was a problem hiding this comment.
All previously flagged bugs have been addressed in c3abc8f; the remaining inline comments are nits that don't affect production correctness.
Extended reasoning...
Overview
This PR modifies spirv_codegen.cpp to fix a SIGSEGV crash during autodiff backward passes on Metal/Vulkan. The changes include: raising the SPIRV-Tools max ID bound from ~4M to ~64M, inserting intermediate AggressiveDCE passes matching the upstream performance pipeline, appending a CompactIdsPass, adding an ID-overflow detection flag via the message consumer callback, and falling back to un-optimized SPIR-V when optimization fails. Commit c3abc8f (pushed after the initial review) addressed the null-pointer UB in the message consumer, the dedup-state flush/reset gap, and the duplicate comment block.
Security risks
None. This is GPU shader compilation infrastructure with no auth, crypto, or network surface. The only external data is SPIR-V bytecode generated internally by the Quadrants compiler.
Level of scrutiny
Moderate. The changes touch a critical codegen path (SPIR-V optimization pipeline), but the logic is well-motivated and follows the upstream RegisterPerformancePasses ordering. The fallback to un-optimized SPIR-V is simple and correct. The author provided a reproduction script and passing test results (349 tests).
Other factors
The two remaining inline bugs are both nits: one is a comment inaccuracy on a value description and spec claim (doc-only, no code change needed), the other is a debug-only dump path that could show corrupt SPIR-V when DUMP_IR is set and optimization fails (zero production impact). Neither affects the correctness of the fix. The author has been responsive, addressing all three previously raised concerns in a single follow-up commit.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
quadrants/codegen/spirv/spirv_codegen.cpp:2383-2395— The debug IR dump block (enabled viaDUMP_IRenv var) callsspirv_tools_->Disassemble(optimized_spv)unconditionally, even whensuccess=falsedue to ID overflow — meaning the_after_opt.spirvfile will contain the corrupt/partial SPIR-V rather than the actual un-optimized fallback that was submitted to the GPU. The fix is to mirror the production push_back logic: usesuccess ? optimized_spv : task_res.spirv_codein the dump path.Extended reasoning...
What the bug is and how it manifests
The PR introduces a correct fallback at the
generated_spirv.push_backcall (line 2425):success ? std::move(optimized_spv) : std::move(task_res.spirv_code). However, the debug IR dump block immediately above — which is gated on an env-var and writes a_after_opt.spirvfile — callsspirv_tools_->Disassemble(optimized_spv, &spirv_asm)unconditionally, without consulting thesuccessflag.The specific code path that triggers it
- Developer enables
DUMP_IRto investigate an optimizer issue. - A large autodiff kernel triggers the ID-overflow path:
spirv_opt_id_overflow_seenbecomestrue,successis set tofalse. - The dump_ir block is entered (it is not guarded by
success). spirv_tools_->Disassemble(optimized_spv)is called with the corrupt, partially-optimized SPIR-V.- The disassembler writes garbage or partial output to
_after_opt.spirv. - The developer inspects the dump expecting to see the un-optimized SPIR-V that was actually shipped to the GPU, but sees the corrupt optimized blob instead.
Why existing code does not prevent it
The
successflag check introduced by this PR at lines 2392–2394 correctly gates the production path, but the dump block was not updated in tandem. The PR's own comment at line 2423 explicitly warns: "optimized_spv may reference id 0 and is unsafe to ship to the GPU backend" — yet this same pointer is passed to the dump disassembler. Notably, the disabled/dead code block at lines 2409–2420 even defines a localspirvvariable that correctly selectssuccess ? optimized_spv : task_res.spirv_code, but then still callsDisassemble(optimized_spv)directly rather thanDisassemble(spirv)— a further sign that the dump path was overlooked during this refactor.Impact
This is a debug-only code path (off by default), so there is zero production impact. The disassembler is robust enough to handle id-0 references without crashing (it will just emit
%0 = ...), so there is no crash risk in practice. The real harm is diagnostic: the developer who enablesDUMP_IRspecifically to debug an ID-overflow failure will see a dump that does not match what was submitted to the GPU, potentially leading them to chase a red herring.How to fix it
Replace
spirv_tools_->Disassemble(optimized_spv, &spirv_asm)in the dump block withspirv_tools_->Disassemble(success ? optimized_spv : task_res.spirv_code, &spirv_asm), mirroring the pattern already used correctly at line 2425. Alternatively, move the dump block after thepush_backand disassemble fromgenerated_spirv.back().Step-by-step proof
success = false(ID overflow path).- Execution enters the
if (dump_ir)block — there is noif (success)guard. spirv_tools_->Disassemble(optimized_spv, &spirv_asm)is called;optimized_spvmay contain references to id 0.- The disassembler emits textual output with
%0references into_after_opt.spirv. - Lines 2422–2425 execute and correctly push
task_res.spirv_code(the un-optimized version) togenerated_spirv. - The dump file and the actual GPU submission now show different SPIR-V — the dump is misleading.
- Developer enables
|
@claude can I ask questions here? |
|
@claude What are potential downsides to this PR? What are potential risks of this PR? What are potential performance impact of this PR? What effects will this PR have on memory usage? What risks are that this PR might affect other spir-v platofrms such as metal? |
|
from Opus, on answer to questions I posed to Claude above: Potential downsides
Potential risks
Performance impact
Memory usage effects
Metal and other SPIR-V platforms
Possible follow-ups (optional)
|
|
Description + good points + bad points, from Opus: SummaryThis PR hardens SPIRV-Tools optimization in What changed
Good
Tradeoffs / risks
|
# what would be consequences if we only increased the limit, with no other changes?If we only raised the ID limit (
|
| Area | Effect |
|---|---|
| Failures from the default bound | SPIRV-Tools would stop rejecting work purely because the built‑in ~4M cap is too low. Many “large autodiff / SSA” cases that |
| only failed that check could start optimizing successfully. | |
| Code churn | Minimal diff: one option + comment; easiest to review and revert. |
What would not improve (vs the full branch)
| Area | Effect |
|---|---|
| ID growth between passes | The branch adds intermediate DCE (and related ordering) because dead IR can pile up between heavy passes. Cap-only does not reduce |
| that growth; you only allow a higher ceiling. Some kernels might still exhaust IDs, memory, or time—or stay on the edge of failure. | |
| Output / peak resources | Without extra cleanup + CompactIds, modules can stay larger in ID space and possibly heavier for the rest of the pipeline than |
| with the full change set. | |
| “Run succeeded but output is bad” | The full PR treats “ID overflow” in messages as failure because Run() may not always fail when overflow is reported. |
| Cap-only does not detect that; behavior stays whatever SPIRV-Tools already did in those ambiguous cases. | |
| Logs | No dedup of repeated SPIRV-Tools lines; spam unchanged. |
| Consumer bug | The SPV_MSG_INFO duplicated branch (debug/trace path) stays wrong on main. |
Behavior when optimization still fails (Run returns failure)
On origin/main, the code still push_backs optimized_spv even when optimization is marked failed (see the pre‑PR
generated_spirv.push_back(std::move(optimized_spv))).
Cap-only does not add the fallback to task_res.spirv_code, so:
- Any remaining
Runfailures (pass bugs, other limits, corruption) still ship whatever is inoptimized_spv, which the PR comments describe as potentially
invalid (e.g. bad IDs).
So raising the cap alone mainly shrinks how often you hit that failure path; it does not fix the “failed opt still gets used” hazard.
Summary
| Question | Cap-only answer |
|---|---|
| Simpler? | Yes. |
| Fixes bound-too-low cases? | Usually yes, if that was the sole bottleneck. |
| Matches full PR robustness? | No — no structural relief for ID blowup, no compacting, no overflow-aware failure, no safe fallback, no log/diagnostic improvements. |
| Net: Cap-only is a reasonable minimal mitigation when failures are only “default ID bound too small.” The rest of the branch targets root-cause IR growth, | |
| ambiguous optimizer outcomes, and safe behavior when optimization fails—those consequences remain if you only bump the limit. |
# Is there some intermediate change possible beyond simply increaseing the id cap, but without turning off optimization? What would be the consequences?
Intermediate options (more than cap-only, but no “turn off optimization” fallback)
Here “turn off optimization” means the PR’s behavior of shipping unoptimized task_res.spirv_code when optimization fails or overflow is seen. Intermediate designs
try to stay on an optimized path or fail loudly instead.
Option A — Cap + pass pipeline / CompactIds (no fallback)
What: Keep (or add) intermediate DCE, CompactIds, maybe reorder passes — and raise max_id_bound, but do not replace successful/failed output with
unoptimized SPIR-V.
Consequences
| Upside | Downside |
|---|---|
| Addresses why IDs explode (dead IR between heavy passes), not only the numeric ceiling. | More CPU time per kernel compile (extra passes). |
CompactIds can shrink ID space in the final module. |
If Run() still fails, you’re back to whatever main did with failed optimized_spv unless you add |
| something else (see B). | |
| You never silently drop to unoptimized. | Doesn’t fix “Run reports success but messages say overflow” unless you add explicit handling (see B/C). |
Option B — Cap + pipeline + treat overflow as hard compile error (no fallback)
What: Detect overflow (message flag or better API if available) or failed Run(), and QD_ERROR / abort the kernel compile (or return an error to the user) instead
of emitting unoptimized SPIR-V.
Consequences
| Upside | Downside |
|---|---|
| No invalid optimized SPIR-V shipped by “best effort.” | Some kernels stop compiling that the full PR would have run in unoptimized mode. |
| Clear failure mode: “optimizer/ID space failed” vs silent QoS change. | Worse availability for huge graphs unless you invest in more fixes upstream. |
| This is “don’t turn off optimization” in the sense of not emitting unoptimized IR — but the user loses that kernel until they disable opt or change the program. |
Option C — Retry with a lighter optimizer preset (still optimized)
What: On failure or overflow signal, run again with a smaller pass list, or skip the heaviest passes, still producing optimized SPIR-V (weaker optimization).
Consequences
| Upside | Downside |
|---|---|
| Still some optimization; avoids full unoptimized path. | Two passes on failure = slower compile on those kernels. |
| May avoid ID blowup without max unoptimized IR. | Tuning burden: which passes to drop; risk of rare miscompiles if preset is wrong. |
Option D — Cap + post-Run validation (fail compile if invalid)
What: After Run(), run the SPIR-V validator (or your own checks). If invalid, error (or retry C), don’t push bad bytes.
Consequences
| Upside | Downside |
|---|---|
| Catches more than “cap too low.” | Validator cost; may need environment/extensions aligned with your targets. |
| No unoptimized fallback required for safety. | Still need a policy for “invalid after opt”: error vs retry. |
(set_run_validator(false) in the branch suggests validation was off for speed/tooling reasons — turning it on is a product tradeoff.) |
How this compares to the full PR
| Approach | Optimization “off”? | Typical consequence |
|---|---|---|
| Full PR (fallback) | Sometimes uses unoptimized SPIR-V | Best availability, possible big GPU perf hit when fallback triggers. |
| A (pipeline + cap only) | Never deliberately unoptimized | Better than cap-only at avoiding overflow; doesn’t fix bad-output-on-failed-Run by itself. |
| B (fail compile) | No unoptimized emit | Safest semantics if you can’t trust failed output; hardest on huge kernels. |
| C (lighter retry) | No full unoptimized | Middle ground on perf vs compile time vs complexity. |
Short answer: Yes — the natural intermediate step is cap + IR cleanup (DCE / CompactIds / pass order) without fallback. That does not “turn off”
optimization; it tries to keep optimization succeeding. The consequences are mainly higher compile cost and no automatic recovery when optimization still
fails — for that you either accept main-style risk, fail the build (B), or retry a weaker opt (C).
# whats the difference between option A and B?
Difference between Option A and Option B
Option A — Cap + pipeline / CompactIds (no fallback)
- Goal: Make optimization succeed more often by reducing ID pressure (extra DCE between heavy passes, compaction, higher cap).
- When things go wrong: You did not define a new rule in my summary — behavior on failed
Run()or ambiguous overflow is basically unchanged from whatever the
codebase already does (e.g. still appendingoptimized_spveven on failure, unless something else in the tree fixes that). - Emphasis: Prevention, not a new failure policy.
Option B — Same tooling improvements + hard error on failure / overflow (no fallback)
- Goal: Never ship suspect optimized SPIR-V and never fall back to unoptimized — instead fail the kernel compile (or abort with a clear error) when
Run()
fails or overflow is detected. - When things go wrong: Explicit failure for the user; no silent alternate path.
- Emphasis: Safety / correctness policy when optimization does not look trustworthy.
One-line contrast
| Option A | Option B | |
|---|---|---|
| Primary lever | Try harder so optimization usually works | Stop when optimization is not trustworthy |
| User-visible outcome on bad opt | Unspecified / legacy behavior unless you add more | Compile error (or explicit failure) |
Analogy: A is “widen the road and clear debris.” B is “if the bridge still looks unsafe, close it — don’t send traffic over it or on a detour you didn’t approve.”
You can combine them: A’s pass schedule + B’s error-on-failure is a common pattern (prevent first, then fail closed if prevention wasn’t enough).
an you write down option E: Cap + pass pipeline / CompactIds + treat overflow as hard compile error (no fallback)Option E — Cap + pass pipeline /
|
| Situation | Outcome |
|---|---|
| Optimization succeeds | Ship optimized SPIR-V (same goal as today’s success path). |
Run() fails or overflow is detected (e.g. message-based flag, or better API if available) |
Fail the kernel compile with a specific error (e.g. “SPIR-V |
| optimizer failed” / “ID space overflow”). | |
Ambiguous success (messages suggest overflow but Run() doesn’t fail) |
Policy choice: either same as failure (strict) or needs a stronger signal from |
| SPIRV-Tools; strict E prefers fail closed if you treat overflow messages as fatal. |
Upsides
- No silent QoS cliff from falling back to unoptimized shaders.
- No shipping potentially invalid
optimized_spvon failure — fail closed instead of “best effort” bytes. - Structural mitigation (DCE / compact / cap) reduces how often you hit the error path vs cap-only.
Downsides / consequences
- Availability: Kernels that would have run with the full PR’s unoptimized fallback will fail to compile under Option E unless you add another mitigation (e.g.
weaker retry preset, user disables opt, or smaller kernel). - Compile time: Extra passes increase SPIRV-Tools work on the success path (same as Option A).
- Overflow detection: If you rely on message substring matching (or similar), version fragility remains; false negatives could still ship bad IR unless you add
validation or a first-class overflow bit from the toolchain. - Support burden: Users hit hard errors more often than with fallback — need actionable diagnostics (which pass, approximate module size, hint to disable
optimization or reduce kernel).
How Option E relates to A and B
- Versus A: Same prevention stack; Option E adds a defined failure policy (hard error, no legacy “push failed optimized” ambiguity).
- Versus B: Same fail closed policy; Option E includes the full pipeline + cap prevention layer, not only policy.
Option E is essentially “maximum prevention + fail closed,” with no unoptimized fallback.
|
@duburcqa I'm a little concerned about the fallback to unoptimized code, without warning. I'm fairly tempted by option E above, which I interpret as:
Thoughts? |
is there some way we can make the cap dynamic instead of fixed, and grow as needed?Dynamic cap — what’s possible?What SPIRV-Tools does todayIn SPIRV-Tools,
|
| Aspect | Notes |
|---|---|
| Cost | Worst case: multiple full optimizer runs — can be much slower than picking one high cap up front. |
| Correctness | Each attempt is a normal Run(); no semantic difference vs setting a fixed high cap, if you eventually use a bound that suffices. |
| Safety | You still need a maximum bound; unbounded growth risks OOM or runaway compile time on pathological inputs. |
| Heuristic start | Seeding from the input SPIR-V header’s bound can avoid paying a huge cap on tiny shaders and avoid one failure on medium ones — tuning |
| required. |
What “dynamic” does not solve by itself
- Root cause of ID explosion (dead IR between passes) — that’s still pass ordering / DCE, etc.
- Ambiguous failures (e.g. success + overflow messages) — retry logic must tie into whatever signal you trust (same issue as the current PR’s message flag).
Short answer
Yes — in app code: implement retry (or binary search) on set_max_id_bound with a ceiling, re-calling Run() until success or cap. No — not inside SPIRV-Tools as a
single fixed call: the bound is per Run, not an auto-growing knob during one run.
|
@hughperkins Answering your questions: What would be consequences if we only increased the limit, with no other changes? The optimizer would produce valid SPIR-V (no crash), but with very sparse IDs (we measured up to 6.8M). MoltenVK/SPIRV-Cross on Metal can't handle IDs that high — we got an assertion failure in Is there some intermediate change possible beyond simply increasing the id cap, but without turning off optimization? That's exactly what commit 4 does. Nothing is turned off — the full optimization pipeline runs (with additional intermediate DCE passes matching SPIRV-Tools' own What's the difference between option A and B? Commits 1-3 are defense-in-depth (detect overflow via message consumer, dedup warnings, threshold skip). Commit 4 is the real fix (raise cap + DCE + CompactIds). Commit 5 replaced the silent fallback with a hard error per @duburcqa's feedback. With the cap raised to ~67M and CompactIds renumbering the output, the defense-in-depth code paths should never trigger in practice — but they're there in case a future kernel is pathologically large. Can we make the cap dynamic? The SPIRV-Tools API ( |
57627d8 to
dfb6c14
Compare
|
@claude review |
|
Opus description: SummaryAddresses SPIR-V optimizer ID-space exhaustion on very large kernels (e.g. autodiff / heavy SSA-style transforms) by raising the SPIRV-Tools max ID bound, adding What changed
Strengths
Limitations / risks
|
…overflows the ID space.
…e overflow crash.
…IDs after optimization.
…(), remove duplicate comment.
… skip dump_ir on failed optimization.
…ROR to QD_ERROR, warn on overflow detection, fix dead-code Disassemble.
…hrough SPIRV-Tools.
…, fix attribs/spirv sync on throw.
…entiate error messages, fix comment accuracy.
93330ba to
c1a8c83
Compare
Summary
Fixes a SIGSEGV during
loss.backward()on Metal/Vulkan when differentiating articulated-bodysimulations (e.g. Genesis rigid ABD with freejoint + child joints). The reverse-mode autodiff
kernel generates SPIR-V modules with ~89K IDs, and the SPIRV-Tools SSA-construction pass
(
LocalMultiStoreElim) was overflowing the default 4M ID cap, producing corrupt SPIR-V thatcrashed the GPU driver.
Root cause: SPIRV-Tools caps
TakeNextId()at0x3FFFFF(~4M) by default. The SSA passcreates phi nodes proportional to (variables x join blocks), easily exceeding 4M on large autodiff
kernels. When
TakeNextId()returns 0, the pass continues with invalid id references and segfaults.Fix: Raise
max_id_boundto 64M via the publicOptimizerOptions::set_max_id_bound()API, add intermediate
AggressiveDCEpasses matching the upstreamRegisterPerformancePassespipeline, and append
CompactIdsPassat the end to renumber IDs back to a dense range forMetal/Vulkan drivers. No vendored SPIRV-Tools code is modified.
Reproduction
Reproduction script for Genesis-Embodied-AI/Genesis#2537:
repro_2537.py
Test plan
repro_2537.py(Genesis freejoint + hinge): EXIT=0, backward completes in 18s (cached),correct gradient
pytest tests/python/test_ad_ndarray_torch.py test_ndarray.py test_function.py test_struct.py(349 passed, 0 failed)