investigation(bb/msm): root-cause ba_walker_combine dx==0 — uninitialized partial slots mis-linked to bucket 0#23741
Draft
AztecBot wants to merge 3 commits into
Draft
Conversation
…ized partial slots mis-linked to bucket 0
AztecBot
added a commit
that referenced
this pull request
May 30, 2026
…orrect
Per-nb cross-check on real macOS hardware (Chrome 148, logn16): nb=1 matches
@noble/curves but nb=2,3,4,5,10 ALL disagree. This refutes the lever's premise
('no correctness change, same path as the logn20 default'). Root cause is host
side: planner/walker bind groups + params are built once with no per-batch
bucket offset, so the batchBuckets-vs-bTotal index spaces only coincide at nb=1.
Blast radius: the same multi-batch code is the wgFits-forced default at logn19
(nb=2) and logn20 (nb=3), so MsmV2 is very likely incorrect by default at
logn>=19 (never cross-checked there — noble too slow). Distinct from the
invisible bucket-0 issue in PR #23741.
Warns on MsmConfig.memBudgetBytes and documents the evidence + root-cause
direction in MSM_V2_MEMORY.md. Does not change runtime behavior (default budget
stays a no-op); the fix to the multi-batch path is required follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root-cause investigation (not a symptom patch)
Target: the
ba_walker_combineoff-curve-bucket bug at logn 8..16 — a few buckets go off-curve, each with exactly onedx == 0during the linked-list traversal. The prior session was about to make the combine's affine addition exception-safe; this PR looks upstream instead.Finding 1 — the walker/combine logic is correct (rules out the hypotheses)
dev/msm-webgpu/walker-dx0-repro.mjsis a dependency-free, GPU-free port of the exact integer logic ofdecompose_scalars_booth,ba_planner_cumsum/partition_thread/partition_task,ba_stream_walker, andba_walker_combine, plus inline BN254. Across logn 8..16 × 25 seeds it finds zero structural defects:This rules out the hypothesized upstream causes — scalar decomposition, signed-digit (Booth) recoding, bucket-index assignment, and walker partial emission producing a duplicate or P/−P sign collision. (I also confirmed
BW = ceil((2^(c-1)+1)/256)*256correctly reserves the top digit2^(c-1), so the max-magnitude digit does not overflow into the next window.)Finding 2 — the real defect: a sentinel mismatch injects off-curve garbage
ba_walker_combine's input list is built byba_walker_partials_index, which links every slot withpartial_dest != NO_BUCKET(0xffffffff). But the host prepares that buffer withenc.clearBuffer(walkerPartialDest), which writes 0. The walker is indirect-dispatched over onlynumActive = nwg*256threads (≤streamNumThreads = 8192), so it only initializes the slots it owns. Every slot owned by a non-dispatched thread still reads 0, which the indexer treats asbucket_id = 0and links into global bucket 0's combine list — withpartials_buf = (0,0)(off-curve, same zero-clear). Combining(0,0)+(0,0)is exactly thedx == 0.ba_reduce_init_bench.wgsldrops column 0 (the zero digit,src = w*bw + i + 1), so the reference cross-check still passes even with bucket 0 corrupt — consistent with detection via a per-bucket on-curve assertion.numActive < streamNumThreads(smaller logn);node_counternever overflows (realPartials ≤ dispatched).A harness gotcha was found and fixed in the repro: an LCG PRNG has non-random low bits that bias the Booth digits and can mask bucket-distribution bugs — the repro uses mulberry32.
Fix (two upstream changes, both implemented here)
ba_planner_classify(primary): buckets withid % BW == 0are the Booth zero digit, contribute nothing, and are already dropped by the reduction. Excluding them from the dense/size1 lists means global bucket 0 is never combined, so the mis-linked uninitialized slots are never read. Result is identical; it also saves accumulating every per-window zero bucket.classifyParams.ynow carriesBW.ba_walker_partials_indexagainstbucket_id == 0(defense in depth): keeps the zero-cleared slots out of the linked list regardless of classification.Regenerated
_generated/shaders.ts. The deeper sentinel hardening (initializepartial_desttoNO_BUCKET, or bound the indexer scan to the dispatched range) is documented indev/msm-webgpu/WALKER_DX0_FINDINGS.mdas a follow-up.Validation status
Logic correctness and the slot-staleness arithmetic are validated deterministically on CPU (
node dev/msm-webgpu/walker-dx0-repro.mjs sweep). The final GPU cross-check (20+ runs vs@noble/curvesunder headless SwiftShader) could not be run here — no Chrome/Playwright in the container and the in-treebarretenberg.wasm.gzis a 213-byte stub. The repro and the exact buckets/nodes to read back are provided so a GPU-equipped run can confirm.