Skip to content

investigation(bb/msm): root-cause ba_walker_combine dx==0 — uninitialized partial slots mis-linked to bucket 0#23741

Draft
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/stream-walker-dx0-rootcause
Draft

investigation(bb/msm): root-cause ba_walker_combine dx==0 — uninitialized partial slots mis-linked to bucket 0#23741
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/stream-walker-dx0-rootcause

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Root-cause investigation (not a symptom patch)

Target: the ba_walker_combine off-curve-bucket bug at logn 8..16 — a few buckets go off-curve, each with exactly one dx == 0 during the linked-list traversal. The prior session was about to make the combine's affine addition exception-safe; this PR looks upstream instead.

Finding 1 — the walker/combine logic is correct (rules out the hypotheses)

dev/msm-webgpu/walker-dx0-repro.mjs is a dependency-free, GPU-free port of the exact integer logic of decompose_scalars_booth, ba_planner_cumsum/partition_thread/partition_task, ba_stream_walker, and ba_walker_combine, plus inline BN254. Across logn 8..16 × 25 seeds it finds zero structural defects:

  • no range overlap (double-count), no gap — partials always cleanly partition each bucket;
  • no bucket holds the same input point twice (case a), and none holds a point and its negation (case b);
  • the affine combine of the partials is on-curve and matches the reference.

This rules out the hypothesized upstream causes — scalar decomposition, signed-digit (Booth) recoding, bucket-index assignment, and walker partial emission producing a duplicate or P/−P sign collision. (I also confirmed BW = ceil((2^(c-1)+1)/256)*256 correctly reserves the top digit 2^(c-1), so the max-magnitude digit does not overflow into the next window.)

Finding 2 — the real defect: a sentinel mismatch injects off-curve garbage

ba_walker_combine's input list is built by ba_walker_partials_index, which links every slot with partial_dest != NO_BUCKET (0xffffffff). But the host prepares that buffer with enc.clearBuffer(walkerPartialDest), which writes 0. The walker is indirect-dispatched over only numActive = nwg*256 threads (≤ streamNumThreads = 8192), so it only initializes the slots it owns. Every slot owned by a non-dispatched thread still reads 0, which the indexer treats as bucket_id = 0 and links into global bucket 0's combine list — with partials_buf = (0,0) (off-curve, same zero-clear). Combining (0,0)+(0,0) is exactly the dx == 0.

  • Matches "off-curve value for a small number of buckets" (global bucket 0).
  • Invisible to the final result: ba_reduce_init_bench.wgsl drops column 0 (the zero digit, src = w*bw + i + 1), so the reference cross-check still passes even with bucket 0 corrupt — consistent with detection via a per-bucket on-curve assertion.
  • Manifests only when numActive < streamNumThreads (smaller logn); node_counter never overflows (realPartials ≤ dispatched).

A harness gotcha was found and fixed in the repro: an LCG PRNG has non-random low bits that bias the Booth digits and can mask bucket-distribution bugs — the repro uses mulberry32.

Fix (two upstream changes, both implemented here)

  1. Exclude zero-digit buckets in ba_planner_classify (primary): buckets with id % BW == 0 are the Booth zero digit, contribute nothing, and are already dropped by the reduction. Excluding them from the dense/size1 lists means global bucket 0 is never combined, so the mis-linked uninitialized slots are never read. Result is identical; it also saves accumulating every per-window zero bucket. classifyParams.y now carries BW.
  2. Guard ba_walker_partials_index against bucket_id == 0 (defense in depth): keeps the zero-cleared slots out of the linked list regardless of classification.

Regenerated _generated/shaders.ts. The deeper sentinel hardening (initialize partial_dest to NO_BUCKET, or bound the indexer scan to the dispatched range) is documented in dev/msm-webgpu/WALKER_DX0_FINDINGS.md as a follow-up.

Validation status

Logic correctness and the slot-staleness arithmetic are validated deterministically on CPU (node dev/msm-webgpu/walker-dx0-repro.mjs sweep). The final GPU cross-check (20+ runs vs @noble/curves under headless SwiftShader) could not be run here — no Chrome/Playwright in the container and the in-tree barretenberg.wasm.gz is a 213-byte stub. The repro and the exact buckets/nodes to read back are provided so a GPU-equipped run can confirm.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added a commit that referenced this pull request May 30, 2026
…orrect

Per-nb cross-check on real macOS hardware (Chrome 148, logn16): nb=1 matches
@noble/curves but nb=2,3,4,5,10 ALL disagree. This refutes the lever's premise
('no correctness change, same path as the logn20 default'). Root cause is host
side: planner/walker bind groups + params are built once with no per-batch
bucket offset, so the batchBuckets-vs-bTotal index spaces only coincide at nb=1.

Blast radius: the same multi-batch code is the wgFits-forced default at logn19
(nb=2) and logn20 (nb=3), so MsmV2 is very likely incorrect by default at
logn>=19 (never cross-checked there — noble too slow). Distinct from the
invisible bucket-0 issue in PR #23741.

Warns on MsmConfig.memBudgetBytes and documents the evidence + root-cause
direction in MSM_V2_MEMORY.md. Does not change runtime behavior (default budget
stays a no-op); the fix to the multi-batch path is required follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant