investigation(bb/msm): root-cause ba_walker_combine dx==0 — uninitialized partial slots mis-linked to bucket 0 by AztecBot · Pull Request #23741 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T03:50:20Z

Root-cause investigation (not a symptom patch)

Target: the ba_walker_combine off-curve-bucket bug at logn 8..16 — a few buckets go off-curve, each with exactly one dx == 0 during the linked-list traversal. The prior session was about to make the combine's affine addition exception-safe; this PR looks upstream instead.

Finding 1 — the walker/combine logic is correct (rules out the hypotheses)

dev/msm-webgpu/walker-dx0-repro.mjs is a dependency-free, GPU-free port of the exact integer logic of decompose_scalars_booth, ba_planner_cumsum/partition_thread/partition_task, ba_stream_walker, and ba_walker_combine, plus inline BN254. Across logn 8..16 × 25 seeds it finds zero structural defects:

no range overlap (double-count), no gap — partials always cleanly partition each bucket;
no bucket holds the same input point twice (case a), and none holds a point and its negation (case b);
the affine combine of the partials is on-curve and matches the reference.

This rules out the hypothesized upstream causes — scalar decomposition, signed-digit (Booth) recoding, bucket-index assignment, and walker partial emission producing a duplicate or P/−P sign collision. (I also confirmed BW = ceil((2^(c-1)+1)/256)*256 correctly reserves the top digit 2^(c-1), so the max-magnitude digit does not overflow into the next window.)

Finding 2 — the real defect: a sentinel mismatch injects off-curve garbage

ba_walker_combine's input list is built by ba_walker_partials_index, which links every slot with partial_dest != NO_BUCKET (0xffffffff). But the host prepares that buffer with enc.clearBuffer(walkerPartialDest), which writes 0. The walker is indirect-dispatched over only numActive = nwg*256 threads (≤ streamNumThreads = 8192), so it only initializes the slots it owns. Every slot owned by a non-dispatched thread still reads 0, which the indexer treats as bucket_id = 0 and links into global bucket 0's combine list — with partials_buf = (0,0) (off-curve, same zero-clear). Combining (0,0)+(0,0) is exactly the dx == 0.

Matches "off-curve value for a small number of buckets" (global bucket 0).
Invisible to the final result: ba_reduce_init_bench.wgsl drops column 0 (the zero digit, src = w*bw + i + 1), so the reference cross-check still passes even with bucket 0 corrupt — consistent with detection via a per-bucket on-curve assertion.
Manifests only when numActive < streamNumThreads (smaller logn); node_counter never overflows (realPartials ≤ dispatched).

A harness gotcha was found and fixed in the repro: an LCG PRNG has non-random low bits that bias the Booth digits and can mask bucket-distribution bugs — the repro uses mulberry32.

Fix (two upstream changes, both implemented here)

Exclude zero-digit buckets in ba_planner_classify (primary): buckets with id % BW == 0 are the Booth zero digit, contribute nothing, and are already dropped by the reduction. Excluding them from the dense/size1 lists means global bucket 0 is never combined, so the mis-linked uninitialized slots are never read. Result is identical; it also saves accumulating every per-window zero bucket. classifyParams.y now carries BW.
Guard ba_walker_partials_index against bucket_id == 0 (defense in depth): keeps the zero-cleared slots out of the linked list regardless of classification.

Regenerated _generated/shaders.ts. The deeper sentinel hardening (initialize partial_dest to NO_BUCKET, or bound the indexer scan to the dispatched range) is documented in dev/msm-webgpu/WALKER_DX0_FINDINGS.md as a follow-up.

Validation status

Logic correctness and the slot-staleness arithmetic are validated deterministically on CPU (node dev/msm-webgpu/walker-dx0-repro.mjs sweep). The final GPU cross-check (20+ runs vs @noble/curves under headless SwiftShader) could not be run here — no Chrome/Playwright in the container and the in-tree barretenberg.wasm.gz is a 213-byte stub. The repro and the exact buckets/nodes to read back are provided so a GPU-equipped run can confirm.

…ized partial slots mis-linked to bucket 0

…orrect Per-nb cross-check on real macOS hardware (Chrome 148, logn16): nb=1 matches @noble/curves but nb=2,3,4,5,10 ALL disagree. This refutes the lever's premise ('no correctness change, same path as the logn20 default'). Root cause is host side: planner/walker bind groups + params are built once with no per-batch bucket offset, so the batchBuckets-vs-bTotal index spaces only coincide at nb=1. Blast radius: the same multi-batch code is the wgFits-forced default at logn19 (nb=2) and logn20 (nb=3), so MsmV2 is very likely incorrect by default at logn>=19 (never cross-checked there — noble too slow). Distinct from the invisible bucket-0 issue in PR #23741. Warns on MsmConfig.memBudgetBytes and documents the evidence + root-cause direction in MSM_V2_MEMORY.md. Does not change runtime behavior (default budget stays a no-op); the fix to the multi-batch path is required follow-up.

investigation(bb/msm): root-cause ba_walker_combine dx==0 — uninitial…

18cd099

…ized partial slots mis-linked to bucket 0

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 2 commits May 30, 2026 03:53

update PR #23741

18eb50f

update PR #23741

15bff54

This was referenced May 30, 2026

fix(bb/msm): correct MsmV2 multi-batch (numBatches>1) + honest peak-memory map #23733

Draft

perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigation(bb/msm): root-cause ba_walker_combine dx==0 — uninitialized partial slots mis-linked to bucket 0#23741

investigation(bb/msm): root-cause ba_walker_combine dx==0 — uninitialized partial slots mis-linked to bucket 0#23741
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/stream-walker-dx0-rootcause

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root-cause investigation (not a symptom patch)

Finding 1 — the walker/combine logic is correct (rules out the hypotheses)

Finding 2 — the real defect: a sentinel mismatch injects off-curve garbage

Fix (two upstream changes, both implemented here)

Validation status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading