perf: portable SIMD (NEON + WASM simd128) + cross-platform CI & wheel testing#19
Conversation
The 128 GiB payload cap was a usize literal (128 * 1024 * 1024 * 1024), which overflows const-evaluation on 32-bit targets (wasm32, armv7) where usize is 32-bit. Retype as u64 and widen the comparison; on 32-bit, usize::MAX (~4 GiB) is already the ceiling so the check never trips. No behaviour change on 64-bit.
The bitmap/sign scan fallbacks ran scalar u64::count_ones() on every non-x86 target, so aarch64 (Graviton, Apple silicon, Axion) and wasm left popcount throughput unused. Add shared and_popcount/xor_popcount helpers dispatching NEON (vcntq_u8/vaddvq_u8, baseline on aarch64, no runtime detection), simd128 (u8x16_popcnt + pairwise reduce) on wasm32+simd128, and scalar elsewhere; wired into all six bitmap/sign scan fallbacks. The x86_64 AVX-512/AVX2 kernels are untouched. popcount is exact-integer, so every path is bit-identical. Also: total_cmp for the top-k finalize sort (true total order, robust to any non-finite score slipping past the guards; agrees with partial_cmp on finite scores), and a popcount equivalence unit test that is the runtime correctness gate for the NEON and simd128 paths.
python.yml: test the bindings on windows-latest, macOS Intel (macos-13), and Linux aarch64 (ubuntu-24.04-arm) on top of linux-x86_64 + macOS-arm64, so every wheel target is behaviourally tested (build/install/pytest); the ARM legs are NEON's runtime gate. ci.yml: a wasm job (build wasm32 +simd128, run the popcount test under wasmtime via wasip1) and a bench job (bench_rank on x86 AVX vs Linux-ARM NEON, numbers in the CI log). release-python.yml: a T2 gate that installs + pytests each freshly-built wheel before publish (cross-built aarch64 leg skipped, covered by python.yml's native ARM runner).
Review Summary by QodoPortable SIMD popcount (NEON + WASM simd128) with cross-platform CI
WalkthroughsDescription• Add portable SIMD popcount kernels for aarch64 (NEON) and wasm32 (simd128) - Previously only x86_64 had SIMD acceleration; other targets fell back to scalar - NEON and simd128 paths return bit-identical results to scalar/AVX-512 • Fix 32-bit const-eval overflow in MAX_PAYLOAD by retyping as u64 • Expand CI coverage to test all wheel targets (Windows, Intel-Mac, ARM-Linux) - Add wasm job with simd128 runtime correctness gate - Add benchmark job for cross-platform SIMD performance visibility • Improve sort robustness with total_cmp for top-k finalize Diagramflowchart LR
A["Bitmap/Sign Scans"] -->|"x86_64"| B["AVX-512/AVX2<br/>VPOPCNTDQ"]
A -->|"aarch64"| C["NEON<br/>vcntq_u8"]
A -->|"wasm32+simd128"| D["WASM simd128<br/>u8x16_popcnt"]
A -->|"other"| E["Scalar<br/>count_ones"]
B --> F["Bit-identical<br/>Results"]
C --> F
D --> F
E --> F
G["CI Matrix"] -->|"Test"| H["Linux x86_64<br/>macOS arm64<br/>macOS x86_64<br/>Windows<br/>Linux aarch64"]
I["Benchmarks"] -->|"Measure"| J["x86_64 AVX<br/>aarch64 NEON"]
File Changes1. src/util.rs
|
Code Review by Qodo
1. WASM job drops -D warnings
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 359b86bc0f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces portable SIMD-accelerated popcount utilities for aarch64 (NEON) and wasm32 (simd128), replacing manual scalar loops in the bitmap and sign-bitmap scanning logic. Additionally, it enhances the robustness of the TopK collector by utilizing f32::total_cmp for sorting and prevents potential constant-evaluation overflows on 32-bit targets by explicitly typing MAX_PAYLOAD as u64. All review comments were filtered as they provided validation rather than actionable feedback; therefore, I have no additional feedback to provide.
|
/agentic_review |
|
Persistent review updated to latest commit 359b86b |
and_popcount/xor_popcount are pub(crate) safe fns whose NEON/simd128 paths read q at offsets up to doc.len(). The length invariant was only a debug_assert (stripped in release), so a future caller passing a shorter q would be a release-mode OOB read - and the scalar fallback would silently truncate instead (divergent behaviour). All six current callers pass equal-length qpv rows, so this is latent, not live; the hard assert_eq! turns any future misuse into a clean panic, matching the crate's hard-assert-before-SIMD pattern (body_overlap_scores_subset). Surfaced by qodo's review on #19.
…n only macos-13 is GitHub's last Intel-mac image (deprecated + scarce); on a private repo those jobs sit queued and block the PR indefinitely. Coverage is redundant - the x86_64 logic is tested on the linux-x86_64 legs and Mach-O on macos-arm64, and the Intel wheel is still built + shipped by release-python.yml. Also restrict the push trigger to main so a feature branch with an open PR doesn't run python.yml twice (push + pull_request).
Summary
Gives the bitmap / sign-bitmap popcount scans a portable SIMD path so every architecture we ship is fast and tested — not just x86. Previously all SIMD was
#[cfg(target_arch = "x86_64")](AVX-512/AVX2) and every other target fell through to scalaru64::count_ones(), so aarch64 (Graviton, Apple silicon, Axion) and wasm ran the slow path. This adds:simd128popcount kernels, runtime-dispatched alongside the existing AVX-512/AVX2 → scalar tiers.popcount(Q AND/XOR D)is an exact integer, so the NEON and simd128 paths return bit-identical results to scalar/AVX — no cross-CPU score drift to reconcile (unlike float kernels).Changes
Perf (
src/)util.rs: sharedand_popcount/xor_popcounthelpers — NEON (vcntq_u8/vaddvq_u8, baseline on aarch64, no runtime detection), simd128 (u8x16_popcnt+ pairwise reduce,cfg(target_feature = "simd128")), scalar elsewhere. x86_64 AVX-512/AVX2 kernels untouched.bitmap.rs/sign_bitmap.rs: all six scan fallbacks route through the helpers.rank_io.rs:MAX_PAYLOAD(128 GiB) was ausizeliteral that overflowed const-eval on 32-bit targets — retypedu64. The crate had never been 32-bit-clean.util.rs:total_cmpfor the top-k finalize sort (true total order; agrees withpartial_cmpon the finite scores the input guards already enforce), plus apopcount_helpers_match_naivetest that is the runtime correctness gate for NEON/simd128.Cargo.toml: MSRV / SIMD-dispatch comment accuracy.CI (
.github/workflows/)python.yml: bindings tested on Windows + Intel-mac + ARM-Linux as well as linux-x86_64 + macOS-arm64. Every wheelrelease-python.ymlships is now behaviourally tested (build → install → pytest); the ARM legs are NEON's runtime gate.ci.yml: awasmjob (buildwasm32-unknown-unknown+simd128, run the popcount test under wasmtime viawasm32-wasip1— the simd128 runtime gate, the wasm analogue of the AVX-512-under-SDE job) and abenchjob (bench_rankon x86-AVX vs Linux-ARM-NEON, numbers in the CI log).release-python.yml: a T2 gate — every natively-runnable wheel is install + pytest'd before it's uploaded for publish (cross-built aarch64 leg skipped; covered by python.yml's native ARM runner).An audit, verified — not applied blind
These changes act on the one verified-real finding from a May-2026 "production footguns" audit. Verifying each finding against the source first mattered: of 8 findings, only the ARM/WASM scalar gap (#2) was real-and-valuable.
FLUSH = 128"fix" would only have halved the flush window and regressed throughput. Not applied.total_cmpand refactor: OrdVec ontology rebrand #7 MSRV comment — landed as cheap, correct hygiene (note: docs: clarify provenance (original work developed in turbovec, not extracted source) #5 is defense-in-depth, not a live bug — the finite-input policy already makes a NaN score unreachable).k, where the O(k) cost is negligible.finalize_intoalloc) and perf: optimize symmetric rank-cosine search (centre-drop identity) #8 (Rayon+PyO3 panic → abort) — declined. perf: optimize symmetric rank-cosine search (centre-drop identity) #8 is mischaracterized: with the default unwind, a worker panic propagates to a PyO3PanicException(not a process abort), and the binding already pre-validates every input.std::simdis nightly, so it's stable-blocked; thestd::archpath needs a rank-storage layout refactor).Test plan
Local gate (all green):
cargo fmt --all --checkcargo clippy --all-targets --all-features -- -D warningson x86_64, aarch64 (NEON), and wasm32 + simd128cargo test/--features experimental/--no-default-features→ 89 / 96 / 89, 0 failedcargo +1.89.0 buildaarch64-unknown-linux-gnu+wasm32-unknown-unknown(+simd128)The new
popcount_helpers_match_naivetest exercises whichever path is active per target — scalar on x86, NEON on the ARM CI runner, simd128 on the wasm lane — so this PR's CI is the runtime correctness gate for the new kernels.CI additionally exercises: the bindings on all 5 wheel targets, the simd128 kernel under wasmtime, and
bench_rankon x86 + Linux-ARM.