Skip to content

perf: portable SIMD (NEON + WASM simd128) + cross-platform CI & wheel testing#19

Merged
Fieldnote-Echo merged 5 commits into
mainfrom
perf/portable-simd
May 23, 2026
Merged

perf: portable SIMD (NEON + WASM simd128) + cross-platform CI & wheel testing#19
Fieldnote-Echo merged 5 commits into
mainfrom
perf/portable-simd

Conversation

@Fieldnote-Echo
Copy link
Copy Markdown
Owner

Summary

Gives the bitmap / sign-bitmap popcount scans a portable SIMD path so every architecture we ship is fast and tested — not just x86. Previously all SIMD was #[cfg(target_arch = "x86_64")] (AVX-512/AVX2) and every other target fell through to scalar u64::count_ones(), so aarch64 (Graviton, Apple silicon, Axion) and wasm ran the slow path. This adds:

  • NEON (aarch64) and WASM simd128 popcount kernels, runtime-dispatched alongside the existing AVX-512/AVX2 → scalar tiers.
  • A real 32-bit portability fix the wasm build surfaced.
  • Cross-platform CI that tests every wheel target we publish and measures the per-arch SIMD numbers.

popcount(Q AND/XOR D) is an exact integer, so the NEON and simd128 paths return bit-identical results to scalar/AVX — no cross-CPU score drift to reconcile (unlike float kernels).

Changes

Perf (src/)

  • util.rs: shared and_popcount / xor_popcount helpers — NEON (vcntq_u8/vaddvq_u8, baseline on aarch64, no runtime detection), simd128 (u8x16_popcnt + pairwise reduce, cfg(target_feature = "simd128")), scalar elsewhere. x86_64 AVX-512/AVX2 kernels untouched.
  • bitmap.rs / sign_bitmap.rs: all six scan fallbacks route through the helpers.
  • rank_io.rs: MAX_PAYLOAD (128 GiB) was a usize literal that overflowed const-eval on 32-bit targets — retyped u64. The crate had never been 32-bit-clean.
  • util.rs: total_cmp for the top-k finalize sort (true total order; agrees with partial_cmp on the finite scores the input guards already enforce), plus a popcount_helpers_match_naive test that is the runtime correctness gate for NEON/simd128.
  • Cargo.toml: MSRV / SIMD-dispatch comment accuracy.

CI (.github/workflows/)

  • python.yml: bindings tested on Windows + Intel-mac + ARM-Linux as well as linux-x86_64 + macOS-arm64. Every wheel release-python.yml ships is now behaviourally tested (build → install → pytest); the ARM legs are NEON's runtime gate.
  • ci.yml: a wasm job (build wasm32-unknown-unknown +simd128, run the popcount test under wasmtime via wasm32-wasip1 — the simd128 runtime gate, the wasm analogue of the AVX-512-under-SDE job) and a bench job (bench_rank on x86-AVX vs Linux-ARM-NEON, numbers in the CI log).
  • release-python.yml: a T2 gate — every natively-runnable wheel is install + pytest'd before it's uploaded for publish (cross-built aarch64 leg skipped; covered by python.yml's native ARM runner).

An audit, verified — not applied blind

These changes act on the one verified-real finding from a May-2026 "production footguns" audit. Verifying each finding against the source first mattered: of 8 findings, only the ARM/WASM scalar gap (#2) was real-and-valuable.

Test plan

Local gate (all green):

  • cargo fmt --all --check
  • cargo clippy --all-targets --all-features -- -D warnings on x86_64, aarch64 (NEON), and wasm32 + simd128
  • cargo test / --features experimental / --no-default-features89 / 96 / 89, 0 failed
  • MSRV cargo +1.89.0 build
  • cross-compile: aarch64-unknown-linux-gnu + wasm32-unknown-unknown (+simd128)

The new popcount_helpers_match_naive test exercises whichever path is active per target — scalar on x86, NEON on the ARM CI runner, simd128 on the wasm lane — so this PR's CI is the runtime correctness gate for the new kernels.

CI additionally exercises: the bindings on all 5 wheel targets, the simd128 kernel under wasmtime, and bench_rank on x86 + Linux-ARM.

Closes nothing on its own; tracks the remaining scalar inner loop in #18.

The 128 GiB payload cap was a usize literal (128 * 1024 * 1024 * 1024), which overflows const-evaluation on 32-bit targets (wasm32, armv7) where usize is 32-bit. Retype as u64 and widen the comparison; on 32-bit, usize::MAX (~4 GiB) is already the ceiling so the check never trips. No behaviour change on 64-bit.
The bitmap/sign scan fallbacks ran scalar u64::count_ones() on every non-x86 target, so aarch64 (Graviton, Apple silicon, Axion) and wasm left popcount throughput unused. Add shared and_popcount/xor_popcount helpers dispatching NEON (vcntq_u8/vaddvq_u8, baseline on aarch64, no runtime detection), simd128 (u8x16_popcnt + pairwise reduce) on wasm32+simd128, and scalar elsewhere; wired into all six bitmap/sign scan fallbacks. The x86_64 AVX-512/AVX2 kernels are untouched. popcount is exact-integer, so every path is bit-identical.

Also: total_cmp for the top-k finalize sort (true total order, robust to any non-finite score slipping past the guards; agrees with partial_cmp on finite scores), and a popcount equivalence unit test that is the runtime correctness gate for the NEON and simd128 paths.
python.yml: test the bindings on windows-latest, macOS Intel (macos-13), and Linux aarch64 (ubuntu-24.04-arm) on top of linux-x86_64 + macOS-arm64, so every wheel target is behaviourally tested (build/install/pytest); the ARM legs are NEON's runtime gate. ci.yml: a wasm job (build wasm32 +simd128, run the popcount test under wasmtime via wasip1) and a bench job (bench_rank on x86 AVX vs Linux-ARM NEON, numbers in the CI log). release-python.yml: a T2 gate that installs + pytests each freshly-built wheel before publish (cross-built aarch64 leg skipped, covered by python.yml's native ARM runner).
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Portable SIMD popcount (NEON + WASM simd128) with cross-platform CI

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Add portable SIMD popcount kernels for aarch64 (NEON) and wasm32 (simd128)
  - Previously only x86_64 had SIMD acceleration; other targets fell back to scalar
  - NEON and simd128 paths return bit-identical results to scalar/AVX-512
• Fix 32-bit const-eval overflow in MAX_PAYLOAD by retyping as u64
• Expand CI coverage to test all wheel targets (Windows, Intel-Mac, ARM-Linux)
  - Add wasm job with simd128 runtime correctness gate
  - Add benchmark job for cross-platform SIMD performance visibility
• Improve sort robustness with total_cmp for top-k finalize
Diagram
flowchart LR
  A["Bitmap/Sign Scans"] -->|"x86_64"| B["AVX-512/AVX2<br/>VPOPCNTDQ"]
  A -->|"aarch64"| C["NEON<br/>vcntq_u8"]
  A -->|"wasm32+simd128"| D["WASM simd128<br/>u8x16_popcnt"]
  A -->|"other"| E["Scalar<br/>count_ones"]
  B --> F["Bit-identical<br/>Results"]
  C --> F
  D --> F
  E --> F
  G["CI Matrix"] -->|"Test"| H["Linux x86_64<br/>macOS arm64<br/>macOS x86_64<br/>Windows<br/>Linux aarch64"]
  I["Benchmarks"] -->|"Measure"| J["x86_64 AVX<br/>aarch64 NEON"]

Loading

File Changes

1. src/util.rs ✨ Enhancement +229/-3

Portable SIMD popcount dispatch with NEON and simd128

• Add and_popcount and xor_popcount helper functions with runtime dispatch
• Implement NEON kernels for aarch64 using vcntq_u8/vaddvq_u8 intrinsics
• Implement WASM simd128 kernels using u8x16_popcnt with pairwise reduction
• Provide scalar fallback for non-SIMD targets
• Add total_cmp for robust top-k sort (true total order vs partial_cmp)
• Add popcount_helpers_match_naive unit test as runtime correctness gate

src/util.rs


2. src/bitmap.rs ✨ Enhancement +8/-23

Refactor bitmap scans to use portable popcount helper

• Replace inline popcount loops with and_popcount helper calls
• Update fallback scan functions to use the new portable helper
• Update documentation to reflect NEON availability on aarch64
• Simplify four separate scan functions by centralizing popcount logic

src/bitmap.rs


3. src/sign_bitmap.rs ✨ Enhancement +3/-11

Refactor sign-bitmap scans to use portable popcount helper

• Replace inline XOR-popcount loops with xor_popcount helper calls
• Update fallback scan functions to use the new portable helper
• Update comments to reflect portable SIMD dispatch (NEON on aarch64)
• Simplify two scan functions by centralizing popcount logic

src/sign_bitmap.rs


View more (5)
4. src/rank_io.rs 🐞 Bug fix +6/-3

Fix 32-bit const-eval overflow in MAX_PAYLOAD

• Retype MAX_PAYLOAD constant from usize to u64 to avoid 32-bit overflow
• Widen comparison to u64 to handle 32-bit targets (wasm32, armv7)
• Add detailed comment explaining the 32-bit portability fix

src/rank_io.rs


5. Cargo.toml 📝 Documentation +6/-4

Update MSRV and SIMD dispatch documentation

• Update MSRV comment to clarify AVX-512 stabilization and u64::is_multiple_of floor
• Expand feature documentation to describe SIMD dispatch across all targets
• Document that aarch64 uses NEON baseline, wasm32 uses simd128 when enabled

Cargo.toml


6. .github/workflows/ci.yml ⚙️ Configuration changes +60/-0

Add wasm and benchmark CI jobs for SIMD coverage

• Add wasm job to build wasm32-unknown-unknown with simd128 and test under wasmtime
• Add bench job to run bench_rank on x86_64 and aarch64 for SIMD performance visibility
• Install wasmtime for wasm test execution via wasm32-wasip1
• Document that wasm job is the runtime correctness gate for simd128 path

.github/workflows/ci.yml


7. .github/workflows/python.yml ⚙️ Configuration changes +16/-4

Expand Python test matrix to all wheel targets

• Expand test matrix from 2 OS targets to 6 (add Windows, Intel-Mac, Linux aarch64)
• Test all wheel targets with native runners to exercise NEON on ARM platforms
• Keep python version sweep (3.9/3.13) only on linux x86_64 for abi3 coverage
• Document that aarch64 runners exercise NEON kernels through bindings

.github/workflows/python.yml


8. .github/workflows/release-python.yml ⚙️ Configuration changes +18/-0

Add pre-publish wheel runtime validation gate

• Add T2 gate: install wheel and run pytest before upload for publish
• Skip cross-built aarch64 wheel test (covered by native ubuntu-24.04-arm runner)
• Test all natively-runnable wheels to catch runtime regressions pre-publish

.github/workflows/release-python.yml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 23, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (2) 📎 Requirement gaps (0)

Grey Divider


Action required

1. WASM job drops -D warnings 📘 Rule violation ✧ Quality
Description
The new wasm CI job overrides the workflow-level RUSTFLAGS: -D warnings, allowing warnings to
pass without failing the build for that target and weakening the warnings-as-errors gate. Because
other jobs do not compile wasm targets, wasm-only code paths can introduce warnings that are not
caught elsewhere, violating the requirement to keep the crate building cleanly with warnings denied.
Code

.github/workflows/ci.yml[R234-246]

Evidence
The compliance requirement is to build with warnings denied (e.g., RUSTFLAGS="-D warnings"), and
the workflow enforces this globally by setting RUSTFLAGS: -D warnings. However, the added wasm
job sets its own RUSTFLAGS to only enable -C target-feature=+simd128 and explicitly notes it is
“dropping the workflow-level -D warnings,” which means wasm builds/tests in that job no longer treat
warnings as errors, letting warnings slip through on that target.

Rule 737888: Crate must build with zero rustc warnings
.github/workflows/ci.yml[29-29]
.github/workflows/ci.yml[234-246]
.github/workflows/ci.yml[26-30]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The `wasm` CI job overrides `RUSTFLAGS` and drops the workflow-level `-D warnings`, weakening the warnings-as-errors policy specifically for wasm targets so warnings can slip through on wasm builds.

## Issue Context
The workflow sets `RUSTFLAGS: -D warnings` globally, but the wasm steps override it with only `-C target-feature=+simd128` (explicitly “dropping the workflow-level -D warnings”). A comment indicates this was done to avoid wasm-only warnings in dependencies failing the build, but it also removes the CI signal for warnings introduced in your own wasm-only code, and other jobs do not compile wasm targets to catch those warnings.

## Fix Focus Areas
- .github/workflows/ci.yml[26-30]
- .github/workflows/ci.yml[234-246]

### Suggested direction
Add an additional step that checks your crate with warnings-as-errors without applying it to dependencies, e.g. a `cargo rustc -p ordvec --target wasm32-wasip1 -- -D warnings` compile-only check (plus `-C target-feature=+simd128`), while keeping the existing runtime test step permissive if needed.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. CI curl-pipes to bash 🐞 Bug ⛨ Security
Description
The wasm CI job installs wasmtime by piping an unpinned remote script into bash, which executes
arbitrary remote code during CI runs. If the endpoint is ever compromised, this can compromise the
runner and any credentials available on trusted branches.
Code

.github/workflows/ci.yml[R229-233]

Evidence
The wasm job step explicitly downloads a remote install script and pipes it directly to bash,
which executes whatever the server returns at runtime.

.github/workflows/ci.yml[220-246]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The workflow installs wasmtime via `curl ... | bash`, which is a supply-chain risk because it executes unpinned remote code.

### Issue Context
This runs in CI on Ubuntu runners and is part of the wasm correctness gate.

### Fix Focus Areas
- .github/workflows/ci.yml[229-233]

### Suggested direction
Replace with a pinned GitHub Action or a pinned release download (specific version) with checksum verification, or install from a trusted package source.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Unsafe popcount lacks length check ✓ Resolved 🐞 Bug ⛨ Security
Description
and_popcount/xor_popcount are safe functions but only debug_assert that slice lengths match;
in release, a mismatch can make the NEON/wasm SIMD paths perform unchecked loads from q based on
doc.len(), which is undefined behavior. The scalar fallback also silently truncates via zip(),
masking the mismatch instead of failing fast.
Code

src/util.rs[R69-129]

Evidence
The helpers are safe (pub(crate) fn) and only enforce equal lengths via debug_assert_eq. The
NEON and simd128 implementations compute loop bounds from doc.len() and then load from both doc
and q pointers without any runtime check that q is as long as doc, which makes the safe
wrapper unsound if called with mismatched slices; additionally, the scalar path uses zip() which
truncates to the shorter slice, hiding length bugs.

src/util.rs[66-111]
src/util.rs[118-129]
src/util.rs[137-152]
src/util.rs[185-209]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`and_popcount` / `xor_popcount` are safe wrappers around unsafe SIMD loads but only use `debug_assert_eq!(doc.len(), q.len())`. In release builds, a length mismatch can cause out-of-bounds reads (UB) in the NEON and simd128 implementations, and the scalar implementation would silently compute only the min length due to `zip()`.

### Issue Context
These helpers are `pub(crate)` and used from bitmap/sign-bitmap scan fallbacks. Even if current call sites pass equal-length slices, keeping the wrapper safe means it must enforce its preconditions in release as well.

### Fix Focus Areas
- src/util.rs[69-129]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Signed-zero tie-break drift 🐞 Bug ≡ Correctness
Description
TopK::finalize_into now uses f32::total_cmp (IEEE-754 totalOrder), which orders -0.0 and
+0.0 differently; this means numeric-equal zero scores may no longer be tie-broken by doc_id as
documented, and the comment claiming total_cmp and partial_cmp “agree for finite scores” is not
strictly true. This can change the final returned ordering (even when the kept top-k set is
unchanged) in edge cases where computations yield signed zero.
Code

src/util.rs[R359-365]

Evidence
The TopK docs state that equal-score ties are broken by doc_id, but the new comparator uses IEEE
totalOrder semantics via total_cmp, which can distinguish signed zeros and therefore may bypass
the doc_id tie-break in that case. The code itself acknowledges use of IEEE-754 totalOrder, and the
rest of the collector logic uses == to detect score ties, creating a semantic mismatch if signed
zeros occur in practice.

src/util.rs[251-260]
src/util.rs[348-366]
src/util.rs[289-333]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`TopK::finalize_into` switched from `partial_cmp(...).unwrap_or(Equal)` to `total_cmp`, which uses IEEE-754 total ordering and can distinguish `-0.0` from `+0.0`. This conflicts with the module’s stated ordering rule that *equal scores* are tie-broken by doc_id, and it makes the nearby comment (“for finite scores we actually have, the two agree”) not strictly correct.

### Issue Context
The collector’s eviction logic still treats `score == other_score` as a tie, but final ordering now may treat signed zeros as non-ties.

### Fix Focus Areas
- src/util.rs[348-366]

### Suggested fix
Pick one of:
1) Canonicalize signed zero before comparison (and/or before storing): map any `score` with `score == 0.0` to `0.0` so `-0.0` cannot affect ordering.
2) Keep `total_cmp` but explicitly treat `-0.0` and `+0.0` as equal in the comparator (normalize both operands when `== 0.0`).
3) If you intentionally want IEEE total ordering (including signed zero distinction), update the doc comment that says equal-score ties are broken by doc_id and adjust the “two agree” comment accordingly.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Uncited Zen 5 throughput claim 📘 Rule violation § Compliance
Description
A new comment claims u64::count_ones() “Zen 5 retires at 1/cycle” without tying it to the repo’s
reproducible benchmark (examples/bench_rank) or otherwise making it verifiable in-repo. This
violates the policy against unverifiable quantitative performance claims.
Code

src/bitmap.rs[R388-390]

Evidence
The rule forbids quantitative performance claims unless they are backed by the repo’s reproducible
benchmark instructions. The modified comment introduces a specific numeric throughput claim
(“1/cycle”) that is not linked to examples/bench_rank or any reproducibility guidance.

Rule 737884: Avoid fabricated benchmarks or unverifiable performance claims
src/bitmap.rs[386-390]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The code comment includes a quantitative performance statement (`Zen 5` “1/cycle”) that is not reproducible/verifiable via the repo’s benchmark guidance.

## Issue Context
The compliance policy requires quantitative performance numbers to be backed by the in-repo benchmark (`examples/bench_rank`) and environment details, or otherwise avoid specific numbers.

## Fix Focus Areas
- src/bitmap.rs[388-390]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread .github/workflows/ci.yml
Comment thread src/util.rs
Comment thread .github/workflows/ci.yml
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 359b86bc0f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/util.rs Outdated
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces portable SIMD-accelerated popcount utilities for aarch64 (NEON) and wasm32 (simd128), replacing manual scalar loops in the bitmap and sign-bitmap scanning logic. Additionally, it enhances the robustness of the TopK collector by utilizing f32::total_cmp for sorting and prevents potential constant-evaluation overflows on 32-bit targets by explicitly typing MAX_PAYLOAD as u64. All review comments were filtered as they provided validation rather than actionable feedback; therefore, I have no additional feedback to provide.

@Fieldnote-Echo
Copy link
Copy Markdown
Owner Author

/agentic_review

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 23, 2026

Persistent review updated to latest commit 359b86b

and_popcount/xor_popcount are pub(crate) safe fns whose NEON/simd128 paths read q at offsets up to doc.len(). The length invariant was only a debug_assert (stripped in release), so a future caller passing a shorter q would be a release-mode OOB read - and the scalar fallback would silently truncate instead (divergent behaviour). All six current callers pass equal-length qpv rows, so this is latent, not live; the hard assert_eq! turns any future misuse into a clean panic, matching the crate's hard-assert-before-SIMD pattern (body_overlap_scores_subset). Surfaced by qodo's review on #19.
…n only

macos-13 is GitHub's last Intel-mac image (deprecated + scarce); on a private repo those jobs sit queued and block the PR indefinitely. Coverage is redundant - the x86_64 logic is tested on the linux-x86_64 legs and Mach-O on macos-arm64, and the Intel wheel is still built + shipped by release-python.yml. Also restrict the push trigger to main so a feature branch with an open PR doesn't run python.yml twice (push + pull_request).
@Fieldnote-Echo Fieldnote-Echo merged commit 2d092f2 into main May 23, 2026
16 checks passed
@Fieldnote-Echo Fieldnote-Echo deleted the perf/portable-simd branch May 24, 2026 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant