perf(simd): vectorize I8x16::saturating_abs (VPABSB) + binding W1a tests#204
Merged
Merged
Conversation
…a tests I8x16::saturating_abs now uses _mm_abs_epi8 + _mm_min_epu8 (the contract's VPABSB correction: VPABSB returns 0x80 for i8::MIN, VPMINUB clamps to 0x7f) instead of a per-lane branching scalar loop — 16 lanes branchless. Also adds the binding W1a unit tests that #203 shipped without (only rust,ignore doctests existed): saturating_abs(i8::MIN)==i8::MAX for I8x16 and I8x32, a scalar-reference corpus, i4 sign-extension, U64x8 popcnt / xor_popcount, and gather_u16. All 6 pass on the v3 build. Not changed (measured, not assumed): U64x8::popcnt on AVX2 already lowers to hardware POPCNT via count_ones; gather_u16 stays scalar because a 32-bit _mm256_i32gather over a &[u16] over-reads past the last index (no 16-bit hardware gather exists). https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Contributor
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThe PR optimizes ChangesI8x16 Saturating Abs Optimization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up hardening of the W1a SIMD primitives merged in #203 — turning the AVX-512-baseline
I8x16::saturating_absfrom a scalar loop into real SIMD, and adding the binding parity tests #203 shipped without.I8x16::saturating_abs→_mm_abs_epi8+_mm_min_epu8(simd_avx512.rs). The contract's VPABSB correction: bare VPABSB returns0x80fori8::MIN; VPMINUB clamps it to0x7f(=i8::MAX). 16 lanes, branchless, vs the prior per-lane branching scalar loop. (I8x32::saturating_abswas already real.)rust,ignoredoctests existed before):saturating_abs(i8::MIN)==i8::MAXforI8x16+I8x32, a scalar-reference corpus, i4 sign-extension,U64x8::popcnt/xor_popcount, andgather_u16. All 6 pass on the v3 build.Measured, not assumed (deliberately not changed)
U64x8::popcnton AVX2 already lowers to the hardwarePOPCNTinstruction viau64::count_ones()— a VPSHUFB-Mula rewrite adds complexity for ~zero gain at 8 lanes.gather_u16stays scalar: a 32-bit_mm256_i32gather_epi32over a&[u16]over-reads 2 bytes past the last valid index (UB even for in-bounds indices), and no 16-bit hardware gather exists (AVX2/AVX-512 gather granularity is 32/64-bit). The safe SIMD path for small palettes (≤32u16) would be_mm512_permutexvar_epi16(VPERMW, register permute) — a possible follow-up.Posture
Compile-time dispatch only (runtime dispatch deferred). Consumer site:
lance-graph:crates/lance-graph-contract/src/mul.rs(i4 saturating-abs classifier). The AVX-512 path is CI-verified — it can't be compiled on a non-AVX-512 runner (the v4 build SIGILLs in build scripts).Test plan
cargo test --lib w1a_— 6/6 pass on default v3cargo fmt --all --checkcleancargo clippy --libcleantier4-avx512-checkcompiles the AVX-512 path; NEON job covers aarch64https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Generated by Claude Code
Summary by CodeRabbit
Release Notes
Bug Fixes
Performance
Tests