perf(simd): vectorize I8x16::saturating_abs (VPABSB) + binding W1a tests by AdaWorldAPI · Pull Request #204 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-26T02:24:31Z

Summary

Follow-up hardening of the W1a SIMD primitives merged in #203 — turning the AVX-512-baseline I8x16::saturating_abs from a scalar loop into real SIMD, and adding the binding parity tests #203 shipped without.

I8x16::saturating_abs → _mm_abs_epi8 + _mm_min_epu8 (simd_avx512.rs). The contract's VPABSB correction: bare VPABSB returns 0x80 for i8::MIN; VPMINUB clamps it to 0x7f (= i8::MAX). 16 lanes, branchless, vs the prior per-lane branching scalar loop. (I8x32::saturating_abs was already real.)
Binding W1a unit tests added (only rust,ignore doctests existed before): saturating_abs(i8::MIN)==i8::MAX for I8x16+I8x32, a scalar-reference corpus, i4 sign-extension, U64x8::popcnt/xor_popcount, and gather_u16. All 6 pass on the v3 build.

Measured, not assumed (deliberately not changed)

U64x8::popcnt on AVX2 already lowers to the hardware POPCNT instruction via u64::count_ones() — a VPSHUFB-Mula rewrite adds complexity for ~zero gain at 8 lanes.
gather_u16 stays scalar: a 32-bit _mm256_i32gather_epi32 over a &[u16] over-reads 2 bytes past the last valid index (UB even for in-bounds indices), and no 16-bit hardware gather exists (AVX2/AVX-512 gather granularity is 32/64-bit). The safe SIMD path for small palettes (≤32 u16) would be _mm512_permutexvar_epi16 (VPERMW, register permute) — a possible follow-up.

Posture

Compile-time dispatch only (runtime dispatch deferred). Consumer site: lance-graph:crates/lance-graph-contract/src/mul.rs (i4 saturating-abs classifier). The AVX-512 path is CI-verified — it can't be compiled on a non-AVX-512 runner (the v4 build SIGILLs in build scripts).

Test plan

cargo test --lib w1a_ — 6/6 pass on default v3
cargo fmt --all --check clean
cargo clippy --lib clean
CI: tier4-avx512-check compiles the AVX-512 path; NEON job covers aarch64

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41

Generated by Claude Code

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed saturation behavior for minimum integer values in SIMD absolute value operations.
Performance
- Optimized integer absolute value saturation computation using hardware-accelerated SIMD instructions.
Tests
- Added comprehensive test coverage for saturating arithmetic operations and SIMD primitives.

…a tests I8x16::saturating_abs now uses _mm_abs_epi8 + _mm_min_epu8 (the contract's VPABSB correction: VPABSB returns 0x80 for i8::MIN, VPMINUB clamps to 0x7f) instead of a per-lane branching scalar loop — 16 lanes branchless. Also adds the binding W1a unit tests that #203 shipped without (only rust,ignore doctests existed): saturating_abs(i8::MIN)==i8::MAX for I8x16 and I8x32, a scalar-reference corpus, i4 sign-extension, U64x8 popcnt / xor_popcount, and gather_u16. All 6 pass on the v3 build. Not changed (measured, not assumed): U64x8::popcnt on AVX2 already lowers to hardware POPCNT via count_ones; gather_u16 stays scalar because a 32-bit _mm256_i32gather over a &[u16] over-reads past the last index (no 16-bit hardware gather exists). https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41

coderabbitai · 2026-05-26T02:24:48Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The PR optimizes I8x16::saturating_abs in the AVX-512 backend by replacing a scalar per-lane loop with x86 SIMD intrinsics, and adds comprehensive test coverage for the optimized function alongside related W1a primitives to validate correctness and behavior.

Changes

I8x16 Saturating Abs Optimization

Layer / File(s)	Summary
I8x16::saturating_abs SIMD implementation `src/simd_avx512.rs`	Replaced scalar loop with x86 intrinsics: `_mm_abs_epi8` computes raw absolute values, then `_mm_min_epu8` clamps unsigned results to `0x7f` to enforce `i8::MIN → i8::MAX` saturation across all 16 lanes.
W1a primitive test coverage `src/simd_avx512.rs`	Added test cases for `I8x16::saturating_abs` (`i8::MIN` saturation and scalar parity), `I8x32::saturating_abs` saturation, `I8x16::from_i4_packed_u64` sign-extension, `U64x8::popcnt` and `xor_popcount` Hamming-distance behavior, and `U16x8::gather_u16` in-bounds correctness.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hops through bytes with care, 🐰
Where i8::MIN floats through the air,
SIMD intrinsics clamp it tight,
To i8::MAX with all their might,
Tests ensure saturation's fair!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: vectorizing I8x16::saturating_abs using VPABSB and adding W1a binding tests.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/splat3d-cpu-simd-renderer-MAOO0

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

AdaWorldAPI merged commit f373c75 into master May 26, 2026
16 of 17 checks passed

AdaWorldAPI mentioned this pull request May 26, 2026

feat(cesium): implement tileset.rs cold-import parser — Group-A entry, no-serde #205

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(simd): vectorize I8x16::saturating_abs (VPABSB) + binding W1a tests#204

perf(simd): vectorize I8x16::saturating_abs (VPABSB) + binding W1a tests#204
AdaWorldAPI merged 1 commit into
masterfrom
claude/splat3d-cpu-simd-renderer-MAOO0

AdaWorldAPI commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured, not assumed (deliberately not changed)

Posture

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdaWorldAPI commented May 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading