Skip to content

perf(simd): vectorize I8x16::saturating_abs (VPABSB) + binding W1a tests#204

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/splat3d-cpu-simd-renderer-MAOO0
May 26, 2026
Merged

perf(simd): vectorize I8x16::saturating_abs (VPABSB) + binding W1a tests#204
AdaWorldAPI merged 1 commit into
masterfrom
claude/splat3d-cpu-simd-renderer-MAOO0

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

@AdaWorldAPI AdaWorldAPI commented May 26, 2026

Summary

Follow-up hardening of the W1a SIMD primitives merged in #203 — turning the AVX-512-baseline I8x16::saturating_abs from a scalar loop into real SIMD, and adding the binding parity tests #203 shipped without.

  • I8x16::saturating_abs_mm_abs_epi8 + _mm_min_epu8 (simd_avx512.rs). The contract's VPABSB correction: bare VPABSB returns 0x80 for i8::MIN; VPMINUB clamps it to 0x7f (= i8::MAX). 16 lanes, branchless, vs the prior per-lane branching scalar loop. (I8x32::saturating_abs was already real.)
  • Binding W1a unit tests added (only rust,ignore doctests existed before): saturating_abs(i8::MIN)==i8::MAX for I8x16+I8x32, a scalar-reference corpus, i4 sign-extension, U64x8::popcnt/xor_popcount, and gather_u16. All 6 pass on the v3 build.

Measured, not assumed (deliberately not changed)

  • U64x8::popcnt on AVX2 already lowers to the hardware POPCNT instruction via u64::count_ones() — a VPSHUFB-Mula rewrite adds complexity for ~zero gain at 8 lanes.
  • gather_u16 stays scalar: a 32-bit _mm256_i32gather_epi32 over a &[u16] over-reads 2 bytes past the last valid index (UB even for in-bounds indices), and no 16-bit hardware gather exists (AVX2/AVX-512 gather granularity is 32/64-bit). The safe SIMD path for small palettes (≤32 u16) would be _mm512_permutexvar_epi16 (VPERMW, register permute) — a possible follow-up.

Posture

Compile-time dispatch only (runtime dispatch deferred). Consumer site: lance-graph:crates/lance-graph-contract/src/mul.rs (i4 saturating-abs classifier). The AVX-512 path is CI-verified — it can't be compiled on a non-AVX-512 runner (the v4 build SIGILLs in build scripts).

Test plan

  • cargo test --lib w1a_ — 6/6 pass on default v3
  • cargo fmt --all --check clean
  • cargo clippy --lib clean
  • CI: tier4-avx512-check compiles the AVX-512 path; NEON job covers aarch64

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41


Generated by Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed saturation behavior for minimum integer values in SIMD absolute value operations.
  • Performance

    • Optimized integer absolute value saturation computation using hardware-accelerated SIMD instructions.
  • Tests

    • Added comprehensive test coverage for saturating arithmetic operations and SIMD primitives.

Review Change Stack

…a tests

I8x16::saturating_abs now uses _mm_abs_epi8 + _mm_min_epu8 (the contract's
VPABSB correction: VPABSB returns 0x80 for i8::MIN, VPMINUB clamps to 0x7f)
instead of a per-lane branching scalar loop — 16 lanes branchless.

Also adds the binding W1a unit tests that #203 shipped without (only
rust,ignore doctests existed): saturating_abs(i8::MIN)==i8::MAX for I8x16
and I8x32, a scalar-reference corpus, i4 sign-extension, U64x8 popcnt /
xor_popcount, and gather_u16. All 6 pass on the v3 build.

Not changed (measured, not assumed): U64x8::popcnt on AVX2 already lowers
to hardware POPCNT via count_ones; gather_u16 stays scalar because a 32-bit
_mm256_i32gather over a &[u16] over-reads past the last index (no 16-bit
hardware gather exists).

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The PR optimizes I8x16::saturating_abs in the AVX-512 backend by replacing a scalar per-lane loop with x86 SIMD intrinsics, and adds comprehensive test coverage for the optimized function alongside related W1a primitives to validate correctness and behavior.

Changes

I8x16 Saturating Abs Optimization

Layer / File(s) Summary
I8x16::saturating_abs SIMD implementation
src/simd_avx512.rs
Replaced scalar loop with x86 intrinsics: _mm_abs_epi8 computes raw absolute values, then _mm_min_epu8 clamps unsigned results to 0x7f to enforce i8::MIN → i8::MAX saturation across all 16 lanes.
W1a primitive test coverage
src/simd_avx512.rs
Added test cases for I8x16::saturating_abs (i8::MIN saturation and scalar parity), I8x32::saturating_abs saturation, I8x16::from_i4_packed_u64 sign-extension, U64x8::popcnt and xor_popcount Hamming-distance behavior, and U16x8::gather_u16 in-bounds correctness.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hops through bytes with care, 🐰
Where i8::MIN floats through the air,
SIMD intrinsics clamp it tight,
To i8::MAX with all their might,
Tests ensure saturation's fair!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: vectorizing I8x16::saturating_abs using VPABSB and adding W1a binding tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/splat3d-cpu-simd-renderer-MAOO0

Comment @coderabbitai help to get the list of available commands and usage tips.

@AdaWorldAPI AdaWorldAPI merged commit f373c75 into master May 26, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants