Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972] by dor-forer · Pull Request #973 · RedisAI/VectorSimilarity

dor-forer · 2026-05-31T13:01:54Z

Describe the changes in the pull request

Add asymmetric SQ8↔FP16 SIMD distance kernels (IP, L2, Cosine) for ARM tiers: NEON_HP, SVE, SVE2. Stacked on PR #970 (MOD-14954), which delivers the x86 equivalents.

The SVE hot loop uses svld1uh_u32 to zero-extend each FP16 halfword into a 32-bit lane, allowing svcvt_f32_f16_x to read the correct bits directly. The NEON residual mirrors the SQ8_FP32 NEON sister: three independent 4-lane sub-steps (r>=4/8/12) leaving at most 3 elements for scalar, replacing the previous single 8-lane block + up-to-7 software conversions.

Which issues this PR fixes

MOD-14972

Main objects this PR modified

src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h — new NEON_HP IP kernel
src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h — new SVE/SVE2 IP kernel
src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h — new NEON_HP L2 kernel
src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h — new SVE/SVE2 L2 kernel
src/VecSim/spaces/functions/NEON_HP.{h,cpp}, SVE.{h,cpp}, SVE2.{h,cpp} — chooser symbols
src/VecSim/spaces/IP_space.cpp, L2_space.cpp — AArch64 dispatcher blocks
tests/unit/test_spaces.cpp — tier-walk tests for NEON_HP / SVE / SVE2
tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp — ARM microbench registrations

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

Note

Medium Risk
Changes hot-path vector distance math and CPU dispatch; incorrect SIMD or residual handling could skew ANN results, though extensive parity tests mitigate this.

Overview
Adds ARM SIMD for asymmetric SQ8 storage ↔ FP16 query distances (inner product, L2, cosine), paralleling the existing x86 SQ8↔FP16 path.

New NEON_HP kernels use 16-byte chunks, FP16→FP32 widening, and a residual path aligned with the SQ8↔FP32 NEON style. New SVE/SVE2 IP kernels load FP16 via svld1uh_u32 and widen with svcvt_f32_f16; L2 reuses the IP core plus sum-of-squares metadata.

Dispatch in IP_space.cpp and L2_space.cpp selects SVE2 → SVE → NEON_HP when dim >= 16 and the matching CPU features are present. Chooser wiring lives in NEON_HP, SVE, and SVE2 function modules.

Tests/benchmarks: unit tests walk each ARM tier and compare to baseline; GetDistFuncSQ8FP16Asymmetric uses dim=15 for portable scalar checks; benchmarks register NEON_HP, SVE, and SVE2 variants.

^{Reviewed by Cursor Bugbot for commit 6f6ef26. Bugbot is set up for automated code reviews on this repo. Configure here.}

Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing ARM TU compile flags. Scalar fallback already on main serves as reference. Bakes in PR #970 review lessons (assert(dim>=16), 4-accumulator ILP, formula anchor, load_unaligned<float> metadata, dispatcher-routed tier-walk tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md. Each task ends in a commit; assistant runs tests/ASan/benchmarks after the user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…972]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…OD-14972] The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32 sister blocks which assert it. Adds the assertions to lock the contract that ARM tiers leave the caller's alignment value untouched.

…D-14972] svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e]. The step function loaded chunk consecutive FP16 values into positions 0..chunk-1, then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,... and silently skipped positions 1,3,5,... For chunk=4 (128-bit SVE), only 2 of 4 FP16 values per step were used, producing wrong dot products. Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so FCVT correctly reads v[0],v[1],v[2],... Applied to both the full step helper and the partial-chunk path. Discovered and fixed during ARM host verification (Task 14, MOD-14972).

…D-14972] SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to the partial-chunk path, which also drops the now-redundant pg16_partial predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z to match the SQ8_FP32 SVE sister. NEON residual: replace the single 8-lane block + up-to-7 software-scalar iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12), leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for residuals 4..15 (previously up to 7 software conversions per call). Both IP headers: remove assert()+<cassert> (no sister kernel uses them). Both L2 headers: drop redundant float16.h include and using declarations (arrive transitively through the included IP header).

CLAassistant · 2026-05-31T13:02:02Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ dor-forer
❌ Ubuntu

Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

jit-ci · 2026-05-31T13:03:37Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

…MOD-14972] - Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970 removed its equivalent doc before merge. - Drop 5-line "No alignment write" prose comment from the three AArch64 NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no such comment — the absent alignment write already encodes the intent. - Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister). The scalar-fallback assertion it previously duplicated is already covered by the trailing block of SQ8_FP16_SpacesOptimizationTest.

codecov · 2026-05-31T13:22:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.07%. Comparing base (999580f) to head (6f6ef26).
⚠️ Report is 1 commits behind head on dor-forer-sq8-fp16-x86-kernels-mod-14954.

Additional details and impacted files

@@                             Coverage Diff                              @@
##           dor-forer-sq8-fp16-x86-kernels-mod-14954     #973      +/-   ##
============================================================================
+ Coverage                                     97.04%   97.07%   +0.03%     
============================================================================
  Files                                           141      141              
  Lines                                          8110     8110              
============================================================================
+ Hits                                           7870     7873       +3     
+ Misses                                          240      237       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dor-forer and others added 17 commits May 31, 2026 11:32

Retarget SQ8↔FP16 scalar-fallback dispatcher test to dim=0/15 [MOD-14…

bb17ad1

…972]

Add NEON_HP SQ8↔FP16 IP kernel header [MOD-14972]

6a69a83

Add NEON_HP SQ8↔FP16 L2 kernel header [MOD-14972]

143d195

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wire NEON_HP SQ8↔FP16 choosers [MOD-14972]

cc78728

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Dispatch SQ8↔FP16 to NEON_HP tier on AArch64 [MOD-14972]

4c90779

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extend SQ8↔FP16 tier-walk tests with NEON_HP [MOD-14972]

94c4299

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add SVE SQ8↔FP16 IP kernel header [MOD-14972]

472a6a2

Add SVE SQ8↔FP16 L2 kernel header [MOD-14972]

49ca426

Wire SVE/SVE2 SQ8↔FP16 choosers [MOD-14972]

5d2fee9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Dispatch SQ8↔FP16 to SVE/SVE2 tiers on AArch64 [MOD-14972]

d6c7c41

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extend SQ8↔FP16 tier-walk tests with SVE/SVE2 [MOD-14972]

ee9da04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Register ARM SQ8↔FP16 microbenchmarks [MOD-14972]

f98aa09

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ubuntu added 2 commits May 31, 2026 13:03

Apply clang-format [MOD-14972]

8fdeaa5

Apply clang-format 18.1.8 (matches CI) [MOD-14972]

6f6ef26

dor-forer added the benchmarks-all label May 31, 2026

dor-forer requested review from BenGoldberger, lerman25 and ofiryanai May 31, 2026 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973

Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973
dor-forer wants to merge 20 commits into
dor-forer-sq8-fp16-x86-kernels-mod-14954from
dor-forer-sq8-fp16-arm-kernels-mod-14972

dor-forer commented May 31, 2026 •

edited by cursor Bot

Loading

Uh oh!

CLAassistant commented May 31, 2026

Uh oh!

jit-ci Bot commented May 31, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dor-forer commented May 31, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented May 31, 2026

Uh oh!

jit-ci Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Jit Security Scan Results

Uh oh!

codecov Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dor-forer commented May 31, 2026 •

edited by cursor Bot

Loading

jit-ci Bot commented May 31, 2026 •

edited

Loading

codecov Bot commented May 31, 2026 •

edited

Loading