Dorer SQ8 dist functions [MOD-9626] #673

dor-forer · 2025-05-12T12:29:10Z

Describe the changes in the pull request

Add dist functions and tests to support SQ8

Which issues this PR fixes

MOD-9626

Main objects this PR modified

dist functions

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

Copilot

Pull Request Overview

This PR adds support for SQ8 (8-bit quantized) distance functions for L2, inner product, and cosine similarities across various architectures, enhancing SIMD-based performance.

Introduces SQ8_L2Sqr and its SIMD implementations (SVE, SSE4, NEON, AVX2, AVX512, AVX2_FMA)
Adds SQ8_InnerProduct and SQ8_Cosine with matching SIMD variants and a common dispatcher in IP_space.cpp
Updates headers (L2.h, IP.h), build scripts (CMakeLists, instruction flags) for new functions and compiler flags

Reviewed Changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/VecSim/spaces/L2/L2_SVE_SQ8.h	Adds SVE-based SQ8 L2 squared distance step and template
src/VecSim/spaces/L2/L2_SSE4_SQ8.h	Adds SSE4-based SQ8 L2 squared distance
src/VecSim/spaces/L2/L2_NEON_SQ8.h	Adds NEON-based SQ8 L2 squared distance
src/VecSim/spaces/L2/L2_AVX512F_BW_VL_VNNI_SQ8.h	Adds AVX512F BW VL VNNI SQ8 L2 squared distance
src/VecSim/spaces/L2/L2_AVX2_SQ8.h	Adds AVX2-based SQ8 L2 squared distance
src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8.h	Adds AVX2+FMA SQ8 L2 squared distance
src/VecSim/spaces/L2/L2.h	Declares `SQ8_L2Sqr`
src/VecSim/spaces/L2/L2.cpp	Implements naive `SQ8_L2Sqr`
src/VecSim/spaces/IP_space.cpp	Registers SQ8 inner-product and cosine dispatch functions
src/VecSim/spaces/IP/IP_SVE_SQ8.h	Adds SVE-based SQ8 inner-product and cosine
src/VecSim/spaces/IP/IP_SSE4_SQ8.h	Adds SSE4-based SQ8 inner-product and cosine
...	(other SIMD variants for IP and Cosine)
src/VecSim/spaces/IP/IP.h	Declares `SQ8_InnerProduct`, `SQ8_Cosine`
src/VecSim/spaces/CMakeLists.txt	Enables SSE4 and AVX2+FMA source files with proper flags
cmake/x86_64InstructionFlags.cmake	Adds detection and definitions for SSE4 and AVX2_FMA flags

Comments suppressed due to low confidence (6)

src/VecSim/spaces/IP/IP_SVE_SQ8.h:45

[nitpick] Avoid naming a variable min which can conflict with std::min; consider renaming to min_val for clarity and to prevent shadowing.

    float min = *(float *)(pVect2 + dimension);

src/VecSim/spaces/L2/L2.cpp:22

Typo in comment: 'structred' should be 'structured'.

    // it structred as [quantized values (uint8_t * dim)][min_val (float)][delta

src/VecSim/spaces/L2/L2.h:14

[nitpick] No unit tests were added for SQ8_L2Sqr; please add tests covering full-chunk, partial-chunk, and edge-case dimensions to ensure correctness.

float SQ8_L2Sqr(const void *pVect1v, const void *pVect2v, size_t dimension);

src/VecSim/spaces/L2/L2.cpp:13

Remove the unused #include <iostream> from this file to avoid unnecessary dependencies.

#include <iostream>

src/VecSim/spaces/IP/IP_SVE_SQ8.h:11

The <iostream> header is not used in this implementation, consider removing it to keep the header lightweight.

#include <iostream>

src/VecSim/spaces/L2/L2.h:14

Consider adding a corresponding L2_SQ8_GetDistFunc and registering it in the L2 space selector (similar to IP_space) so that the SQ8 L2 implementation can be chosen dynamically.

float SQ8_L2Sqr(const void *pVect1v, const void *pVect2v, size_t dimension);

lerman25

Nice L2
Some comments

lerman25 · 2025-05-26T06:23:34Z

src/VecSim/spaces/IP/IP_SVE_SQ8.h

+    sum = svmla_f32_x(pg, sum, v1, v2_dequant);
+
+    // Move to the next set of elements
+    offset += svcntw();


In "regular" IP/L2 SVE we passed the offset to the function as parameter, I think we should align this behavior here

It is passed as a parameter.
Or am I missing something?

I meant the offset += svcntw(); -> offset += chunk;
Look at IP_SVE_INT8 for reference

Oh I see now.
You all did it behind my back :) I didn't do it on my FP32 implementation.
I will align with your implementation.

src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8.h

src/VecSim/spaces/L2/L2_SVE_SQ8.h

src/VecSim/spaces/L2/L2_AVX2_SQ8.h

src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8.h

lerman25

Great job 👑
Other than my 2 comments nothing else to add

lerman25 · 2025-05-27T09:06:55Z

src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8.h

+    __m256 diff_squared = _mm256_mul_ps(diff, diff);
+
+    // Add to running sum
+    sum256 = _mm256_add_ps(sum256, diff_squared);


fmadd is an option

lerman25 · 2025-05-27T09:42:34Z

tests/unit/test_spaces.cpp

+    params[2] = inv_norm;
+
+    float dist = SQ8_L2Sqr((const void *)v1_orig, (const void *)v2_compressed.data(), dim);
+    ASSERT_NEAR(dist, 0.0f, 0.01f) << "SQ8_Cosine failed to match expected distance";


is 0.01 enough?
@GuyAv46 WDYT?
Other tests used 0.000001

It ip tests pass with 0.000001 and the l2 tests pass with 0.00001

GuyAv46 · 2025-06-05T12:38:22Z

src/VecSim/spaces/IP/IP.cpp

+    const float min_val = *reinterpret_cast<const float *>(pVect2 + dimension);
+    const float delta = *reinterpret_cast<const float *>(pVect2 + dimension + sizeof(float));


consider remodelling so the metadata is at the start of the vector

GuyAv46 · 2025-06-05T12:43:25Z

src/VecSim/spaces/IP_space.cpp

+        if (dim % 16 == 0) // no point in aligning if we have an offsetting residual
+            *alignment = 16 * sizeof(float); // handles 16 floats


If we are not aligned when including the metadata of the vector, no need to be aligned. If we're not sure what's the alignment should be - it is better to not set it (keep it 0)

GuyAv46 · 2025-06-05T12:57:52Z

src/VecSim/spaces/IP/IP_NEON_SQ8.h

+    float32x4_t v2_f = vcvtq_f32_u32(v2_u32);
+
+    // Dequantize: (val * delta) + min_val
+    float32x4_t v2_dequant = vmlaq_f32(min_val_vec, v2_f, delta_vec);


There are intrinsics for MLA with a scalar, removing the need to create min_val_vec and delta_vec. We should check (in the following PR) what's the difference in performance

GuyAv46 · 2025-06-05T12:58:56Z

src/VecSim/spaces/IP/IP_NEON_SQ8.h

+        if constexpr (final_residual >= 1) {
+            v1 = vld1q_lane_f32(pVect1, v1, 0);
+            float dequant0 = pVect2[0] * delta + min_val;
+            v2_dequant = vld1q_lane_f32(&dequant0, v2_dequant, 0);
+        }
+        if constexpr (final_residual >= 2) {
+            v1 = vld1q_lane_f32(pVect1 + 1, v1, 1);
+            float dequant1 = pVect2[1] * delta + min_val;
+            v2_dequant = vld1q_lane_f32(&dequant1, v2_dequant, 1);
+        }
+        if constexpr (final_residual >= 3) {
+            v1 = vld1q_lane_f32(pVect1 + 2, v1, 2);
+            float dequant2 = pVect2[2] * delta + min_val;
+            v2_dequant = vld1q_lane_f32(&dequant2, v2_dequant, 2);
+        }


should be evaluated (performance)

GuyAv46 · 2025-06-05T13:02:32Z

src/VecSim/spaces/IP/IP_SSE4_SQ8.h

+                float dequant0 = quantized[0] * delta + min;
+                v2_dequant = _mm_load_ss(&dequant0);
+
+                // Dequantize next two values
+                float dequant_high[2] = {quantized[1] * delta + min, quantized[2] * delta + min};
+                v2_dequant = _mm_loadh_pi(v2_dequant, (__m64 *)dequant_high);


Consider using set amd manually set the relevant elements of the vector, instead of loading from a stack variable

GuyAv46 · 2025-06-08T16:09:37Z

src/VecSim/spaces/L2_space.cpp

+        if (dim % 16 == 0) // no point in aligning if we have an offsetting residual
+            *alignment = 16 * sizeof(float); // handles 16 floats


remove redundant alignment

GuyAv46

We should still validate the implementation performance and address the performance comments left on this PR

dor-forer added 30 commits May 11, 2025 10:25

add sq8

69d63ac

Change to IP_AVX512F

af85432

Change

b215799

vec1

8b4188b

float

a1d1a16

finish

b5860bb

now

0d07d71

remove Choose_SQ8_Cosine_implementation_AVX512F

66c49e8

in test

aa26c71

alignemnt

43b58a8

back to bw

1e12fa3

back again

984a030

again

c3670a8

optimization

11303b7

more BW

7474c05

fix avx

2cfd9b6

add avx cosine test

3cdf05e

avx

fc8bc7d

add impl

513839b

add l2

f676c1b

replace OPT_AVX512_F_BW_VL_VNNI

9a899cc

align

4fa5327

Fix avx

1379d6d

add l2 sse

f7fdb2b

Remove prints

4fa88b2

sve2 l2

4476833

add neon

2a7477c

fix sve

b1f502c

add sq8 cosine test

dc154b5

test utils

25a9400

dor-forer added 3 commits May 22, 2025 09:35

format

ef09ead

change to _mm_cvtsi32_si128

7567730

Change in the l2

a767547

dor-forer requested review from lerman25 and Copilot May 22, 2025 07:22

Copilot AI reviewed May 22, 2025

View reviewed changes

lerman25 reviewed May 26, 2025

View reviewed changes

PR changes

e6422dc

dor-forer requested a review from lerman25 May 27, 2025 06:17

added chunk to functions

10a6098

lerman25 previously approved these changes May 27, 2025

View reviewed changes

diff squared

767e190

dor-forer dismissed lerman25’s stale review via 767e190 May 27, 2025 10:10

dor-forer added 2 commits May 27, 2025 13:19

format

44be275

chnage diff

3a956bf

GuyAv46 reviewed Jun 5, 2025

View reviewed changes

dor-forer added 2 commits June 5, 2025 17:31

Remove align from tests improve sse4

5840e3f

format

2a89dd8

dor-forer requested a review from GuyAv46 June 5, 2025 15:09

dor-forer added 2 commits June 8, 2025 18:09

applied to l2

e562a86

format

2a0b4e6

GuyAv46 reviewed Jun 8, 2025

View reviewed changes

Remove alignment l2

ab18690

dor-forer requested review from GuyAv46 and lerman25 June 8, 2025 16:36

GuyAv46 approved these changes Jun 9, 2025

View reviewed changes

dor-forer added this pull request to the merge queue Jun 9, 2025

Merged via the queue into main with commit 6a84603 Jun 9, 2025
23 checks passed

dor-forer deleted the dorer-sq8-dist-functions branch June 9, 2025 10:53

		const float min_val = reinterpret_cast<const float >(pVect2 + dimension);
		const float delta = reinterpret_cast<const float >(pVect2 + dimension + sizeof(float));

		if (dim % 16 == 0) // no point in aligning if we have an offsetting residual
		alignment = 16 sizeof(float); // handles 16 floats

Dorer SQ8 dist functions [MOD-9626] #673

Dorer SQ8 dist functions [MOD-9626] #673

Uh oh!

Conversation

dor-forer commented May 12, 2025 • edited by jira bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

lerman25 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lerman25 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dor-forer May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GuyAv46 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dor-forer commented May 12, 2025 •

edited by jira bot

Loading

dor-forer May 27, 2025 •

edited

Loading