-
Notifications
You must be signed in to change notification settings - Fork 21
Optimize Distance Functions - [MOD-5434] #395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Distance Functions - [MOD-5434] #395
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## feature_hnsw_vector_blocks #395 +/- ##
==============================================================
- Coverage 96.28% 95.48% -0.81%
==============================================================
Files 67 66 -1
Lines 4689 4120 -569
==============================================================
- Hits 4515 3934 -581
- Misses 174 186 +12
☔ View full report in Codecov by Sentry. |
5030493
to
e9c7f93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks very good and clean ! Well done
Please verify we still need all the compilation & hardware flags for DQ
ASSERT_EQ(L2_FP32_GetDistFunc(dim, ARCH_OPT_AVX512_F), FP32_L2Sqr); | ||
ASSERT_EQ(L2_FP64_GetDistFunc(dim, ARCH_OPT_AVX512_F), FP64_L2Sqr); | ||
ASSERT_EQ(IP_FP32_GetDistFunc(dim, ARCH_OPT_AVX512_F), FP32_InnerProduct); | ||
ASSERT_EQ(IP_FP64_GetDistFunc(dim, ARCH_OPT_AVX512_F), FP64_InnerProduct); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will always work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we make some optimization for dim<8/16
#pragma once | ||
|
||
/* | ||
* Include this file after done using the implementation chooser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably nothing unless more X definitions will be presented
* We assume that we are dealing with 512-bit blocks, so we define a chunk size of 16 for 32-bit | ||
* elements, and a chunk size of 8 for 64-bit elements. The main macro is CHOOSE_IMPLEMENTATION, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we are assuming this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the way the functions were implemented in the first place. we can consider changing that but today it simply means X2 more condition checks for AVX and X4 for SSE. also it makes choosing the implementation simpler
// Macro for 4 cases. Used to collapse the switch statement. For a given N, expands to 4 X macros | ||
// of 4N, 4N+1, 4N+2, 4N+3. | ||
#define C4(X, func, N) X(4 * N, func) X(4 * N + 1, func) X(4 * N + 2, func) X(4 * N + 3, func) | ||
|
||
// Macros for 8 and 16 cases. Used to collapse the switch statement. Expands into 0-15 or 0-7 cases. | ||
#define CASES16(X, func) C4(X, func, 0) C4(X, func, 1) C4(X, func, 2) C4(X, func, 3) | ||
#define CASES8(X, func) C4(X, func, 0) C4(X, func, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use C8 instead of C4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. This was done for readability (so each definition will fit one line). I can change that
target_link_libraries(VectorSimilaritySpaces cpu_features) | ||
endif() | ||
|
||
if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as all the functions are now templated and written in H files, we no longer have files that are compiled only sometimes
#include "space_includes.h" | ||
|
||
template <__mmask8 mask> // (2^n)-1, where n is in 1..7 (1, 4, ..., 127) | ||
static inline __m256 my_mm256_maskz_loadu_ps(const float *p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a need for those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make them not templated. Probably won't affect the performance
* HNSW refactor 3 - [MOD-5302] (#389) * moved vector_block * make DataBlock available to use in vectors * some general improvements * implement data blocks in HNSW * disabled serializer benchmarks and some tests * more disable * enabled serialization (for current implementation) * added prefetch for range * move generic meta of element to a separated vector * added prefetch for metadata * enabled save and load from bindings * [TMP] return of the old HNSW REVERT ME * fix bm * some changes in prefetch * shortened BM * reverted prefetches to by as before * shorten BM * small improvement for range * packing structs * unrelated performance improvement * fix * revert adding origin hnsw, remove support for v1 and v2 serialization * update for BM file * some fixes and test updates * [TEMP] disable 2 tests so coverage will run * fix flow test * more test fixes * improved tests * fix for bach iterator scan (needs a benchmark) * for benchmark * reverting some temporary changes * more reverting * fix for clang * file name update * another prefetch option * few improvements * some more trying * final change * another update to all but `hnsw.h` * returned `increaseCapacity` responsibility to `addVector` * make `hnsw.h` use blocks * BF comment fix * comment fix * fix some tests * fixed hnsw tests * fixed hnsw-multi tests * fixed almost all tiered HNSW tests * Fix memory bookkeeping tests * Fix memory bookkeeping tests 2 * fixed estimations and their tests * review fixes * fix review fix * more review fixes * move rounding up of initial capacity to a static function * added comments on data blocks * some optimizations (reduce the use of `getDataByInternalId`) --------- Co-authored-by: alon <alonreshef24@gmail.com> * HNSW blocks refactor - add lock to graph data struct (#390) * Improved serializing code - [MOD-5372] (#391) * improved serializing code * review fixes * HNSW Refactor - benchmarks - [MOD-5371] (#392) * update benchmark files * updated wget links * publish serialization script * another benchmark cleanup iteration * review fixes * Renaming "meta" variables (#394) * renaming "meta" variables * revert temp change * Optimize Distance Functions - [MOD-5434] (#395) * initial templated with masks implementations * format * tidy up * enabled spaces tests back * changed template type and handle residual first * re-enabled benchmarks (keeping old names) * download fix * improved unit testing * improved spaces benchmarks * verify correctness * some cleanup * give up optimizing dim<16 for safety * aligned serialization links * added lots of comments * added a test and small fix * include opts only on x86 machines * remove AVX512DQ references from the project (not in use) * rename qty to dimension * Update AVX_utils.h comments * Optimize - implement align allocation for vector alignment - [MOD-5433] (#399) * aligning query vector * implement aligned allocation * added alignment hing to VecSimIndexAbstract, used it in block allocation * test fix * review fixes * set default value to the alignment hint (1 - any address is valid) * refactor allocation header to have alignment flag, unify free function * use alignment only on vector blocks * changed default alignment value (0) * updated tests * added missing break * improved comment * removed alignment from allocator test
* HNSW refactor 3 - [MOD-5302] (#389) * moved vector_block * make DataBlock available to use in vectors * some general improvements * implement data blocks in HNSW * disabled serializer benchmarks and some tests * more disable * enabled serialization (for current implementation) * added prefetch for range * move generic meta of element to a separated vector * added prefetch for metadata * enabled save and load from bindings * [TMP] return of the old HNSW REVERT ME * fix bm * some changes in prefetch * shortened BM * reverted prefetches to by as before * shorten BM * small improvement for range * packing structs * unrelated performance improvement * fix * revert adding origin hnsw, remove support for v1 and v2 serialization * update for BM file * some fixes and test updates * [TEMP] disable 2 tests so coverage will run * fix flow test * more test fixes * improved tests * fix for bach iterator scan (needs a benchmark) * for benchmark * reverting some temporary changes * more reverting * fix for clang * file name update * another prefetch option * few improvements * some more trying * final change * another update to all but `hnsw.h` * returned `increaseCapacity` responsibility to `addVector` * make `hnsw.h` use blocks * BF comment fix * comment fix * fix some tests * fixed hnsw tests * fixed hnsw-multi tests * fixed almost all tiered HNSW tests * Fix memory bookkeeping tests * Fix memory bookkeeping tests 2 * fixed estimations and their tests * review fixes * fix review fix * more review fixes * move rounding up of initial capacity to a static function * added comments on data blocks * some optimizations (reduce the use of `getDataByInternalId`) --------- Co-authored-by: alon <alonreshef24@gmail.com> * HNSW blocks refactor - add lock to graph data struct (#390) * Improved serializing code - [MOD-5372] (#391) * improved serializing code * review fixes * HNSW Refactor - benchmarks - [MOD-5371] (#392) * update benchmark files * updated wget links * publish serialization script * another benchmark cleanup iteration * review fixes * Renaming "meta" variables (#394) * renaming "meta" variables * revert temp change * Optimize Distance Functions - [MOD-5434] (#395) * initial templated with masks implementations * format * tidy up * enabled spaces tests back * changed template type and handle residual first * re-enabled benchmarks (keeping old names) * download fix * improved unit testing * improved spaces benchmarks * verify correctness * some cleanup * give up optimizing dim<16 for safety * aligned serialization links * added lots of comments * added a test and small fix * include opts only on x86 machines * remove AVX512DQ references from the project (not in use) * rename qty to dimension * Update AVX_utils.h comments * Optimize - implement align allocation for vector alignment - [MOD-5433] (#399) * aligning query vector * implement aligned allocation * added alignment hing to VecSimIndexAbstract, used it in block allocation * test fix * review fixes * set default value to the alignment hint (1 - any address is valid) * refactor allocation header to have alignment flag, unify free function * use alignment only on vector blocks * changed default alignment value (0) * updated tests * added missing break * improved comment * removed alignment from allocator test
* HNSW refactor 3 - [MOD-5302] (#389) * moved vector_block * make DataBlock available to use in vectors * some general improvements * implement data blocks in HNSW * disabled serializer benchmarks and some tests * more disable * enabled serialization (for current implementation) * added prefetch for range * move generic meta of element to a separated vector * added prefetch for metadata * enabled save and load from bindings * [TMP] return of the old HNSW REVERT ME * fix bm * some changes in prefetch * shortened BM * reverted prefetches to by as before * shorten BM * small improvement for range * packing structs * unrelated performance improvement * fix * revert adding origin hnsw, remove support for v1 and v2 serialization * update for BM file * some fixes and test updates * [TEMP] disable 2 tests so coverage will run * fix flow test * more test fixes * improved tests * fix for bach iterator scan (needs a benchmark) * for benchmark * reverting some temporary changes * more reverting * fix for clang * file name update * another prefetch option * few improvements * some more trying * final change * another update to all but `hnsw.h` * returned `increaseCapacity` responsibility to `addVector` * make `hnsw.h` use blocks * BF comment fix * comment fix * fix some tests * fixed hnsw tests * fixed hnsw-multi tests * fixed almost all tiered HNSW tests * Fix memory bookkeeping tests * Fix memory bookkeeping tests 2 * fixed estimations and their tests * review fixes * fix review fix * more review fixes * move rounding up of initial capacity to a static function * added comments on data blocks * some optimizations (reduce the use of `getDataByInternalId`) --------- Co-authored-by: alon <alonreshef24@gmail.com> * HNSW blocks refactor - add lock to graph data struct (#390) * Improved serializing code - [MOD-5372] (#391) * improved serializing code * review fixes * HNSW Refactor - benchmarks - [MOD-5371] (#392) * update benchmark files * updated wget links * publish serialization script * another benchmark cleanup iteration * review fixes * Renaming "meta" variables (#394) * renaming "meta" variables * revert temp change * Optimize Distance Functions - [MOD-5434] (#395) * initial templated with masks implementations * format * tidy up * enabled spaces tests back * changed template type and handle residual first * re-enabled benchmarks (keeping old names) * download fix * improved unit testing * improved spaces benchmarks * verify correctness * some cleanup * give up optimizing dim<16 for safety * aligned serialization links * added lots of comments * added a test and small fix * include opts only on x86 machines * remove AVX512DQ references from the project (not in use) * rename qty to dimension * Update AVX_utils.h comments * Optimize - implement align allocation for vector alignment - [MOD-5433] (#399) * aligning query vector * implement aligned allocation * added alignment hing to VecSimIndexAbstract, used it in block allocation * test fix * review fixes * set default value to the alignment hint (1 - any address is valid) * refactor allocation header to have alignment flag, unify free function * use alignment only on vector blocks * changed default alignment value (0) * updated tests * added missing break * improved comment * removed alignment from allocator test
Describe the changes in the pull request
Refactor all distance functions to be templated by the residual of the dimension, so that each dimension gets a tailor-made implementation with the fewest condition checks possible and at the same time, having less code (one function for each architecture) to maintain.
The new implementation for each architecture uses mask loading of the given arch and therefore we don't have to handle "tails" of non-512-bit-multiple vectors. we handle the residual part first, so the mask loading can safely load some more data if the implementation requires it, and with the 512-bit chunks after. We assume we have enough owned data for the mask loading, so we give up optimizing dimensions smaller than 512-bit (16 for floats, 8 for doubles). This is fine because there is no too-big hit, and there are no real-life scenarios of small dimensions. We can implement spatial optimizations for small dimensions in the future.
Main objects this PR modified
Mark if applicable