-
Notifications
You must be signed in to change notification settings - Fork 19
[MOD-9557] Fix incorrect vector blob size calculation #665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
size_t main_vector_indexing_count;
format comment out new tests
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #665 +/- ##
==========================================
- Coverage 96.34% 96.24% -0.11%
==========================================
Files 112 112
Lines 6239 6278 +39
==========================================
+ Hits 6011 6042 +31
- Misses 228 236 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
increases seed every call so we will get different vectors. this required a change in metrics_test because now the score is not exactly 0. VecSimIndexAbstract CastToBruteForce()-> rename to GetFlatBufferIndex add BruteForceIndex CastToBruteForce() add getFlatBufferIndexAsBruteForce to tiered index to get BruteForceIndex functions added to hnsw and BF: Formalize getDataByLabel definition: populate the intput empty vector with the vectors data,not including any additional meta data that might present in the vector storage like the norm. add getStoredVectorDataByLabel returns a vector of char vectors. each vector contains the vector as it is stored in the vectors container, including the meta data if exists.
assert in FP32SpacesOptimizationTest::FP32InnerProductTest
use preprocessQuery with force_copy in batch iterator instead on preprocessQueryInPlace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses several improvements in testing, data processing, and memory management for vector similarity indexes. Key changes include:
- Updates to test cases in both INT8 and BruteForce tests to ensure const correctness and more precise floating‑point comparisons.
- Modifications to index and preprocessor APIs to allow forced data copying during query preprocessing, as well as the addition of helper functions for retrieving complete stored vector data.
- Minor documentation and spelling corrections across multiple modules.
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tests/unit/test_int8.cpp | Adjusted random vector generation and improved float assertions. |
tests/unit/test_bruteforce.cpp | Introduced const correctness when fetching internal vector data. |
src/VecSim/vec_sim_tiered_index.h | Added new helper to access the brute-force view of the flat buffer. |
src/VecSim/vec_sim_index.h | Updated preprocessQuery signature to support forced copying. |
src/VecSim/spaces/computer/preprocessors.h | Added TODO comments regarding blob size mismatches. |
src/VecSim/spaces/computer/preprocessor_container.h | Updated method signatures to support a force_copy option. |
src/VecSim/spaces/computer/preprocessor_container.cpp | Updated memory copy logic with force_copy and added TODO comments. |
src/VecSim/algorithms/svs/svs.h | Modified query blob handling in newBatchIterator for forced copying. |
src/VecSim/algorithms/hnsw/*.h | Improved vector data copying to exclude metadata and fixed spelling. |
src/VecSim/algorithms/brute_force/*.h | Updated tests to return complete vector data with const correctness. |
.github/workflows/event-pull_request.yml | Minor updates to build flags and commented out an unused workflow dependency. |
Comments suppressed due to low confidence (1)
tests/unit/test_int8.cpp:396
- [nitpick] Verify that the tolerance value (1e-6) used in ASSERT_NEAR is appropriate for the expected precision of floating‑point computations in this test scenario.
ASSERT_NEAR(score, expected_score, 1e-6f) << "failed at vector id:" << id;
@@ -55,9 +55,11 @@ class CosinePreprocessor : public PreprocessorInterface { | |||
// If one of them is null, allocate memory for it and copy the original_blob to it. | |||
if (storage_blob == nullptr) { | |||
storage_blob = this->allocator->allocate(processed_bytes_count); | |||
// TODO: handle original_blob_size != processed_bytes_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider elaborating on how to handle cases where the original blob size differs from processed_bytes_count to improve future maintainability.
Copilot uses AI. Check for mistakes.
this->allocator->allocate_aligned(this->dataSize, this->preprocessors->getAlignment()); | ||
memcpy(queryBlobCopy, queryBlob, this->dim * sizeof(DataType)); | ||
this->preprocessQueryInPlace(queryBlobCopy); | ||
// force_copy == true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Clarify in documentation the rationale for forcing a copy (force_copy == true) of the query blob in the newBatchIterator to make the memory ownership semantics explicit.
// force_copy == true. | |
// force_copy == true. Forcing a copy ensures that the VecSimBatchIterator takes ownership | |
// of the query blob's memory and is responsible for freeing it. This avoids potential | |
// memory management issues and makes ownership semantics explicit. |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch and nice tests!
* @return A vector containing the complete vector data (elements + metadata) for the given | ||
* label | ||
*/ | ||
virtual std::vector<std::vector<char>> getStoredVectorDataByLabel(labelType label) const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we move the two helpers under
VecSimIndexAbstract
(test only), so we don't need to redefine it inHNSW
? - Consider renaming - I'd expect
getDataByLabel
to return all the data. MaybegetDataByLabel
andgetVectorByLabel
, or something similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Agreed —
getDataByInternalId
should also be unified, as it's currently declared separately in each algorithm with different return types for BruteForce and HNSW. - I also agree that getDataByLabel isn’t a very intuitive name. getStoredVectorDataByLabel is more descriptive for the full blob, including metadata like norms.
That said, to avoid overloading in this PR, I suggest we address the renaming and unification in a separate follow-up. That would also be a good opportunity to review existing usages of getDataByLabel and see where getStoredVectorDataByLabel would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
|
||
if (needs_copy) { | ||
auto aligned_mem = this->allocator->allocate_aligned(blob_bytes_count, this->alignment); | ||
// TODO: handle original_blob_size != processed_bytes_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When? Is there something that will fail a PR if we forget to support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASAP. Reading beyond an allocated memory is an undefined behaviour that potentially can lead to a crash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
src/VecSim/vec_sim_tiered_index.h
Outdated
@@ -110,6 +110,9 @@ class VecSimTieredIndex : public VecSimIndexInterface { | |||
inline VecSimIndexAbstract<DataType, DistType> *getFlatBufferIndex() { | |||
return this->frontendIndex; | |||
} | |||
inline BruteForceIndex<DataType, DistType> *getFlatBufferIndexAsBruteForce() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing getFlatBufferIndex
to return BruteForceIndex
insread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense & done
VecSimQueryParams *queryParams, std::shared_ptr<VecSimAllocator> allocator) | ||
: VecSimBatchIterator(query_vector, queryParams ? queryParams->timeoutCtx : nullptr, | ||
// Tiered batch iterator doesn't hold its own copy of the query vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that? Please extend the document here since it is not trivial..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
currently testing get_stored_vector_data_single_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue with vector blob copying by switching from a size computed as (dim * sizeof(DataType)) to the correct size provided by getDataSize(), thereby ensuring that additional metadata (e.g. norms for int8 vectors with cosine similarity) is correctly copied. It also removes redundant memory allocation in the batch iterator and introduces new test utilities for retrieving the full stored vector data.
Reviewed Changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
tests/utils/tests_utils.h | Adds a helper to populate float vectors for testing. |
tests/unit/test_int8.cpp | Updates random vector population and adjusts float comparison in search tests. |
tests/unit/test_index_test_utils.cpp | Improves test utilities for vector data verification. |
tests/unit/test_bruteforce.cpp | Applies const-correctness to getDataByInternalId calls. |
src/VecSim/vec_sim_tiered_index.h | Updates getFlatBufferIndex to return a BruteForceIndex pointer. |
src/VecSim/vec_sim_index.h | Refactors preprocessQuery to incorporate a force_copy parameter. |
src/VecSim/spaces/computer/* | Propagates force_copy changes and adds TODOs regarding blob size mismatches. |
src/VecSim/algorithms/* | Updates vector copy logic to copy only the vector elements where appropriate and adds methods for retrieving full stored data. |
Files not reviewed (1)
- tests/unit/CMakeLists.txt: Language not supported
this->frontendIndex->getDim() * sizeof(DataType)); | ||
size_t data_size = this->frontendIndex->getDataSize(); | ||
auto blob_copy = this->getAllocator()->allocate_unique(data_size); | ||
// Assuming the size of the blob stored in the frontend index matches the size of the blob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider adding an assertion or runtime check to verify that the blob size from the frontend index (getDataSize()) matches the expected size in the HNSW index. This extra check would help catch unexpected mismatches early.
Copilot uses AI. Check for mistakes.
// TODO: handle original_blob_size != processed_bytes_count | ||
memcpy(storage_blob, original_blob, processed_bytes_count); | ||
} else if (query_blob == nullptr) { | ||
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment); | ||
// TODO: handle original_blob_size != processed_bytes_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a check or assertion to ensure that the original blob size is at least as large as processed_bytes_count before calling memcpy. This can help prevent potential out-of-bounds reads if the assumption is violated.
// TODO: handle original_blob_size != processed_bytes_count | |
memcpy(storage_blob, original_blob, processed_bytes_count); | |
} else if (query_blob == nullptr) { | |
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment); | |
// TODO: handle original_blob_size != processed_bytes_count | |
assert(original_blob_size >= processed_bytes_count && "original_blob is too small"); | |
memcpy(storage_blob, original_blob, processed_bytes_count); | |
} else if (query_blob == nullptr) { | |
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment); | |
assert(original_blob_size >= processed_bytes_count && "original_blob is too small"); |
Copilot uses AI. Check for mistakes.
virtual void fitMemory() {}; | ||
void fitMemory() override {} | ||
std::vector<std::vector<char>> getStoredVectorDataByLabel(labelType label) const override { | ||
assert(nullptr && "Not implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] If getStoredVectorDataByLabel in SVSIndex is likely to be invoked during tests or in production, consider providing a safe default implementation instead of an assert, or clearly document that it should not be used.
assert(nullptr && "Not implemented"); | |
// This method is not implemented. Returning an empty vector as a safe default. |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done!
Lets not forget the renaming of the getData
helpers
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.0
git worktree add -d .worktree/backport-665-to-8.0 origin/8.0
cd .worktree/backport-665-to-8.0
git switch --create backport-665-to-8.0
git cherry-pick -x 98cf0a1174a6ffdd91e864aa6e29a11be90baf6d |
* add size_t bg_vector_indexing_count; size_t main_vector_indexing_count; * fix datasize in executeInsertJob * disable neighbours print * replace manual blob size calcaulations format comment out new tests * INT8Test::PopulateRandomVector increases seed every call so we will get different vectors. this required a change in metrics_test because now the score is not exactly 0. VecSimIndexAbstract CastToBruteForce()-> rename to GetFlatBufferIndex add BruteForceIndex CastToBruteForce() add getFlatBufferIndexAsBruteForce to tiered index to get BruteForceIndex functions added to hnsw and BF: Formalize getDataByLabel definition: populate the intput empty vector with the vectors data,not including any additional meta data that might present in the vector storage like the norm. add getStoredVectorDataByLabel returns a vector of char vectors. each vector contains the vector as it is stored in the vectors container, including the meta data if exists. * build test with VERBOSE=1 * verbose cov * verbose cov fix assert in FP32SpacesOptimizationTest::FP32InnerProductTest * fix * add force_copy to maybeCopyToAlignedMem and preprocessQuery use preprocessQuery with force_copy in batch iterator instead on preprocessQueryInPlace * also batch in svs * revert redundancies * tiered batch iterator doesn't copy the blob * add // TODO: handle original_blob_size != processed_bytes_count * revert pull request changes * move getDataByLabel and getStoredVectorDataByLabel to VecSimIndexAbstract * add test_index_test_utils currently testing get_stored_vector_data_single_test * getFlatBufferIndex returns BruteForceIndex * little test name fix * small fix * fix range (cherry picked from commit 98cf0a1)
[MOD-9557] Fix incorrect vector blob size calculation (#665) * add size_t bg_vector_indexing_count; size_t main_vector_indexing_count; * fix datasize in executeInsertJob * disable neighbours print * replace manual blob size calcaulations format comment out new tests * INT8Test::PopulateRandomVector increases seed every call so we will get different vectors. this required a change in metrics_test because now the score is not exactly 0. VecSimIndexAbstract CastToBruteForce()-> rename to GetFlatBufferIndex add BruteForceIndex CastToBruteForce() add getFlatBufferIndexAsBruteForce to tiered index to get BruteForceIndex functions added to hnsw and BF: Formalize getDataByLabel definition: populate the intput empty vector with the vectors data,not including any additional meta data that might present in the vector storage like the norm. add getStoredVectorDataByLabel returns a vector of char vectors. each vector contains the vector as it is stored in the vectors container, including the meta data if exists. * build test with VERBOSE=1 * verbose cov * verbose cov fix assert in FP32SpacesOptimizationTest::FP32InnerProductTest * fix * add force_copy to maybeCopyToAlignedMem and preprocessQuery use preprocessQuery with force_copy in batch iterator instead on preprocessQueryInPlace * also batch in svs * revert redundancies * tiered batch iterator doesn't copy the blob * add // TODO: handle original_blob_size != processed_bytes_count * revert pull request changes * move getDataByLabel and getStoredVectorDataByLabel to VecSimIndexAbstract * add test_index_test_utils currently testing get_stored_vector_data_single_test * getFlatBufferIndex returns BruteForceIndex * little test name fix * small fix * fix range (cherry picked from commit 98cf0a1)
Description:
This PR fixes a bug where vector data was copied using an incorrect size, leading to truncated vectors and corrupted distance calculations in certain configurations.
Bug Summary:
Previously, when copying a vector blob, we computed the size manually using:
memcpy(blob_copy.get(), this->frontendIndex->getDataByInternalId(job->id), this->frontendIndex->getDim() * sizeof(DataType));
This approach ignores additional data stored alongside the vector. In particular, when using
int8
withcosine
distance, afloat
-typed norm (4 bytes) is appended to the end of each vector. By not including this in the copy, the norm was lost.This led to:
-inf
similarity scores due to division by invalid norms.Fix:
We now use the correct size returned by
getDataSize()
for the copy:Miscellaneous Improvements
Eliminated unnecessary memory allocation and copying in
TieredHNSW_BatchIterator
TieredHNSW_BatchIterator
allocated memory and copied the original query blob, but never used this copy directly.flat_iterator
andhnsw_iterator
) create their own copies:flat_iterator
copy is created duringTieredHNSW_BatchIterator
constructionTieredHNSW_BatchIterator::getNextResults()
is called andhnsw_iterator
is not initialized, it takes the blob fromflat_iterator
Test utilities
getStoredVectorDataByLabel
Method: Introduced a new method inBruteForceIndex
,HNSWIndex
, and their derived classes. This new method retrieves the full stored vector, including appended metadata (e.g., norms).getDataByLabel
method, which retrieves only the vector elements.📝 Note:
This fix also reveals a bug in vector preprocessing where the code assumes the input blob and processed blob have the same size:
This is problematic because:
processed_bytes_count
includes space for the normoriginal_blob
might only contain the vector elements(dim * sizeof(DataType))
memcpy
reads.This issue will be fully addressed in a follow-up PR, see MOD-9576.
In this PR, preliminary steps have been taken:
TODO added: A comment has been placed in the preprocessor logic to flag the need for handling size mismatches:
// TODO: handle original_blob_size != processed_bytes_count
🔄 Batch Iterator Modification:
To eliminate duplicated, potentially unsafe copy logic and ensure consistent handling of blob size and memory allocation, the responsibility for copying the query blob has been delegated from the batch iterator to the preprocessor.
Previously, the batch iterator manually allocated memory and copied the input before invoking
preprocessQueryInPlace()
:This has now been replaced with a call to the existing
preprocessQuery()
method withforce_copy = true
: