Skip to content

[MOD-9557] Fix incorrect vector blob size calculation #665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
May 7, 2025

Conversation

meiravgri
Copy link
Collaborator

@meiravgri meiravgri commented Apr 27, 2025

Description:

This PR fixes a bug where vector data was copied using an incorrect size, leading to truncated vectors and corrupted distance calculations in certain configurations.

Bug Summary:

Previously, when copying a vector blob, we computed the size manually using:

memcpy(blob_copy.get(), this->frontendIndex->getDataByInternalId(job->id),
       this->frontendIndex->getDim() * sizeof(DataType));

This approach ignores additional data stored alongside the vector. In particular, when using int8 with cosine distance, a float-typed norm (4 bytes) is appended to the end of each vector. By not including this in the copy, the norm was lost.

This led to:

  • Invalid norm values (0 or garbage) during distance computation.
  • -inf similarity scores due to division by invalid norms.
  • Many vectors selecting the same neighbors.
  • Lock contention on shared nodes in multithreaded environments, significantly impacting performance.

Fix:

We now use the correct size returned by getDataSize() for the copy:

auto blob_copy = this->getAllocator()->allocate_unique(this->frontendIndex->getDataSize());

memcpy(blob_copy.get(), this->frontendIndex->getDataByInternalId(job->id),
       this->frontendIndex->getDataSize());

Miscellaneous Improvements

Eliminated unnecessary memory allocation and copying in TieredHNSW_BatchIterator

  • Previously, TieredHNSW_BatchIterator allocated memory and copied the original query blob, but never used this copy directly.
  • The internal batch iterators (flat_iterator and hnsw_iterator) create their own copies:
    • flat_iterator copy is created during TieredHNSW_BatchIterator construction
    • When TieredHNSW_BatchIterator::getNextResults() is called and hnsw_iterator is not initialized, it takes the blob from flat_iterator

Test utilities

  • Added getStoredVectorDataByLabel Method: Introduced a new method in BruteForceIndex, HNSWIndex, and their derived classes. This new method retrieves the full stored vector, including appended metadata (e.g., norms).
  • This complements the existing getDataByLabel method, which retrieves only the vector elements.

📝 Note:

This fix also reveals a bug in vector preprocessing where the code assumes the input blob and processed blob have the same size:

storage_blob = this->allocator->allocate(processed_bytes_count);
memcpy(storage_blob, original_blob, processed_bytes_count);

This is problematic because:

  • For int8 vectors with cosine similarity, processed_bytes_count includes space for the norm
  • But original_blob might only contain the vector elements (dim * sizeof(DataType))
  • This may result in out-of-bounds memcpy reads.

This issue will be fully addressed in a follow-up PR, see MOD-9576.

In this PR, preliminary steps have been taken:

  1. TODO added: A comment has been placed in the preprocessor logic to flag the need for handling size mismatches:
    // TODO: handle original_blob_size != processed_bytes_count

  2. 🔄 Batch Iterator Modification:

To eliminate duplicated, potentially unsafe copy logic and ensure consistent handling of blob size and memory allocation, the responsibility for copying the query blob has been delegated from the batch iterator to the preprocessor.

Previously, the batch iterator manually allocated memory and copied the input before invoking preprocessQueryInPlace():

auto *queryBlobCopy =
    this->allocator->allocate_aligned(this->dataSize, this->preprocessors->getAlignment());
memcpy(queryBlobCopy, queryBlob, this->dim * sizeof(DataType));
this->preprocessQueryInPlace(queryBlobCopy);

This has now been replaced with a call to the existing preprocessQuery() method with force_copy = true:

auto queryBlobCopy = this->preprocessQuery(queryBlob, true);
auto *queryBlobCopyPtr = queryBlobCopy.release();  // Take ownership

Copy link

codecov bot commented Apr 28, 2025

Codecov Report

Attention: Patch coverage is 91.25000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 96.24%. Comparing base (b22e5c2) to head (1747911).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/VecSim/algorithms/svs/svs.h 60.00% 4 Missing ⚠️
src/VecSim/algorithms/hnsw/hnsw_multi.h 85.71% 2 Missing ⚠️
.../VecSim/algorithms/brute_force/brute_force_multi.h 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #665      +/-   ##
==========================================
- Coverage   96.34%   96.24%   -0.11%     
==========================================
  Files         112      112              
  Lines        6239     6278      +39     
==========================================
+ Hits         6011     6042      +31     
- Misses        228      236       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

increases seed every call so we will get different vectors.
this required a change in metrics_test because now the score is not exactly 0.

VecSimIndexAbstract CastToBruteForce()-> rename to GetFlatBufferIndex

add BruteForceIndex CastToBruteForce()

add getFlatBufferIndexAsBruteForce to tiered index to get BruteForceIndex

functions added to hnsw and BF:
Formalize getDataByLabel definition: populate the intput empty vector with the vectors data,not including any additional meta data that might present in the vector storage like the norm.
add getStoredVectorDataByLabel returns a vector of char vectors. each vector contains the vector as it is stored in the vectors container, including the meta data if exists.
assert in FP32SpacesOptimizationTest::FP32InnerProductTest
use preprocessQuery with force_copy in batch iterator instead on preprocessQueryInPlace
                // TODO: handle original_blob_size != processed_bytes_count
@meiravgri meiravgri requested a review from Copilot May 4, 2025 13:13
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses several improvements in testing, data processing, and memory management for vector similarity indexes. Key changes include:

  • Updates to test cases in both INT8 and BruteForce tests to ensure const correctness and more precise floating‑point comparisons.
  • Modifications to index and preprocessor APIs to allow forced data copying during query preprocessing, as well as the addition of helper functions for retrieving complete stored vector data.
  • Minor documentation and spelling corrections across multiple modules.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/test_int8.cpp Adjusted random vector generation and improved float assertions.
tests/unit/test_bruteforce.cpp Introduced const correctness when fetching internal vector data.
src/VecSim/vec_sim_tiered_index.h Added new helper to access the brute-force view of the flat buffer.
src/VecSim/vec_sim_index.h Updated preprocessQuery signature to support forced copying.
src/VecSim/spaces/computer/preprocessors.h Added TODO comments regarding blob size mismatches.
src/VecSim/spaces/computer/preprocessor_container.h Updated method signatures to support a force_copy option.
src/VecSim/spaces/computer/preprocessor_container.cpp Updated memory copy logic with force_copy and added TODO comments.
src/VecSim/algorithms/svs/svs.h Modified query blob handling in newBatchIterator for forced copying.
src/VecSim/algorithms/hnsw/*.h Improved vector data copying to exclude metadata and fixed spelling.
src/VecSim/algorithms/brute_force/*.h Updated tests to return complete vector data with const correctness.
.github/workflows/event-pull_request.yml Minor updates to build flags and commented out an unused workflow dependency.
Comments suppressed due to low confidence (1)

tests/unit/test_int8.cpp:396

  • [nitpick] Verify that the tolerance value (1e-6) used in ASSERT_NEAR is appropriate for the expected precision of floating‑point computations in this test scenario.
        ASSERT_NEAR(score, expected_score, 1e-6f) << "failed at vector id:" << id;

@@ -55,9 +55,11 @@ class CosinePreprocessor : public PreprocessorInterface {
// If one of them is null, allocate memory for it and copy the original_blob to it.
if (storage_blob == nullptr) {
storage_blob = this->allocator->allocate(processed_bytes_count);
// TODO: handle original_blob_size != processed_bytes_count
Copy link
Preview

Copilot AI May 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider elaborating on how to handle cases where the original blob size differs from processed_bytes_count to improve future maintainability.

Copilot uses AI. Check for mistakes.

this->allocator->allocate_aligned(this->dataSize, this->preprocessors->getAlignment());
memcpy(queryBlobCopy, queryBlob, this->dim * sizeof(DataType));
this->preprocessQueryInPlace(queryBlobCopy);
// force_copy == true.
Copy link
Preview

Copilot AI May 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Clarify in documentation the rationale for forcing a copy (force_copy == true) of the query blob in the newBatchIterator to make the memory ownership semantics explicit.

Suggested change
// force_copy == true.
// force_copy == true. Forcing a copy ensures that the VecSimBatchIterator takes ownership
// of the query blob's memory and is responsible for freeing it. This avoids potential
// memory management issues and makes ownership semantics explicit.

Copilot uses AI. Check for mistakes.

@meiravgri meiravgri changed the title Meiravg_investigate_int8_anamaly [MOD-9557] Fix blob allocations size May 4, 2025
@meiravgri meiravgri changed the title [MOD-9557] Fix blob allocations size [MOD-9557] Fix incorrect vector blob size calculation May 4, 2025
@meiravgri meiravgri requested review from GuyAv46 and alonre24 May 4, 2025 14:27
Copy link
Collaborator

@GuyAv46 GuyAv46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch and nice tests!

* @return A vector containing the complete vector data (elements + metadata) for the given
* label
*/
virtual std::vector<std::vector<char>> getStoredVectorDataByLabel(labelType label) const = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we move the two helpers under VecSimIndexAbstract (test only), so we don't need to redefine it in HNSW?
  2. Consider renaming - I'd expect getDataByLabel to return all the data. Maybe getDataByLabel and getVectorByLabel, or something similar

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agreed — getDataByInternalId should also be unified, as it's currently declared separately in each algorithm with different return types for BruteForce and HNSW.
  2. I also agree that getDataByLabel isn’t a very intuitive name. getStoredVectorDataByLabel is more descriptive for the full blob, including metadata like norms.

That said, to avoid overloading in this PR, I suggest we address the renaming and unification in a separate follow-up. That would also be a good opportunity to review existing usages of getDataByLabel and see where getStoredVectorDataByLabel would be more appropriate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved


if (needs_copy) {
auto aligned_mem = this->allocator->allocate_aligned(blob_bytes_count, this->alignment);
// TODO: handle original_blob_size != processed_bytes_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When? Is there something that will fail a PR if we forget to support it?

Copy link
Collaborator Author

@meiravgri meiravgri May 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASAP. Reading beyond an allocated memory is an undefined behaviour that potentially can lead to a crash

Copy link
Collaborator

@alonre24 alonre24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -110,6 +110,9 @@ class VecSimTieredIndex : public VecSimIndexInterface {
inline VecSimIndexAbstract<DataType, DistType> *getFlatBufferIndex() {
return this->frontendIndex;
}
inline BruteForceIndex<DataType, DistType> *getFlatBufferIndexAsBruteForce() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing getFlatBufferIndex to return BruteForceIndex insread

Copy link
Collaborator Author

@meiravgri meiravgri May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense & done

VecSimQueryParams *queryParams, std::shared_ptr<VecSimAllocator> allocator)
: VecSimBatchIterator(query_vector, queryParams ? queryParams->timeoutCtx : nullptr,
// Tiered batch iterator doesn't hold its own copy of the query vector.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? Please extend the document here since it is not trivial..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

meiravgri added 2 commits May 6, 2025 12:42
currently testing get_stored_vector_data_single_test
@meiravgri meiravgri requested review from alonre24, GuyAv46 and Copilot May 6, 2025 12:52
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue with vector blob copying by switching from a size computed as (dim * sizeof(DataType)) to the correct size provided by getDataSize(), thereby ensuring that additional metadata (e.g. norms for int8 vectors with cosine similarity) is correctly copied. It also removes redundant memory allocation in the batch iterator and introduces new test utilities for retrieving the full stored vector data.

Reviewed Changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/utils/tests_utils.h Adds a helper to populate float vectors for testing.
tests/unit/test_int8.cpp Updates random vector population and adjusts float comparison in search tests.
tests/unit/test_index_test_utils.cpp Improves test utilities for vector data verification.
tests/unit/test_bruteforce.cpp Applies const-correctness to getDataByInternalId calls.
src/VecSim/vec_sim_tiered_index.h Updates getFlatBufferIndex to return a BruteForceIndex pointer.
src/VecSim/vec_sim_index.h Refactors preprocessQuery to incorporate a force_copy parameter.
src/VecSim/spaces/computer/* Propagates force_copy changes and adds TODOs regarding blob size mismatches.
src/VecSim/algorithms/* Updates vector copy logic to copy only the vector elements where appropriate and adds methods for retrieving full stored data.
Files not reviewed (1)
  • tests/unit/CMakeLists.txt: Language not supported

this->frontendIndex->getDim() * sizeof(DataType));
size_t data_size = this->frontendIndex->getDataSize();
auto blob_copy = this->getAllocator()->allocate_unique(data_size);
// Assuming the size of the blob stored in the frontend index matches the size of the blob
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider adding an assertion or runtime check to verify that the blob size from the frontend index (getDataSize()) matches the expected size in the HNSW index. This extra check would help catch unexpected mismatches early.

Copilot uses AI. Check for mistakes.

Comment on lines +62 to +66
// TODO: handle original_blob_size != processed_bytes_count
memcpy(storage_blob, original_blob, processed_bytes_count);
} else if (query_blob == nullptr) {
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment);
// TODO: handle original_blob_size != processed_bytes_count
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a check or assertion to ensure that the original blob size is at least as large as processed_bytes_count before calling memcpy. This can help prevent potential out-of-bounds reads if the assumption is violated.

Suggested change
// TODO: handle original_blob_size != processed_bytes_count
memcpy(storage_blob, original_blob, processed_bytes_count);
} else if (query_blob == nullptr) {
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment);
// TODO: handle original_blob_size != processed_bytes_count
assert(original_blob_size >= processed_bytes_count && "original_blob is too small");
memcpy(storage_blob, original_blob, processed_bytes_count);
} else if (query_blob == nullptr) {
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment);
assert(original_blob_size >= processed_bytes_count && "original_blob is too small");

Copilot uses AI. Check for mistakes.

virtual void fitMemory() {};
void fitMemory() override {}
std::vector<std::vector<char>> getStoredVectorDataByLabel(labelType label) const override {
assert(nullptr && "Not implemented");
Copy link
Preview

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] If getStoredVectorDataByLabel in SVSIndex is likely to be invoked during tests or in production, consider providing a safe default implementation instead of an assert, or clearly document that it should not be used.

Suggested change
assert(nullptr && "Not implemented");
// This method is not implemented. Returning an empty vector as a safe default.

Copilot uses AI. Check for mistakes.

GuyAv46
GuyAv46 previously approved these changes May 6, 2025
Copy link
Collaborator

@GuyAv46 GuyAv46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!
Lets not forget the renaming of the getData helpers

alonre24
alonre24 previously approved these changes May 6, 2025
@meiravgri meiravgri added this pull request to the merge queue May 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 6, 2025
@meiravgri meiravgri dismissed stale reviews from alonre24 and GuyAv46 via 1747911 May 7, 2025 04:42
@meiravgri meiravgri requested review from GuyAv46 and alonre24 May 7, 2025 04:44
@meiravgri meiravgri enabled auto-merge May 7, 2025 04:45
@meiravgri meiravgri added this pull request to the merge queue May 7, 2025
Merged via the queue into main with commit 98cf0a1 May 7, 2025
15 of 17 checks passed
@meiravgri meiravgri deleted the meiravg_investigate_int8_anamaly branch May 7, 2025 06:55
Copy link

github-actions bot commented May 7, 2025

Backport failed for 8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.0
git worktree add -d .worktree/backport-665-to-8.0 origin/8.0
cd .worktree/backport-665-to-8.0
git switch --create backport-665-to-8.0
git cherry-pick -x 98cf0a1174a6ffdd91e864aa6e29a11be90baf6d

meiravgri added a commit that referenced this pull request May 11, 2025
* add     size_t bg_vector_indexing_count;
    size_t main_vector_indexing_count;

* fix datasize in executeInsertJob

* disable neighbours print

* replace manual blob size calcaulations

format

comment out new tests

* INT8Test::PopulateRandomVector
increases seed every call so we will get different vectors.
this required a change in metrics_test because now the score is not exactly 0.

VecSimIndexAbstract CastToBruteForce()-> rename to GetFlatBufferIndex

add BruteForceIndex CastToBruteForce()

add getFlatBufferIndexAsBruteForce to tiered index to get BruteForceIndex

functions added to hnsw and BF:
Formalize getDataByLabel definition: populate the intput empty vector with the vectors data,not including any additional meta data that might present in the vector storage like the norm.
add getStoredVectorDataByLabel returns a vector of char vectors. each vector contains the vector as it is stored in the vectors container, including the meta data if exists.

* build  test with VERBOSE=1

* verbose cov

* verbose cov fix

assert in FP32SpacesOptimizationTest::FP32InnerProductTest

* fix

* add force_copy to maybeCopyToAlignedMem and preprocessQuery

use preprocessQuery with force_copy in batch iterator instead on preprocessQueryInPlace

* also batch in svs

* revert redundancies

* tiered batch iterator doesn't copy the blob

* add
                // TODO: handle original_blob_size != processed_bytes_count

* revert pull request changes

* move getDataByLabel and getStoredVectorDataByLabel to VecSimIndexAbstract

* add test_index_test_utils
currently testing get_stored_vector_data_single_test

* getFlatBufferIndex returns BruteForceIndex

* little test name fix

* small fix

* fix range

(cherry picked from commit 98cf0a1)
github-merge-queue bot pushed a commit that referenced this pull request May 11, 2025
[MOD-9557]  Fix incorrect vector blob size calculation (#665)

* add     size_t bg_vector_indexing_count;
    size_t main_vector_indexing_count;

* fix datasize in executeInsertJob

* disable neighbours print

* replace manual blob size calcaulations

format

comment out new tests

* INT8Test::PopulateRandomVector
increases seed every call so we will get different vectors.
this required a change in metrics_test because now the score is not exactly 0.

VecSimIndexAbstract CastToBruteForce()-> rename to GetFlatBufferIndex

add BruteForceIndex CastToBruteForce()

add getFlatBufferIndexAsBruteForce to tiered index to get BruteForceIndex

functions added to hnsw and BF:
Formalize getDataByLabel definition: populate the intput empty vector with the vectors data,not including any additional meta data that might present in the vector storage like the norm.
add getStoredVectorDataByLabel returns a vector of char vectors. each vector contains the vector as it is stored in the vectors container, including the meta data if exists.

* build  test with VERBOSE=1

* verbose cov

* verbose cov fix

assert in FP32SpacesOptimizationTest::FP32InnerProductTest

* fix

* add force_copy to maybeCopyToAlignedMem and preprocessQuery

use preprocessQuery with force_copy in batch iterator instead on preprocessQueryInPlace

* also batch in svs

* revert redundancies

* tiered batch iterator doesn't copy the blob

* add
                // TODO: handle original_blob_size != processed_bytes_count

* revert pull request changes

* move getDataByLabel and getStoredVectorDataByLabel to VecSimIndexAbstract

* add test_index_test_utils
currently testing get_stored_vector_data_single_test

* getFlatBufferIndex returns BruteForceIndex

* little test name fix

* small fix

* fix range

(cherry picked from commit 98cf0a1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants