Skip to content

[MOD-9576] Fix memory safety in cosine preprocessing #670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 27, 2025

Conversation

meiravgri
Copy link
Collaborator

@meiravgri meiravgri commented May 8, 2025

This PR fixes a memory safety issue in our vector preprocessing pipeline, particularly affecting int8_t and uint8_t vectors with cosine similarity.

The Bug

Previously, CosinePreprocessor copied processed_bytes_count bytes from the source blob.
For int8_t / uint8_t + cosine similarity that value is input bytes + 4 (extra norm), so

memcpy(blob, original_blob, processed_bytes_count);

could touch uninitialised memory—undefined behaviour.

Key Changes

Preprocessors API

  • Blob‑size propagation – each preprocess* function now receives input_blob_size by reference and can write back the new size after it reallocates or extends the blob.
  • Fixed destination size – processed_bytes_count is stored once in the cosine‑preprocessor constructor.
    It is not taken from the preprocess* arguments, so we no longer rely on processed_bytes_count == input_blob_size
  • Added assertions to ensure any pre‑allocated storage_blob or query_blob already equals processed_bytes_count, enabling safe in‑place processing; dynamic resizing wasn’t implemented now to avoid added complexity and performance costs—if needed in the future, these checks can be removed and proper allocation logic added.

Factory

  • Added processed_bytes_count to PreprocessorsContainerParams.
  • Converted CreatePreprocessorsContainerParams from a regular function to a template function to properly handle type-specific size calculations

Impact

  • Eliminates out‑of‑bounds reads for int8_t / uint8_t cosine vectors.
  • API change: downstream code must propagate the (possibly updated) input_blob_size.
  • No functional change for other datatype‑metric pairs.
  • Negligible perf hit: at most 4 bytes of extra memset.

…er file

add logic to determine the cosine processed_bytes_count

add processed_bytes_count to the cosine component

change blob_size param to the original blob size instead of processed_bytes_count in all preprocessing functions.
Copy link

codecov bot commented May 8, 2025

Codecov Report

Attention: Patch coverage is 89.36170% with 5 lines in your changes missing coverage. Please review.

Project coverage is 96.29%. Comparing base (fbff84f) to head (d529f5e).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
.../VecSim/spaces/computer/preprocessor_container.cpp 75.00% 2 Missing ⚠️
src/VecSim/spaces/computer/preprocessors.h 89.47% 2 Missing ⚠️
...rc/VecSim/spaces/computer/preprocessor_container.h 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #670      +/-   ##
==========================================
+ Coverage   96.24%   96.29%   +0.05%     
==========================================
  Files         112      111       -1     
  Lines        6278     6288      +10     
==========================================
+ Hits         6042     6055      +13     
+ Misses        236      233       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meiravgri meiravgri changed the title make CreatePreprocessorsContainerParams templated and move it to header file [MOD-9576] make CreatePreprocessorsContainerParams templated and move it to header file May 11, 2025
meiravgri added 5 commits May 11, 2025 16:35
add DummyChangeAllocSizePreprocessor class to test pp that changes the size

add tests
one will fail because the input blob size is not updated

for (auto pp : preprocessors) {
if (!pp)
break;
// modifies the memory in place
pp->preprocessStorageInPlace(blob, processed_bytes_count);
pp->preprocessStorageInPlace(blob, input_blob_size);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be covered by svs code

meiravgri added 3 commits May 17, 2025 04:50
…tes_count

add valgrind macro

value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int
@meiravgri meiravgri changed the title [MOD-9576] make CreatePreprocessorsContainerParams templated and move it to header file [MOD-9576] Fix memory safety in cosine preprocessing May 18, 2025
@meiravgri meiravgri requested review from alonre24, GuyAv46, Copilot and dor-forer and removed request for GuyAv46 May 18, 2025 12:23
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the preprocessor pipeline API to propagate and fix the blob size across all stages, and it ensures safe in-place processing for cosine similarity by storing the expected processed_bytes_count. It also updates the factory to calculate and pass the new size, and adds a Valgrind build option.

  • Switch from a processed_bytes_count value to a size_t &input_blob_size reference throughout all preprocessors.
  • Store processed_bytes_count in CosinePreprocessor and add assertions to guard in-place processing.
  • Update factories to compute and forward the correct blob size; add USE_VALGRIND option in tests.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/test_hnsw_tiered.cpp Adapted test preprocessor implementations to new input_blob_size API
tests/unit/CMakeLists.txt Added USE_VALGRIND CMake option
src/VecSim/spaces/computer/preprocessors.h Changed preprocess interfaces to use size reference; updated CosinePreprocessor
src/VecSim/spaces/computer/preprocessor_container.h Propagate size reference through container API
src/VecSim/spaces/computer/preprocessor_container.cpp Updated container implementation to use new size
src/VecSim/index_factories/components/preprocessors_factory.h Added processed_bytes_count to params
src/VecSim/index_factories/components/components_factory.h Created templated CreatePreprocessorsContainerParams to compute new size
src/VecSim/CMakeLists.txt Removed deprecated components_factory.cpp from build
Makefile Enable USE_VALGRIND flag when requested
Comments suppressed due to low confidence (1)

src/VecSim/spaces/computer/preprocessors.h:61

  • The both-null case (when storage_blob and query_blob are both initially nullptr) is never entered because it's inside the if (storage_blob != query_blob) block. You need an else branch matching the original implementation to handle allocating and splitting a single shared nullptr blob.
if (storage_blob != query_blob) {

Comment on lines +65 to +68
memcpy(storage_blob, original_blob, input_blob_size);
} else if (query_blob == nullptr) {
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment);
// TODO: handle original_blob_size != processed_bytes_count
memcpy(query_blob, original_blob, processed_bytes_count);
memcpy(query_blob, original_blob, input_blob_size);
Copy link
Preview

Copilot AI May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You allocate processed_bytes_count bytes but only copy input_blob_size bytes, leaving the extra norm space uninitialized. Consider initializing or writing the norm value explicitly before use to avoid undefined data in the trailing bytes.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we assert that input_blob_size==processed_bytes_count. allocating input_blob_size would be more readable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation must use processed_bytes_count, since it's the actual size required (e.g., includes norm for int8/uint8 with cosine). The assertion is there to verify that if a blob was already allocated by a previous preprocessor, its size matches what we expect — ensuring it’s safe to process in-place. Dynamic resizing isn't supported now to avoid performance costs, and no current path needs it.

ASSERT_EQ(((const float *)storage_blob)[0], initial_value + value_to_add_storage);
ASSERT_EQ(((const float *)query_blob)[0], initial_value + value_to_add_query);
// Both blobs were processed.
ASSERT_EQ(((const int8_t *)storage_blob)[0], original_blob[0] + value_to_add_storage);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use static_cast so it will be more clear

Comment on lines 967 to 991
TEST(PreprocessorsTest, cosine_then_change_size) {
// cosine (not changing)
// pp that changes the blob size
}

TEST(PreprocessorsTest, cosine_change_then_pp_change) {
// cosine ( changing)
// pp that also changes the blob size
}

// TEST(PreprocessorsTest, multiPPContainerMixedThenCosinePreprocess) {
// // add cosine pp that changes the original blob size
// // add a pp that preprocesses the normalized blob (same size)
// // add a pp that changes the storage_blob size, but not changing the query_blob size
// }

// TEST(PreprocessorsTest, multiPPContainerMixedThenCosinePreprocess) {
// // add a pp that changes the storage_blob size, but not changing the query_blob size
// // add a pp that preprocesses the normalized blob (same size)
// // add cosine pp that changes the original blob size
// }

// TEST(PreprocessorsTest, multiPPContainerMixedThenCosinePreprocess) {
// // pp multi container where cosine is only needed for the query blob (not supported yet)
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thee is no need in these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftovers

// normalize the original blob
VecSim_Normalize(expected_processed_blob, elements, VecSimType_INT8);
// add the values of the pp that increases the blob size
unsigned char added_value = DummyChangeAllocSizePreprocessor<int8_t>::getExcessValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is added_value an unsigned char and not an int8?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the excess value of DummyChangeAllocSizePreprocessor is always unsigned char, regardless of the class DataType, to represent a "one byte" entry.
Is there any reason that is should be int8?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because you are adding it to an int8 array (expected_processed_blob), and it performs an implicit conversion

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add an explicit cast
The purpose of this is to fill the extra space with some values to be validated in CompareVectors

Comment on lines +65 to +68
memcpy(storage_blob, original_blob, input_blob_size);
} else if (query_blob == nullptr) {
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment);
// TODO: handle original_blob_size != processed_bytes_count
memcpy(query_blob, original_blob, processed_bytes_count);
memcpy(query_blob, original_blob, input_blob_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we assert that input_blob_size==processed_bytes_count. allocating input_blob_size would be more readable

normalize_func(blob, this->dim);
}

private:
spaces::normalizeVector_f<DataType> normalize_func;
const size_t dim;
const size_t processed_bytes_count;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is here only for the assertion that the input blob matches the expected size right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

@meiravgri meiravgri added this pull request to the merge queue May 27, 2025
Merged via the queue into main with commit 0929890 May 27, 2025
22 of 23 checks passed
@meiravgri meiravgri deleted the meiravg_fix_blob_copy_size branch May 27, 2025 13:00
Copy link

Backport failed for 8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.0
git worktree add -d .worktree/backport-670-to-8.0 origin/8.0
cd .worktree/backport-670-to-8.0
git switch --create backport-670-to-8.0
git cherry-pick -x 0929890d4fc1d4899443a2784045fa8febd472ae

meiravgri added a commit that referenced this pull request May 27, 2025
* make CreatePreprocessorsContainerParams templated and move it to header file

add logic to determine the cosine processed_bytes_count

add processed_bytes_count to the cosine component

change blob_size param to the original blob size instead of processed_bytes_count in all preprocessing functions.

* plan for the tests

* rename original_blob_size-> input_blob_size

add DummyChangeAllocSizePreprocessor class to test pp that changes the size

add tests
one will fail because the input blob size is not updated

* preprocessors now change the blob size

* fix test

* fix tiered test

* add assert storage_blob == nullptr || input_blob_size == processed_bytes_count

add valgrind macro

value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int

* enable assert only in debug

* use constexpr for blob size

* small docs changes

* review fixes

* ש

(cherry picked from commit 0929890)
github-merge-queue bot pushed a commit that referenced this pull request May 27, 2025
[MOD-9576] Fix memory safety in cosine preprocessing (#670)

* make CreatePreprocessorsContainerParams templated and move it to header file

add logic to determine the cosine processed_bytes_count

add processed_bytes_count to the cosine component

change blob_size param to the original blob size instead of processed_bytes_count in all preprocessing functions.

* plan for the tests

* rename original_blob_size-> input_blob_size

add DummyChangeAllocSizePreprocessor class to test pp that changes the size

add tests
one will fail because the input blob size is not updated

* preprocessors now change the blob size

* fix test

* fix tiered test

* add assert storage_blob == nullptr || input_blob_size == processed_bytes_count

add valgrind macro

value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int

* enable assert only in debug

* use constexpr for blob size

* small docs changes

* review fixes

* ש

(cherry picked from commit 0929890)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants