-
Notifications
You must be signed in to change notification settings - Fork 19
[MOD-9576] Fix memory safety in cosine preprocessing #670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…er file add logic to determine the cosine processed_bytes_count add processed_bytes_count to the cosine component change blob_size param to the original blob size instead of processed_bytes_count in all preprocessing functions.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #670 +/- ##
==========================================
+ Coverage 96.24% 96.29% +0.05%
==========================================
Files 112 111 -1
Lines 6278 6288 +10
==========================================
+ Hits 6042 6055 +13
+ Misses 236 233 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
add DummyChangeAllocSizePreprocessor class to test pp that changes the size add tests one will fail because the input blob size is not updated
|
||
for (auto pp : preprocessors) { | ||
if (!pp) | ||
break; | ||
// modifies the memory in place | ||
pp->preprocessStorageInPlace(blob, processed_bytes_count); | ||
pp->preprocessStorageInPlace(blob, input_blob_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be covered by svs code
…tes_count add valgrind macro value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the preprocessor pipeline API to propagate and fix the blob size across all stages, and it ensures safe in-place processing for cosine similarity by storing the expected processed_bytes_count
. It also updates the factory to calculate and pass the new size, and adds a Valgrind build option.
- Switch from a
processed_bytes_count
value to asize_t &input_blob_size
reference throughout all preprocessors. - Store
processed_bytes_count
inCosinePreprocessor
and add assertions to guard in-place processing. - Update factories to compute and forward the correct blob size; add
USE_VALGRIND
option in tests.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/unit/test_hnsw_tiered.cpp | Adapted test preprocessor implementations to new input_blob_size API |
tests/unit/CMakeLists.txt | Added USE_VALGRIND CMake option |
src/VecSim/spaces/computer/preprocessors.h | Changed preprocess interfaces to use size reference; updated CosinePreprocessor |
src/VecSim/spaces/computer/preprocessor_container.h | Propagate size reference through container API |
src/VecSim/spaces/computer/preprocessor_container.cpp | Updated container implementation to use new size |
src/VecSim/index_factories/components/preprocessors_factory.h | Added processed_bytes_count to params |
src/VecSim/index_factories/components/components_factory.h | Created templated CreatePreprocessorsContainerParams to compute new size |
src/VecSim/CMakeLists.txt | Removed deprecated components_factory.cpp from build |
Makefile | Enable USE_VALGRIND flag when requested |
Comments suppressed due to low confidence (1)
src/VecSim/spaces/computer/preprocessors.h:61
- The both-null case (when
storage_blob
andquery_blob
are both initiallynullptr
) is never entered because it's inside theif (storage_blob != query_blob)
block. You need anelse
branch matching the original implementation to handle allocating and splitting a single sharednullptr
blob.
if (storage_blob != query_blob) {
memcpy(storage_blob, original_blob, input_blob_size); | ||
} else if (query_blob == nullptr) { | ||
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment); | ||
// TODO: handle original_blob_size != processed_bytes_count | ||
memcpy(query_blob, original_blob, processed_bytes_count); | ||
memcpy(query_blob, original_blob, input_blob_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You allocate processed_bytes_count
bytes but only copy input_blob_size
bytes, leaving the extra norm space uninitialized. Consider initializing or writing the norm value explicitly before use to avoid undefined data in the trailing bytes.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we assert that input_blob_size==processed_bytes_count
. allocating input_blob_size
would be more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocation must use processed_bytes_count,
since it's the actual size required (e.g., includes norm for int8/uint8 with cosine). The assertion is there to verify that if a blob was already allocated by a previous preprocessor, its size matches what we expect — ensuring it’s safe to process in-place. Dynamic resizing isn't supported now to avoid performance costs, and no current path needs it.
tests/unit/test_components.cpp
Outdated
ASSERT_EQ(((const float *)storage_blob)[0], initial_value + value_to_add_storage); | ||
ASSERT_EQ(((const float *)query_blob)[0], initial_value + value_to_add_query); | ||
// Both blobs were processed. | ||
ASSERT_EQ(((const int8_t *)storage_blob)[0], original_blob[0] + value_to_add_storage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use static_cast so it will be more clear
tests/unit/test_components.cpp
Outdated
TEST(PreprocessorsTest, cosine_then_change_size) { | ||
// cosine (not changing) | ||
// pp that changes the blob size | ||
} | ||
|
||
TEST(PreprocessorsTest, cosine_change_then_pp_change) { | ||
// cosine ( changing) | ||
// pp that also changes the blob size | ||
} | ||
|
||
// TEST(PreprocessorsTest, multiPPContainerMixedThenCosinePreprocess) { | ||
// // add cosine pp that changes the original blob size | ||
// // add a pp that preprocesses the normalized blob (same size) | ||
// // add a pp that changes the storage_blob size, but not changing the query_blob size | ||
// } | ||
|
||
// TEST(PreprocessorsTest, multiPPContainerMixedThenCosinePreprocess) { | ||
// // add a pp that changes the storage_blob size, but not changing the query_blob size | ||
// // add a pp that preprocesses the normalized blob (same size) | ||
// // add cosine pp that changes the original blob size | ||
// } | ||
|
||
// TEST(PreprocessorsTest, multiPPContainerMixedThenCosinePreprocess) { | ||
// // pp multi container where cosine is only needed for the query blob (not supported yet) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thee is no need in these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftovers
// normalize the original blob | ||
VecSim_Normalize(expected_processed_blob, elements, VecSimType_INT8); | ||
// add the values of the pp that increases the blob size | ||
unsigned char added_value = DummyChangeAllocSizePreprocessor<int8_t>::getExcessValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is added_value an unsigned char and not an int8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the excess value of DummyChangeAllocSizePreprocessor is always unsigned char, regardless of the class DataType, to represent a "one byte" entry.
Is there any reason that is should be int8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because you are adding it to an int8 array (expected_processed_blob), and it performs an implicit conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add an explicit cast
The purpose of this is to fill the extra space with some values to be validated in CompareVectors
memcpy(storage_blob, original_blob, input_blob_size); | ||
} else if (query_blob == nullptr) { | ||
query_blob = this->allocator->allocate_aligned(processed_bytes_count, alignment); | ||
// TODO: handle original_blob_size != processed_bytes_count | ||
memcpy(query_blob, original_blob, processed_bytes_count); | ||
memcpy(query_blob, original_blob, input_blob_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we assert that input_blob_size==processed_bytes_count
. allocating input_blob_size
would be more readable
normalize_func(blob, this->dim); | ||
} | ||
|
||
private: | ||
spaces::normalizeVector_f<DataType> normalize_func; | ||
const size_t dim; | ||
const size_t processed_bytes_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is here only for the assertion that the input blob matches the expected size right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.0
git worktree add -d .worktree/backport-670-to-8.0 origin/8.0
cd .worktree/backport-670-to-8.0
git switch --create backport-670-to-8.0
git cherry-pick -x 0929890d4fc1d4899443a2784045fa8febd472ae |
* make CreatePreprocessorsContainerParams templated and move it to header file add logic to determine the cosine processed_bytes_count add processed_bytes_count to the cosine component change blob_size param to the original blob size instead of processed_bytes_count in all preprocessing functions. * plan for the tests * rename original_blob_size-> input_blob_size add DummyChangeAllocSizePreprocessor class to test pp that changes the size add tests one will fail because the input blob size is not updated * preprocessors now change the blob size * fix test * fix tiered test * add assert storage_blob == nullptr || input_blob_size == processed_bytes_count add valgrind macro value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int * enable assert only in debug * use constexpr for blob size * small docs changes * review fixes * ש (cherry picked from commit 0929890)
[MOD-9576] Fix memory safety in cosine preprocessing (#670) * make CreatePreprocessorsContainerParams templated and move it to header file add logic to determine the cosine processed_bytes_count add processed_bytes_count to the cosine component change blob_size param to the original blob size instead of processed_bytes_count in all preprocessing functions. * plan for the tests * rename original_blob_size-> input_blob_size add DummyChangeAllocSizePreprocessor class to test pp that changes the size add tests one will fail because the input blob size is not updated * preprocessors now change the blob size * fix test * fix tiered test * add assert storage_blob == nullptr || input_blob_size == processed_bytes_count add valgrind macro value_to_add of DummyQueryPreprocessor and DummyMixedPreprocessor is now DataType instead of int * enable assert only in debug * use constexpr for blob size * small docs changes * review fixes * ש (cherry picked from commit 0929890)
This PR fixes a memory safety issue in our vector preprocessing pipeline, particularly affecting int8_t and uint8_t vectors with cosine similarity.
The Bug
Previously,
CosinePreprocessor
copied processed_bytes_count bytes from the source blob.For
int8_t
/uint8_t
+ cosine similarity that value is input bytes + 4 (extra norm), socould touch uninitialised memory—undefined behaviour.
Key Changes
Preprocessors API
preprocess*
function now receivesinput_blob_size
by reference and can write back the new size after it reallocates or extends the blob.processed_bytes_count
is stored once in the cosine‑preprocessor constructor.It is not taken from the preprocess* arguments, so we no longer rely on
processed_bytes_count == input_blob_size
storage_blob
orquery_blob
already equalsprocessed_bytes_count,
enabling safe in‑place processing; dynamic resizing wasn’t implemented now to avoid added complexity and performance costs—if needed in the future, these checks can be removed and proper allocation logic added.Factory
processed_bytes_count
toPreprocessorsContainerParams
.CreatePreprocessorsContainerParams
from a regular function to a template function to properly handle type-specific size calculationsImpact
int8_t
/uint8_t
cosine vectors.input_blob_size
.memset
.