Text index: add internal bloom filter layer#85356
Merged
Conversation
Contributor
|
Workflow [PR], commit [afcf9a0] Summary: ⏳
|
rschu1ze
reviewed
Aug 11, 2025
|
|
||
| void GinIndexStoreDeserializer::readSegmentFST(GinSegmentDictionaryPtr segment_dictionary) | ||
| { | ||
| std::scoped_lock lock(segment_dictionary->mutex_fst); |
Member
There was a problem hiding this comment.
We should use std::lock_guard here (SO).
More importantly, the caller readSegmentedPostingsLists accesses fst as well:
if (seg_dict.second->fst == nullptr)TSAN will sooner or later complain. It technically comes down to the question if the pointer itself needs synchronization or only the pointed-to object. I would argue for the former - the pointer gets set once the FST is loaded.
Therefore, can we omit locking here and rewrite readSegmentedPostingsLists to something like this:
FST::FiniteStateTransducer::Output fst_output;
{
std::lock_guard lock(segment_dictionary.second->fst_mutex);
if (segment_dictionary.second->fst == nullptr)
{
/// Segment dictionary is not loaded. First check the bloom filter if we can avoid the load.
if (segment_dictionary.second->bloom_filter && !segment_dictionary.second->bloom_filter->contains(term))
continue;
/// Term might be in segment dictionary
readSegmentFST(segment_dictionary.second);
}
fst_output = segment_dictionary.second->fst->getOutput(term);
if (!fst_output.found)
continue;
}(I also changed the return type of FiniteStateTransducer::getOutput a little bit:
struct Output {
UInt64 offset;
bool found;
};
Output getOutput(std::string_view term);)
rschu1ze
approved these changes
Aug 11, 2025
Member
|
02443_detach_attach_partition: #54748 test_ttl_replicated/test.py::test_ttl_compatibility[node_left2-node_right2-2]: #83789 (comment) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Similar to #83485 but better. Reverts #85271 which reverts #85054.
Adds an additional bloom filter layer on top of text index segments. This prevents loading the segment dictionary into memory when a term does not exist in the dictionary.
The bloom filter is constructed from the segment data on writeSegment call.
Introduce a new text index parameter bloom_filter_false_positive_rate to control false-positive rate while building the bloom filter. By default it's set to 0.1%.
Changelog category (leave one):