Skip to content

Text index: add internal bloom filter layer#85356

Merged
rschu1ze merged 4 commits intomasterfrom
ahmadov/fst-bloom-filter-thread-sanitizer
Aug 11, 2025
Merged

Text index: add internal bloom filter layer#85356
rschu1ze merged 4 commits intomasterfrom
ahmadov/fst-bloom-filter-thread-sanitizer

Conversation

@ahmadov
Copy link
Copy Markdown
Member

@ahmadov ahmadov commented Aug 11, 2025

Similar to #83485 but better. Reverts #85271 which reverts #85054.

Adds an additional bloom filter layer on top of text index segments. This prevents loading the segment dictionary into memory when a term does not exist in the dictionary.

The bloom filter is constructed from the segment data on writeSegment call.

Introduce a new text index parameter bloom_filter_false_positive_rate to control false-positive rate while building the bloom filter. By default it's set to 0.1%.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Aug 11, 2025

Workflow [PR], commit [afcf9a0]

Summary:

job_name test_name status info comment
Stateless tests (amd_debug, distributed plan, s3 storage, parallel) failure
02443_detach_attach_partition FAIL
Integration tests (asan, old analyzer, 3/6) failure
test_ttl_replicated/test.py::test_ttl_compatibility[node_left2-node_right2-2] FAIL

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 11, 2025
@ahmadov ahmadov changed the title Ahmadov/fst bloom filter thread sanitizer Text index: add internal bloom filter layer Aug 11, 2025
@ahmadov ahmadov requested review from Ergus and rschu1ze August 11, 2025 10:54
@rschu1ze rschu1ze self-assigned this Aug 11, 2025

void GinIndexStoreDeserializer::readSegmentFST(GinSegmentDictionaryPtr segment_dictionary)
{
std::scoped_lock lock(segment_dictionary->mutex_fst);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use std::lock_guard here (SO).

More importantly, the caller readSegmentedPostingsLists accesses fst as well:

if (seg_dict.second->fst == nullptr)

TSAN will sooner or later complain. It technically comes down to the question if the pointer itself needs synchronization or only the pointed-to object. I would argue for the former - the pointer gets set once the FST is loaded.

Therefore, can we omit locking here and rewrite readSegmentedPostingsLists to something like this:

FST::FiniteStateTransducer::Output fst_output;

{
    std::lock_guard lock(segment_dictionary.second->fst_mutex);

    if (segment_dictionary.second->fst == nullptr)
    {
        /// Segment dictionary is not loaded. First check the bloom filter if we can avoid the load.
        if (segment_dictionary.second->bloom_filter && !segment_dictionary.second->bloom_filter->contains(term))
            continue;

        /// Term might be in segment dictionary
        readSegmentFST(segment_dictionary.second);
    }

    fst_output = segment_dictionary.second->fst->getOutput(term);
    if (!fst_output.found)
        continue;
}

(I also changed the return type of FiniteStateTransducer::getOutput a little bit:

struct Output {
    UInt64 offset;
    bool found;
};
Output getOutput(std::string_view term);

)

@rschu1ze
Copy link
Copy Markdown
Member

02443_detach_attach_partition: #54748

test_ttl_replicated/test.py::test_ttl_compatibility[node_left2-node_right2-2]: #83789 (comment)

@rschu1ze rschu1ze added this pull request to the merge queue Aug 11, 2025
Merged via the queue into master with commit b99ad5a Aug 11, 2025
114 of 122 checks passed
@rschu1ze rschu1ze deleted the ahmadov/fst-bloom-filter-thread-sanitizer branch August 11, 2025 16:03
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants