Text index: Use text index to improve the performance of like queries#98149
Text index: Use text index to improve the performance of like queries#98149
like queries#98149Conversation
like/notLike querieslike/notLike queries
|
Workflow [PR], commit [8d09073] Summary: ❌
AI ReviewSummaryThis PR adds text-index dictionary-scan acceleration for Findings
ClickHouse Rules
Final Verdict
|
More supported patterns would follow.
There was a problem hiding this comment.
Pull request overview
This PR extends the MergeTree inverted text index “direct read” path to support an opt-in optimization for like/notLike predicates (currently limited to the SplitByNonAlpha tokenizer) by scanning dictionary tokens against a LIKE-derived regex and reading/unioning the postings of matched tokens.
Changes:
- Add LIKE/NOT LIKE support to
MergeTreeIndexConditionTextand direct-read optimization plumbing, including a newuse_text_index_like_optimizationsetting gate. - Extend text-index granule deserialization/analysis to collect pattern-matched tokens (
pattern_tokens) and read postings for them. - Update the text-index reader to read postings for both exact tokens and pattern-matched tokens and to fill virtual columns accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp | Reads postings for pattern-matched tokens and fills virtual columns for LIKE-style queries. |
| src/Storages/MergeTree/MergeTreeIndexText.h | Extends granule API/state with pattern_tokens and adds pattern-query helpers. |
| src/Storages/MergeTree/MergeTreeIndexText.cpp | Implements dictionary scanning for LIKE patterns and registers rare-token postings for pattern tokens. |
| src/Storages/MergeTree/MergeTreeIndexConditionText.h | Extends TextSearchQuery to carry compiled patterns; adds LIKE/NOT LIKE RPN function kinds. |
| src/Storages/MergeTree/MergeTreeIndexConditionText.cpp | Parses eligible %token% LIKE patterns into regexes and wires LIKE/NOT LIKE into the condition/RPN evaluation. |
| src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp | Adjusts direct-read selection logic for pattern-based queries. |
| src/Core/SettingsChangesHistory.cpp | Records the new use_text_index_like_optimization setting in settings history. |
| src/Core/Settings.cpp | Declares the new use_text_index_like_optimization setting. |
just negate the result array
|
|
||
| The spaces left and right of `support` make sure that the term can be extracted as a token. | ||
|
|
||
| Fortunately, there is a special case where ClickHouse can leverage the inverted index to speed up LIKE queries significantly. |
There was a problem hiding this comment.
The docs section references settings that do not exist anymore: it points to use_text_index_like_optimization, but the implemented setting is use_text_index_like_evaluation_by_dictionary_scan.
This leaves users with a broken setting name in docs and makes tuning instructions non-actionable.
Please update this paragraph and links to use the actual setting names (use_text_index_like_evaluation_by_dictionary_scan, text_index_like_min_pattern_length, text_index_like_max_postings_to_read).
|
|
||
| ### LIKE/ILIKE queries {#like-ilike-queries-perf} | ||
|
|
||
| When a LIKE/ILIKE query pattern is `%<alpha-numeric-characters-without-spaces>%` and the text index tokenizer is `splitByNonAlpha`, ClickHouse leverages the inverted index to speed up LIKE/ILIKE queries significantly. To achieve that, ClickHouse scans the inverted index dictionary instead of a full-table scan to find the matching pattern. |
There was a problem hiding this comment.
The docs say the optimized pattern is %<alpha-numeric-characters-without-spaces>%, but the implementation currently accepts only ASCII alphanumerics (isASCII + isAlphaNumericASCII in stringLikeToPatterns).
Without clarifying this, users can expect dictionary-scan acceleration for non-ASCII patterns and get surprising behavior/perf differences.
Could you either:
- tighten docs wording to explicitly say
ASCIIalphanumeric, or - broaden the matcher to support non-ASCII alphanumeric needles?
| matched_tokens.push_back(String(token)); | ||
|
|
||
| if (!matched_tokens.empty()) | ||
| applyPostingsAny(column, postings, indices_buffer, matched_tokens, old_size, row_offset, num_rows); |
There was a problem hiding this comment.
fillColumn writes LIKE semantics for pattern queries but never inverts for notLike / notILike.
In this branch, applyPostingsAny marks rows that match at least one pattern token, then returns immediately. For TextIndexDirectReadMode::Exact, the original predicate is removed, so notLike can return matching rows instead of non-matching rows.
Please invert the just-filled range when search_query->function_name is notLike/notILike (or route these through fillColumnFallback).
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
LLVM Coverage Report
Changed lines: 77.39% (493/637) | lost baseline coverage: 7 line(s) · Uncovered code |
Strip broken setting anchors in textindexes.md that reference settings not yet documented. Remove once core adds the setting entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Excited to see this merged! I'm curious how much harder supporting the |
@EmeraldShift, it should be fairly easy to enable for the array tokenizer. I'll have a look |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
This PR includes the perf optimization for
likequeries from #97723.The idea is scanning and applying the like pattern to dictionary tokens to get matching tokens and postings.
For a simple pattern as
column LIKE '%search%', the union of matching postings gives a result that would be same as scanning the column data.The optimization is only available for
SplitByNonAlphatokenizer for now.Note
Medium Risk
Touches core MergeTree text-index planning and read-time filtering, adding new code paths and a fallback reader; incorrect handling could cause wrong filtering or performance regressions under LIKE workloads.
Overview
Adds an opt-in optimization for
LIKE/NOT LIKEandILIKE/NOT ILIKEontextskip indexes by scanning the text-index dictionary for tokens matching simple%needle%patterns and using their postings to filter granules/rows.Introduces new settings (
use_text_index_like_optimization,text_index_like_min_pattern_length,text_index_like_max_postings_to_read) and extends the text-index condition/granule/reader pipeline to carry compiled patterns, read postings for pattern-matched tokens, and fall back to evaluating the original LIKE predicate on the base column when the dictionary scan is cut short (too many large postings). Adds stateless and performance tests covering correctness, EXPLAIN behavior, and fallback cases.Written by Cursor Bugbot for commit c20269a. This will update automatically on new commits. Configure here.