Skip to content

Text index: Use text index to improve the performance of like queries#98149

Merged
ahmadov merged 68 commits intomasterfrom
ahmadov/text-index-like-perf
Apr 7, 2026
Merged

Text index: Use text index to improve the performance of like queries#98149
ahmadov merged 68 commits intomasterfrom
ahmadov/text-index-like-perf

Conversation

@ahmadov
Copy link
Copy Markdown
Member

@ahmadov ahmadov commented Feb 26, 2026

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

This PR includes the perf optimization for like queries from #97723.

The idea is scanning and applying the like pattern to dictionary tokens to get matching tokens and postings.

For a simple pattern as column LIKE '%search%', the union of matching postings gives a result that would be same as scanning the column data.

The optimization is only available for SplitByNonAlpha tokenizer for now.


Note

Medium Risk
Touches core MergeTree text-index planning and read-time filtering, adding new code paths and a fallback reader; incorrect handling could cause wrong filtering or performance regressions under LIKE workloads.

Overview
Adds an opt-in optimization for LIKE/NOT LIKE and ILIKE/NOT ILIKE on text skip indexes by scanning the text-index dictionary for tokens matching simple %needle% patterns and using their postings to filter granules/rows.

Introduces new settings (use_text_index_like_optimization, text_index_like_min_pattern_length, text_index_like_max_postings_to_read) and extends the text-index condition/granule/reader pipeline to carry compiled patterns, read postings for pattern-matched tokens, and fall back to evaluating the original LIKE predicate on the base column when the dictionary scan is cut short (too many large postings). Adds stateless and performance tests covering correctness, EXPLAIN behavior, and fallback cases.

Written by Cursor Bugbot for commit c20269a. This will update automatically on new commits. Configure here.

@ahmadov ahmadov changed the title Text index: Use text index dictionary for like/notLike queries Text index: Use text index to improve the performance of like/notLike queries Feb 26, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Feb 26, 2026

Workflow [PR], commit [8d09073]

Summary:

job_name test_name status info comment
Stateless tests (arm_asan_ubsan, flaky check) failure
00078_group_by_arrays FAIL cidb IGNORED
Stateless tests (amd_tsan, flaky check) failure
00169_contingency FAIL cidb, issue ISSUE EXISTS
00169_contingency FAIL cidb, issue ISSUE EXISTS
00169_contingency FAIL cidb, issue ISSUE EXISTS
Stateless tests (amd_tsan, s3 storage, parallel, 1/2) failure
02859_replicated_db_name_zookeeper FAIL cidb IGNORED
Stress test (arm_tsan) failure
Server died FAIL cidb IGNORED
Logical error: Can't adjust last granule because it has A rows, but try to subtract B rows (num_read_rows = C, total_rows_per_granule = D, rows_per_granule = [E], debug: max_rows=F, rows_from_read=G, rows_from_finalize_loop=H, rows_from_finalize_post=I, ranges_processed=J, skipped_marks=K, use_query_condition_cache=L, can_read_incomplete_granules=M) (STID: 5258-4b5f) FAIL cidb IGNORED

AI Review

Summary

This PR adds text-index dictionary-scan acceleration for LIKE/ILIKE patterns and fallback behavior when dictionary scans are cut short. I found one actionable issue: documentation overstates supported pattern characters versus the actual implementation.

Findings
  • ⚠️ Majors
    • [docs/en/engines/table-engines/mergetree-family/textindexes.md:905] The docs describe eligible patterns as %<alpha-numeric-characters-without-spaces>%, but implementation in MergeTreeIndexConditionText::stringLikeToPatterns only accepts ASCII alphanumerics (isASCII + isAlphaNumericASCII). This can mislead users expecting non-ASCII letters/digits to be optimized.
    • Suggested fix: either clarify docs to %<ASCII alpha-numeric characters without spaces>%, or broaden matching to Unicode alphanumerics.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required action: align the LIKE/ILIKE optimization documentation with the actual ASCII-only matcher semantics (or broaden matcher semantics to match docs).

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Feb 26, 2026
@ahmadov ahmadov marked this pull request as ready for review March 3, 2026 20:52
@ahmadov ahmadov requested review from CurtizJ, Ergus and Copilot and removed request for CurtizJ March 4, 2026 08:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the MergeTree inverted text index “direct read” path to support an opt-in optimization for like/notLike predicates (currently limited to the SplitByNonAlpha tokenizer) by scanning dictionary tokens against a LIKE-derived regex and reading/unioning the postings of matched tokens.

Changes:

  • Add LIKE/NOT LIKE support to MergeTreeIndexConditionText and direct-read optimization plumbing, including a new use_text_index_like_optimization setting gate.
  • Extend text-index granule deserialization/analysis to collect pattern-matched tokens (pattern_tokens) and read postings for them.
  • Update the text-index reader to read postings for both exact tokens and pattern-matched tokens and to fill virtual columns accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp Reads postings for pattern-matched tokens and fills virtual columns for LIKE-style queries.
src/Storages/MergeTree/MergeTreeIndexText.h Extends granule API/state with pattern_tokens and adds pattern-query helpers.
src/Storages/MergeTree/MergeTreeIndexText.cpp Implements dictionary scanning for LIKE patterns and registers rare-token postings for pattern tokens.
src/Storages/MergeTree/MergeTreeIndexConditionText.h Extends TextSearchQuery to carry compiled patterns; adds LIKE/NOT LIKE RPN function kinds.
src/Storages/MergeTree/MergeTreeIndexConditionText.cpp Parses eligible %token% LIKE patterns into regexes and wires LIKE/NOT LIKE into the condition/RPN evaluation.
src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp Adjusts direct-read selection logic for pattern-based queries.
src/Core/SettingsChangesHistory.cpp Records the new use_text_index_like_optimization setting in settings history.
src/Core/Settings.cpp Declares the new use_text_index_like_optimization setting.

Comment thread src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp
Comment thread src/Storages/MergeTree/MergeTreeIndexText.h
Comment thread src/Storages/MergeTree/MergeTreeIndexText.h
Comment thread src/Storages/MergeTree/MergeTreeIndexConditionText.cpp
Comment thread src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp
Comment thread src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp
Comment thread src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeIndexConditionText.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeIndexText.cpp
Comment thread src/Storages/MergeTree/MergeTreeIndexText.cpp
Comment thread src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp
Comment thread src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp
Comment thread src/Storages/MergeTree/MergeTreeIndexConditionText.cpp
Comment thread src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp
Comment thread src/Core/Settings.cpp Outdated
Comment thread src/Core/Settings.cpp Outdated
Comment thread src/Core/Settings.cpp Outdated
Comment thread src/Core/SettingsChangesHistory.cpp Outdated
Comment thread tests/queries/0_stateless/02346_text_index_function_like_notLike.sql Outdated
Comment thread src/Storages/MergeTree/MergeTreeIndexText.cpp
Comment thread src/Storages/MergeTree/MergeTreeIndexText.cpp
Comment thread src/Storages/MergeTree/MergeTreeIndexText.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeIndexText.h
Comment thread src/Storages/MergeTree/MergeTreeIndexConditionText.cpp

The spaces left and right of `support` make sure that the term can be extracted as a token.

Fortunately, there is a special case where ClickHouse can leverage the inverted index to speed up LIKE queries significantly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs section references settings that do not exist anymore: it points to use_text_index_like_optimization, but the implemented setting is use_text_index_like_evaluation_by_dictionary_scan.

This leaves users with a broken setting name in docs and makes tuning instructions non-actionable.

Please update this paragraph and links to use the actual setting names (use_text_index_like_evaluation_by_dictionary_scan, text_index_like_min_pattern_length, text_index_like_max_postings_to_read).


### LIKE/ILIKE queries {#like-ilike-queries-perf}

When a LIKE/ILIKE query pattern is `%<alpha-numeric-characters-without-spaces>%` and the text index tokenizer is `splitByNonAlpha`, ClickHouse leverages the inverted index to speed up LIKE/ILIKE queries significantly. To achieve that, ClickHouse scans the inverted index dictionary instead of a full-table scan to find the matching pattern.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say the optimized pattern is %<alpha-numeric-characters-without-spaces>%, but the implementation currently accepts only ASCII alphanumerics (isASCII + isAlphaNumericASCII in stringLikeToPatterns).

Without clarifying this, users can expect dictionary-scan acceleration for non-ASCII patterns and get surprising behavior/perf differences.

Could you either:

  • tighten docs wording to explicitly say ASCII alphanumeric, or
  • broaden the matcher to support non-ASCII alphanumeric needles?

Comment thread docs/en/engines/table-engines/mergetree-family/textindexes.md
@ahmadov ahmadov enabled auto-merge April 2, 2026 12:20
matched_tokens.push_back(String(token));

if (!matched_tokens.empty())
applyPostingsAny(column, postings, indices_buffer, matched_tokens, old_size, row_offset, num_rows);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fillColumn writes LIKE semantics for pattern queries but never inverts for notLike / notILike.

In this branch, applyPostingsAny marks rows that match at least one pattern token, then returns immediately. For TextIndexDirectReadMode::Exact, the original predicate is removed, so notLike can return matching rows instead of non-matching rows.

Please invert the just-filled range when search_query->function_name is notLike/notILike (or route these through fillColumnFallback).

@alexey-milovidov
Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 7, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 77.39% (493/637) | lost baseline coverage: 7 line(s) · Uncovered code

Full report · Diff report

@ahmadov ahmadov added this pull request to the merge queue Apr 7, 2026
Merged via the queue into master with commit 639aa60 Apr 7, 2026
158 of 163 checks passed
@ahmadov ahmadov deleted the ahmadov/text-index-like-perf branch April 7, 2026 20:22
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 7, 2026
dhtclk added a commit to ClickHouse/clickhouse-docs that referenced this pull request Apr 7, 2026
Strip broken setting anchors in textindexes.md that reference
settings not yet documented. Remove once core adds the setting
entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EmeraldShift
Copy link
Copy Markdown
Contributor

Excited to see this merged! I'm curious how much harder supporting the array tokenizer is? I'm using array to keep the index size down for some really large tables

@ahmadov
Copy link
Copy Markdown
Member Author

ahmadov commented Apr 16, 2026

Excited to see this merged! I'm curious how much harder supporting the array tokenizer is? I'm using array to keep the index size down for some really large tables

@EmeraldShift, it should be fairly easy to enable for the array tokenizer. I'll have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants