Text index: Use text index to improve the performance of `like` queries by ahmadov · Pull Request #98149 · ClickHouse/ClickHouse

ahmadov · 2026-02-26T23:51:44Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

This PR includes the perf optimization for like queries from #97723.

The idea is scanning and applying the like pattern to dictionary tokens to get matching tokens and postings.

For a simple pattern as column LIKE '%search%', the union of matching postings gives a result that would be same as scanning the column data.

The optimization is only available for SplitByNonAlpha tokenizer for now.

Note

Medium Risk
Touches core MergeTree text-index planning and read-time filtering, adding new code paths and a fallback reader; incorrect handling could cause wrong filtering or performance regressions under LIKE workloads.

Overview
Adds an opt-in optimization for LIKE/NOT LIKE and ILIKE/NOT ILIKE on text skip indexes by scanning the text-index dictionary for tokens matching simple %needle% patterns and using their postings to filter granules/rows.

Introduces new settings (use_text_index_like_optimization, text_index_like_min_pattern_length, text_index_like_max_postings_to_read) and extends the text-index condition/granule/reader pipeline to carry compiled patterns, read postings for pattern-matched tokens, and fall back to evaluating the original LIKE predicate on the base column when the dictionary scan is cut short (too many large postings). Adds stateless and performance tests covering correctness, EXPLAIN behavior, and fallback cases.

^{Written by Cursor Bugbot for commit c20269a. This will update automatically on new commits. Configure here.}

clickhouse-gh · 2026-02-26T23:52:22Z

Workflow [PR], commit [8d09073]

Summary: ❌

job_name	test_name	status	info	comment
Stateless tests (arm_asan_ubsan, flaky check)		failure
	00078_group_by_arrays	FAIL	cidb	IGNORED
Stateless tests (amd_tsan, flaky check)		failure
	00169_contingency	FAIL	cidb, issue	ISSUE EXISTS
	00169_contingency	FAIL	cidb, issue	ISSUE EXISTS
	00169_contingency	FAIL	cidb, issue	ISSUE EXISTS
Stateless tests (amd_tsan, s3 storage, parallel, 1/2)		failure
	02859_replicated_db_name_zookeeper	FAIL	cidb	IGNORED
Stress test (arm_tsan)		failure
	Server died	FAIL	cidb	IGNORED
	Logical error: Can't adjust last granule because it has A rows, but try to subtract B rows (num_read_rows = C, total_rows_per_granule = D, rows_per_granule = [E], debug: max_rows=F, rows_from_read=G, rows_from_finalize_loop=H, rows_from_finalize_post=I, ranges_processed=J, skipped_marks=K, use_query_condition_cache=L, can_read_incomplete_granules=M) (STID: 5258-4b5f)	FAIL	cidb	IGNORED

AI Review

Summary

This PR adds text-index dictionary-scan acceleration for LIKE/ILIKE patterns and fallback behavior when dictionary scans are cut short. I found one actionable issue: documentation overstates supported pattern characters versus the actual implementation.

Findings

⚠️ Majors
- [docs/en/engines/table-engines/mergetree-family/textindexes.md:905] The docs describe eligible patterns as %<alpha-numeric-characters-without-spaces>%, but implementation in MergeTreeIndexConditionText::stringLikeToPatterns only accepts ASCII alphanumerics (isASCII + isAlphaNumericASCII). This can mislead users expecting non-ASCII letters/digits to be optimized.
- Suggested fix: either clarify docs to %<ASCII alpha-numeric characters without spaces>%, or broaden matching to Unicode alphanumerics.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	✅
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅

Final Verdict

Status: ⚠️ Request changes
Minimum required action: align the LIKE/ILIKE optimization documentation with the actual ASCII-only matcher semantics (or broaden matcher semantics to match docs).

More supported patterns would follow.

…like-perf

Copilot

Pull request overview

This PR extends the MergeTree inverted text index “direct read” path to support an opt-in optimization for like/notLike predicates (currently limited to the SplitByNonAlpha tokenizer) by scanning dictionary tokens against a LIKE-derived regex and reading/unioning the postings of matched tokens.

Changes:

Add LIKE/NOT LIKE support to MergeTreeIndexConditionText and direct-read optimization plumbing, including a new use_text_index_like_optimization setting gate.
Extend text-index granule deserialization/analysis to collect pattern-matched tokens (pattern_tokens) and read postings for them.
Update the text-index reader to read postings for both exact tokens and pattern-matched tokens and to fill virtual columns accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp	Reads postings for pattern-matched tokens and fills virtual columns for LIKE-style queries.
src/Storages/MergeTree/MergeTreeIndexText.h	Extends granule API/state with `pattern_tokens` and adds pattern-query helpers.
src/Storages/MergeTree/MergeTreeIndexText.cpp	Implements dictionary scanning for LIKE patterns and registers rare-token postings for pattern tokens.
src/Storages/MergeTree/MergeTreeIndexConditionText.h	Extends `TextSearchQuery` to carry compiled patterns; adds LIKE/NOT LIKE RPN function kinds.
src/Storages/MergeTree/MergeTreeIndexConditionText.cpp	Parses eligible `%token%` LIKE patterns into regexes and wires LIKE/NOT LIKE into the condition/RPN evaluation.
src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp	Adjusts direct-read selection logic for pattern-based queries.
src/Core/SettingsChangesHistory.cpp	Records the new `use_text_index_like_optimization` setting in settings history.
src/Core/Settings.cpp	Declares the new `use_text_index_like_optimization` setting.

just negate the result array

clickhouse-gh · 2026-04-01T16:08:27Z


 The spaces left and right of `support` make sure that the term can be extracted as a token.

+Fortunately, there is a special case where ClickHouse can leverage the inverted index to speed up LIKE queries significantly.


The docs section references settings that do not exist anymore: it points to use_text_index_like_optimization, but the implemented setting is use_text_index_like_evaluation_by_dictionary_scan.

This leaves users with a broken setting name in docs and makes tuning instructions non-actionable.

Please update this paragraph and links to use the actual setting names (use_text_index_like_evaluation_by_dictionary_scan, text_index_like_min_pattern_length, text_index_like_max_postings_to_read).

…like-perf

clickhouse-gh · 2026-04-01T16:29:24Z


+### LIKE/ILIKE queries {#like-ilike-queries-perf}
+
+When a LIKE/ILIKE query pattern is `%<alpha-numeric-characters-without-spaces>%` and the text index tokenizer is `splitByNonAlpha`, ClickHouse leverages the inverted index to speed up LIKE/ILIKE queries significantly. To achieve that, ClickHouse scans the inverted index dictionary instead of a full-table scan to find the matching pattern.


The docs say the optimized pattern is %<alpha-numeric-characters-without-spaces>%, but the implementation currently accepts only ASCII alphanumerics (isASCII + isAlphaNumericASCII in stringLikeToPatterns).

Without clarifying this, users can expect dictionary-scan acceleration for non-ASCII patterns and get surprising behavior/perf differences.

Could you either:

tighten docs wording to explicitly say ASCII alphanumeric, or

broaden the matcher to support non-ASCII alphanumeric needles?

…like-perf

clickhouse-gh · 2026-04-02T15:30:32Z

+                matched_tokens.push_back(String(token));
+
+        if (!matched_tokens.empty())
+            applyPostingsAny(column, postings, indices_buffer, matched_tokens, old_size, row_offset, num_rows);


fillColumn writes LIKE semantics for pattern queries but never inverts for notLike / notILike.

In this branch, applyPostingsAny marks rows that match at least one pattern token, then returns immediately. For TextIndexDirectReadMode::Exact, the original predicate is removed, so notLike can return matching rows instead of non-matching rows.

Please invert the just-filled range when search_query->function_name is notLike/notILike (or route these through fillColumnFallback).

alexey-milovidov · 2026-04-07T00:22:35Z

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

…like-perf

clickhouse-gh · 2026-04-07T18:14:29Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.00%	84.00%	+0.00%
Functions	90.90%	90.90%	+0.00%
Branches	76.60%	76.60%	+0.00%

Changed lines: 77.39% (493/637) | lost baseline coverage: 7 line(s) · Uncovered code

Full report · Diff report

Strip broken setting anchors in textindexes.md that reference settings not yet documented. Remove once core adds the setting entries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EmeraldShift · 2026-04-07T21:53:38Z

Excited to see this merged! I'm curious how much harder supporting the array tokenizer is? I'm using array to keep the index size down for some really large tables

ahmadov · 2026-04-16T08:35:10Z

Excited to see this merged! I'm curious how much harder supporting the array tokenizer is? I'm using array to keep the index size down for some really large tables

@EmeraldShift, it should be fairly easy to enable for the array tokenizer. I'll have a look

ahmadov added 3 commits February 27, 2026 00:41

Text index: Use text index dictionary for LIKE/NOT LIKE queries

e230469

fill virtual column with union of pattern matched postings

54c0dc8

Add performance tests

6127e81

ahmadov changed the title ~~Text index: Use text index dictionary for like/notLike queries~~ Text index: Use text index to improve the performance of like/notLike queries Feb 26, 2026

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Feb 26, 2026

ahmadov added 12 commits February 27, 2026 00:56

fix style check

ed722c5

Use optimization only when a direct read mode is applicable

370fbaa

More supported patterns would follow.

disable estimation for patterns

9649a4c

add new setting to enable/disable like optimization

4a742ea

Merge remote-tracking branch 'origin/master' into ahmadov/text-index-…

5893261

…like-perf

add missing switch/cases for like/notLike functions

8f87ff0

re-order include headers

601ed39

Merge remote-tracking branch 'origin/master' into ahmadov/text-index-…

786f1a3

…like-perf

fix after merging origin/master

9f856b3

fix fast tests

4c94933

fix style check

78eda7e

disable new setting by default

82ff6d7

ahmadov marked this pull request as ready for review March 3, 2026 20:52

ahmadov requested review from CurtizJ, Ergus and Copilot and removed request for CurtizJ March 4, 2026 08:44

Copilot started reviewing on behalf of ahmadov March 4, 2026 08:45 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

ahmadov added 2 commits March 4, 2026 11:19

remove redundant assignment

5536368

fix for notLike queries

ec105c5

just negate the result array

CurtizJ reviewed Mar 4, 2026

View reviewed changes

add tests

c433d62

clickhouse-gh Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/MergeTreeReaderTextIndex.cpp

rschu1ze reviewed Mar 31, 2026

View reviewed changes

ahmadov added 5 commits April 1, 2026 16:25

address pr review comments

eb23e80

Rename setting to use_text_index_like_evaluation_by_dictionary_scan

09c3274

update docs

6aa2711

Make member variable names explicit

051d9fe

Add performance tuning section for LIKE/ILIKE queries

f285f8a

clickhouse-gh Bot reviewed Apr 1, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into ahmadov/text-index-…

da6cdc9

…like-perf

clickhouse-gh Bot reviewed Apr 1, 2026

View reviewed changes

ahmadov added 3 commits April 1, 2026 18:36

Merge remote-tracking branch 'origin/master' into ahmadov/text-index-…

d0bba68

…like-perf

fix perf bench build

c5785a0

Merge remote-tracking branch 'origin/master' into ahmadov/text-index-…

c0a0970

…like-perf

clickhouse-gh Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread docs/en/engines/table-engines/mergetree-family/textindexes.md

ahmadov enabled auto-merge April 2, 2026 12:20

Merge remote-tracking branch 'origin/master' into ahmadov/text-index-…

decfa76

…like-perf

clickhouse-gh Bot reviewed Apr 2, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into ahmadov/text-index-…

8d09073

…like-perf

ahmadov added this pull request to the merge queue Apr 7, 2026

Merged via the queue into master with commit 639aa60 Apr 7, 2026
158 of 163 checks passed

ahmadov deleted the ahmadov/text-index-like-perf branch April 7, 2026 20:22

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 7, 2026

dhtclk mentioned this pull request Apr 7, 2026

Fix broken textindex anchors from core ClickHouse/clickhouse-docs#5964

Merged

1 task

ahmadov mentioned this pull request Apr 16, 2026

Text index: support array tokenizer for the LIKE optimization #102880

Merged

1 task


		The spaces left and right of `support` make sure that the term can be extracted as a token.

		Fortunately, there is a special case where ClickHouse can leverage the inverted index to speed up LIKE queries significantly.


		### LIKE/ILIKE queries {#like-ilike-queries-perf}

		When a LIKE/ILIKE query pattern is `%<alpha-numeric-characters-without-spaces>%` and the text index tokenizer is `splitByNonAlpha`, ClickHouse leverages the inverted index to speed up LIKE/ILIKE queries significantly. To achieve that, ClickHouse scans the inverted index dictionary instead of a full-table scan to find the matching pattern.

Conversation

ahmadov commented Feb 26, 2026 • edited by CurtizJ Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh Bot commented Feb 26, 2026 • edited by ahmadov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

ClickHouse Rules

Final Verdict

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clickhouse-gh Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Apr 7, 2026

Uh oh!

clickhouse-gh Bot commented Apr 7, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

EmeraldShift commented Apr 7, 2026

Uh oh!

ahmadov commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ahmadov commented Feb 26, 2026 •

edited by CurtizJ

Loading

clickhouse-gh Bot commented Feb 26, 2026 •

edited by ahmadov

Loading