Skip to content

Introduce new hasPhrase function#101997

Merged
ahmadov merged 9 commits intomasterfrom
ahmadov/match-phrase-function
Apr 10, 2026
Merged

Introduce new hasPhrase function#101997
ahmadov merged 9 commits intomasterfrom
ahmadov/match-phrase-function

Conversation

@ahmadov
Copy link
Copy Markdown
Member

@ahmadov ahmadov commented Apr 7, 2026

The first stage for the #101473. The new function is independent of how the positional data would be stored.

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Adds a hasPhrase (alias matchPhrase) function for phrase search (continuous sequences of tokens). Search is brute-force, i.e. not supported by the text index yet.

Currently, the phrase search is not supported even without the text
index. This new function adds the support for it by applying the
brute-force search.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 7, 2026

Workflow [PR], commit [3b03b2d]

Summary:

job_name test_name status info comment
AST fuzzer (amd_debug, targeted) failure
Assertion `px != 0' failed (STID: 0250-3d88) FAIL cidb
AST fuzzer (amd_debug, targeted, old_compatibility) failure
Assertion `px != 0' failed (STID: 0250-3d88) FAIL cidb

AI Review

Summary

This PR adds a new string-search function hasPhrase (with alias matchPhrase) that matches consecutive token sequences using tokenizer-aware tokenization and a KMP-style matcher, plus comprehensive stateless tests. I reviewed all changed files and did not find correctness, safety, concurrency, or performance blockers in the current patch.

Missing context
  • ⚠️ Full CI signal is not yet available (PR status is still pending), so this review is code-focused.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Apr 7, 2026
Comment thread src/Functions/matchPhrase.cpp Outdated
@ahmadov ahmadov requested review from CurtizJ, Ergus and rschu1ze April 8, 2026 11:55
@rschu1ze rschu1ze self-assigned this Apr 9, 2026
Comment thread src/Functions/matchPhrase.cpp Outdated
Comment thread tests/queries/0_stateless/02346_function_matchPhrase.sql Outdated
Comment thread src/Functions/matchPhrase.cpp Outdated
Comment thread src/Functions/matchPhrase.cpp Outdated
Comment thread src/Functions/matchPhrase.cpp Outdated
Copy link
Copy Markdown
Member

@Ergus Ergus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but the added a few comment it worth address before merging.

My question with this is how is this different from preexisting functions like multiSearch or substring family of functions?

Comment thread src/Functions/matchPhrase.cpp Outdated
if (const auto * col_input_string = checkAndGetColumn<ColumnString>(col_input.get()))
executeMatchPhrase(*col_input_string, col_result, input_rows_count, tokenizer, phrase_tokens);
else if (const auto * col_input_fixedstring = checkAndGetColumn<ColumnFixedString>(col_input.get()))
executeMatchPhrase(*col_input_fixedstring, col_result, input_rows_count, tokenizer, phrase_tokens);
Copy link
Copy Markdown
Member

@Ergus Ergus Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else 
   UNREACHABLE()

or

else
   throw Exception...

It is safe to always add a last resource guard error.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, cannot happen because of the validator. I removed the similar checks #101997 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can change the validator and forget this, or the other way around ;)

Comment thread src/Functions/matchPhrase.cpp Outdated
Comment thread src/Functions/matchPhrase.cpp Outdated
Comment thread tests/queries/0_stateless/02346_function_matchPhrase.sql Outdated
template <typename OnMatchCallback>
auto operator()(OnMatchCallback && onMatchCallback)
{
return [&](const char * token_start, size_t token_len)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some time ago I had an issue with code like this:

onMatchCallback is captured to the current scope in a move. But the lambda captures it by reference. So it conceptually goes OOS when this operator() function returns... in principle this is a potential dangling pointer.

C++ compilers hide this with a trick: when a reference-type variable is captured by reference, C++ captures the referent (not the reference variable itself). So the closure holds a reference directly to the original temporary callback, which stays alive for the full expression.

Not a bug in this code. But a reason why I don't like complex lambdas ;)

Do we really need to return a lambda?

Copy link
Copy Markdown
Member Author

@ahmadov ahmadov Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forEachToken accepts a callback (lambda) to process each tokenized token. We can directly do that implementation there as

for (size_t i = 0; i < input_rows_count; ++i)
{
    std::string_view input = col_input.getDataAt(i);
    col_result[i] = 0;

    forEachToken(*tokenizer, input.data(), input.size(), [...](...) {
       ...
    });
}

but I prefer to extract the matching logic to its own place and keep the implementation simpler. So, in any case we would need a lambda.

@ahmadov
Copy link
Copy Markdown
Member Author

ahmadov commented Apr 9, 2026

My question with this is how is this different from preexisting functions like multiSearch or substring family of functions?

@Ergus, both multiSearch or substring search functions do not remove tokenizer separators in between. see these tests:

SELECT '-- tokenizer separators in phrase are removed before matching';
SELECT matchPhrase('error: connection refused', 'error---connection');
SELECT matchPhrase('error: connection refused', 'error:connection');
SELECT matchPhrase('one two three', 'one...two...three');
SELECT matchPhrase('one two three', 'one!@#two$%^three');
.

It is similar to hasAllTokens(input, 'token_1 token_2 ... token_n'), but it makes sure the tokens are consecutive.

Comment thread src/Functions/hasPhrase.cpp
@ahmadov ahmadov changed the title Introduce new matchPhrase function Introduce new hasPhrase function Apr 9, 2026
@ahmadov ahmadov enabled auto-merge April 9, 2026 21:49
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 10, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 98.81% (166/168) · Uncovered code

Full report · Diff report

@ahmadov ahmadov added this pull request to the merge queue Apr 10, 2026
Merged via the queue into master with commit 39834d0 Apr 10, 2026
160 of 163 checks passed
@ahmadov ahmadov deleted the ahmadov/match-phrase-function branch April 10, 2026 03:23
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants