Conversation
Currently, the phrase search is not supported even without the text index. This new function adds the support for it by applying the brute-force search.
|
Workflow [PR], commit [3b03b2d] Summary: ❌
AI ReviewSummaryThis PR adds a new string-search function Missing context
ClickHouse Rules
Final Verdict
|
Ergus
left a comment
There was a problem hiding this comment.
Overall LGTM, but the added a few comment it worth address before merging.
My question with this is how is this different from preexisting functions like multiSearch or substring family of functions?
| if (const auto * col_input_string = checkAndGetColumn<ColumnString>(col_input.get())) | ||
| executeMatchPhrase(*col_input_string, col_result, input_rows_count, tokenizer, phrase_tokens); | ||
| else if (const auto * col_input_fixedstring = checkAndGetColumn<ColumnFixedString>(col_input.get())) | ||
| executeMatchPhrase(*col_input_fixedstring, col_result, input_rows_count, tokenizer, phrase_tokens); |
There was a problem hiding this comment.
else
UNREACHABLE()
or
else
throw Exception...
It is safe to always add a last resource guard error.
There was a problem hiding this comment.
actually, cannot happen because of the validator. I removed the similar checks #101997 (comment)
There was a problem hiding this comment.
But you can change the validator and forget this, or the other way around ;)
| template <typename OnMatchCallback> | ||
| auto operator()(OnMatchCallback && onMatchCallback) | ||
| { | ||
| return [&](const char * token_start, size_t token_len) |
There was a problem hiding this comment.
Some time ago I had an issue with code like this:
onMatchCallback is captured to the current scope in a move. But the lambda captures it by reference. So it conceptually goes OOS when this operator() function returns... in principle this is a potential dangling pointer.
C++ compilers hide this with a trick: when a reference-type variable is captured by reference, C++ captures the referent (not the reference variable itself). So the closure holds a reference directly to the original temporary callback, which stays alive for the full expression.
Not a bug in this code. But a reason why I don't like complex lambdas ;)
Do we really need to return a lambda?
There was a problem hiding this comment.
forEachToken accepts a callback (lambda) to process each tokenized token. We can directly do that implementation there as
for (size_t i = 0; i < input_rows_count; ++i)
{
std::string_view input = col_input.getDataAt(i);
col_result[i] = 0;
forEachToken(*tokenizer, input.data(), input.size(), [...](...) {
...
});
}but I prefer to extract the matching logic to its own place and keep the implementation simpler. So, in any case we would need a lambda.
@Ergus, both multiSearch or substring search functions do not remove tokenizer separators in between. see these tests: ClickHouse/tests/queries/0_stateless/02346_function_matchPhrase.sql Lines 49 to 53 in 31f5790 It is similar to |
Still uses matchPhrase as an alias of hasPhrase
matchPhrase functionhasPhrase function
LLVM Coverage Report
Changed lines: 98.81% (166/168) · Uncovered code |
The first stage for the #101473. The new function is independent of how the positional data would be stored.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Adds a
hasPhrase(aliasmatchPhrase) function for phrase search (continuous sequences of tokens). Search is brute-force, i.e. not supported by the text index yet.