fix SplitTokenExtractor::nextInStringLike skipping tokens#72264
fix SplitTokenExtractor::nextInStringLike skipping tokens#72264ozcelgozde wants to merge 5 commits intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit 2864374 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
|
@ozcelgozde A test would be very appreciated. But also these extractors available as functions and I don't see a problem here: And the tests have failed... |
|
@nikitamikhaylov hey thanks for the reply. This case only happens during bloom filter (tokenbf_v1) index calculation so regular tokens function is not hitting that. I see the tests related to that areas are failing so I will take a look at them. then tries to do token search with like '%x%' but since tokens('hhhhhhhhhhhhhhhhhhhhhhhhhxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyy1342957354 this not returning any value is valid wdyt? Before this PR this statement: was skipping x as a token thats why tests were returning all 1000 values |
|
@ozcelgozde Check the tests. |
|
Dear @alexey-milovidov, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
This sounds invalid to me. In your original example, a search Recently I think there is a new function |
|
@EmeraldShift, yes, exactly! And we introduced these new functions recently. |
We are using a lot of like searches with tokenbf index and I was noticing the performance was really bad. After looking through the code and also seeing a few more issues opened here, I noticed SplitTokenExtractor::nextInStringLike skips some crucial tokens. For example if i do a search like this: body like '%test%', test as a token is completely skipped and this becomes a full db scan. Also other issues: 72065 and 68985. Not sure how to proceed from here so some feedback would be appreciated.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Refactored token extraction logic to ensure valid tokens are not prematurely reset by wildcards (% or _)
CI Settings (Only check the boxes if you know what you are doing):