-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wildcard query support with suffix trie #2932
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Added some comments. My main concerns is about assumption of NULL terminated strings. Maybe we should just organise and document those assumptions.
Also, I did not see tests that verify it on FT.AGGREAGTEA and on coordinator, I agree it should not make any different but lets verify.
Also, any documentation that we need to update?
* parser work * wip * TEXT field is working * wip * TAG works * benchmark * wip * working w/o params * with params * clean params * cleanup + streamline * added TM_WILDCARD_FIXED_LEN_MODE * fix tests * llvm warning clear * add ft.profile tests * change to Wildcard_MatchType * add unit tests * add comments * suffix support for TAG * wip * quit early with no stars * per meir review * update comment * wip * wildcard Suffix TEXT * clean * fix * cleanup * clean * add score by suffix token index * fix leak * better func names * add upper lower case test * per review comments * fallback from suffix to brute force * remove redundant function * remove extra environment * fix test
This PR complements PR #2886 and adds an optimization to
wildcard queries
utilizing a suffix trie which was introduced in PR #2774 forcontains queries
.The wildcard pattern is broken into tokens at '*' character and an estimation is made to find the token which would require the least processing.
The suffix trie is then iterated, and the payload of the relative nodes, which contains a list of words, is returned.
Each word in the list is checked against the pattern, and all matches are added to the union iterator.
If a wildcard pattern does not support using suffix trie (tokens must be 2 chars or more), the brute-force function is used instead.
Main changes:
Suffix_StarBreak
functions receive either char or rune pattern and break it into tokens at*
. A heuristic is used to determine the best token to use on the suffix trie. The initial score is the length of the token. Each '?' reduces a point, and having a '*' at the end reduces further 5 points (since all children have to be iterated).SuffixCtx
was added for ease of use.TrieRangeCallback
, the callback of the iterator ofTrie
, now return the payload. It is used to access the payload, which contains theSuffixData
struct with all matching words.recursiveAdd
to return an error. The error can be produced by the callback function ifmaxPrefixExpansions
is reached.related to #2886
MOD-3284
MOD-3752