Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wildcard query support with suffix trie #2932

Merged
merged 52 commits into from
Aug 8, 2022
Merged

Conversation

ashtul
Copy link
Contributor

@ashtul ashtul commented Jul 24, 2022

This PR complements PR #2886 and adds an optimization to wildcard queries utilizing a suffix trie which was introduced in PR #2774 for contains queries.
The wildcard pattern is broken into tokens at '*' character and an estimation is made to find the token which would require the least processing.
The suffix trie is then iterated, and the payload of the relative nodes, which contains a list of words, is returned.
Each word in the list is checked against the pattern, and all matches are added to the union iterator.
If a wildcard pattern does not support using suffix trie (tokens must be 2 chars or more), the brute-force function is used instead.

Main changes:

  • Suffix_StarBreak functions receive either char or rune pattern and break it into tokens at *. A heuristic is used to determine the best token to use on the suffix trie. The initial score is the length of the token. Each '?' reduces a point, and having a '*' at the end reduces further 5 points (since all children have to be iterated).
  • SuffixCtx was added for ease of use.
  • TrieRangeCallback, the callback of the iterator of Trie, now return the payload. It is used to access the payload, which contains the SuffixData struct with all matching words.
  • Fixed recursiveAdd to return an error. The error can be produced by the callback function if maxPrefixExpansions is reached.

related to #2886
MOD-3284
MOD-3752

src/suffix.c Outdated Show resolved Hide resolved
src/suffix.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@MeirShpilraien MeirShpilraien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Added some comments. My main concerns is about assumption of NULL terminated strings. Maybe we should just organise and document those assumptions.

Also, I did not see tests that verify it on FT.AGGREAGTEA and on coordinator, I agree it should not make any different but lets verify.

Also, any documentation that we need to update?

MeirShpilraien
MeirShpilraien previously approved these changes Aug 4, 2022
@ashtul ashtul merged commit 5e47970 into master Aug 8, 2022
@ashtul ashtul deleted the ariel_wildcard_suffix branch August 8, 2022 13:09
oshadmi pushed a commit that referenced this pull request Aug 9, 2022
* parser work

* wip

* TEXT field is working

* wip

* TAG works

* benchmark

* wip

* working w/o params

* with params

* clean params

* cleanup + streamline

* added TM_WILDCARD_FIXED_LEN_MODE

* fix tests

* llvm warning clear

* add ft.profile tests

* change to Wildcard_MatchType

* add unit tests

* add comments

* suffix support for TAG

* wip

* quit early with no stars

* per meir review

* update comment

* wip

* wildcard Suffix TEXT

* clean

* fix

* cleanup

* clean

* add score by suffix token index

* fix leak

* better func names

* add upper lower case test

* per review comments

* fallback from suffix to brute force

* remove redundant function

* remove extra environment

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants