Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index stems with a prefix #227

Merged
merged 3 commits into from
Nov 29, 2017
Merged

Index stems with a prefix #227

merged 3 commits into from
Nov 29, 2017

Conversation

dvirsky
Copy link
Contributor

@dvirsky dvirsky commented Nov 28, 2017

so that searching for a word that's both a stem and a valid term with VERBATIM - will not return documents not containing the verbatim term.

For example, if we index two documents, one with the term "going" and one with the term "go".
Before this PR, searching for "go" verbatim, will return the doc containing "going".

After this PR this won't happen.

This is because we now use two separate indexes for the term and the stem even if they are identical. we prepend a + to the stem. So "going" would get encoded into two indexes: going and +go. But go will get encoded only into go. Thus in verbatim mode, we search only for go, but in expanded mode we search for go OR +go.

…stem with VERBATIM will not return documents not containing this word verbatim
@dvirsky
Copy link
Contributor Author

dvirsky commented Nov 28, 2017

@mnunberg the highlighting test fails now, please check and add the fix to this PR.

When prefix matching is used, because stems are no longer considered as
part of their parents, they are not part of the same inverted index.
It's essentially a matter of chance which is the first term that gets
selected (in this case, likely the first actual prefix expansion?)
The stemmer already ensures that the stem is not similar to the word
itself, so no need to have it within the tokenizer
@dvirsky dvirsky merged commit 745115c into master Nov 29, 2017
@tw-bert
Copy link

tw-bert commented Dec 5, 2017

Doesn't this implementation conflict with #70 ?
A "+" might not be a tokenization character under all future circumstances.
Suggestion: Use prefix "S\x00" where S is a sentinel value with 255 (even 256) possibilities.

Sidenote: didn't get around to actually testing this PM yet

@dvirsky
Copy link
Contributor Author

dvirsky commented Dec 5, 2017

a. I'm not sure I'll ever do #70, it has too many limitations. I might allow custom tokenization. But we can always internally escape punctuation marks, so it won't be a problem.

@tw-bert
Copy link

tw-bert commented Dec 5, 2017

Sounds good, thanks for the extra info. I hope custom tokenization and custom collation will be at some time supported, they have many usecases.

@dvirsky
Copy link
Contributor Author

dvirsky commented Dec 5, 2017

we're actually not that far off - when making the change to support Chinese tokenization, we did a little refactor and internally there is a "pluggable" architecture for tokenizers. It's just a matter of adding support for it in the extension API.

@tw-bert
Copy link

tw-bert commented Dec 5, 2017

Interesting. And collation?

@dvirsky
Copy link
Contributor Author

dvirsky commented Dec 5, 2017

@tw-bert not a big chagned. It's already an "interface" but right now there is just one hard coded.

@tw-bert
Copy link

tw-bert commented Dec 5, 2017

@dvirsky Great

@mnunberg mnunberg deleted the stem-prefix branch May 14, 2018 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants