-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index stems with a prefix #227
Conversation
…stem with VERBATIM will not return documents not containing this word verbatim
@mnunberg the highlighting test fails now, please check and add the fix to this PR. |
When prefix matching is used, because stems are no longer considered as part of their parents, they are not part of the same inverted index. It's essentially a matter of chance which is the first term that gets selected (in this case, likely the first actual prefix expansion?)
The stemmer already ensures that the stem is not similar to the word itself, so no need to have it within the tokenizer
Doesn't this implementation conflict with #70 ? Sidenote: didn't get around to actually testing this PM yet |
a. I'm not sure I'll ever do #70, it has too many limitations. I might allow custom tokenization. But we can always internally escape punctuation marks, so it won't be a problem. |
Sounds good, thanks for the extra info. I hope custom tokenization and custom collation will be at some time supported, they have many usecases. |
we're actually not that far off - when making the change to support Chinese tokenization, we did a little refactor and internally there is a "pluggable" architecture for tokenizers. It's just a matter of adding support for it in the extension API. |
Interesting. And collation? |
@tw-bert not a big chagned. It's already an "interface" but right now there is just one hard coded. |
@dvirsky Great |
so that searching for a word that's both a stem and a valid term with VERBATIM - will not return documents not containing the verbatim term.
For example, if we index two documents, one with the term "going" and one with the term "go".
Before this PR, searching for "go" verbatim, will return the doc containing "going".
After this PR this won't happen.
This is because we now use two separate indexes for the term and the stem even if they are identical. we prepend a
+
to the stem. So "going" would get encoded into two indexes:going
and+go
. Butgo
will get encoded only intogo
. Thus in verbatim mode, we search only forgo
, but in expanded mode we search forgo OR +go
.