Index stems with a prefix #227

dvirsky · 2017-11-28T13:07:35Z

so that searching for a word that's both a stem and a valid term with VERBATIM - will not return documents not containing the verbatim term.

For example, if we index two documents, one with the term "going" and one with the term "go".
Before this PR, searching for "go" verbatim, will return the doc containing "going".

After this PR this won't happen.

This is because we now use two separate indexes for the term and the stem even if they are identical. we prepend a + to the stem. So "going" would get encoded into two indexes: going and +go. But go will get encoded only into go. Thus in verbatim mode, we search only for go, but in expanded mode we search for go OR +go.

…stem with VERBATIM will not return documents not containing this word verbatim

dvirsky · 2017-11-28T13:08:03Z

@mnunberg the highlighting test fails now, please check and add the fix to this PR.

When prefix matching is used, because stems are no longer considered as part of their parents, they are not part of the same inverted index. It's essentially a matter of chance which is the first term that gets selected (in this case, likely the first actual prefix expansion?)

The stemmer already ensures that the stem is not similar to the word itself, so no need to have it within the tokenizer

tw-bert · 2017-12-05T10:51:39Z

Doesn't this implementation conflict with #70 ?
A "+" might not be a tokenization character under all future circumstances.
Suggestion: Use prefix "S\x00" where S is a sentinel value with 255 (even 256) possibilities.

Sidenote: didn't get around to actually testing this PM yet

dvirsky · 2017-12-05T13:33:10Z

a. I'm not sure I'll ever do #70, it has too many limitations. I might allow custom tokenization. But we can always internally escape punctuation marks, so it won't be a problem.

tw-bert · 2017-12-05T13:54:03Z

Sounds good, thanks for the extra info. I hope custom tokenization and custom collation will be at some time supported, they have many usecases.

dvirsky · 2017-12-05T13:56:29Z

we're actually not that far off - when making the change to support Chinese tokenization, we did a little refactor and internally there is a "pluggable" architecture for tokenizers. It's just a matter of adding support for it in the extension API.

tw-bert · 2017-12-05T13:58:52Z

Interesting. And collation?

dvirsky · 2017-12-05T14:11:44Z

@tw-bert not a big chagned. It's already an "interface" but right now there is just one hard coded.

tw-bert · 2017-12-05T14:33:23Z

@dvirsky Great

Index stems with a prefix so that seraching for a word that's also a …

e4efb24

…stem with VERBATIM will not return documents not containing this word verbatim

dvirsky mentioned this pull request Nov 28, 2017

Faulty replication; some changes slaved, some not; mem usage increasing but "on hold" #223

Closed

mnunberg added 2 commits November 29, 2017 08:47

Remove redundant stem-similar check

b2f3a0b

The stemmer already ensures that the stem is not similar to the word itself, so no need to have it within the tokenizer

dvirsky merged commit 745115c into master Nov 29, 2017

mnunberg deleted the stem-prefix branch May 14, 2018 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index stems with a prefix #227

Index stems with a prefix #227

dvirsky commented Nov 28, 2017

dvirsky commented Nov 28, 2017

tw-bert commented Dec 5, 2017

dvirsky commented Dec 5, 2017

tw-bert commented Dec 5, 2017

dvirsky commented Dec 5, 2017

tw-bert commented Dec 5, 2017

dvirsky commented Dec 5, 2017

tw-bert commented Dec 5, 2017

Index stems with a prefix #227

Index stems with a prefix #227

Conversation

dvirsky commented Nov 28, 2017

dvirsky commented Nov 28, 2017

tw-bert commented Dec 5, 2017

dvirsky commented Dec 5, 2017

tw-bert commented Dec 5, 2017

dvirsky commented Dec 5, 2017

tw-bert commented Dec 5, 2017

dvirsky commented Dec 5, 2017

tw-bert commented Dec 5, 2017