New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to StemmerTokenFilter #6452
Conversation
This looks great: I especially like the cleanup to the code organization and the doc links. |
@@ -85,90 +89,116 @@ public TokenStream create(TokenStream tokenStream) { | |||
return new CzechStemFilter(tokenStream); | |||
} else if ("danish".equalsIgnoreCase(language)) { | |||
return new SnowballFilter(tokenStream, new DanishStemmer()); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove this additional line? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra lines are there to break up the language blocks, eg English, vs French etc. I've added comments for each block with multiple languages
@jpountz Please could you check that I'm using the index versions correctly here |
@clintongormley It looks good to me. I really like the fact that we now recommend on a stemmer! |
* `english` returned the slow snowball English stemmer * `porter2` returned the snowball Porter stemmer (v1) * `portuguese` was used twice, preventing the second version from working Changes: * `english` now returns the fast PorterStemmer (for indices created from v1.3.0 onwards) * `porter2` now returns the snowball English stemmer (for indices created from v1.3.0 onwards) * `light_english` now returns the `kstem` stemmer (`kstem` still works) * `portuguese_rslp` returns the PortugueseStemmer * `dutch_kp` is a synonym for `kp` Tests and docs updated Fixes #6345 Fixes #6213 Fixes #6330
Elasticsearch 1.3.0 changed the stemmer implementation used when the "english" stemmer is requested. Instead of using the stemmer as revised by Martin Porter in 2001/2002 to avoid lots of incorrect over-stemming and other problems, elasticsearch now defaults to using the algorithm defined in 1980. This is an entirely inferior algorithm in terms of the output it produces. To get the better algorithm back, we now need to request the "porter2" algorithm. This has made me grumpy. The reason that snowball returns the "english" stemmer instead of the "porter" stemmer when it's asked for a stemmer for "english" is exactly because the "porter" stemmer is inferior in every case where it differs from the "english" stemmer. However, elasticsearch elastic/elasticsearch#6452 changed the default, apparently because the implementation of the 1980 Porter stemmer available is faster. In particular, this change has made a search for "news" on GOV.UK return anything with the word "new" in it.
The StemmerTokenFilter had a number of issues:
english
returned the slow snowball English stemmerporter2
returned the snowball Porter stemmer (v1)portuguese
was used twice, preventing the second version from workingChanges:
english
now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)porter2
now returns the snowball English stemmer (for indices created from v1.3.0 onwards)light_english
now returns thekstem
stemmer (kstem
still works)portuguese_rslp
returns the PortugueseStemmerdutch_kp
is a synonym forkp
Tests and docs updated