Improvements to StemmerTokenFilter #6452

clintongormley · 2014-06-10T15:18:04Z

The StemmerTokenFilter had a number of issues:

english returned the slow snowball English stemmer
porter2 returned the snowball Porter stemmer (v1)
portuguese was used twice, preventing the second version from working

Changes:

english now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)
porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
light_english now returns the kstem stemmer (kstem still works)
portuguese_rslp returns the PortugueseStemmer
dutch_kp is a synonym for kp

Tests and docs updated

clintongormley · 2014-06-10T15:19:52Z

Fixes #6345
Fixes #6213
Fixes #6330

rmuir · 2014-06-10T15:27:36Z

This looks great: I especially like the cleanup to the code organization and the doc links.

javanna · 2014-06-10T15:30:16Z

src/main/java/org/elasticsearch/index/analysis/StemmerTokenFilterFactory.java

@@ -85,90 +89,116 @@ public TokenStream create(TokenStream tokenStream) {
            return new CzechStemFilter(tokenStream);
        } else if ("danish".equalsIgnoreCase(language)) {
            return new SnowballFilter(tokenStream, new DanishStemmer());
+


can you remove this additional line? ;)

The extra lines are there to break up the language blocks, eg English, vs French etc. I've added comments for each block with multiple languages

clintongormley · 2014-06-10T15:50:01Z

@jpountz Please could you check that I'm using the index versions correctly here

jpountz · 2014-06-11T08:59:38Z

@clintongormley It looks good to me. I really like the fact that we now recommend on a stemmer!

* `english` returned the slow snowball English stemmer * `porter2` returned the snowball Porter stemmer (v1) * `portuguese` was used twice, preventing the second version from working Changes: * `english` now returns the fast PorterStemmer (for indices created from v1.3.0 onwards) * `porter2` now returns the snowball English stemmer (for indices created from v1.3.0 onwards) * `light_english` now returns the `kstem` stemmer (`kstem` still works) * `portuguese_rslp` returns the PortugueseStemmer * `dutch_kp` is a synonym for `kp` Tests and docs updated Fixes #6345 Fixes #6213 Fixes #6330

Elasticsearch 1.3.0 changed the stemmer implementation used when the "english" stemmer is requested. Instead of using the stemmer as revised by Martin Porter in 2001/2002 to avoid lots of incorrect over-stemming and other problems, elasticsearch now defaults to using the algorithm defined in 1980. This is an entirely inferior algorithm in terms of the output it produces. To get the better algorithm back, we now need to request the "porter2" algorithm. This has made me grumpy. The reason that snowball returns the "english" stemmer instead of the "porter" stemmer when it's asked for a stemmer for "english" is exactly because the "porter" stemmer is inferior in every case where it differs from the "english" stemmer. However, elasticsearch elastic/elasticsearch#6452 changed the default, apparently because the implementation of the 1980 Porter stemmer available is faster. In particular, this change has made a search for "news" on GOV.UK return anything with the word "new" in it.

clintongormley added v1.3.0 labels Jun 10, 2014

javanna reviewed Jun 10, 2014
View reviewed changes

clintongormley mentioned this pull request Jun 10, 2014

Mapping API: Added portuguese stem token filter #6345

Closed

clintongormley merged commit 673ef3d into elastic:master Jun 11, 2014

clintongormley deleted the stemmer_changes branch June 11, 2014 10:33

s1monw removed the review label Jun 18, 2014

clintongormley added the :Search/Analysis How text is split into tokens label Jun 6, 2015

clintongormley changed the title ~~Analysis: Improvements to StemmerTokenFilter~~ Improvements to StemmerTokenFilter Jun 6, 2015

clintongormley removed >bug >enhancement labels Jun 7, 2015

This was referenced Jun 8, 2015

Actually use the modern English stemmer alphagov/search-api#447

Merged

Default english stemmer uses obselete algorithm from 1980 #11541

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to StemmerTokenFilter #6452

Improvements to StemmerTokenFilter #6452

clintongormley commented Jun 10, 2014

clintongormley commented Jun 10, 2014

rmuir commented Jun 10, 2014

javanna Jun 10, 2014

clintongormley Jun 10, 2014

clintongormley commented Jun 10, 2014

jpountz commented Jun 11, 2014

Improvements to StemmerTokenFilter #6452

Improvements to StemmerTokenFilter #6452

Conversation

clintongormley commented Jun 10, 2014

clintongormley commented Jun 10, 2014

rmuir commented Jun 10, 2014

javanna Jun 10, 2014

Choose a reason for hiding this comment

clintongormley Jun 10, 2014

Choose a reason for hiding this comment

clintongormley commented Jun 10, 2014

jpountz commented Jun 11, 2014