Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to StemmerTokenFilter #6452

Merged
merged 1 commit into from Jun 11, 2014
Merged

Improvements to StemmerTokenFilter #6452

merged 1 commit into from Jun 11, 2014

Conversation

clintongormley
Copy link

The StemmerTokenFilter had a number of issues:

  • english returned the slow snowball English stemmer
  • porter2 returned the snowball Porter stemmer (v1)
  • portuguese was used twice, preventing the second version from working

Changes:

  • english now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)
  • porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
  • light_english now returns the kstem stemmer (kstem still works)
  • portuguese_rslp returns the PortugueseStemmer
  • dutch_kp is a synonym for kp

Tests and docs updated

@clintongormley
Copy link
Author

Fixes #6345
Fixes #6213
Fixes #6330

@rmuir
Copy link
Contributor

rmuir commented Jun 10, 2014

This looks great: I especially like the cleanup to the code organization and the doc links.

@@ -85,90 +89,116 @@ public TokenStream create(TokenStream tokenStream) {
return new CzechStemFilter(tokenStream);
} else if ("danish".equalsIgnoreCase(language)) {
return new SnowballFilter(tokenStream, new DanishStemmer());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this additional line? ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra lines are there to break up the language blocks, eg English, vs French etc. I've added comments for each block with multiple languages

@clintongormley
Copy link
Author

@jpountz Please could you check that I'm using the index versions correctly here

@jpountz
Copy link
Contributor

jpountz commented Jun 11, 2014

@clintongormley It looks good to me. I really like the fact that we now recommend on a stemmer!

* `english` returned the slow snowball English stemmer
* `porter2` returned the snowball Porter stemmer (v1)
* `portuguese` was used twice, preventing the second version from working

Changes:

* `english` now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)
* `porter2` now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
* `light_english` now returns the `kstem` stemmer (`kstem` still works)
* `portuguese_rslp` returns the PortugueseStemmer
* `dutch_kp` is a synonym for `kp`

Tests and docs updated

Fixes #6345
Fixes #6213
Fixes #6330
@clintongormley clintongormley merged commit 673ef3d into elastic:master Jun 11, 2014
@clintongormley clintongormley deleted the stemmer_changes branch June 11, 2014 10:33
@s1monw s1monw removed the review label Jun 18, 2014
@clintongormley clintongormley added the :Search/Analysis How text is split into tokens label Jun 6, 2015
@clintongormley clintongormley changed the title Analysis: Improvements to StemmerTokenFilter Improvements to StemmerTokenFilter Jun 6, 2015
rboulton pushed a commit to alphagov/search-api that referenced this pull request Jun 8, 2015
Elasticsearch 1.3.0 changed the stemmer implementation used when the
"english" stemmer is requested. Instead of using the stemmer as revised
by Martin Porter in 2001/2002 to avoid lots of incorrect over-stemming
and other problems, elasticsearch now defaults to using the algorithm
defined in 1980. This is an entirely inferior algorithm in terms of the
output it produces.

To get the better algorithm back, we now need to request the "porter2"
algorithm.

This has made me grumpy.  The reason that snowball returns the "english"
stemmer instead of the "porter" stemmer when it's asked for a stemmer
for "english" is exactly because the "porter" stemmer is inferior in
every case where it differs from the "english" stemmer. However,
elasticsearch elastic/elasticsearch#6452 changed
the default, apparently because the implementation of the 1980 Porter
stemmer available is faster.

In particular, this change has made a search for "news" on GOV.UK return
anything with the word "new" in it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants