Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis: Default analyzer includes stopwords #5974

Closed
clintongormley opened this issue Apr 29, 2014 · 3 comments · Fixed by #6043
Closed

Analysis: Default analyzer includes stopwords #5974

clintongormley opened this issue Apr 29, 2014 · 3 comments · Fixed by #6043

Comments

@clintongormley
Copy link

Using the default analyzer:

GET /_analyze?text=The fox

Removes stopwords:

{
   "tokens": [
      {
         "token": "fox",
         "start_offset": 4,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

Using the standard analyzer:

GET /_analyze?text=The fox&analyzer=standard

Keeps stopwords:

{
   "tokens": [
      {
         "token": "the",
         "start_offset": 0,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "fox",
         "start_offset": 4,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}
@gmarz
Copy link
Contributor

gmarz commented May 4, 2014

@clintongormley

Based on #4092 and the below snippet, it appears that the correct behavior is to not remove stop words for versions on or after 1.0.0.Beta1.

STANDARD(CachingStrategy.ELASTICSEARCH) { // we don't do stopwords anymore from 1.0Beta on
    @Override
    protected Analyzer create(Version version) {
        if (version.onOrAfter(Version.V_1_0_0_Beta1)) {
            return new StandardAnalyzer(version.luceneVersion, CharArraySet.EMPTY_SET);
        }
        return new StandardAnalyzer(version.luceneVersion);
    }
}

The inconsistency here is due to the fact that when an analyzer name isn't specified in the query params, the analyzer isn't being resolved from the pre-built analyzers, but instead from Lucene.STANDARD_ANALYZER which is configured to use stop words.

Perhaps it should be resolved from the pre-built analyzers instead.

@spinscale spinscale self-assigned this May 5, 2014
spinscale added a commit to spinscale/elasticsearch that referenced this issue May 5, 2014
The analyze API used the standard analyzer from lucene and therefore removed
stopwords instead of using the elasticsearch default analyzer.

Closes elastic#5974
spinscale added a commit that referenced this issue May 5, 2014
The analyze API used the standard analyzer from lucene and therefore removed
stopwords instead of using the elasticsearch default analyzer.

Closes #5974
@s1monw
Copy link
Contributor

s1monw commented May 18, 2014

I guess we should port this to 1.1.2 as well @spinscale

spinscale added a commit that referenced this issue May 18, 2014
The analyze API used the standard analyzer from lucene and therefore removed
stopwords instead of using the elasticsearch default analyzer.

Closes #5974
@spinscale
Copy link
Contributor

done

@clintongormley clintongormley changed the title Default analyzer includes stopwords Analysis: Default analyzer includes stopwords Jul 16, 2014
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
The analyze API used the standard analyzer from lucene and therefore removed
stopwords instead of using the elasticsearch default analyzer.

Closes elastic#5974
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants