Analysis: Default analyzer includes stopwords #5974

clintongormley · 2014-04-29T10:04:23Z

Using the default analyzer:

GET /_analyze?text=The fox

Removes stopwords:

{
   "tokens": [
      {
         "token": "fox",
         "start_offset": 4,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

Using the standard analyzer:

GET /_analyze?text=The fox&analyzer=standard

Keeps stopwords:

{
   "tokens": [
      {
         "token": "the",
         "start_offset": 0,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "fox",
         "start_offset": 4,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

The text was updated successfully, but these errors were encountered:

gmarz · 2014-05-04T19:14:03Z

@clintongormley

Based on #4092 and the below snippet, it appears that the correct behavior is to not remove stop words for versions on or after 1.0.0.Beta1.

STANDARD(CachingStrategy.ELASTICSEARCH) { // we don't do stopwords anymore from 1.0Beta on
    @Override
    protected Analyzer create(Version version) {
        if (version.onOrAfter(Version.V_1_0_0_Beta1)) {
            return new StandardAnalyzer(version.luceneVersion, CharArraySet.EMPTY_SET);
        }
        return new StandardAnalyzer(version.luceneVersion);
    }
}

The inconsistency here is due to the fact that when an analyzer name isn't specified in the query params, the analyzer isn't being resolved from the pre-built analyzers, but instead from Lucene.STANDARD_ANALYZER which is configured to use stop words.

Perhaps it should be resolved from the pre-built analyzers instead.

The analyze API used the standard analyzer from lucene and therefore removed stopwords instead of using the elasticsearch default analyzer. Closes elastic#5974

The analyze API used the standard analyzer from lucene and therefore removed stopwords instead of using the elasticsearch default analyzer. Closes #5974

s1monw · 2014-05-18T10:06:09Z

I guess we should port this to 1.1.2 as well @spinscale

The analyze API used the standard analyzer from lucene and therefore removed stopwords instead of using the elasticsearch default analyzer. Closes #5974

spinscale · 2014-05-18T15:58:02Z

done

The analyze API used the standard analyzer from lucene and therefore removed stopwords instead of using the elasticsearch default analyzer. Closes elastic#5974

clintongormley added bug labels Apr 29, 2014

spinscale self-assigned this May 5, 2014

spinscale mentioned this issue May 5, 2014

Analyze API: Default analyzer accidentally removed stopwords #6043

Merged

spinscale closed this as completed in #6043 May 5, 2014

spinscale added a commit that referenced this issue May 5, 2014

Analyze API: Default analyzer accidentally removed stopwords

879db9c

The analyze API used the standard analyzer from lucene and therefore removed stopwords instead of using the elasticsearch default analyzer. Closes #5974

spinscale added a commit that referenced this issue May 18, 2014

Analyze API: Default analyzer accidentally removed stopwords

7e091c4

The analyze API used the standard analyzer from lucene and therefore removed stopwords instead of using the elasticsearch default analyzer. Closes #5974

spinscale added the v1.1.2 label May 18, 2014

clintongormley changed the title ~~Default analyzer includes stopwords~~ Analysis: Default analyzer includes stopwords Jul 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis: Default analyzer includes stopwords #5974

Analysis: Default analyzer includes stopwords #5974

clintongormley commented Apr 29, 2014

gmarz commented May 4, 2014

s1monw commented May 18, 2014

spinscale commented May 18, 2014

Analysis: Default analyzer includes stopwords #5974

Analysis: Default analyzer includes stopwords #5974

Comments

clintongormley commented Apr 29, 2014

gmarz commented May 4, 2014

s1monw commented May 18, 2014

spinscale commented May 18, 2014