Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove lowercase_expanded_terms option #9978

Closed
rmuir opened this issue Mar 4, 2015 · 4 comments · Fixed by #10086 or #20208
Closed

Remove lowercase_expanded_terms option #9978

rmuir opened this issue Mar 4, 2015 · 4 comments · Fixed by #10086 or #20208
Assignees

Comments

@rmuir
Copy link
Contributor

rmuir commented Mar 4, 2015

for the record, i think we should remove this lowercasing option completely in parsers, disable it, and let the analysis chain take care. For multitermqueries, its a little tricky, which subset of the filters should be used? For example lowercasing is reasonable, stemming is not.

But lucene already annotates each analyzer component deemed to be "reasonable" for wildcards with a marker interface (MultiTermAwareComponent). Things like lowercasefilter have it and things like stemmers dont have it. This is enough to build a "chain", automatically from the query analyzer, that acts reasonably for multitermqueries.

I know we don't use the lucene factories (es has its own), but we have a table that maps between them, i know because its in a test I wrote. So the information is there :)

All queryparsers have hooks (e.g. factory methods for prefix/wildcard/range) that make it possible to use this, for example solr does it by default, as soon as it did this, people stopped complaining about confusing behavior: both for the unanalyzed, and the analyzed case. it just works.

Sorry for the long explanation.

Compare:

@rmuir rmuir added the help wanted adoptme label Mar 4, 2015
@rmuir
Copy link
Contributor Author

rmuir commented Mar 4, 2015

background: #9973 (i copied my comment here)

@clintongormley clintongormley added the :Search/Analysis How text is split into tokens label Mar 9, 2015
@dakrone dakrone self-assigned this Mar 13, 2015
dakrone added a commit to dakrone/elasticsearch that referenced this issue Mar 13, 2015
The analysis chain should be used instead of relying on this, as it is
confusing when dealing with different per-field analysers.

The `locale` option was only used for `lowercase_expanded_terms`, which,
once removed, is no longer needed, so it was removed as well.

Fixes elastic#9978
Relates to elastic#9973
@dakrone dakrone reopened this Mar 14, 2015
@dakrone
Copy link
Member

dakrone commented Mar 31, 2015

@rmuir can you explain a little bit more how the MultiTermAwareComponent should be used? When I look for usages of it in the Lucene 4.x branch, the only users are tests and Solr.

Looking at the QueryBuilder, it looks like it uses the analyzer to get the tokenStream for the text to get tokens, but that this doesn't seem to interface with choosing whether or not to do things like stemming? Do you think this is something that we should hold off on doing until we figure out how to exclude things like stemming from the analysis chain when analyzing expanded terms, or should we let the user deal with it by specifying a separate query-time analyzer?

(I originally merged a change for this that did it in the generic way but then reverted it when Ryan reminded my about the MultiTermAwareComponent part of this)

@rmuir
Copy link
Contributor Author

rmuir commented Sep 1, 2015

@rmuir can you explain a little bit more how the MultiTermAwareComponent should be used?

You create a separate chain (analyzer) for multitermqueries by iterating thru the query analysis chain. implicitly the tokenizer should be keywordtokenizer.

imagine query analysis chain is: lowercase + snowball. you look at lowercase (http://svn.apache.org/viewvc/lucene/dev/branches/branch_5x/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseFilterFactory.java?view=markup) and with instanceof you see it has that interface, so you call getMultiTermComponent() and add that to the chain. On the other hand snowball does not implement it, so nothing gets added.

@javanna
Copy link
Member

javanna commented Sep 1, 2015

As a side note, if we remove the lowercase_expanded_terms option we should be able to remove the locale parameter from simple_query_string and query_string as well, that would help with solving #13229 .

@spinscale spinscale added v2.3.0 and removed v2.2.0 labels Dec 23, 2015
jpountz added a commit to jpountz/elasticsearch that referenced this issue Jun 23, 2016
…lastic#19028

This is the same as what Lucene does for its analysis factories, and we hawe
tests that make sure that the elasticsearch factories are in sync with
Lucene's. This is a first step to move forward on elastic#9978 and elastic#18064.
jpountz added a commit to jpountz/elasticsearch that referenced this issue Jun 24, 2016
…ple_)query_string`.

This pull request uses the `MultiTermAwareComponent` interface in order to
figure out how to deal with queries that match partial strings. This provides a
better out-of-the-box experience and allows to remove the
`lowercase_expanded_terms` and `locale` (which was only used for lowercasing)
options.

Things are expected to work well for custom analyzers. However, built-in
analyzers make it challenging to know which components should be kept for
multi-term analysis. The way it is implemented today is thet there is a default
implementation that returns a lowercasing analyzer, which should be fine for
most language analyzers for european languages. I did not want to go crazy
with configuring the correct multi-term analyzer for those until we have a way
to test that we are sync'ed with what happens in Lucene like we do for testing
which factories need to implement `MultiTermAwareComponent`.

In the future we could consider removing `analyze_wildcards` as well, but the
query parser currently has the ability to tokenize it and generate a term query
for the n-1 first tokens and a wildcard query on the last token. I suspect some
users are relying on this behaviour so I think this should be explored in a
separate change.

Closes elastic#9978
jpountz added a commit to jpountz/elasticsearch that referenced this issue Nov 2, 2016
…ons.

Lucene 6.2 introduces the new `Analyzer.normalize` API, which allows to apply
only character-level normalization such as lowercasing or accent folding, which
is exactly what is needed to process queries that operate on partial terms such
as `prefix`, `wildcard` or `fuzzy` queries. As a consequence, the
`lowercase_expanded_terms` option is not necessary anymore. Furthermore, the
`locale` option was only needed in order to know how to perform the lowercasing,
so this one can be removed as well.

Closes elastic#9978
jpountz added a commit that referenced this issue Nov 2, 2016
…ons. (#20208)

Lucene 6.2 introduces the new `Analyzer.normalize` API, which allows to apply
only character-level normalization such as lowercasing or accent folding, which
is exactly what is needed to process queries that operate on partial terms such
as `prefix`, `wildcard` or `fuzzy` queries. As a consequence, the
`lowercase_expanded_terms` option is not necessary anymore. Furthermore, the
`locale` option was only needed in order to know how to perform the lowercasing,
so this one can be removed as well.

Closes #9978
jpountz added a commit that referenced this issue Nov 2, 2016
…ons. (#20208)

Lucene 6.2 introduces the new `Analyzer.normalize` API, which allows to apply
only character-level normalization such as lowercasing or accent folding, which
is exactly what is needed to process queries that operate on partial terms such
as `prefix`, `wildcard` or `fuzzy` queries. As a consequence, the
`lowercase_expanded_terms` option is not necessary anymore. Furthermore, the
`locale` option was only needed in order to know how to perform the lowercasing,
so this one can be removed as well.

Closes #9978
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment