New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove lowercase_expanded_terms option #9978
Comments
background: #9973 (i copied my comment here) |
The analysis chain should be used instead of relying on this, as it is confusing when dealing with different per-field analysers. The `locale` option was only used for `lowercase_expanded_terms`, which, once removed, is no longer needed, so it was removed as well. Fixes elastic#9978 Relates to elastic#9973
@rmuir can you explain a little bit more how the Looking at the (I originally merged a change for this that did it in the generic way but then reverted it when Ryan reminded my about the MultiTermAwareComponent part of this) |
You create a separate chain (analyzer) for multitermqueries by iterating thru the query analysis chain. implicitly the tokenizer should be keywordtokenizer. imagine query analysis chain is: lowercase + snowball. you look at lowercase (http://svn.apache.org/viewvc/lucene/dev/branches/branch_5x/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/LowerCaseFilterFactory.java?view=markup) and with instanceof you see it has that interface, so you call getMultiTermComponent() and add that to the chain. On the other hand snowball does not implement it, so nothing gets added. |
As a side note, if we remove the |
…lastic#19028 This is the same as what Lucene does for its analysis factories, and we hawe tests that make sure that the elasticsearch factories are in sync with Lucene's. This is a first step to move forward on elastic#9978 and elastic#18064.
…ple_)query_string`. This pull request uses the `MultiTermAwareComponent` interface in order to figure out how to deal with queries that match partial strings. This provides a better out-of-the-box experience and allows to remove the `lowercase_expanded_terms` and `locale` (which was only used for lowercasing) options. Things are expected to work well for custom analyzers. However, built-in analyzers make it challenging to know which components should be kept for multi-term analysis. The way it is implemented today is thet there is a default implementation that returns a lowercasing analyzer, which should be fine for most language analyzers for european languages. I did not want to go crazy with configuring the correct multi-term analyzer for those until we have a way to test that we are sync'ed with what happens in Lucene like we do for testing which factories need to implement `MultiTermAwareComponent`. In the future we could consider removing `analyze_wildcards` as well, but the query parser currently has the ability to tokenize it and generate a term query for the n-1 first tokens and a wildcard query on the last token. I suspect some users are relying on this behaviour so I think this should be explored in a separate change. Closes elastic#9978
…ons. Lucene 6.2 introduces the new `Analyzer.normalize` API, which allows to apply only character-level normalization such as lowercasing or accent folding, which is exactly what is needed to process queries that operate on partial terms such as `prefix`, `wildcard` or `fuzzy` queries. As a consequence, the `lowercase_expanded_terms` option is not necessary anymore. Furthermore, the `locale` option was only needed in order to know how to perform the lowercasing, so this one can be removed as well. Closes elastic#9978
…ons. (#20208) Lucene 6.2 introduces the new `Analyzer.normalize` API, which allows to apply only character-level normalization such as lowercasing or accent folding, which is exactly what is needed to process queries that operate on partial terms such as `prefix`, `wildcard` or `fuzzy` queries. As a consequence, the `lowercase_expanded_terms` option is not necessary anymore. Furthermore, the `locale` option was only needed in order to know how to perform the lowercasing, so this one can be removed as well. Closes #9978
…ons. (#20208) Lucene 6.2 introduces the new `Analyzer.normalize` API, which allows to apply only character-level normalization such as lowercasing or accent folding, which is exactly what is needed to process queries that operate on partial terms such as `prefix`, `wildcard` or `fuzzy` queries. As a consequence, the `lowercase_expanded_terms` option is not necessary anymore. Furthermore, the `locale` option was only needed in order to know how to perform the lowercasing, so this one can be removed as well. Closes #9978
for the record, i think we should remove this lowercasing option completely in parsers, disable it, and let the analysis chain take care. For multitermqueries, its a little tricky, which subset of the filters should be used? For example lowercasing is reasonable, stemming is not.
But lucene already annotates each analyzer component deemed to be "reasonable" for wildcards with a marker interface (MultiTermAwareComponent). Things like lowercasefilter have it and things like stemmers dont have it. This is enough to build a "chain", automatically from the query analyzer, that acts reasonably for multitermqueries.
I know we don't use the lucene factories (es has its own), but we have a table that maps between them, i know because its in a test I wrote. So the information is there :)
All queryparsers have hooks (e.g. factory methods for prefix/wildcard/range) that make it possible to use this, for example solr does it by default, as soon as it did this, people stopped complaining about confusing behavior: both for the unanalyzed, and the analyzed case. it just works.
Sorry for the long explanation.
Compare:
The text was updated successfully, but these errors were encountered: