Fixes many misbehaving user parameters #8028

alexksikes · 2014-10-08T21:07:25Z

Previously, the MLT API would create one MLT query per field and per value.
This would make the parameters related to the term selection and query
formation such as max_query_terms, min_term_freq, minimum_should_match
(previously percent_terms_to_match) or boost_terms behave in an unexpected
manner. Let's take the common example of looking up similar documents with
respect to a list of tag names. Suppose these tags are modeled by a multi-
value field with a keyword analyzer. Performing a MLT request would therefore
result in one MLT query per tag, regardless of the value of max_query_terms
or minimum_should_match. This would result in a query made of all the tag
names, if min_term_freq = 1 (no actual selection of terms is taking place),
or zero tag whatsoever, if min_term_freq > 1 (note the default is 2). The
boost_terms parameter would also have unexpected effects as it would depend
on the frequency of the term within field value and, again, not within the
whole field.

This commit fixes these issues by calling upon the term vector API and by
directly passing the response (the terms) to the MoreLikeThisQueryParser. Now
both the API and the query yield exactly the same results under any given set
of parameters, but while keeping the added benefit for the API of calling upon
the TV API only once.

Closes #2914

s1monw · 2014-10-09T19:53:52Z

src/main/java/org/elasticsearch/index/query/MoreLikeThisQueryParser.java

@@ -69,6 +71,7 @@
        public static final ParseField STOP_WORDS = new ParseField("stop_words");
        public static final ParseField DOCUMENT_IDS = new ParseField("ids");
        public static final ParseField DOCUMENTS = new ParseField("docs");
+        public static final ParseField TERM_VECTOR_RESPONSE = new ParseField("term_vector_response");


can we name this term_vector ?

the problem with that is that the response already has a field called term_vectors ... But this might not be an issue because it might fit into the new like parameter in such a way that any object that has the field "term_vectors" will be treated as a term vector response.

{ "query": { "more_like_this": { "like": { "_index": "index", "_type": "type", "_id": "id", "term_vectors": { "field_name": { ... } } } } } }

s1monw · 2014-10-09T20:02:18Z

left some comments but I like the idea simplifies the thing a bit :)

alexksikes · 2014-10-11T15:29:43Z

I have switched to using dummy Fields. However, there is an easier way which would consist of making a TermVectorResponse object from the parsed json. Maybe that would a cleaner implementation?

s1monw · 2014-10-13T11:06:03Z

@alexksikes I am not sure I understand what you mean can you clarify?

alexksikes · 2014-10-13T11:28:40Z

@s1monw Well the idea was to simply overload TermVectorWriter#setFields to take parser.map() and so not even have a TermVectorResponseParser class with a ParsedTermVectorResponse, but just a TermVectorResponse. But now that I re-think of it, it seems fine the way it is.

Previously, the MLT API would create one MLT query per field and per value. This would make the parameters related to the term selection and query formation such as `max_query_terms`, `min_term_freq`, `minimum_should_match` (previously `percent_terms_to_match`) or `boost_terms` behave in an unexpected manner. Let's take the common example of looking up similar documents with respect to a list of tag names. Suppose these tags are modeled by a multi- value field with a keyword analyzer. Performing a MLT request would therefore result in one MLT query per tag, regardless of the value of `max_query_terms` or `minimum_should_match`. This would result in a query made of all the tag names, if `min_term_freq` = 1 (no actual selection of terms is taking place), or zero tag whatsoever, if `min_term_freq` > 1 (note the default is 2). The `boost_terms` parameter would also have unexpected effects as it would depend on the frequency of the term within field value and, again, not within the whole field. This commit fixes these issues by calling upon the term vector API and by directly passing the response (the terms) to the MoreLikeThisQueryParser. Now both the API and the query yield exactly the same results under any given set of parameters, but while keeping the added benefit for the API of calling upon the TV API only once. Closes elastic#2914

alexksikes added v1.5.0 review v2.0.0-beta1 >enhancement >bug labels Oct 8, 2014

s1monw reviewed Oct 9, 2014
View reviewed changes

alexksikes added review and removed review labels Oct 11, 2014

clintongormley changed the title ~~MLT API: fixes many miss behaving user parameters~~ MLT API: fixes many misbehaving user parameters Oct 16, 2014

alexksikes removed the v1.5.0 label Oct 23, 2014

alexksikes added 3 commits October 29, 2014 11:58

use dummy Fields

2620eb3

rebased on master + adding breaking changes

91ac4c8

alexksikes force-pushed the feature/mlt-api-refactor branch from a970365 to 91ac4c8 Compare October 29, 2014 11:27

clintongormley added the :MLT label Nov 29, 2014

alexksikes removed the review label Dec 5, 2014

alexksikes mentioned this pull request Dec 17, 2014

Support for shard level caching of term vectors #8395

Closed

drewr force-pushed the master branch from dcc3da0 to 7c20a8a Compare February 20, 2015 16:48

alexksikes closed this Apr 15, 2015

clintongormley removed the >enhancement label Jun 7, 2015

clintongormley changed the title ~~MLT API: fixes many misbehaving user parameters~~ Fixes many misbehaving user parameters Jun 7, 2015

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :More Like This labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes many misbehaving user parameters #8028

Fixes many misbehaving user parameters #8028

alexksikes commented Oct 8, 2014

s1monw Oct 9, 2014

alexksikes Oct 10, 2014

s1monw commented Oct 9, 2014

alexksikes commented Oct 11, 2014

s1monw commented Oct 13, 2014

alexksikes commented Oct 13, 2014

Fixes many misbehaving user parameters #8028

Fixes many misbehaving user parameters #8028

Conversation

alexksikes commented Oct 8, 2014

s1monw Oct 9, 2014

Choose a reason for hiding this comment

alexksikes Oct 10, 2014

Choose a reason for hiding this comment

s1monw commented Oct 9, 2014

alexksikes commented Oct 11, 2014

s1monw commented Oct 13, 2014

alexksikes commented Oct 13, 2014