Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes many misbehaving user parameters #8028

Closed

Conversation

alexksikes
Copy link
Contributor

Previously, the MLT API would create one MLT query per field and per value.
This would make the parameters related to the term selection and query
formation such as max_query_terms, min_term_freq, minimum_should_match
(previously percent_terms_to_match) or boost_terms behave in an unexpected
manner. Let's take the common example of looking up similar documents with
respect to a list of tag names. Suppose these tags are modeled by a multi-
value field with a keyword analyzer. Performing a MLT request would therefore
result in one MLT query per tag, regardless of the value of max_query_terms
or minimum_should_match. This would result in a query made of all the tag
names, if min_term_freq = 1 (no actual selection of terms is taking place),
or zero tag whatsoever, if min_term_freq > 1 (note the default is 2). The
boost_terms parameter would also have unexpected effects as it would depend
on the frequency of the term within field value and, again, not within the
whole field.

This commit fixes these issues by calling upon the term vector API and by
directly passing the response (the terms) to the MoreLikeThisQueryParser. Now
both the API and the query yield exactly the same results under any given set
of parameters, but while keeping the added benefit for the API of calling upon
the TV API only once.

Closes #2914

@@ -69,6 +71,7 @@
public static final ParseField STOP_WORDS = new ParseField("stop_words");
public static final ParseField DOCUMENT_IDS = new ParseField("ids");
public static final ParseField DOCUMENTS = new ParseField("docs");
public static final ParseField TERM_VECTOR_RESPONSE = new ParseField("term_vector_response");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we name this term_vector ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with that is that the response already has a field called term_vectors ... But this might not be an issue because it might fit into the new like parameter in such a way that any object that has the field "term_vectors" will be treated as a term vector response.

{
  "query": {
    "more_like_this": {
      "like": {
        "_index": "index",
        "_type": "type",
        "_id": "id",
        "term_vectors": {
            "field_name": {
                ...
            }
        }
      }
    }
  }
}

@s1monw
Copy link
Contributor

s1monw commented Oct 9, 2014

left some comments but I like the idea simplifies the thing a bit :)

@alexksikes
Copy link
Contributor Author

I have switched to using dummy Fields. However, there is an easier way which would consist of making a TermVectorResponse object from the parsed json. Maybe that would a cleaner implementation?

@alexksikes alexksikes added review and removed review labels Oct 11, 2014
@s1monw
Copy link
Contributor

s1monw commented Oct 13, 2014

@alexksikes I am not sure I understand what you mean can you clarify?

@alexksikes
Copy link
Contributor Author

@s1monw Well the idea was to simply overload TermVectorWriter#setFields to take parser.map() and so not even have a TermVectorResponseParser class with a ParsedTermVectorResponse, but just a TermVectorResponse. But now that I re-think of it, it seems fine the way it is.

@clintongormley clintongormley changed the title MLT API: fixes many miss behaving user parameters MLT API: fixes many misbehaving user parameters Oct 16, 2014
@alexksikes alexksikes removed the v1.5.0 label Oct 23, 2014
Previously, the MLT API would create one MLT query per field and per value.
This would make the parameters related to the term selection and query
formation such as `max_query_terms`, `min_term_freq`, `minimum_should_match`
(previously `percent_terms_to_match`) or `boost_terms` behave in an unexpected
manner. Let's take the common example of looking up similar documents with
respect to a list of tag names. Suppose these tags are modeled by a multi-
value field with a keyword analyzer. Performing a MLT request would therefore
result in one MLT query per tag, regardless of the value of `max_query_terms`
or `minimum_should_match`. This would result in a query made of all the tag
names, if `min_term_freq` = 1 (no actual selection of terms is taking place),
or zero tag whatsoever, if `min_term_freq` > 1 (note the default is 2). The
`boost_terms` parameter would also have unexpected effects as it would depend
on the frequency of the term within field value and, again, not within the
whole field.

This commit fixes these issues by calling upon the term vector API and by
directly passing the response (the terms) to the MoreLikeThisQueryParser. Now
both the API and the query yield exactly the same results under any given set
of parameters, but while keeping the added benefit for the API of calling upon
the TV API only once.

Closes elastic#2914
@alexksikes alexksikes removed the review label Dec 5, 2014
@alexksikes alexksikes closed this Apr 15, 2015
@clintongormley clintongormley changed the title MLT API: fixes many misbehaving user parameters Fixes many misbehaving user parameters Jun 7, 2015
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :More Like This labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MLT bug when source disabled?
3 participants