New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes many misbehaving user parameters #8028
Conversation
@@ -69,6 +71,7 @@ | |||
public static final ParseField STOP_WORDS = new ParseField("stop_words"); | |||
public static final ParseField DOCUMENT_IDS = new ParseField("ids"); | |||
public static final ParseField DOCUMENTS = new ParseField("docs"); | |||
public static final ParseField TERM_VECTOR_RESPONSE = new ParseField("term_vector_response"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we name this term_vector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem with that is that the response already has a field called term_vectors
... But this might not be an issue because it might fit into the new like
parameter in such a way that any object that has the field "term_vectors" will be treated as a term vector response.
{
"query": {
"more_like_this": {
"like": {
"_index": "index",
"_type": "type",
"_id": "id",
"term_vectors": {
"field_name": {
...
}
}
}
}
}
}
left some comments but I like the idea simplifies the thing a bit :) |
I have switched to using dummy Fields. However, there is an easier way which would consist of making a TermVectorResponse object from the parsed json. Maybe that would a cleaner implementation? |
@alexksikes I am not sure I understand what you mean can you clarify? |
@s1monw Well the idea was to simply overload TermVectorWriter#setFields to take parser.map() and so not even have a TermVectorResponseParser class with a ParsedTermVectorResponse, but just a TermVectorResponse. But now that I re-think of it, it seems fine the way it is. |
Previously, the MLT API would create one MLT query per field and per value. This would make the parameters related to the term selection and query formation such as `max_query_terms`, `min_term_freq`, `minimum_should_match` (previously `percent_terms_to_match`) or `boost_terms` behave in an unexpected manner. Let's take the common example of looking up similar documents with respect to a list of tag names. Suppose these tags are modeled by a multi- value field with a keyword analyzer. Performing a MLT request would therefore result in one MLT query per tag, regardless of the value of `max_query_terms` or `minimum_should_match`. This would result in a query made of all the tag names, if `min_term_freq` = 1 (no actual selection of terms is taking place), or zero tag whatsoever, if `min_term_freq` > 1 (note the default is 2). The `boost_terms` parameter would also have unexpected effects as it would depend on the frequency of the term within field value and, again, not within the whole field. This commit fixes these issues by calling upon the term vector API and by directly passing the response (the terms) to the MoreLikeThisQueryParser. Now both the API and the query yield exactly the same results under any given set of parameters, but while keeping the added benefit for the API of calling upon the TV API only once. Closes elastic#2914
a970365
to
91ac4c8
Compare
Previously, the MLT API would create one MLT query per field and per value.
This would make the parameters related to the term selection and query
formation such as
max_query_terms
,min_term_freq
,minimum_should_match
(previously
percent_terms_to_match
) orboost_terms
behave in an unexpectedmanner. Let's take the common example of looking up similar documents with
respect to a list of tag names. Suppose these tags are modeled by a multi-
value field with a keyword analyzer. Performing a MLT request would therefore
result in one MLT query per tag, regardless of the value of
max_query_terms
or
minimum_should_match
. This would result in a query made of all the tagnames, if
min_term_freq
= 1 (no actual selection of terms is taking place),or zero tag whatsoever, if
min_term_freq
> 1 (note the default is 2). Theboost_terms
parameter would also have unexpected effects as it would dependon the frequency of the term within field value and, again, not within the
whole field.
This commit fixes these issues by calling upon the term vector API and by
directly passing the response (the terms) to the MoreLikeThisQueryParser. Now
both the API and the query yield exactly the same results under any given set
of parameters, but while keeping the added benefit for the API of calling upon
the TV API only once.
Closes #2914