New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support terms filtering #9561
Support terms filtering #9561
Conversation
This adds a new feature to the Term Vectors API which allows for filtering of terms based on their tf-idf scores. With `dfs` option on, this could be useful for finding out a good characteric vector of a document or a set of documents. The parameters are similar to the ones used in the MLT Query.
topLevelTermsEnum = topLevelTerms.iterator(topLevelTermsEnum); | ||
while (termsEnum.next() != null) { | ||
BytesRef termBytesRef = termsEnum.term(); | ||
topLevelTermsEnum.seekExact(termBytesRef); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to ignore the return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just assert that it returns true?
Looks good overall, can you add some tests? |
@@ -86,6 +86,40 @@ or the field statistics of the entire index, and not just at the shard. Use it | |||
with caution as distributed frequencies can have a serious performance impact. | |||
|
|||
[float] | |||
==== Terms Filtering coming[2.0] | |||
|
|||
With the parameter `filter`, the terms returned could also be filtered based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm nervous about using the word filter
here, as we try to reserve that purely for query DSL filters (although we do use it for fielddata filters as well http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/fielddata-formats.html#field-data-filtering). I'm struggling to come up with a better name though.
I'm wondering if we need this sub-level at all, or if we can just move the max_num_terms
etc parameters up to the top level? There doesn't seem to be any name clash (although I do like the fact that having them as sub-params indicates their action more clearly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nice thing about using filter
as a sub-level parameter is that filter : {}
with no parameter will perform filtering but using the default sub parameters. Another reason would be to hide the complexity of a feature which might be used only in some very specific cases. It also allows us to expand on this sub feature, in order, for example, to allow for filtering based on a script more cleanly I think. But I do agree, I am not sure if filter
should be the proper name for this feature.
@alexksikes are you picking this up again or can we close it? |
Just need to add the tests for this. We would need this for Item Query also for which I wanted to write a dev issue to explain the plan. |
LGTM, just left a minor comment that does not need further review. However before pushing can you make sure to settle on the naming of this feature with @clintongormley ? |
I can't think of a better name for this parameter, and it fits with fielddata filtering, so I think we should just stick to |
This adds a new feature to the Term Vectors API which allows for filtering of
terms based on their tf-idf scores. With
dfs
option on, this could be usefulfor finding out a good characteristic vector of a document or a set of documents.
The parameters are similar to the ones used in the MLT Query.