Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support terms filtering #9561

Closed

Conversation

alexksikes
Copy link
Contributor

This adds a new feature to the Term Vectors API which allows for filtering of
terms based on their tf-idf scores. With dfs option on, this could be useful
for finding out a good characteristic vector of a document or a set of documents.
The parameters are similar to the ones used in the MLT Query.

This adds a new feature to the Term Vectors API which allows for filtering of
terms based on their tf-idf scores. With `dfs` option on, this could be useful
for finding out a good characteric vector of a document or a set of documents.
The parameters are similar to the ones used in the MLT Query.
topLevelTermsEnum = topLevelTerms.iterator(topLevelTermsEnum);
while (termsEnum.next() != null) {
BytesRef termBytesRef = termsEnum.term();
topLevelTermsEnum.seekExact(termBytesRef);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to ignore the return value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just assert that it returns true?

@jpountz
Copy link
Contributor

jpountz commented Feb 4, 2015

Looks good overall, can you add some tests?

@@ -86,6 +86,40 @@ or the field statistics of the entire index, and not just at the shard. Use it
with caution as distributed frequencies can have a serious performance impact.

[float]
==== Terms Filtering coming[2.0]

With the parameter `filter`, the terms returned could also be filtered based

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nervous about using the word filter here, as we try to reserve that purely for query DSL filters (although we do use it for fielddata filters as well http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/fielddata-formats.html#field-data-filtering). I'm struggling to come up with a better name though.

I'm wondering if we need this sub-level at all, or if we can just move the max_num_terms etc parameters up to the top level? There doesn't seem to be any name clash (although I do like the fact that having them as sub-params indicates their action more clearly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about using filter as a sub-level parameter is that filter : {} with no parameter will perform filtering but using the default sub parameters. Another reason would be to hide the complexity of a feature which might be used only in some very specific cases. It also allows us to expand on this sub feature, in order, for example, to allow for filtering based on a script more cleanly I think. But I do agree, I am not sure if filter should be the proper name for this feature.

@s1monw
Copy link
Contributor

s1monw commented Mar 20, 2015

@alexksikes are you picking this up again or can we close it?

@alexksikes
Copy link
Contributor Author

Just need to add the tests for this. We would need this for Item Query also for which I wanted to write a dev issue to explain the plan.

@jpountz jpountz self-assigned this Apr 13, 2015
@jpountz
Copy link
Contributor

jpountz commented Apr 14, 2015

LGTM, just left a minor comment that does not need further review. However before pushing can you make sure to settle on the naming of this feature with @clintongormley ?

@clintongormley
Copy link

I can't think of a better name for this parameter, and it fits with fielddata filtering, so I think we should just stick to filter here (especially in light of the fact that query DSL filters will kinda go away in the future)

@alexksikes alexksikes removed the review label Apr 14, 2015
@alexksikes alexksikes deleted the feature/tvs-terms-filtering branch April 14, 2015 17:18
@clintongormley clintongormley changed the title Term Vectors: terms filtering Support terms filtering Jun 6, 2015
@alexksikes alexksikes mentioned this pull request Jun 22, 2015
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Term Vectors labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants