New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregations: new “Sampler” provides a filter for top-scoring docs #8191
Conversation
@jpountz Would appreciate you taking a look if you have time |
import org.elasticsearch.search.aggregations.bucket.SingleBucketAggregation; | ||
|
||
/** | ||
* A {@code filter} aggregation. Defines a single bucket that holds all documents that match a specific filter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy-paste left-over?
I'm wondering how we could allow for other modes of sampling in the future. For instance this mode is focused on quality and replays the top documents to the sub aggregators but we could also imagine having a sampling aggregation that would only forward every N-th document to the sub aggs (for speed reasons as it would certainly not help quality). So maybe we could rename it to something like The |
If the sampling is being done for reasons of speed rather than quality then you'll want to cap the number of docs sampled with some accuracy. Taking every N-th doc is probably not the best way to achieve this as we typically don't know how many docs will match a query. It's probably better to take a random sample of a fixed size - this can be achieved by using the existing priority queue impl with a fixed size and populating it with matches that have random scores.
The only alternative means of applying a quality-based filter is to tighten up the query using min_should_match or similar and this is not always easy to control or for users to understand.
The large deviation in scores produced by these factors could provide a useful way to separate the strongly and weakly matched users. |
For the record, scoring (as in use of IDF, norms and coord) seems impossible if your content is numerics as in my movie-lists recommendation example above. The fact that the field type is an integer means |
…ring docs used in child aggregations. The existing Aggregator base class support for deferring computation of child aggs is modified (DeferringBucketCollector is refactored to be abstract and the logic it had for “best bucket” trimming is pushed down into a new BestBucketsDeferringCollector while the new logic for trimming based on score quality is in subclass BestDocsDeferringCollector). As well as allowing a sample of top-scoring docs the option of a random sample is also provided. Closes #8108
Rebased on latest master. Removed option of "relative_to_max_score", added suggested option of random sampling. |
shard_size:: Optional. The maximum number of documents sampled on each shard. | ||
Defaults to 100 | ||
|
||
random_sample:: Optional. Set to true to take a random sample of documents rather than top-scoring documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about something like sample: top|random
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe a better way to expose such a functionality would be to remove this parameter and allow this aggregation to take a query as a parameter, and then users interested in random sampling could use a random function score? (which would also automatically add support for reproducibility through seeds, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I was the one who suggested random sampling but it was really more to start a discussion than a recommandation so if it ends up making everything more complicated I'd be happy to not have it in this PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Thanks for the review. I think adding the random sampling was beneficial if only to refactor deferring collector base class to allow for other forms of determining "top" collections.
I have a potentially more useful feature in mind to add - a "diversity filter" which ensures the selected sample is free of duplicate sources of information e.g. you could de-dupe on the "orginal_author" field of an index full of tweets or the "email_thread_id" of an index of emails. The reason for using this would be to avoid a single voice dominating any analysis. The implementation would have to be a bit smart and select the best scoring doc for each key in order for the key's doc to be competitive. This may be an interesting challenge to implement efficiently.
I'm not sure if this "diversity filter" should exist as part of the sampling agg or could provide a useful feature as a stand-alone agg. An example e-commerce requirement might be to query a product database for the best matches but at most bring back 3 products from any single manufacturer. You can get some of the way there using existing aggs ( see this gist but it has issues and I expect pagination will always be a problem with this sort of logic. It's probably a mistake generally to mix pagination and aggs.
I'll save the "diversity" logic for another PR but it is worth considering how we design the Sampler DSL if we hope to add this sort of extension in future.
@markharwood Thanks for your explanations, this aggregation is making more and more sense to me. I left some comments about the configuration of this new agg. |
The "diversity" feature I mentioned in comments above is dependent on this proposed Lucene feature: https://issues.apache.org/jira/browse/LUCENE-6066 |
Some use cases that are coming up: https://groups.google.com/forum/#!topic/elasticsearch/iNA75M3xkZ4 |
Added thought - as a result of poking around the time-checking features involved in #9156 it looks like it should also be possible to constrain a sample by max processing time rather than volume of documents as suggested here. SearchContext includes a timer that can be used to check elapsed time efficiently. |
@markharwood Now that #10221 is in, we can close this PR, no? |
Pushed to master 63db34f |
Used to limit processing in child aggregations.
The existing Aggregator base class support for deferring computation of child aggs is modified (DeferringBucketCollector is refactored to be abstract and the logic it had for “best bucket” trimming is pushed down into a new BestBucketsDeferringCollector while the new logic for trimming based on doc score quality is in subclass BestDocsDeferringCollector).
Closes #8108