New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampler aggregation #10221
Sampler aggregation #10221
Conversation
import java.util.ArrayList; | ||
import java.util.List; | ||
|
||
public class BestBucketsDeferringCollector extends DeferringBucketCollector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have multiple implementations of DeferringBucketCollector, could we have a class-level Javadoc on the implementation to describe briefly what each one is trying to achieve?
[[search-aggregations-bucket-sampler-aggregation]] | ||
=== Sampler Aggregation | ||
|
||
A filtering aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we say sub-aggregation here so we don't overload 'nested aggregation' which could cause confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
@colings86 rebased on latest master if you get a chance to review |
@@ -0,0 +1,160 @@ | |||
[[search-aggregations-bucket-sampler-aggregation]] | |||
=== Sampler Aggregation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it already is, but can't see it if so: we should mark this feature as experimental in the docs
@markharwood Left some comments |
@colings86 Thanks for the review. I added a couple of comments above on execution_hint test coverage and updated the code based on your other comments. |
@jpountz @clintongormley This PR allows users to do analytics on a sample where you can also choose to diversify results on the basis of a particular field (e.g. analyse top X tweets but no more than Y tweets from a single Twitter account on each shard). The question is what is the least-worst thing to do on each shard given the unmapped problem ie the choice of diversifying field doesn't exist on one of the indexes/shards being queried:
|
…ns' processing to a sample of the top-scoring documents. Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author". Closes #8108
…SamplerAggregator, added nestedSamples test.
…for this condition.
Took a decision with Colin on the 2 remaining questions:
|
Poke @colings86 |
LGTM |
Pushed to master 63db34f |
Used to limit any nested aggregations' processing to a sample of the top-scoring documents.
Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author".
The original "DeferringBucketCollector" is now abstracted with the bulk of the original code in new subclass BestBucketsDeferringCollector and the new alternative policy for deferring is implemented in the BestDocsDeferringCollector subclass.
The diversifying logic is reliant on Lucene 5.1 which has changes to support this specialized form of result collection.
Closes #8108