New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deferred aggregations prevent combinatorial explosion #6128
Conversation
|
||
@Override | ||
public final void collect(int docId, long bucketOrdinal) throws IOException { | ||
int pos = Arrays.binarySearch(sortedOrds, bucketOrdinal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be a hash table instead to make the access constant-time? I think it wouldn't matter with the default size of 10 but maybe it would it the user sets shard_size to eg. 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already make a split in choice of collector impl for the case where the number of buckets is 1 or >1 so maybe there could be another break-point where we choose between hash table and sorted array?
I left a few comments, but I like the new per-segment buffering of documents/buckets. I also think we should remove |
I quickly looked at the last changes and they look good! Before we pull that in, I think we should make sure users would get a meaningful error if they try to use scores while replaying doc IDs and to take another look at the formatting (some missing spaces around operators/brackets and lines with trailing spaces). |
…prune_first strategies
…ow for more compact data structures downstream where heavy pruning reduces the numbers of buckets under consideration
Added 'Deferred Aggregation' to the TermsAggregationSearchBenchmark and created a new benchmark for testing nested aggregations with different combinations of collect mode at each level
// A scorer used for the deferred collection mode to handle any child aggs asking for scores that are not | ||
// recorded. | ||
static final Scorer unavailableScorer=new Scorer(null){ | ||
private final String MSG="A limitation of the "+SubAggCollectionMode.DEPTH_FIRST.parseField.getPreferredName()+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/DEPTH_FIRST/BREADTH_FIRST/ ?
Another "deferred" use case to consider? https://groups.google.com/forum/#!topic/elasticsearch/CtDhs0HDK2Q |
I don't think it could help: building buckets based on counts is not practical as you would need the global counts to make a decision while a shard would only have shard-local knowledge. |
// A scorer used for the deferred collection mode to handle any child aggs asking for scores that are not | ||
// recorded. | ||
static final Scorer unavailableScorer=new Scorer(null){ | ||
private final String MSG="A limitation of the "+SubAggCollectionMode.BREADTH_FIRST.parseField.getPreferredName()+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add spaces around '=' and '+'?
LGTM, I just left comments about formatting. Can you fix these before pushing? |
LGTM |
…regator class to support a new mode of deferred collection. A new "breadth_first" results collection mode allows upper branches of aggregation tree to be calculated and then pruned to a smaller selection before advancing into executing collection on child branches. Closes #6128
New BucketCollector classes to aid the recording and subsequent playback of "collect" streams in aggs to reduce combinatorial explosions where pruning of parent buckets should occur before calculating child aggs.
Aggregator base class now wraps the subAgg BucketCollectors with any required caching of collect streams for sub aggregations that are indicated as being deferred. Aggregator subclasses should now override
shouldDefer
to indicate any aggs that are expensive to compute and in the BuildAggregation call should subsequently callrunDeferredCollections
with the subset of bucket ordinals that represent the pruned parent buckets of interest.