Aggregations: delegation of setNextReader calls #9098

jpountz · 2014-12-30T11:27:49Z

This is a follow-up to #6477. Calls to setNextReader are today centralized by the AggregationContext class. On #6477 the conclusion was that this behaviour was better than making setNextReader delegate to sub aggregations as it was more efficient with deeply nested aggregation trees (with several aggregators sharing the same source of values). However, I'm more and more convinced this is not the right approach:

it makes aggregations hard to unit test
it makes defer/replay sub-optimal as the whole context is replayed instead of just what is needed
it makes it hard to migrate to Lucene5-style collectors (with one leaf collector per segment), which would allow us to have more optimized aggregations (especially for the single-valued case).

Ideally I would like aggregators to be as close as possible to Lucene collectors in terms of API. So even if it would initially be worse for deeply nested trees, I think we should revive #6477 and think about other ways to make deeply nested trees faster.

cc @colings86

The text was updated successfully, but these errors were encountered:

colings86 · 2015-01-05T09:47:58Z

+1

The problem I found in #6477 was with the aggregations (such as the terms aggregation) that dynamically create buckets. Because we have no idea how many buckets will be needed up front, at the moment we create a new instance of the aggregator for each bucket. This instance needs to be reader and scorer aware. We use the anonymous Aggregator class in AggregatorFactories [1] to create and manage these instances. With the approaches I tried in #6477 this anonymous class always ends up iterating through a collection of Aggregator instances (one for each bucket) for every call to setNextReader() and it is this iteration which kills the performance as in nested terms aggregations (terms aggregations with sub-terms aggregations) we end up creating an iterator for every parent term bucket to iterate through the sub-term buckets. With the current way of registering the instance with the AggregationContext, there is only a single list of instance on which setNextReader() needs to be called, so only one instance of an iterator is required rather than having nested iterators.

If we could somehow remove the need to have BucketAggregationMode.PER_BUCKET aggregators then we would only have one instance of the ReaderContextAware and ScorerAware class (the Aggregator) regardless of the number of buckets the Aggregator creates.

[1] https://github.com/elasticsearch/elasticsearch/blob/610ce078fb3c84c47d6d32aff7d77ba850e28f9d/src/main/java/org/elasticsearch/search/aggregations/AggregatorFactories.java#L78-146

jpountz · 2015-06-15T09:41:26Z

Fixed through #9544

jpountz added v2.0.0-beta1 blocker :Analytics/Aggregations Aggregations labels Dec 30, 2014

clintongormley added the >enhancement label Jun 8, 2015

jpountz closed this as completed Jun 15, 2015

jpountz mentioned this issue Jun 15, 2015

Refactor aggregations to use lucene5-style collectors. #9544

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations: delegation of setNextReader calls #9098

Aggregations: delegation of setNextReader calls #9098

jpountz commented Dec 30, 2014

colings86 commented Jan 5, 2015

jpountz commented Jun 15, 2015

Aggregations: delegation of setNextReader calls #9098

Aggregations: delegation of setNextReader calls #9098

Comments

jpountz commented Dec 30, 2014

colings86 commented Jan 5, 2015

jpountz commented Jun 15, 2015