Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deferred aggregations prevent combinatorial explosion #6128

Closed
wants to merge 15 commits into from
Closed

Deferred aggregations prevent combinatorial explosion #6128

wants to merge 15 commits into from

Conversation

markharwood
Copy link
Contributor

New BucketCollector classes to aid the recording and subsequent playback of "collect" streams in aggs to reduce combinatorial explosions where pruning of parent buckets should occur before calculating child aggs.
Aggregator base class now wraps the subAgg BucketCollectors with any required caching of collect streams for sub aggregations that are indicated as being deferred. Aggregator subclasses should now override shouldDefer to indicate any aggs that are expensive to compute and in the BuildAggregation call should subsequently call runDeferredCollections with the subset of bucket ordinals that represent the pruned parent buckets of interest.

@jpountz jpountz added v1.2.0 and removed v1.2.0 labels May 12, 2014

@Override
public final void collect(int docId, long bucketOrdinal) throws IOException {
int pos = Arrays.binarySearch(sortedOrds, bucketOrdinal);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be a hash table instead to make the access constant-time? I think it wouldn't matter with the default size of 10 but maybe it would it the user sets shard_size to eg. 1000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already make a split in choice of collector impl for the case where the number of buckets is 1 or >1 so maybe there could be another break-point where we choose between hash table and sorted array?

@jpountz
Copy link
Contributor

jpountz commented May 30, 2014

I left a few comments, but I like the new per-segment buffering of documents/buckets. I also think we should remove FilteringSingleBucketCollector, I don't like having a specialization for something that would be used so rarely: it is only used when shard_size is 1.

@jpountz
Copy link
Contributor

jpountz commented May 30, 2014

I quickly looked at the last changes and they look good! Before we pull that in, I think we should make sure users would get a meaningful error if they try to use scores while replaying doc IDs and to take another look at the formatting (some missing spaces around operators/brackets and lines with trailing spaces).

markharwood and others added 5 commits June 2, 2014 11:25
…ow for more compact data structures downstream where heavy pruning reduces the numbers of buckets under consideration
Added 'Deferred Aggregation' to the TermsAggregationSearchBenchmark and created a new benchmark for testing nested aggregations with different combinations of collect mode at each level
// A scorer used for the deferred collection mode to handle any child aggs asking for scores that are not
// recorded.
static final Scorer unavailableScorer=new Scorer(null){
private final String MSG="A limitation of the "+SubAggCollectionMode.DEPTH_FIRST.parseField.getPreferredName()+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DEPTH_FIRST/BREADTH_FIRST/ ?

@markharwood
Copy link
Contributor Author

Another "deferred" use case to consider? https://groups.google.com/forum/#!topic/elasticsearch/CtDhs0HDK2Q

@jpountz
Copy link
Contributor

jpountz commented Jun 4, 2014

I don't think it could help: building buckets based on counts is not practical as you would need the global counts to make a decision while a shard would only have shard-local knowledge.

// A scorer used for the deferred collection mode to handle any child aggs asking for scores that are not
// recorded.
static final Scorer unavailableScorer=new Scorer(null){
private final String MSG="A limitation of the "+SubAggCollectionMode.BREADTH_FIRST.parseField.getPreferredName()+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add spaces around '=' and '+'?

@jpountz
Copy link
Contributor

jpountz commented Jun 4, 2014

LGTM, I just left comments about formatting. Can you fix these before pushing?

@jpountz
Copy link
Contributor

jpountz commented Jun 6, 2014

LGTM

markharwood added a commit that referenced this pull request Jun 6, 2014
…regator class to support a new mode of deferred collection.

A new "breadth_first" results collection mode allows upper branches of aggregation tree to be calculated and then pruned
to a smaller selection before advancing into executing collection on child branches.

Closes #6128
@s1monw s1monw removed the review label Jun 18, 2014
@clintongormley clintongormley changed the title Deferred aggregation Aggregations: Deferred aggregations prevent combinatorial explosion Jul 16, 2014
@clintongormley clintongormley changed the title Aggregations: Deferred aggregations prevent combinatorial explosion Deferred aggregations prevent combinatorial explosion Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants