Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampler aggregation #10221

Closed
wants to merge 3 commits into from
Closed

Sampler aggregation #10221

wants to merge 3 commits into from

Conversation

markharwood
Copy link
Contributor

Used to limit any nested aggregations' processing to a sample of the top-scoring documents.

Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author".

The original "DeferringBucketCollector" is now abstracted with the bulk of the original code in new subclass BestBucketsDeferringCollector and the new alternative policy for deferring is implemented in the BestDocsDeferringCollector subclass.

The diversifying logic is reliant on Lucene 5.1 which has changes to support this specialized form of result collection.

Closes #8108

import java.util.ArrayList;
import java.util.List;

public class BestBucketsDeferringCollector extends DeferringBucketCollector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have multiple implementations of DeferringBucketCollector, could we have a class-level Javadoc on the implementation to describe briefly what each one is trying to achieve?

[[search-aggregations-bucket-sampler-aggregation]]
=== Sampler Aggregation

A filtering aggregation used to limit any nested aggregations' processing to a sample of the top-scoring documents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say sub-aggregation here so we don't overload 'nested aggregation' which could cause confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@markharwood
Copy link
Contributor Author

@colings86 rebased on latest master if you get a chance to review

@@ -0,0 +1,160 @@
[[search-aggregations-bucket-sampler-aggregation]]
=== Sampler Aggregation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it already is, but can't see it if so: we should mark this feature as experimental in the docs

@colings86
Copy link
Contributor

@markharwood Left some comments

@markharwood
Copy link
Contributor Author

@colings86 Thanks for the review. I added a couple of comments above on execution_hint test coverage and updated the code based on your other comments.

@markharwood
Copy link
Contributor Author

@jpountz @clintongormley This PR allows users to do analytics on a sample where you can also choose to diversify results on the basis of a particular field (e.g. analyse top X tweets but no more than Y tweets from a single Twitter account on each shard).

The question is what is the least-worst thing to do on each shard given the unmapped problem ie the choice of diversifying field doesn't exist on one of the indexes/shards being queried:

  1. Throw an error
  2. Return no results (because we can't guarantee diversification)
  3. Return top results but without applying any of the diversity constraints

…ns' processing to a sample of the top-scoring documents.

Optionally, a “diversify” setting can limit the number of collected matches that share a common value such as an "author".

Closes #8108
…SamplerAggregator, added nestedSamples test.
@markharwood
Copy link
Contributor Author

Took a decision with Colin on the 2 remaining questions:

  1. execution_hint follows the precedent set in terms agg - never errors e.g. when used in relation to a numeric field in which case is ignored
  2. Unmapped choices of diversifying field will return no results rather than a sample of undiversified results

@markharwood
Copy link
Contributor Author

Poke @colings86

@colings86
Copy link
Contributor

LGTM

@markharwood markharwood added :Analytics/Aggregations Aggregations and removed review labels Apr 21, 2015
@markharwood
Copy link
Contributor Author

Pushed to master 63db34f

@clintongormley clintongormley changed the title New feature - Sampler aggregation Sampler aggregation Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sampling aggregation to filter down to top-scoring docs
2 participants