Aggregations: new “Sampler” provides a filter for top-scoring docs #8191

markharwood · 2014-10-22T13:29:06Z

Used to limit processing in child aggregations.

The existing Aggregator base class support for deferring computation of child aggs is modified (DeferringBucketCollector is refactored to be abstract and the logic it had for “best bucket” trimming is pushed down into a new BestBucketsDeferringCollector while the new logic for trimming based on doc score quality is in subclass BestDocsDeferringCollector).
Closes #8108

markharwood · 2014-10-22T13:30:05Z

@jpountz Would appreciate you taking a look if you have time

jpountz · 2014-10-26T21:36:40Z

src/main/java/org/elasticsearch/search/aggregations/bucket/sampler/Sampler.java

+import org.elasticsearch.search.aggregations.bucket.SingleBucketAggregation;
+
+/**
+ * A {@code filter} aggregation. Defines a single bucket that holds all documents that match a specific filter.


Copy-paste left-over?

jpountz · 2014-10-26T21:44:22Z

I'm wondering how we could allow for other modes of sampling in the future. For instance this mode is focused on quality and replays the top documents to the sub aggregators but we could also imagine having a sampling aggregation that would only forward every N-th document to the sub aggs (for speed reasons as it would certainly not help quality). So maybe we could rename it to something like top_docs_sampler or make the configuration allow for different modes of sampling in the future?

The relative_to_max_score option embarrasses me a bit since it is generally considered a bad practice to compare absolute values of scores?

markharwood · 2014-11-03T14:36:34Z

we could also imagine having a sampling aggregation that would only forward every N-th document to the sub aggs (for speed reasons as it would certainly not help quality)

If the sampling is being done for reasons of speed rather than quality then you'll want to cap the number of docs sampled with some accuracy. Taking every N-th doc is probably not the best way to achieve this as we typically don't know how many docs will match a query. It's probably better to take a random sample of a fixed size - this can be achieved by using the existing priority queue impl with a fixed size and populating it with matches that have random scores.

The relative_to_max_score option embarrasses me a bit since it is generally considered a bad practice to compare absolute values of scores?

The only alternative means of applying a quality-based filter is to tighten up the query using min_should_match or similar and this is not always easy to control or for users to understand.
Imagine a product-recommendation scenario where the search criteria is a list of movies you like and you are summarising matches on "user-profile" docs, each of which contain a user's viewed movies.
You would hit a lot of docs (potentially every one!) but would like to focus on the users with similar interests and see which of their movie choices are "uncommonly common" using a significant terms agg.
If we filter based purely on score we get the following useful ranking factors:

IDF means that we will focus on the users who share your interests in obscure movies rather than the movies everyone in the world tends to like (e.g. Star Wars or Shawshank redemption)
coord means we will focus on users who match more of the movies you like
norms means we will avoid users who list a huge number of movies

The large deviation in scores produced by these factors could provide a useful way to separate the strongly and weakly matched users.
If we don't filter based on score then we would have to find a way to "tighten up" the query to have a hard cut-off which trims the long-tail of weak matches. Criteria settings like min_should_match can be used to do this but they don't address all 3 of the ranking concerns above and setting the wrong value could mean you get no results at all.
There isn't a great answer here but I think the score-based filter is the least-worst option.

markharwood · 2014-11-04T14:00:49Z

For the record, scoring (as in use of IDF, norms and coord) seems impossible if your content is numerics as in my movie-lists recommendation example above. The fact that the field type is an integer means match and terms queries always assume a ConstantScoreQuery is to be used. MLT doesn't work with numerics. To work around this I had to re-index my numeric values as strings.

…ring docs used in child aggregations. The existing Aggregator base class support for deferring computation of child aggs is modified (DeferringBucketCollector is refactored to be abstract and the logic it had for “best bucket” trimming is pushed down into a new BestBucketsDeferringCollector while the new logic for trimming based on score quality is in subclass BestDocsDeferringCollector). As well as allowing a sample of top-scoring docs the option of a random sample is also provided. Closes #8108

markharwood · 2014-11-05T17:16:52Z

Rebased on latest master. Removed option of "relative_to_max_score", added suggested option of random sampling.

jpountz · 2014-11-09T22:21:35Z

docs/reference/search/aggregations/bucket/sampler-aggregation.asciidoc

+shard_size::    Optional. The maximum number of documents sampled on each shard. 
+				Defaults to 100
+
+random_sample::    Optional. Set to true to take a random sample of documents rather than top-scoring documents. 


What would you think about something like sample: top|random instead?

Or maybe a better way to expose such a functionality would be to remove this parameter and allow this aggregation to take a query as a parameter, and then users interested in random sampling could use a random function score? (which would also automatically add support for reproducibility through seeds, etc.)

I know I was the one who suggested random sampling but it was really more to start a discussion than a recommandation so if it ends up making everything more complicated I'd be happy to not have it in this PR :)

@jpountz Thanks for the review. I think adding the random sampling was beneficial if only to refactor deferring collector base class to allow for other forms of determining "top" collections.

I have a potentially more useful feature in mind to add - a "diversity filter" which ensures the selected sample is free of duplicate sources of information e.g. you could de-dupe on the "orginal_author" field of an index full of tweets or the "email_thread_id" of an index of emails. The reason for using this would be to avoid a single voice dominating any analysis. The implementation would have to be a bit smart and select the best scoring doc for each key in order for the key's doc to be competitive. This may be an interesting challenge to implement efficiently.

I'm not sure if this "diversity filter" should exist as part of the sampling agg or could provide a useful feature as a stand-alone agg. An example e-commerce requirement might be to query a product database for the best matches but at most bring back 3 products from any single manufacturer. You can get some of the way there using existing aggs ( see this gist but it has issues and I expect pagination will always be a problem with this sort of logic. It's probably a mistake generally to mix pagination and aggs.

I'll save the "diversity" logic for another PR but it is worth considering how we design the Sampler DSL if we hope to add this sort of extension in future.

jpountz · 2014-11-09T22:50:21Z

@markharwood Thanks for your explanations, this aggregation is making more and more sense to me. I left some comments about the configuration of this new agg.

markharwood · 2014-11-20T16:29:21Z

The "diversity" feature I mentioned in comments above is dependent on this proposed Lucene feature: https://issues.apache.org/jira/browse/LUCENE-6066

markharwood · 2014-12-10T17:35:49Z

Some use cases that are coming up: https://groups.google.com/forum/#!topic/elasticsearch/iNA75M3xkZ4
https://groups.google.com/forum/#!topic/elasticsearch/F6iS3kpTWUA

markharwood · 2015-01-06T17:36:31Z

Added thought - as a result of poking around the time-checking features involved in #9156 it looks like it should also be possible to constrain a sample by max processing time rather than volume of documents as suggested here. SearchContext includes a timer that can be used to check elapsed time efficiently.

clintongormley · 2015-05-25T18:35:08Z

@markharwood Now that #10221 is in, we can close this PR, no?

markharwood · 2015-05-26T08:14:44Z

Pushed to master 63db34f

markharwood added review >feature v1.5.0 v2.0.0-beta1 labels Oct 22, 2014

jpountz reviewed Oct 26, 2014
View reviewed changes

jpountz reviewed Nov 9, 2014
View reviewed changes

drewr force-pushed the master branch from dcc3da0 to 7c20a8a Compare February 20, 2015 16:48

s1monw added v1.6.0 and removed v1.5.0 labels Mar 17, 2015

markharwood closed this May 26, 2015

kevinkluge removed the review label May 26, 2015

clintongormley added :Analytics/Aggregations Aggregations and removed v1.6.0 v2.0.0-beta1 labels May 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations: new “Sampler” provides a filter for top-scoring docs #8191

Aggregations: new “Sampler” provides a filter for top-scoring docs #8191

markharwood commented Oct 22, 2014

markharwood commented Oct 22, 2014

jpountz Oct 26, 2014

jpountz commented Oct 26, 2014

markharwood commented Nov 3, 2014

markharwood commented Nov 4, 2014

markharwood commented Nov 5, 2014

jpountz Nov 9, 2014

jpountz Nov 9, 2014

jpountz Nov 9, 2014

markharwood Nov 12, 2014

jpountz commented Nov 9, 2014

markharwood commented Nov 20, 2014

markharwood commented Dec 10, 2014

markharwood commented Jan 6, 2015

clintongormley commented May 25, 2015

markharwood commented May 26, 2015

Aggregations: new “Sampler” provides a filter for top-scoring docs #8191

Aggregations: new “Sampler” provides a filter for top-scoring docs #8191

Conversation

markharwood commented Oct 22, 2014

markharwood commented Oct 22, 2014

jpountz Oct 26, 2014

Choose a reason for hiding this comment

jpountz commented Oct 26, 2014

markharwood commented Nov 3, 2014

markharwood commented Nov 4, 2014

markharwood commented Nov 5, 2014

jpountz Nov 9, 2014

Choose a reason for hiding this comment

jpountz Nov 9, 2014

Choose a reason for hiding this comment

jpountz Nov 9, 2014

Choose a reason for hiding this comment

markharwood Nov 12, 2014

Choose a reason for hiding this comment

jpountz commented Nov 9, 2014

markharwood commented Nov 20, 2014

markharwood commented Dec 10, 2014

markharwood commented Jan 6, 2015

clintongormley commented May 25, 2015

markharwood commented May 26, 2015