Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: new “Sampler” provides a filter for top-scoring docs #8191

Closed
wants to merge 1 commit into from
Closed

Conversation

markharwood
Copy link
Contributor

Used to limit processing in child aggregations.

The existing Aggregator base class support for deferring computation of child aggs is modified (DeferringBucketCollector is refactored to be abstract and the logic it had for “best bucket” trimming is pushed down into a new BestBucketsDeferringCollector while the new logic for trimming based on doc score quality is in subclass BestDocsDeferringCollector).
Closes #8108

@markharwood
Copy link
Contributor Author

@jpountz Would appreciate you taking a look if you have time

import org.elasticsearch.search.aggregations.bucket.SingleBucketAggregation;

/**
* A {@code filter} aggregation. Defines a single bucket that holds all documents that match a specific filter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy-paste left-over?

@jpountz
Copy link
Contributor

jpountz commented Oct 26, 2014

I'm wondering how we could allow for other modes of sampling in the future. For instance this mode is focused on quality and replays the top documents to the sub aggregators but we could also imagine having a sampling aggregation that would only forward every N-th document to the sub aggs (for speed reasons as it would certainly not help quality). So maybe we could rename it to something like top_docs_sampler or make the configuration allow for different modes of sampling in the future?

The relative_to_max_score option embarrasses me a bit since it is generally considered a bad practice to compare absolute values of scores?

@markharwood
Copy link
Contributor Author

we could also imagine having a sampling aggregation that would only forward every N-th document to the sub aggs (for speed reasons as it would certainly not help quality)

If the sampling is being done for reasons of speed rather than quality then you'll want to cap the number of docs sampled with some accuracy. Taking every N-th doc is probably not the best way to achieve this as we typically don't know how many docs will match a query. It's probably better to take a random sample of a fixed size - this can be achieved by using the existing priority queue impl with a fixed size and populating it with matches that have random scores.

The relative_to_max_score option embarrasses me a bit since it is generally considered a bad practice to compare absolute values of scores?

The only alternative means of applying a quality-based filter is to tighten up the query using min_should_match or similar and this is not always easy to control or for users to understand.
Imagine a product-recommendation scenario where the search criteria is a list of movies you like and you are summarising matches on "user-profile" docs, each of which contain a user's viewed movies.
You would hit a lot of docs (potentially every one!) but would like to focus on the users with similar interests and see which of their movie choices are "uncommonly common" using a significant terms agg.
If we filter based purely on score we get the following useful ranking factors:

  1. IDF means that we will focus on the users who share your interests in obscure movies rather than the movies everyone in the world tends to like (e.g. Star Wars or Shawshank redemption)
  2. coord means we will focus on users who match more of the movies you like
  3. norms means we will avoid users who list a huge number of movies

The large deviation in scores produced by these factors could provide a useful way to separate the strongly and weakly matched users.
If we don't filter based on score then we would have to find a way to "tighten up" the query to have a hard cut-off which trims the long-tail of weak matches. Criteria settings like min_should_match can be used to do this but they don't address all 3 of the ranking concerns above and setting the wrong value could mean you get no results at all.
There isn't a great answer here but I think the score-based filter is the least-worst option.

@markharwood
Copy link
Contributor Author

For the record, scoring (as in use of IDF, norms and coord) seems impossible if your content is numerics as in my movie-lists recommendation example above. The fact that the field type is an integer means match and terms queries always assume a ConstantScoreQuery is to be used. MLT doesn't work with numerics. To work around this I had to re-index my numeric values as strings.

…ring docs used in child aggregations.

The existing Aggregator base class support for deferring computation of child aggs is modified (DeferringBucketCollector is refactored to be abstract and the logic it had for “best bucket” trimming is pushed down into a new BestBucketsDeferringCollector while the new logic for trimming based on score quality is in subclass BestDocsDeferringCollector).
As well as allowing a sample of top-scoring docs the option of a random sample is also provided.
Closes #8108
@markharwood
Copy link
Contributor Author

Rebased on latest master. Removed option of "relative_to_max_score", added suggested option of random sampling.

shard_size:: Optional. The maximum number of documents sampled on each shard.
Defaults to 100

random_sample:: Optional. Set to true to take a random sample of documents rather than top-scoring documents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about something like sample: top|random instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe a better way to expose such a functionality would be to remove this parameter and allow this aggregation to take a query as a parameter, and then users interested in random sampling could use a random function score? (which would also automatically add support for reproducibility through seeds, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I was the one who suggested random sampling but it was really more to start a discussion than a recommandation so if it ends up making everything more complicated I'd be happy to not have it in this PR :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Thanks for the review. I think adding the random sampling was beneficial if only to refactor deferring collector base class to allow for other forms of determining "top" collections.

I have a potentially more useful feature in mind to add - a "diversity filter" which ensures the selected sample is free of duplicate sources of information e.g. you could de-dupe on the "orginal_author" field of an index full of tweets or the "email_thread_id" of an index of emails. The reason for using this would be to avoid a single voice dominating any analysis. The implementation would have to be a bit smart and select the best scoring doc for each key in order for the key's doc to be competitive. This may be an interesting challenge to implement efficiently.

I'm not sure if this "diversity filter" should exist as part of the sampling agg or could provide a useful feature as a stand-alone agg. An example e-commerce requirement might be to query a product database for the best matches but at most bring back 3 products from any single manufacturer. You can get some of the way there using existing aggs ( see this gist but it has issues and I expect pagination will always be a problem with this sort of logic. It's probably a mistake generally to mix pagination and aggs.

I'll save the "diversity" logic for another PR but it is worth considering how we design the Sampler DSL if we hope to add this sort of extension in future.

@jpountz
Copy link
Contributor

jpountz commented Nov 9, 2014

@markharwood Thanks for your explanations, this aggregation is making more and more sense to me. I left some comments about the configuration of this new agg.

@markharwood
Copy link
Contributor Author

The "diversity" feature I mentioned in comments above is dependent on this proposed Lucene feature: https://issues.apache.org/jira/browse/LUCENE-6066

@markharwood
Copy link
Contributor Author

@markharwood
Copy link
Contributor Author

Added thought - as a result of poking around the time-checking features involved in #9156 it looks like it should also be possible to constrain a sample by max processing time rather than volume of documents as suggested here. SearchContext includes a timer that can be used to check elapsed time efficiently.

@clintongormley
Copy link

@markharwood Now that #10221 is in, we can close this PR, no?

@markharwood
Copy link
Contributor Author

Pushed to master 63db34f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sampling aggregation to filter down to top-scoring docs
5 participants