Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an option to show the upper bound of the error for the terms aggregation #6778

Closed
wants to merge 1 commit into from
Closed

Added an option to show the upper bound of the error for the terms aggregation #6778

wants to merge 1 commit into from

Conversation

colings86
Copy link
Contributor

...he terms aggregation.

This is only applicable when the order is set to _count. The upper bound of the error in the doc count is calculated by summing the doc count of the last term on each shard which did not return the term. The implementation calculates the error by summing the doc count for the last term on each shard for which the term IS returned and then subtracts this value from the sum of the doc counts for the last term from ALL shards.

Closes #6696

@@ -70,6 +70,9 @@ NOTE: `shard_size` cannot be smaller than `size` (as it doesn't make much sens
added[1.1.0] It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don't use this
on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network.

coming[1.3.0] The `show_doc_count_error` parameter can be set to true when sorting by `doc_count` to show the upper bound
of the error in the document count for each term. This can be useful for deciding on an appropriate `shard_size` value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to have an option for it instead of returning it all the time? It doesn't seem to add much overhead to terms aggs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry, I just realized that you are returning the error per bucket and not only for the whole agg! Maybe we should also have the max error for the whole aggregation, and this one would not be optional? I think this would help raise awareness that this aggregation in not accurate all the time?

@jpountz
Copy link
Contributor

jpountz commented Jul 11, 2014

I just played with it and I think this is an interesting feature to raise awareness about the accuracy issues of the terms aggregation and although as a way to test the impact of the shard_size parameter. The per-term error is interesting, but I think the global error that you added is also interesting because it also gives information about terms that didn't make it to the top terms.

To move forward, I think it would be nice to have it on all sort orders (potentially by using a special value of eg. -1 when the maximum error cannot be estimated or would be so large that it would not be really useful).

@jpountz
Copy link
Contributor

jpountz commented Jul 11, 2014

On a side note, if it goes into release X, I think we should try to have another change in the same release that would change the default value of shard_size.

@@ -53,6 +56,10 @@ public long getDocCount() {
return docCount;
}

public long getDocCountError() {
return docCountError;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it throw an exception when show_term_doc_count is false?

@jpountz jpountz removed the review label Jul 11, 2014
@colings86
Copy link
Contributor Author

I largely agree although during my testing of this feature I have had quite a few situations where the error for the whole aggregation has been quite big relative to the doc count for the last returned term (e.g error of 3600 with a doc count for the last returned term of 5400) but the error on all of the terms was 0. This seems confusing for a user? Although maybe this just highlights the importance of clearly explaining the way the error is calculated and what it means?

Agree regarding your suggestions for moving forward and the issue around default shard size

this.order = InternalOrder.Streams.readOrder(in);
this.formatter = ValueFormatterStreams.readOptional(in);
this.requiredSize = readSize(in);
this.shardSize = in.readInt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be protected by a if (in.getVersion().onOrAfter(Version.V_1_3_0)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe it should use the same readSize method as requiredSize?

@jpountz jpountz removed the review label Jul 16, 2014
enough to put Product C into the top 5 list for that shard. Product Z was also returned only by 2 shards but the third shard does not contain the
term. There is no way of knowing, at the point of combining the results to produce the final list of terms, that there is an error in the
document count for Product C and not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final
list of terms because it did not make it into the top five terms on any of the shards.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That section is really great!

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2014

I think long is ok if we can avoid serializing the errors when show_term_doc_count_error is false.

I left some comments, but I think it's close!

collectExistingBucket(doc, bucketOrdinal);
} else {
collectBucket(doc, bucketOrdinal);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation issue

@jpountz
Copy link
Contributor

jpountz commented Jul 25, 2014

LGTM but can you fix indentation issues and trailing spaces that have been added on some lines before pushing?

@jpountz jpountz removed the review label Jul 25, 2014
…r the terms aggregation.

This is only applicable when the order is set to _count.  The upper bound of the error in the doc count is calculated by summing the doc count of the last term on each shard which did not return the term.  The implementation calculates the error by summing the doc count for the last term on each shard for which the term IS returned and then subtracts this value from the sum of the doc counts for the last term from ALL shards.

Closes #6696
@colings86
Copy link
Contributor Author

Merged into 1.x and master

@colings86 colings86 closed this Jul 25, 2014
@colings86 colings86 self-assigned this Aug 21, 2014
@clintongormley clintongormley changed the title Aggregations: Added an option to show the upper bound of the error for t... Aggregations: Added an option to show the upper bound of the error for the terms aggregation Sep 11, 2014
@clintongormley clintongormley changed the title Aggregations: Added an option to show the upper bound of the error for the terms aggregation Added an option to show the upper bound of the error for the terms aggregation Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggregations: Return an upper bound of the maximum error for terms
3 participants