Added an option to show the upper bound of the error for the terms aggregation #6778

colings86 · 2014-07-08T10:47:19Z

...he terms aggregation.

This is only applicable when the order is set to _count. The upper bound of the error in the doc count is calculated by summing the doc count of the last term on each shard which did not return the term. The implementation calculates the error by summing the doc count for the last term on each shard for which the term IS returned and then subtracts this value from the sum of the doc counts for the last term from ALL shards.

Closes #6696

jpountz · 2014-07-10T12:40:11Z

docs/reference/search/aggregations/bucket/terms-aggregation.asciidoc

@@ -70,6 +70,9 @@ NOTE:   `shard_size` cannot be smaller than `size` (as it doesn't make much sens
 added[1.1.0] It is possible to not limit the number of terms that are returned by setting `size` to `0`. Don't use this
 on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network.

+coming[1.3.0] The `show_doc_count_error` parameter can be set to true when sorting by `doc_count` to show the upper bound 
+of the error in the document count for each term.  This can be useful for deciding on an appropriate `shard_size` value.


Any reason to have an option for it instead of returning it all the time? It doesn't seem to add much overhead to terms aggs?

Oops sorry, I just realized that you are returning the error per bucket and not only for the whole agg! Maybe we should also have the max error for the whole aggregation, and this one would not be optional? I think this would help raise awareness that this aggregation in not accurate all the time?

jpountz · 2014-07-11T12:53:00Z

I just played with it and I think this is an interesting feature to raise awareness about the accuracy issues of the terms aggregation and although as a way to test the impact of the shard_size parameter. The per-term error is interesting, but I think the global error that you added is also interesting because it also gives information about terms that didn't make it to the top terms.

To move forward, I think it would be nice to have it on all sort orders (potentially by using a special value of eg. -1 when the maximum error cannot be estimated or would be so large that it would not be really useful).

jpountz · 2014-07-11T12:54:08Z

On a side note, if it goes into release X, I think we should try to have another change in the same release that would change the default value of shard_size.

jpountz · 2014-07-11T12:55:18Z

src/main/java/org/elasticsearch/search/aggregations/bucket/terms/InternalTerms.java

@@ -53,6 +56,10 @@ public long getDocCount() {
            return docCount;
        }

+        public long getDocCountError() {
+            return docCountError;


should it throw an exception when show_term_doc_count is false?

colings86 · 2014-07-11T13:03:18Z

I largely agree although during my testing of this feature I have had quite a few situations where the error for the whole aggregation has been quite big relative to the doc count for the last returned term (e.g error of 3600 with a doc count for the last returned term of 5400) but the error on all of the terms was 0. This seems confusing for a user? Although maybe this just highlights the importance of clearly explaining the way the error is calculated and what it means?

Agree regarding your suggestions for moving forward and the issue around default shard size

jpountz · 2014-07-16T09:33:52Z

src/main/java/org/elasticsearch/search/aggregations/bucket/terms/DoubleTerms.java

        this.order = InternalOrder.Streams.readOrder(in);
        this.formatter = ValueFormatterStreams.readOptional(in);
        this.requiredSize = readSize(in);
+        this.shardSize = in.readInt();


This needs to be protected by a if (in.getVersion().onOrAfter(Version.V_1_3_0)) {

and maybe it should use the same readSize method as requiredSize?

jpountz · 2014-07-24T07:28:31Z

docs/reference/search/aggregations/bucket/terms-aggregation.asciidoc

+enough to put Product C into the top 5 list for that shard. Product Z was also returned only by 2 shards but the third shard does not contain the 
+term. There is no way of knowing, at the point of combining the results to produce the final list of terms, that there is an error in the 
+document count for Product C and not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final 
+list of terms because it did not make it into the top five terms on any of the shards.


That section is really great!

jpountz · 2014-07-24T07:49:58Z

I think long is ok if we can avoid serializing the errors when show_term_doc_count_error is false.

I left some comments, but I think it's close!

jpountz · 2014-07-25T09:05:47Z

src/main/java/org/elasticsearch/search/aggregations/bucket/terms/DoubleTermsAggregator.java

+                collectExistingBucket(doc, bucketOrdinal);
+            } else {
+                collectBucket(doc, bucketOrdinal);
+            }


indentation issue

jpountz · 2014-07-25T09:17:49Z

LGTM but can you fix indentation issues and trailing spaces that have been added on some lines before pushing?

…r the terms aggregation. This is only applicable when the order is set to _count. The upper bound of the error in the doc count is calculated by summing the doc count of the last term on each shard which did not return the term. The implementation calculates the error by summing the doc count for the last term on each shard for which the term IS returned and then subtracts this value from the sum of the doc counts for the last term from ALL shards. Closes #6696

colings86 · 2014-07-25T13:28:15Z

Merged into 1.x and master

colings86 added the review label Jul 8, 2014

jpountz reviewed Jul 10, 2014
View reviewed changes

jpountz reviewed Jul 11, 2014
View reviewed changes

jpountz removed the review label Jul 11, 2014

colings86 added the review label Jul 14, 2014

jpountz reviewed Jul 16, 2014
View reviewed changes

jpountz removed the review label Jul 16, 2014

jpountz reviewed Jul 24, 2014
View reviewed changes

jpountz reviewed Jul 25, 2014
View reviewed changes

jpountz removed the review label Jul 25, 2014

colings86 closed this Jul 25, 2014

colings86 self-assigned this Aug 21, 2014

clintongormley changed the title ~~Aggregations: Added an option to show the upper bound of the error for t...~~ Aggregations: Added an option to show the upper bound of the error for the terms aggregation Sep 11, 2014

clintongormley added >enhancement v1.4.0.Beta1 v2.0.0-beta1 labels Sep 11, 2014

clintongormley added the :Analytics/Aggregations Aggregations label Jun 7, 2015

clintongormley changed the title ~~Aggregations: Added an option to show the upper bound of the error for the terms aggregation~~ Added an option to show the upper bound of the error for the terms aggregation Jun 7, 2015

colings86 mentioned this pull request Aug 4, 2016

Should we remove/modify some of the experiment tags in the documentation #19798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added an option to show the upper bound of the error for the terms aggregation #6778

Added an option to show the upper bound of the error for the terms aggregation #6778

colings86 commented Jul 8, 2014

jpountz Jul 10, 2014

jpountz Jul 10, 2014

jpountz commented Jul 11, 2014

jpountz commented Jul 11, 2014

jpountz Jul 11, 2014

colings86 commented Jul 11, 2014

jpountz Jul 16, 2014

jpountz Jul 16, 2014

jpountz Jul 24, 2014

jpountz commented Jul 24, 2014

jpountz Jul 25, 2014

jpountz commented Jul 25, 2014

colings86 commented Jul 25, 2014

Added an option to show the upper bound of the error for the terms aggregation #6778

Added an option to show the upper bound of the error for the terms aggregation #6778

Conversation

colings86 commented Jul 8, 2014

jpountz Jul 10, 2014

Choose a reason for hiding this comment

jpountz Jul 10, 2014

Choose a reason for hiding this comment

jpountz commented Jul 11, 2014

jpountz commented Jul 11, 2014

jpountz Jul 11, 2014

Choose a reason for hiding this comment

colings86 commented Jul 11, 2014

jpountz Jul 16, 2014

Choose a reason for hiding this comment

jpountz Jul 16, 2014

Choose a reason for hiding this comment

jpountz Jul 24, 2014

Choose a reason for hiding this comment

jpountz commented Jul 24, 2014

jpountz Jul 25, 2014

Choose a reason for hiding this comment

jpountz commented Jul 25, 2014

colings86 commented Jul 25, 2014