Add `shard_min_doc_count` parameter to terms aggregation #6143

brwe · 2014-05-13T08:01:54Z

This was discussed in issue #6041 and #5998 . The parameter already exists for significant terms aggregation.

There are also two refactoring commits:

I tried to extract the parsing of common parameters for of significant terms and terms aggregation to get rid of some duplicate code.

Also, I refactored terms and significant terms aggregation a little: requiredSize, shardSize, minDocCount and shardMinDocCount are stored in a class called BucketCountThresholds. Before, every class using these parameters had their own member where these four are stored. This clutters the code. Because they are mostly needed together it might make sense to group them. I hope that this makes the code more readable.

brwe · 2014-05-13T08:22:11Z

@markharwood Maybe we should wait with the review here until you pushed your changes and I resolved all the conflicts?

markharwood · 2014-05-13T08:23:26Z

Would make sense I think. Just running the tests before pushing...

jpountz · 2014-05-13T08:31:27Z

...n/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTermsBuilder.java

-        if (requiredSize != SignificantTermsParser.DEFAULT_REQUIRED_SIZE) {
-            builder.field("size", requiredSize);
+        if (bucketCountThresholds.requiredSize >= 0) {
+            builder.field("size", bucketCountThresholds.requiredSize);
        }


The point of all the != checks is to not print parameters when they are equal to the default. But here, this changes to a >=0 while the default is 10 so it looks to me like size would be put all the time in the builder?

Yeah, I actually copied that from TermsBuilder. Checking for >=0 makes sense also but not building the parameter if it is the default makes sense too. I will add both checks?

I just looked at the current TermsBuilder and the reason why it can do that is that it initializes all values to -1 (which are invalid) by default. So maybe this class should do the same and only emit values which have been set explicitely so that the defaults are only handled on the parser side?

Oh, right! That simplifies things...

jpountz · 2014-05-13T08:37:01Z

This looks good to me, I like the BucketCountsThresholds refactoring and the documentation is clear. I just left a minor comment about the handling of default values in the builders.

Both need the requiredSize, shardSize, minDocCount and shardMinDocCount. Parsing should not be duplicated.

…unt a single parameter every class using these parameters has their own member where these four are stored. This clutters the code. Because they mostly needed together it might make sense to group them. I hope that this makes the code more readable and also that if avoids errors such as mixing up the int/long parameters in the future.

This was discussed in issue elastic#6041 and elastic#5998 .

brwe · 2014-05-13T08:56:01Z

Ok, I rebased on 1e560b0 , will work on @jpountz comments next

squash to commit "refactor: make requiredSize, shardSize, minDocCount and shardMinDocCount a single parameter"

brwe · 2014-05-13T10:23:29Z

I added two commits to implement the comments. I was wondering: Would it make sense to add the logic in TermsParser to SignificantTermsParser as well, @markharwood ? I mean the

if (bucketCountThresholds.requiredSize == 0) {
           bucketCountThresholds.requiredSize = Integer.MAX_VALUE;
}

which is here

markharwood · 2014-05-13T11:04:00Z

src/main/java/org/elasticsearch/search/aggregations/bucket/terms/TermsAggregator.java

+            this.minDocCount = 1;
+            this.shardMinDocCount = 0;
+            this.requiredSize = 10;
+            this.shardSize = -1;


Consider use of an "UNDEFINED" constant (or overridable method) in place of -1 here?
Elsewhere, in https://github.com/elasticsearch/elasticsearch/pull/6143/files#diff-2cdb75b1d8a27a17cd3df51df6995ae6R55 there is a reference to testing for undefined values using shardSize on the default object returned by getDefaultBucketCountThresholds() - but shardSize is mutable - if anyone happens to call ensureValidity on this default object then its shardSize setting would change from -1 to 10 which seems dangerous.

I wonder if there is stuff that can be reused from o.e.common.Explicit - it holds values but remembers the differences between settings that were based purely on defaults and conscious decisions made by users ("explicit").

Ok, I added a commit to use Explicit. Take a look!

a reference to testing for undefined values using shardSize on the default object returned by getDefaultBucketCountThresholds() - but shardSize is mutable - if anyone happens to call ensureValidity on this default object then its shardSize setting would change from -1 to 10 which seems dangerous.

The default values are private and getDefaultBucketCountThresholds() creates a new instance. In addition I now check if the values were set (using .explicit()) when the default parameters are retrieved. The only way to mess with the default settings is now within SignificantTermsParametersParser or TermsParametersParser. Is that good enough?

jpountz · 2014-05-13T15:06:37Z

src/main/java/org/elasticsearch/search/aggregations/bucket/terms/TermsAggregator.java

+            this.shardMinDocCount = new Explicit<>(bucketCountThresholds.shardMinDocCount.value(), false);
+            this.requiredSize = new Explicit<>(bucketCountThresholds.requiredSize.value(), false);
+            this.shardSize = new Explicit<>(bucketCountThresholds.shardSize.value(), false);
+        }


can you refactor these 2 constructors so that they call the 1st one?

jpountz · 2014-05-13T15:32:37Z

I was wondering: Would it make sense to add the logic in TermsParser to SignificantTermsParser as well

I think it would be nice to make it consistent.

…de to BucketCountThresholds

brwe · 2014-05-14T08:19:02Z

Updated with new commits.

jpountz · 2014-05-14T08:21:38Z

docs/reference/search/aggregations/bucket/terms-aggregation.asciidoc

+
+Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
+
+The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local counts. `shard_min_doc_count` is set to `0` per default and has no effect unless you explicitly set it.


can you add coming[1.2.0]?

jpountz · 2014-05-14T08:32:51Z

LGTM

markharwood · 2014-05-14T09:21:18Z

LGTM2

This was discussed in issue #6041 and #5998 . closes #6143

jpountz reviewed May 13, 2014
View reviewed changes

brwe added 5 commits May 13, 2014 10:37

refactor: unify terms and significant_terms parsing

fd80abc

Both need the requiredSize, shardSize, minDocCount and shardMinDocCount. Parsing should not be duplicated.

add failing test for shard_min_doc_count

d97f194

use shard_min_doc_count also in TermsAggregation

869f5d3

This was discussed in issue elastic#6041 and elastic#5998 .

add documentation of shard_min_doc_count

438c4f3

brwe added 2 commits May 13, 2014 12:17

handle defaults only in parser and use names from ParseFields

42c9539

squash to commit "refactor: make requiredSize, shardSize, minDocCount and shardMinDocCount a single parameter"

Move validity checks to BucketCountThresholds and add >=0 check

cbdb9a1

squash to commit "refactor: make requiredSize, shardSize, minDocCount and shardMinDocCount a single parameter"

markharwood reviewed May 13, 2014
View reviewed changes

use class Explicit in BucketCountThresholds

e65a245

brwe added the review label May 13, 2014

jpountz reviewed May 13, 2014
View reviewed changes

brwe added 4 commits May 13, 2014 17:32

refactor constructors of BucketCountThresholds

18d79bc

do not expose explicit() in BucketCountThresholds and move builder co…

56bd7d5

…de to BucketCountThresholds

set shard_size and size to Integer.MAX if they are 0

15af410

Update docs to explain default settings

2324876

jpountz reviewed May 14, 2014
View reviewed changes

add coming [1.2.0] to the changes

38166f2

add coming[1.2.0] also to significant terms

945482b

brwe added v1.2.0 and removed review labels May 14, 2014

brwe added a commit that referenced this pull request May 14, 2014

use shard_min_doc_count also in TermsAggregation

468a0d0

This was discussed in issue #6041 and #5998 . closes #6143

brwe closed this in 08e5789 May 14, 2014

clintongormley added >enhancement :Analytics/Aggregations Aggregations labels Jun 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `shard_min_doc_count` parameter to terms aggregation #6143

Add `shard_min_doc_count` parameter to terms aggregation #6143

brwe commented May 13, 2014

brwe commented May 13, 2014

markharwood commented May 13, 2014

jpountz May 13, 2014

brwe May 13, 2014

jpountz May 13, 2014

brwe May 13, 2014

jpountz commented May 13, 2014

brwe commented May 13, 2014

brwe commented May 13, 2014

markharwood May 13, 2014

markharwood May 13, 2014

brwe May 13, 2014

jpountz May 13, 2014

jpountz commented May 13, 2014

brwe commented May 14, 2014

jpountz May 14, 2014

brwe May 14, 2014

jpountz commented May 14, 2014

markharwood commented May 14, 2014


		Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.

		The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local counts. `shard_min_doc_count` is set to `0` per default and has no effect unless you explicitly set it.

Add shard_min_doc_count parameter to terms aggregation #6143

Add shard_min_doc_count parameter to terms aggregation #6143

Conversation

brwe commented May 13, 2014

brwe commented May 13, 2014

markharwood commented May 13, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented May 13, 2014

brwe commented May 13, 2014

brwe commented May 13, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented May 13, 2014

brwe commented May 14, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented May 14, 2014

markharwood commented May 14, 2014

Add `shard_min_doc_count` parameter to terms aggregation #6143

Add `shard_min_doc_count` parameter to terms aggregation #6143