Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shard_min_doc_count parameter to terms aggregation #6143

Closed
wants to merge 14 commits into from

Conversation

brwe
Copy link
Contributor

@brwe brwe commented May 13, 2014

This was discussed in issue #6041 and #5998 . The parameter already exists for significant terms aggregation.

There are also two refactoring commits:

I tried to extract the parsing of common parameters for of significant terms and terms aggregation to get rid of some duplicate code.

Also, I refactored terms and significant terms aggregation a little: requiredSize, shardSize, minDocCount and shardMinDocCount are stored in a class called BucketCountThresholds. Before, every class using these parameters had their own member where these four are stored. This clutters the code. Because they are mostly needed together it might make sense to group them. I hope that this makes the code more readable.

@brwe
Copy link
Contributor Author

brwe commented May 13, 2014

@markharwood Maybe we should wait with the review here until you pushed your changes and I resolved all the conflicts?

@markharwood
Copy link
Contributor

Would make sense I think. Just running the tests before pushing...

if (requiredSize != SignificantTermsParser.DEFAULT_REQUIRED_SIZE) {
builder.field("size", requiredSize);
if (bucketCountThresholds.requiredSize >= 0) {
builder.field("size", bucketCountThresholds.requiredSize);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of all the != checks is to not print parameters when they are equal to the default. But here, this changes to a >=0 while the default is 10 so it looks to me like size would be put all the time in the builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually copied that from TermsBuilder. Checking for >=0 makes sense also but not building the parameter if it is the default makes sense too. I will add both checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at the current TermsBuilder and the reason why it can do that is that it initializes all values to -1 (which are invalid) by default. So maybe this class should do the same and only emit values which have been set explicitely so that the defaults are only handled on the parser side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right! That simplifies things...

@jpountz
Copy link
Contributor

jpountz commented May 13, 2014

This looks good to me, I like the BucketCountsThresholds refactoring and the documentation is clear. I just left a minor comment about the handling of default values in the builders.

brwe added 5 commits May 13, 2014 10:37
Both need the requiredSize, shardSize, minDocCount and shardMinDocCount.
Parsing should not be duplicated.
…unt a single parameter

every class using these parameters has their own member where these four
are stored. This clutters the code. Because they mostly needed together
it might make sense to group them.
I hope that this makes the code more readable and also that if avoids
errors such as mixing up the int/long parameters in the future.
@brwe
Copy link
Contributor Author

brwe commented May 13, 2014

Ok, I rebased on 1e560b0 , will work on @jpountz comments next

brwe added 2 commits May 13, 2014 12:17
squash to commit "refactor: make requiredSize, shardSize, minDocCount and shardMinDocCount a single parameter"
squash to commit "refactor: make requiredSize, shardSize, minDocCount and shardMinDocCount a single parameter"
@brwe
Copy link
Contributor Author

brwe commented May 13, 2014

I added two commits to implement the comments. I was wondering: Would it make sense to add the logic in TermsParser to SignificantTermsParser as well, @markharwood ? I mean the

if (bucketCountThresholds.requiredSize == 0) {
           bucketCountThresholds.requiredSize = Integer.MAX_VALUE;
}

which is here

this.minDocCount = 1;
this.shardMinDocCount = 0;
this.requiredSize = 10;
this.shardSize = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider use of an "UNDEFINED" constant (or overridable method) in place of -1 here?
Elsewhere, in https://github.com/elasticsearch/elasticsearch/pull/6143/files#diff-2cdb75b1d8a27a17cd3df51df6995ae6R55 there is a reference to testing for undefined values using shardSize on the default object returned by getDefaultBucketCountThresholds() - but shardSize is mutable - if anyone happens to call ensureValidity on this default object then its shardSize setting would change from -1 to 10 which seems dangerous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is stuff that can be reused from o.e.common.Explicit - it holds values but remembers the differences between settings that were based purely on defaults and conscious decisions made by users ("explicit").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added a commit to use Explicit. Take a look!

a reference to testing for undefined values using shardSize on the default object returned by getDefaultBucketCountThresholds() - but shardSize is mutable - if anyone happens to call ensureValidity on this default object then its shardSize setting would change from -1 to 10 which seems dangerous.

The default values are private and getDefaultBucketCountThresholds() creates a new instance. In addition I now check if the values were set (using .explicit()) when the default parameters are retrieved. The only way to mess with the default settings is now within SignificantTermsParametersParser or TermsParametersParser. Is that good enough?

@brwe brwe added the review label May 13, 2014
this.shardMinDocCount = new Explicit<>(bucketCountThresholds.shardMinDocCount.value(), false);
this.requiredSize = new Explicit<>(bucketCountThresholds.requiredSize.value(), false);
this.shardSize = new Explicit<>(bucketCountThresholds.shardSize.value(), false);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you refactor these 2 constructors so that they call the 1st one?

@jpountz
Copy link
Contributor

jpountz commented May 13, 2014

I was wondering: Would it make sense to add the logic in TermsParser to SignificantTermsParser as well

I think it would be nice to make it consistent.

@brwe
Copy link
Contributor Author

brwe commented May 14, 2014

Updated with new commits.


Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.

The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local counts. `shard_min_doc_count` is set to `0` per default and has no effect unless you explicitly set it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add coming[1.2.0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jpountz
Copy link
Contributor

jpountz commented May 14, 2014

LGTM

@markharwood
Copy link
Contributor

LGTM2

@brwe brwe added v1.2.0 and removed review labels May 14, 2014
brwe added a commit that referenced this pull request May 14, 2014
This was discussed in issue #6041 and #5998 .

closes #6143
@brwe brwe closed this in 08e5789 May 14, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants