New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shard_min_doc_count
parameter to terms aggregation
#6143
Conversation
@markharwood Maybe we should wait with the review here until you pushed your changes and I resolved all the conflicts? |
Would make sense I think. Just running the tests before pushing... |
if (requiredSize != SignificantTermsParser.DEFAULT_REQUIRED_SIZE) { | ||
builder.field("size", requiredSize); | ||
if (bucketCountThresholds.requiredSize >= 0) { | ||
builder.field("size", bucketCountThresholds.requiredSize); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of all the !=
checks is to not print parameters when they are equal to the default. But here, this changes to a >=0
while the default is 10
so it looks to me like size
would be put all the time in the builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I actually copied that from TermsBuilder. Checking for >=0 makes sense also but not building the parameter if it is the default makes sense too. I will add both checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked at the current TermsBuilder and the reason why it can do that is that it initializes all values to -1 (which are invalid) by default. So maybe this class should do the same and only emit values which have been set explicitely so that the defaults are only handled on the parser side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right! That simplifies things...
This looks good to me, I like the |
Both need the requiredSize, shardSize, minDocCount and shardMinDocCount. Parsing should not be duplicated.
…unt a single parameter every class using these parameters has their own member where these four are stored. This clutters the code. Because they mostly needed together it might make sense to group them. I hope that this makes the code more readable and also that if avoids errors such as mixing up the int/long parameters in the future.
This was discussed in issue elastic#6041 and elastic#5998 .
squash to commit "refactor: make requiredSize, shardSize, minDocCount and shardMinDocCount a single parameter"
squash to commit "refactor: make requiredSize, shardSize, minDocCount and shardMinDocCount a single parameter"
I added two commits to implement the comments. I was wondering: Would it make sense to add the logic in TermsParser to SignificantTermsParser as well, @markharwood ? I mean the
which is here |
this.minDocCount = 1; | ||
this.shardMinDocCount = 0; | ||
this.requiredSize = 10; | ||
this.shardSize = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider use of an "UNDEFINED" constant (or overridable method) in place of -1 here?
Elsewhere, in https://github.com/elasticsearch/elasticsearch/pull/6143/files#diff-2cdb75b1d8a27a17cd3df51df6995ae6R55 there is a reference to testing for undefined values using shardSize
on the default object returned by getDefaultBucketCountThresholds() - but shardSize is mutable - if anyone happens to call ensureValidity
on this default object then its shardSize setting would change from -1 to 10 which seems dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is stuff that can be reused from o.e.common.Explicit - it holds values but remembers the differences between settings that were based purely on defaults and conscious decisions made by users ("explicit").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added a commit to use Explicit. Take a look!
a reference to testing for undefined values using shardSize on the default object returned by getDefaultBucketCountThresholds() - but shardSize is mutable - if anyone happens to call ensureValidity on this default object then its shardSize setting would change from -1 to 10 which seems dangerous.
The default values are private
and getDefaultBucketCountThresholds()
creates a new instance. In addition I now check if the values were set (using .explicit()
) when the default parameters are retrieved. The only way to mess with the default settings is now within SignificantTermsParametersParser
or TermsParametersParser
. Is that good enough?
this.shardMinDocCount = new Explicit<>(bucketCountThresholds.shardMinDocCount.value(), false); | ||
this.requiredSize = new Explicit<>(bucketCountThresholds.requiredSize.value(), false); | ||
this.shardSize = new Explicit<>(bucketCountThresholds.shardSize.value(), false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you refactor these 2 constructors so that they call the 1st one?
I think it would be nice to make it consistent. |
Updated with new commits. |
|
||
Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The `min_doc_count` criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very _certain_ about if the term will actually reach the required `min_doc_count`. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic. | ||
|
||
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local counts. `shard_min_doc_count` is set to `0` per default and has no effect unless you explicitly set it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add coming[1.2.0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
LGTM |
LGTM2 |
This was discussed in issue #6041 and #5998 . The parameter already exists for significant terms aggregation.
There are also two refactoring commits:
I tried to extract the parsing of common parameters for of significant terms and terms aggregation to get rid of some duplicate code.
Also, I refactored terms and significant terms aggregation a little:
requiredSize
,shardSize
,minDocCount
andshardMinDocCount
are stored in a class calledBucketCountThresholds
. Before, every class using these parameters had their own member where these four are stored. This clutters the code. Because they are mostly needed together it might make sense to group them. I hope that this makes the code more readable.