New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shard_min_doc_count
parameter for significant terms similar to shard_size
#6041
Conversation
…`shard_size` Significant terms internally maintain a priority queue per shard with a size potentially lower than the number of terms. This queue uses the score as criterion to determine if a bucket is kept or not. If many terms with low subsetDF score very high but the `min_doc_count` is set high, this might result in no terms being returned because the pq is filled with low frequent terms which are all sorted out in the end. This can be avoided by increasing the `shard_size` parameter to a higher value. However, it is not immediately clear to which value this parameter must be set because we can not know how many terms with low frequency are scored higher that the high frequent terms that we are actually interested in. On the other hand, if there is no routing of docs to shards involved, we can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. In that case it might be easier to not add documents to the pq that have subsetDF <= `shard_min_doc_count` which can be set to something like `min_doc_count`/number of shards because we would assume that even when summing up the subsetDF across shards `min_doc_count` will not be reached. closes elastic#5998
I think this parameter would be useful to the terms aggregation as well (in a separate change). |
@@ -125,7 +134,7 @@ public AggregatorFactory parse(String aggregationName, XContentParser parser, Se | |||
} | |||
|
|||
IncludeExclude includeExclude = incExcParser.includeExclude(); | |||
return new SignificantTermsAggregatorFactory(aggregationName, vsParser.config(), requiredSize, shardSize, minDocCount, includeExclude, executionHint, filter); | |||
return new SignificantTermsAggregatorFactory(aggregationName, vsParser.config(), requiredSize, shardSize, minDocCount, shardMinDocCount, includeExclude, executionHint, filter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should validate that minDocCount >= shardMinDocCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Thanks for the review! I implemented the changes. |
LGTM!
A pull request is fine, thanks! |
LGTM - just added a comment on the docs re documenting side-effects of shard_min_doc_count settings. |
Thanks! I added a new commit to enhance the documentation. |
…`shard_size` Significant terms internally maintain a priority queue per shard with a size potentially lower than the number of terms. This queue uses the score as criterion to determine if a bucket is kept or not. If many terms with low subsetDF score very high but the `min_doc_count` is set high, this might result in no terms being returned because the pq is filled with low frequent terms which are all sorted out in the end. This can be avoided by increasing the `shard_size` parameter to a higher value. However, it is not immediately clear to which value this parameter must be set because we can not know how many terms with low frequency are scored higher that the high frequent terms that we are actually interested in. On the other hand, if there is no routing of docs to shards involved, we can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. In that case it might be easier to not add documents to the pq that have subsetDF <= `shard_min_doc_count` which can be set to something like `min_doc_count`/number of shards because we would assume that even when summing up the subsetDF across shards `min_doc_count` will not be reached. closes #5998 closes #6041
This was discussed in issue elastic#6041 and elastic#5998 .
Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the
min_doc_count
is set high, this might result in no terms beingreturned because the pq is filled with low frequent terms which are all sorted
out in the end.
This can be avoided by increasing the
shard_size
parameter to a higher value.However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.
On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <=
shard_min_doc_count
which can be set to something likemin_doc_count
/number of shards because we would assume that even when summing upthe subsetDF across shards
min_doc_count
will not be reached.