Add `shard_min_doc_count` parameter for significant terms similar to `shard_size` #6041

brwe · 2014-05-05T08:58:43Z

Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the min_doc_count is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.

This can be avoided by increasing the shard_size parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.

On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= shard_min_doc_count which can be set to something like
min_doc_count/number of shards because we would assume that even when summing up
the subsetDF across shards min_doc_count will not be reached.

…`shard_size` Significant terms internally maintain a priority queue per shard with a size potentially lower than the number of terms. This queue uses the score as criterion to determine if a bucket is kept or not. If many terms with low subsetDF score very high but the `min_doc_count` is set high, this might result in no terms being returned because the pq is filled with low frequent terms which are all sorted out in the end. This can be avoided by increasing the `shard_size` parameter to a higher value. However, it is not immediately clear to which value this parameter must be set because we can not know how many terms with low frequency are scored higher that the high frequent terms that we are actually interested in. On the other hand, if there is no routing of docs to shards involved, we can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. In that case it might be easier to not add documents to the pq that have subsetDF <= `shard_min_doc_count` which can be set to something like `min_doc_count`/number of shards because we would assume that even when summing up the subsetDF across shards `min_doc_count` will not be reached. closes elastic#5998

jpountz · 2014-05-05T09:49:47Z

I think this parameter would be useful to the terms aggregation as well (in a separate change).

jpountz · 2014-05-05T09:54:28Z

...in/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantTermsParser.java

@@ -125,7 +134,7 @@ public AggregatorFactory parse(String aggregationName, XContentParser parser, Se
        }

        IncludeExclude includeExclude = incExcParser.includeExclude();
-        return new SignificantTermsAggregatorFactory(aggregationName, vsParser.config(), requiredSize, shardSize, minDocCount, includeExclude, executionHint, filter);
+        return new SignificantTermsAggregatorFactory(aggregationName, vsParser.config(), requiredSize, shardSize, minDocCount, shardMinDocCount, includeExclude, executionHint, filter);


I think we should validate that minDocCount >= shardMinDocCount?

brwe · 2014-05-05T13:15:20Z

Thanks for the review! I implemented the changes.
I will make a separate pull request for the terms aggregation once this is pushed. Or do we need a separate issue for that also?

jpountz · 2014-05-05T16:48:13Z

LGTM!

I will make a separate pull request for the terms aggregation once this is pushed. Or do we need a separate issue for that also?

A pull request is fine, thanks!

markharwood · 2014-05-06T09:36:27Z

LGTM - just added a comment on the docs re documenting side-effects of shard_min_doc_count settings.
Part of me wonders if some of the low-level settings we offer (shard/reducer PQ sizes, shard/reducer frequency filters) are a bit techy and could be abstracted to more user-friendly policy choices e.g. "find rare things".

brwe · 2014-05-06T10:25:45Z

Thanks! I added a new commit to enhance the documentation.

…`shard_size` Significant terms internally maintain a priority queue per shard with a size potentially lower than the number of terms. This queue uses the score as criterion to determine if a bucket is kept or not. If many terms with low subsetDF score very high but the `min_doc_count` is set high, this might result in no terms being returned because the pq is filled with low frequent terms which are all sorted out in the end. This can be avoided by increasing the `shard_size` parameter to a higher value. However, it is not immediately clear to which value this parameter must be set because we can not know how many terms with low frequency are scored higher that the high frequent terms that we are actually interested in. On the other hand, if there is no routing of docs to shards involved, we can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. In that case it might be easier to not add documents to the pq that have subsetDF <= `shard_min_doc_count` which can be set to something like `min_doc_count`/number of shards because we would assume that even when summing up the subsetDF across shards `min_doc_count` will not be reached. closes #5998 closes #6041

This was discussed in issue elastic#6041 and elastic#5998 .

This was discussed in issue #6041 and #5998 . closes #6143

brwe added 3 commits May 5, 2014 10:29

test for missing significant terms

3f3eff5

add documentation for shard_min_doc_count

1c041b3

jpountz reviewed May 5, 2014
View reviewed changes

brwe added 3 commits May 5, 2014 14:57

validate shardMinDocCount <= minDocCount

d10f8d5

index random in test

b792d8d

add test that shows that without shardMinDocCount 0 results

67fbcfc

brwe added v1.2.0 labels May 5, 2014

jpountz removed the review label May 5, 2014

re add previous removed @OverRide

9cebee7

kevinkluge assigned brwe May 5, 2014

[doc] more explanation on shard_min_doc_count

a7f4e5f

brwe added the review label May 6, 2014

brwe added 4 commits May 7, 2014 16:17

more doc

c119df9

spell check and other chnages

201822d

more changes

b111981

minor change

e662bc4

brwe closed this in 7944369 May 7, 2014

brwe removed the review label May 7, 2014

brwe mentioned this pull request May 13, 2014

Add shard_min_doc_count parameter to terms aggregation #6143

Closed

brwe added a commit to brwe/elasticsearch that referenced this pull request May 13, 2014

use shard_min_doc_count also in TermsAggregation

869f5d3

This was discussed in issue elastic#6041 and elastic#5998 .

brwe added a commit that referenced this pull request May 14, 2014

use shard_min_doc_count also in TermsAggregation

468a0d0

This was discussed in issue #6041 and #5998 . closes #6143

brwe added a commit that referenced this pull request May 14, 2014

use shard_min_doc_count also in TermsAggregation

08e5789

This was discussed in issue #6041 and #5998 . closes #6143

clintongormley added >enhancement :Analytics/Aggregations Aggregations labels Jun 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `shard_min_doc_count` parameter for significant terms similar to `shard_size` #6041

Add `shard_min_doc_count` parameter for significant terms similar to `shard_size` #6041

brwe commented May 5, 2014

jpountz commented May 5, 2014

jpountz May 5, 2014

brwe May 5, 2014

brwe commented May 5, 2014

jpountz commented May 5, 2014

markharwood commented May 6, 2014

brwe commented May 6, 2014

Add shard_min_doc_count parameter for significant terms similar to shard_size #6041

Add shard_min_doc_count parameter for significant terms similar to shard_size #6041

Conversation

brwe commented May 5, 2014

jpountz commented May 5, 2014

jpountz May 5, 2014

Choose a reason for hiding this comment

brwe May 5, 2014

Choose a reason for hiding this comment

brwe commented May 5, 2014

jpountz commented May 5, 2014

markharwood commented May 6, 2014

brwe commented May 6, 2014

Add `shard_min_doc_count` parameter for significant terms similar to `shard_size` #6041

Add `shard_min_doc_count` parameter for significant terms similar to `shard_size` #6041