Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shard_min_doc_count parameter for significant terms similar to shard_size #6041

Closed
wants to merge 12 commits into from

Conversation

brwe
Copy link
Contributor

@brwe brwe commented May 5, 2014

Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the min_doc_count is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.

This can be avoided by increasing the shard_size parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.

On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= shard_min_doc_count which can be set to something like
min_doc_count/number of shards because we would assume that even when summing up
the subsetDF across shards min_doc_count will not be reached.

brwe added 3 commits May 5, 2014 10:29
…`shard_size`

Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.

This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.

On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards  because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.

closes elastic#5998
@jpountz
Copy link
Contributor

jpountz commented May 5, 2014

I think this parameter would be useful to the terms aggregation as well (in a separate change).

@@ -125,7 +134,7 @@ public AggregatorFactory parse(String aggregationName, XContentParser parser, Se
}

IncludeExclude includeExclude = incExcParser.includeExclude();
return new SignificantTermsAggregatorFactory(aggregationName, vsParser.config(), requiredSize, shardSize, minDocCount, includeExclude, executionHint, filter);
return new SignificantTermsAggregatorFactory(aggregationName, vsParser.config(), requiredSize, shardSize, minDocCount, shardMinDocCount, includeExclude, executionHint, filter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should validate that minDocCount >= shardMinDocCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@brwe
Copy link
Contributor Author

brwe commented May 5, 2014

Thanks for the review! I implemented the changes.
I will make a separate pull request for the terms aggregation once this is pushed. Or do we need a separate issue for that also?

@jpountz
Copy link
Contributor

jpountz commented May 5, 2014

LGTM!

I will make a separate pull request for the terms aggregation once this is pushed. Or do we need a separate issue for that also?

A pull request is fine, thanks!

@jpountz jpountz removed the review label May 5, 2014
@markharwood
Copy link
Contributor

LGTM - just added a comment on the docs re documenting side-effects of shard_min_doc_count settings.
Part of me wonders if some of the low-level settings we offer (shard/reducer PQ sizes, shard/reducer frequency filters) are a bit techy and could be abstracted to more user-friendly policy choices e.g. "find rare things".

@brwe
Copy link
Contributor Author

brwe commented May 6, 2014

Thanks! I added a new commit to enhance the documentation.

@brwe brwe added the review label May 6, 2014
brwe added a commit that referenced this pull request May 7, 2014
…`shard_size`

Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.

This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.

On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards  because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.

closes #5998
closes #6041
@brwe brwe closed this in 7944369 May 7, 2014
@brwe brwe removed the review label May 7, 2014
brwe added a commit to brwe/elasticsearch that referenced this pull request May 13, 2014
brwe added a commit that referenced this pull request May 14, 2014
This was discussed in issue #6041 and #5998 .

closes #6143
brwe added a commit that referenced this pull request May 14, 2014
This was discussed in issue #6041 and #5998 .

closes #6143
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants