Significant terms misses some terms #5998

brwe · 2014-04-30T12:41:48Z

When defining a "min_doc_count" and a "size", significant terms might fail to return the right significant terms. The scenario is as follows:

There are at least size * 2 terms that have a subsetDF < min_doc_count
There are terms that have a subsetDF > min_doc_count
Terms in 1. score higher than terms in 2.

Under these circumstances no significant term is returned at all (test is in the branches below). This makes it hard to use significant terms for processing natural text which contains many low frequent terms.

I think the reason for this behavior is that internally a priority queue is maintained that has a maximum of 2*size entries. This queue uses the score as criterion to determine if a bucket is kept or not. Because terms in 1. score higher than terms in 2., only documents with subsetDF < min_doc_count will be kept and therefore finally no terms are returned at all.

It is unclear to me how to fix this. I prepared two potential ways to fix this (not nearly ready for a pr, more proof of concept):

A. assign score -MAX_FLOAT if subsetDf < minDocCount
This would cause terms with a too low subsetDF to be added with a very low score. These terms still would get a chance to be returned if by merging their subsetDf increases. However, they might be thrown out of the priority queue before that by terms with a higher score and never be merged at all.

Branch: https://github.com/brwe/elasticsearch/tree/missing-sig-terms-1

B. Do not add terms to priority queue if subsetDf < minDocCount
Easier to code but unfortunately this means that terms that might match the minDocCount criterion after merging would not be added in the first place.

Branch: https://github.com/brwe/elasticsearch/tree/missing-sig-terms-2

Both ways do not guarantee that the "correct" significant terms are returned.
I would prefer B.

markharwood · 2014-05-02T08:30:08Z

Thanks for looking at this, Britta. The shard_size setting controls the size of the shard-local PriorityQueue and defaults to a multiple of the final size requested by the end user to help improve accuracy.
The terms aggregation employs a similar strategy but for the reasons you point out I use a larger multiplier for the default shard_size in significant_terms because that algo is not as certain as the terms aggregation in the selections it gathers from each shard.
Have you tried experimenting with a bigger shard_size to fix the accuracy issues? The best choice of shard_size will depend on your index sharding policy, the data you have stored, the chosen query/agg and the amount of RAM/network traffic you want to throw at the problem.
Before applying the tweaks you suggest I think it would make sense to try understand the problem better (although for the reasons I outlined above "the problem" can vary wildly). I think it would be useful to measure how many of "the right" terms found in a single-shard scenario were not found in a multi-shard scenario where we perhaps vary the number of shards and the shard_size setting. If we find we have to increase shard_size to unreasonably large settings we could employ some of the techniques you suggest and again measure the improvement based on how many of the correct results it returns.

brwe · 2014-05-02T11:52:34Z

My bad, did not try the shard_size parameter - I can solve this problem for my case it I set shard_size to something like 30X the size parameter which is 100.

A little more explanation: In my particular case I was trying to get the significant terms for wikipedia articles (only partially indexed) that contain a particular term in the title, in this case "shoe". I have only one shard.
If I try getting the top 100 terms (min_doc_count is 10) then with the default settings only one term is returned. I actually get 100 terms returned if I set the shard_size to 3k.

Not sure if 3k is unreasonable and I also only tested tis one case. In addition, I do not know the true significance of the terms that are returned. The documents (only 254 contained "shoe" in the title) did not contain many significant terms anyway, so most of the 100 terms did not make sense and also scored low.

I am still wondering if it makes sense use the min_doc_count parameter for determining if a candidate should be added to the pq or not. The reason is this:
I cannot know in advance how many terms with a low min_doc_count will score higher than the one I with a high min_doc_count. Therefore this parameter must be tuned which might be tricky, since it can also depend on size and min_doc_count.

On the other hand, if there is no routing of docs to shards involved, I can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. So in that case it might be easier to not add documents to the pq that have subsetDF <= min_doc_count/number_of_shards (maybe not exactly but something similar) because I would assume that even when summing up the subsetDF across shards min_doc_count will not be reached. However, I am yet unsure if this would be more easy than setting the shard_size parameter properly - might depend on how rare the sought terms are and how random distribution of docs across shards really is.

Anyway, this is just and idea. We could proceed with formal measurements but this might be tricky and time consuming. If 3k is reasonable for the pq size we can also close this issue for now.

markharwood · 2014-05-02T13:01:53Z

So if shard_size is a shard-local equivalent for the final size then maybe we can also introduce shard_min_doc_count as an optional shard-level policy for controlling quality?

One thing to watch with Wikipedia - there tend to be a lot of very short docs that have no real wordy content on a theme but instead serve as disambiguation or synonym type entries. It can be important to tune these out from these sorts of analysis.

brwe · 2014-05-05T07:30:44Z

I like the shard_min_doc_count parameter. I'll prepare a pull request for that.

…`shard_size` Significant terms internally maintain a priority queue per shard with a size potentially lower than the number of terms. This queue uses the score as criterion to determine if a bucket is kept or not. If many terms with low subsetDF score very high but the `min_doc_count` is set high, this might result in no terms being returned because the pq is filled with low frequent terms which are all sorted out in the end. This can be avoided by increasing the `shard_size` parameter to a higher value. However, it is not immediately clear to which value this parameter must be set because we can not know how many terms with low frequency are scored higher that the high frequent terms that we are actually interested in. On the other hand, if there is no routing of docs to shards involved, we can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. In that case it might be easier to not add documents to the pq that have subsetDF <= `shard_min_doc_count` which can be set to something like `min_doc_count`/number of shards because we would assume that even when summing up the subsetDF across shards `min_doc_count` will not be reached. closes elastic#5998

…`shard_size` Significant terms internally maintain a priority queue per shard with a size potentially lower than the number of terms. This queue uses the score as criterion to determine if a bucket is kept or not. If many terms with low subsetDF score very high but the `min_doc_count` is set high, this might result in no terms being returned because the pq is filled with low frequent terms which are all sorted out in the end. This can be avoided by increasing the `shard_size` parameter to a higher value. However, it is not immediately clear to which value this parameter must be set because we can not know how many terms with low frequency are scored higher that the high frequent terms that we are actually interested in. On the other hand, if there is no routing of docs to shards involved, we can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. In that case it might be easier to not add documents to the pq that have subsetDF <= `shard_min_doc_count` which can be set to something like `min_doc_count`/number of shards because we would assume that even when summing up the subsetDF across shards `min_doc_count` will not be reached. closes #5998 closes #6041

This was discussed in issue elastic#6041 and elastic#5998 .

This was discussed in issue #6041 and #5998 . closes #6143

brwe closed this as completed in 7944369 May 7, 2014

brwe mentioned this issue May 13, 2014

Add shard_min_doc_count parameter to terms aggregation #6143

Closed

brwe added a commit to brwe/elasticsearch that referenced this issue May 13, 2014

use shard_min_doc_count also in TermsAggregation

869f5d3

This was discussed in issue elastic#6041 and elastic#5998 .

brwe added a commit that referenced this issue May 14, 2014

use shard_min_doc_count also in TermsAggregation

468a0d0

This was discussed in issue #6041 and #5998 . closes #6143

brwe added a commit that referenced this issue May 14, 2014

use shard_min_doc_count also in TermsAggregation

08e5789

This was discussed in issue #6041 and #5998 . closes #6143

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant terms misses some terms #5998

Significant terms misses some terms #5998

brwe commented Apr 30, 2014

markharwood commented May 2, 2014

brwe commented May 2, 2014

markharwood commented May 2, 2014

brwe commented May 5, 2014

Significant terms misses some terms #5998

Significant terms misses some terms #5998

Comments

brwe commented Apr 30, 2014

markharwood commented May 2, 2014

brwe commented May 2, 2014

markharwood commented May 2, 2014

brwe commented May 5, 2014