Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant terms misses some terms #5998

Closed
brwe opened this issue Apr 30, 2014 · 4 comments
Closed

Significant terms misses some terms #5998

brwe opened this issue Apr 30, 2014 · 4 comments

Comments

@brwe
Copy link
Contributor

brwe commented Apr 30, 2014

When defining a "min_doc_count" and a "size", significant terms might fail to return the right significant terms. The scenario is as follows:

  1. There are at least size * 2 terms that have a subsetDF < min_doc_count
  2. There are terms that have a subsetDF > min_doc_count
  3. Terms in 1. score higher than terms in 2.

Under these circumstances no significant term is returned at all (test is in the branches below). This makes it hard to use significant terms for processing natural text which contains many low frequent terms.

I think the reason for this behavior is that internally a priority queue is maintained that has a maximum of 2*size entries. This queue uses the score as criterion to determine if a bucket is kept or not. Because terms in 1. score higher than terms in 2., only documents with subsetDF < min_doc_count will be kept and therefore finally no terms are returned at all.

It is unclear to me how to fix this. I prepared two potential ways to fix this (not nearly ready for a pr, more proof of concept):

A. assign score -MAX_FLOAT if subsetDf < minDocCount
This would cause terms with a too low subsetDF to be added with a very low score. These terms still would get a chance to be returned if by merging their subsetDf increases. However, they might be thrown out of the priority queue before that by terms with a higher score and never be merged at all.

Branch: https://github.com/brwe/elasticsearch/tree/missing-sig-terms-1

B. Do not add terms to priority queue if subsetDf < minDocCount
Easier to code but unfortunately this means that terms that might match the minDocCount criterion after merging would not be added in the first place.

Branch: https://github.com/brwe/elasticsearch/tree/missing-sig-terms-2

Both ways do not guarantee that the "correct" significant terms are returned.
I would prefer B.

@markharwood
Copy link
Contributor

Thanks for looking at this, Britta. The shard_size setting controls the size of the shard-local PriorityQueue and defaults to a multiple of the final size requested by the end user to help improve accuracy.
The terms aggregation employs a similar strategy but for the reasons you point out I use a larger multiplier for the default shard_size in significant_terms because that algo is not as certain as the terms aggregation in the selections it gathers from each shard.
Have you tried experimenting with a bigger shard_size to fix the accuracy issues? The best choice of shard_size will depend on your index sharding policy, the data you have stored, the chosen query/agg and the amount of RAM/network traffic you want to throw at the problem.
Before applying the tweaks you suggest I think it would make sense to try understand the problem better (although for the reasons I outlined above "the problem" can vary wildly). I think it would be useful to measure how many of "the right" terms found in a single-shard scenario were not found in a multi-shard scenario where we perhaps vary the number of shards and the shard_size setting. If we find we have to increase shard_size to unreasonably large settings we could employ some of the techniques you suggest and again measure the improvement based on how many of the correct results it returns.

@brwe
Copy link
Contributor Author

brwe commented May 2, 2014

My bad, did not try the shard_size parameter - I can solve this problem for my case it I set shard_size to something like 30X the size parameter which is 100.

A little more explanation: In my particular case I was trying to get the significant terms for wikipedia articles (only partially indexed) that contain a particular term in the title, in this case "shoe". I have only one shard.
If I try getting the top 100 terms (min_doc_count is 10) then with the default settings only one term is returned. I actually get 100 terms returned if I set the shard_size to 3k.

Not sure if 3k is unreasonable and I also only tested tis one case. In addition, I do not know the true significance of the terms that are returned. The documents (only 254 contained "shoe" in the title) did not contain many significant terms anyway, so most of the 100 terms did not make sense and also scored low.

I am still wondering if it makes sense use the min_doc_count parameter for determining if a candidate should be added to the pq or not. The reason is this:
I cannot know in advance how many terms with a low min_doc_count will score higher than the one I with a high min_doc_count. Therefore this parameter must be tuned which might be tricky, since it can also depend on size and min_doc_count.

On the other hand, if there is no routing of docs to shards involved, I can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. So in that case it might be easier to not add documents to the pq that have subsetDF <= min_doc_count/number_of_shards (maybe not exactly but something similar) because I would assume that even when summing up the subsetDF across shards min_doc_count will not be reached. However, I am yet unsure if this would be more easy than setting the shard_size parameter properly - might depend on how rare the sought terms are and how random distribution of docs across shards really is.

Anyway, this is just and idea. We could proceed with formal measurements but this might be tricky and time consuming. If 3k is reasonable for the pq size we can also close this issue for now.

@markharwood
Copy link
Contributor

So if shard_size is a shard-local equivalent for the final size then maybe we can also introduce shard_min_doc_count as an optional shard-level policy for controlling quality?

One thing to watch with Wikipedia - there tend to be a lot of very short docs that have no real wordy content on a theme but instead serve as disambiguation or synonym type entries. It can be important to tune these out from these sorts of analysis.

@brwe
Copy link
Contributor Author

brwe commented May 5, 2014

I like the shard_min_doc_count parameter. I'll prepare a pull request for that.

brwe added a commit to brwe/elasticsearch that referenced this issue May 5, 2014
…`shard_size`

Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.

This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.

On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards  because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.

closes elastic#5998
brwe added a commit that referenced this issue May 7, 2014
…`shard_size`

Significant terms internally maintain a priority queue per shard with a size potentially
lower than the number of terms. This queue uses the score as criterion to determine if
a bucket is kept or not. If many terms with low subsetDF score very high
but the `min_doc_count` is set high, this might result in no terms being
returned because the pq is filled with low frequent terms which are all sorted
out in the end.

This can be avoided by increasing the `shard_size` parameter to a higher value.
However, it is not immediately clear to which value this parameter must be set
because we can not know how many terms with low frequency are scored higher that
the high frequent terms that we are actually interested in.

On the other hand, if there is no routing of docs to shards involved, we can maybe
assume that the documents of classes and also the terms therein are distributed evenly
across shards. In that case it might be easier to not add documents to the pq that have
subsetDF <= `shard_min_doc_count` which can be set to something like
`min_doc_count`/number of shards  because we would assume that even when summing up
the subsetDF across shards `min_doc_count` will not be reached.

closes #5998
closes #6041
@brwe brwe closed this as completed in 7944369 May 7, 2014
brwe added a commit to brwe/elasticsearch that referenced this issue May 13, 2014
brwe added a commit that referenced this issue May 14, 2014
This was discussed in issue #6041 and #5998 .

closes #6143
brwe added a commit that referenced this issue May 14, 2014
This was discussed in issue #6041 and #5998 .

closes #6143
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants