Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terms facet gives wrong count with n_shards > 1 #1305

Closed
jmchambers opened this issue Sep 6, 2011 · 77 comments
Closed

terms facet gives wrong count with n_shards > 1 #1305

jmchambers opened this issue Sep 6, 2011 · 77 comments

Comments

@jmchambers
Copy link

I'm working with nested documents and have noticed that my faceted search interface is giving the wrong counts when I have more than one shard. To be more specific, I'm working with RDF triples (entity > attribute > value) and I'm nesting the attributes (called predicates in my example):

{
  "_id" : "512a2c022f0b4e3daa341e6c8bcf6c2f",
  "url": "http://dbpedia.org/resource/Alan_Shepard",
  "predicates": [
    {
      "type": "type",
      "string_value": ["thing", "person", "astronaut"]
    }, {
      "type": "label",
      "string_value": ["Alan Shepard"]
    }, {
      "type": "time in space",
      "float_value": [216.950]
    },
    ... lots more
  ]
}

I've created a shell script (https://gist.github.com/1196986) that recreates the problem with a fresh index. The created data set has these totals:

  • thing (30)
  • creative work (20)
  • video game (10)
  • tv show (10)
  • people (10)

With only one shard the following query gives the correct counts no matter what the size parameter is set to:

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "facets": {
    "type_counts": {
      "terms": {
        "field": "string_value",
        "size": 5
      },
      "nested": "predicates",
      "facet_filter": {
        "term": {
          "type": "type"
        }
      }
    }
  }
}

However, with more than one shard the size parameter affects the accuracy of the counts. If it is equal to or greater than the number of terms returned by the facet query (5 in this case) then it works fine. However, the terms at the bottom of the list start to display low counts as you reduce the size parameter:

With "size" : 4

  • thing (30)
  • creative work (20)
  • video game (10)
  • tv show (9)

With "size" : 3

  • thing (30)
  • creative work (15)
  • video game (9)

With "size" : 2

  • thing (30)
  • creative work (15)

So it looks like the sub-totals from some of the shards aren't being included for some reason. BTW I'm on ubuntu and the problem seems to affect all versions of ES I've tried (17.0, 17.1 and 17.6). Any ideas...?

P.S. absolutely loving ES - it's made my life a lot easier :)

@losomo
Copy link

losomo commented Oct 17, 2011

+1 for this bug. I have reproduced the problem using just documents with one field.
Complete test (as run on version 0.17.8):
as Perl script: https://gist.github.com/1292897
generated shell test: https://gist.github.com/1292912
The first error can be seen on line 951.
Expected:

"terms" : [
 {
 "count" : 10,
 "term" : "user 10"
 }
 ],

Got:

"terms" : [
 {
 "count" : 7,
 "term" : "user 9"
 }
 ],

@losomo
Copy link

losomo commented Oct 17, 2011

After more experimenting I'd say that it's caused by naïve top-N facet merging. Something like Phase three here should be added:

http://wiki.apache.org/solr/DistributedSearchDesign#Phase_3:_REFINE_FACETS_.28only_for_faceted_search.29

Or something smarter: http://netcins.ceid.upatras.gr/papers/Klee_VLDB.pdf

@kimchy
Copy link
Member

kimchy commented Oct 18, 2011

Right, the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results. The phase 3 thingy is not really a solution, will read the paper though :)

@losomo
Copy link

losomo commented Oct 18, 2011

Right. The "phase 3 thingy" only solves the second problem:

  1. The recall is not 100% (some values may be simply missing)
  2. The numbers are often lower than they should be.

But while the first problem can easily stay unnoticed, the second one leads to quite a lousy user experience in very common faceting scenarios: The web displays top 10 commenters with number of their comments in parentheses. You click the last one with 30 comments and Voilà, it shows her 48 comments. This is hard not to notice.

Meanwhile a slow and not-really-always-working workaround:
Set size to what you want + 150 or another magical constant when requesting the facets and then filter out the top size facets on the client side.

@karussell
Copy link
Contributor

The phase 3 thingy is not really a solution

@kimchy: why do you think so? isn't it a similar approach as query then fetch? now only for facets and its counts instead of docs and its score? (it even could be implemented in the second query to avoid traffic, no?)

Another workaround would be 'routing': every facet value should go to the same shard. But then the data should have a lot facet values and all of them should have not a too high count to avoid wrong balancing ...

@agnellvj
Copy link

Is this on the roadmap? We discovered this same problem with regular fields and large datasets with multiple nodes and shards. This is problematic for us since the faceted counts can fluctuate wildly based on what filters we apply. Is there a workaround?

@piskvorky
Copy link

I still see this bug in 19.4, and the wrong counts are also a show-stopper for us.

Adding "150 or another magical constant" only helps for tiny datasets; does there exist a more robust workaround until this is fixed properly?

@jmchambers
Copy link
Author

+1 for a better workaround or fix

The only completely robust workaround I know of is to limit yourself to one shard. Obviously that pretty much takes the 'elastic' out of 'elasticsearch', but if accurate counts are critical to your app and your index isn't too big you might get away with it...

@tarunjangra
Copy link

Any update on this? I do have an application where counts are critically important. As you are saying to limit to one shard for a particular entity which suppose to undergo such queries.
Does this make sense to route entities by entity type in corresponding shards?

@karmi
Copy link
Contributor

karmi commented Jul 17, 2012

The only completely robust workaround I know of is to limit yourself to one shard. Obviously that pretty much takes the 'elastic' out of 'elasticsearch' (...)

@jmchambers Not really -- you can slice the data into many one-shard indices, with a weekly (daily, hourly, ...) rotation. You can use aliases (or wildcards, ...) to query them in a reasonable way from your application. Would mulitple one-shard indices work around the facet problem?

@piskvorky
Copy link

@karmi: only if the indices are large enough. To my knowledge, the scoring stats (IDF etc) are computed per-index (actually, per-shard by default). So comparing relevancy scores across indices is like comparing apples to oranges, even with dfs_query_then_fetch.

But once the indices are large enough, this doesn't matter anymore as the stats converge (assuming identical doc/word distribution in each index, for the stats to actually converge).

@ajhalani
Copy link

ajhalani commented Aug 9, 2012

+1 for a fix. The workaround (one-shard, multiple one-shard index, etc.) don't sound convincing.

@Downchuck
Copy link

I'm a bit confused on this issue: I have five shards and the top value occurs on each shard as #1, by a large margin. Still, when I facet, I get a 20% smaller number than when I simply query for the value directly.

@jrydberg
Copy link

jrydberg commented Sep 1, 2012

@karmi

you can slice the data into many one-shard indices, with a weekly (daily, hourly, ...) rotation

Won't you run into the same problem when doing a faceted count over multiple indices?

@giamma
Copy link

giamma commented Oct 29, 2012

Any news about this issue?

@tgruben
Copy link

tgruben commented Nov 8, 2012

Any hope of this being fixed in the near future?

@danfairs
Copy link

For what it's worth, the multiple index/single shard approach is working for us. Our application has a new index per week of data anyway, so it's not actually too painful for our current usage.

@markwaddle
Copy link

+1 for a fix
My application has 7m+ docs of varying sizes (600GB index) and growing so 1 shard is not feasible.
For my application I am willing to trade performance (and/or hardware resources) for accurate facet counts.

@webmusing
Copy link

+1 for a fix

Terms Stats Facet seems to be affected with the same issue as well.

@piskvorky
Copy link

It seems there is no fix forthcoming; how about at least making the most common scenario less annoying/more palatable?

What I mean by that: when ES returns let's say 20 facets (with the wrong counts...), it could automatically run a second request against all shards asking for an accurate count only for these exact 20 facets. Then add up these accurate counts, and only return that as a result. Would that make sense?

@vincentpoon
Copy link

+1 for a fix. Incorrect facet counts is resulting in a bad user experience

@songday
Copy link

songday commented Jul 5, 2013

Seems this bug did not fix completely

I did a facet query and sort by count, the results were right:
"terms" : [ {
"term" : "AAA",
"count" : 59,
"total_count" : 59,
"min" : 1.0,
"max" : 54.0,
"total" : 391.0,
"mean" : 6.627118644067797
}, {
"term" : "BBB",
"count" : 55,
"total_count" : 55,
"min" : 1.0,
"max" : 17.0,
"total" : 154.0,
"mean" : 2.8
}]

but if sort by total (same query), the results dose not right:
"terms" : [ {
"term" : "AAA",
"count" : 56, //this is not right
"total_count" : 56,
"min" : 1.0,
"max" : 54.0,
"total" : 388.0,
"mean" : 6.928571428571429
}, {
"term" : "BBB",
"count" : 56, //this is not right
"total_count" : 56,
"min" : 1.0,
"max" : 17.0,
"total" : 171.0,
"mean" : 3.0535714285714284
}]

@tommymonk
Copy link

+1

@HeyBillFinn
Copy link

Hi,

I am trying to run a search query using multiple facets, and I want the facet numbers to update in the same way as the site Zappos does. For example, when I search for Nike in the men's shoe section, I get facets for Brand and Color.

[cid:93D592E2-8A16-49EC-81A2-5F5804ED1836]

When I narrow my search by Brand (by selecting 'Nike'), the numbers within the Brand facet do not change, but the numbers within the Color facet change to reflect the narrowed search results.
[cid:8C58B26B-E7AC-44F3-8DDA-23A0BBD0AD0B]

I can widen my search by selecting 'Nike Action', and still no numbers within the Brand facet are updated, but numbers in the Color facet have been updated to reflect the additional results.

[cid:56423869-2538-47FA-8F43-D6355DC5AA48]

I can see the same expected results if I select a term within the Color facet:

[cid:BDC2DCEB-B7A5-4E86-BFF6-3DDD68E7ADED]

I can think of two ways to do this using Elastic Search, and I'm looking for any guidance/suggestions as to the best way to implement this.

  1. Filtered query using fairly complex facet filters within each facet, including global = true flag.
  2. Top-level filter (which I understand does not affect the facet results) with slightly less complex facet filters within each facet

Which of the two options would perform better? Is there a better option that I'm not thinking of? I can add example JSON if it will help explain my thoughts.

Thanks!

Bill

@danfairs
Copy link

danfairs commented Aug 2, 2013

@finn1317, this is something you should ask on the elasticsearch mailing list. I don't think it's related to the issue being discussed here.

@peakx
Copy link

peakx commented Aug 8, 2013

+1 for a fix.

@SanderDemeester
Copy link

Question, could this bug be related to the following result i'm getting.
If i request the first json document where the value for field x is y i get a result back.
But when If i request all distict values for field x i get a list back with values. Byt 'y' is not included.

query uses the all_terms and i match use match_all.

@igal-getrailo
Copy link
Contributor

@jpountz Thank you for the explanation. I agree with @piskvorky that a parameter like "favor_accuracy_over_perfofmance" (or whatever name) will be a good idea moving forward.

but for now my question is: is there some sort of a formula, or rule of thumb as to what the shard_size parameter should be? as I specified in my question on the mailing list -- https://groups.google.com/d/msg/elasticsearch/xYRaJa04wfc/lcmEvP53rR4J -- I have 5 shards in my index -- and I see this problem when my agg result is less than 50 -- is that because shard_size has a default of 10 so num_shards * shard_size ?

also, for some reason my size parameter is ignored and I get 10 values no matter what size I pass.

@megastef
Copy link

megastef commented May 3, 2014

@igal-getrailo Did you read the post above from @karmi "Check out the new cardinality aggregation added in #5426." Did you try to us aggregations instead of using facets? There is a nice presentation from ElasticSearch team here: https://speakerdeck.com/bleskes/deep-dive-into-aggregations
As far I understood aggregations will replace facets in the future. So I think the ES team listens to the users.

@igal-getrailo
Copy link
Contributor

@megastef yes, I am using the aggregations and am getting the same problem (didn't try the cardinality aggs yet but will check it out).

I only found this bug report here after trying to get a response in the mailing list for over a week, but as can be seen in the example I posted at -- https://groups.google.com/forum/#!msg/elasticsearch/xYRaJa04wfc -- I am passing "aggs": {"available_tags": {"terms": {"field": "tags"}}, "size": 20} with the request.

and I actually went over Boaz's slides a couple of days ago, but I guess without the audio of the presentation the slides are not that helpful.

anyway, at first glance Cardinality Aggregations look promising. I will look deeper into it. thanks!

@igal-getrailo
Copy link
Contributor

@megastef it looks like the cardinality aggregation can do the trick when used as a sub-agg of the terms aggregation. in my case, however, I was placing the size param in the wrong scope, and it was therefore ignored. it should have been instead "aggs": {"available_tags": {"terms": {"field": "tags", "size": 20}}}

the docs were updated since I last checked them, and apparently setting the size param properly to a value that's higher than shard_size also sets shard_size, so in my specific case (I don't have a high-cardinality data set) setting the size param properly resolves the issue.

thank you and @jpountz for your help.

@jpountz
Copy link
Contributor

jpountz commented May 4, 2014

but for now my question is: is there some sort of a formula, or rule of thumb as to what the shard_size parameter should be?

There is no formula, it depends on what your data looks like. Terms that are missing or have inaccurate counts in the final response are those that are not in the top shard_size terms on one shard or more. So if your shards are uniform, even a slight increase of the shard size might improve the accuracy of counts a lot.

@igal-getrailo
Copy link
Contributor

@jpountz Thank you!

@speedplane
Copy link
Contributor

It seems that aggregations are the future and facets are on the way out. However, I want faceting functionality and I am trying to determine whether I should use facets with a set shard_size or a cardinality aggregation and set the precision_threshold. Is there any documentation on the performance / accuracy comparison between those two related options? Is one always better than the other, and if not, what factors influence that design decision?

@eliasah
Copy link
Contributor

eliasah commented Aug 25, 2014

Have you considered using the aggregation feature within the facets feature to fix this issue?

@ajhalani
Copy link

It would be nice to know if there are plans to review/work on this in near future?
As some of the previous updates, an option to force extra round-trip to force correct values would be a nice compromise for many users.

@lloy0076
Copy link

Is this issue likely to be "fixed" given that it appears facets, en masse, are deprecated? See the warning on http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets.html.

@clintongormley
Copy link

@lloy0076 it definitely won't be fixed in facets. However, the same problem exists in aggregations, due the to nature of distributed systems. We'll leave this issue open as a reminder - at some stage we hope to provide a slower but more accurate version.

@jodok
Copy link

jodok commented Feb 13, 2015

Yes, if you use the "scatter/gather" pattern you naturally hit the limits of a two-phase approach. When building Crate.IO* (a SQL Database using Elasticsearch, https://github.com/crate/crate) we decided to extend it to support a distributed map/reduce algorithm to provide accurate aggregations.
*(disclaimer: i'm one of the Crate guys)

@otisg
Copy link

otisg commented Feb 14, 2015

@jodok Because @clintongormley used words "slower" and "more accurate" in "we hope to provide a slower but more accurate version", can you comment on the performance and accuracy of your implementation? Specifically, @clintongormley is implying numbers will be closer to the truth, but not necessarily 100% accurate, and I'm wondering if in case of CrateIO they are always 100% accurate? And, of course, I'm wondering if you've compared performance of CrateIO approach vs. ES approach (at scale)? Thanks.

@jodok
Copy link

jodok commented Feb 14, 2015

@otisg the performance of "normal" queries that don't need the 2-phase map/reduce approach is the same. but when it comes to exact aggregations - they need one more cluster roundtrip. instead of 1. collect data (all shards), 2. merge result (one node) - an additional step is added: 1. collect (all shards), 2. reshuffle data (to multiple nodes) and do aggregation, 3. merge (one node). this requires one more network roundrip in the cluster, but usually this is even less than a ms. this method scales horizontally, performance heavily depends on cardinality and the size of the result set. happy to help running tests on sampledata on a demo cluster.

@otisg
Copy link

otisg commented Feb 15, 2015

@clintongormley @kimchy couldn't ES "borrow" the the approach/idea that Crate or Solr(Cloud) use? Both give 100% accurate counts.

@ppf2
Copy link
Member

ppf2 commented Jul 14, 2015

Facets have been deprecated for a while now and have been removed from master/2.0 (#7337) with aggregations being the replacement. Time to close this ticket?

@clintongormley
Copy link

Agreed

@thanodnl
Copy link
Contributor

I think it is worth to keep open this ticket as the aggregations framework suffers from the same issue with terms aggregates on more than 1 shard.

@clintongormley
Copy link

I've opened #12316 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests