Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect bar ordering for unique count with terms sub aggregation #3314

Closed
antoinebaudoux opened this issue Mar 11, 2015 · 16 comments · Fixed by #8397
Closed

Incorrect bar ordering for unique count with terms sub aggregation #3314

antoinebaudoux opened this issue Mar 11, 2015 · 16 comments · Fixed by #8397
Labels
bug Fixes for quality problems that affect the customer experience v5.1.1

Comments

@antoinebaudoux
Copy link

screen shot 2015-03-10 at 17 45 15
screen shot 2015-03-10 at 17 45 58

@stormpython
Copy link
Contributor

Adding notes to the above issue which was brought up at Elastic{ON}. Essentially, the issue is with the ordering of values in the bar chart for sub aggregations on unique count. The order should be descending by value, but due to the split, the bars are unordered by unique count.

I need to dive into the issue to debug.

@stormpython stormpython self-assigned this Mar 11, 2015
@stormpython stormpython changed the title Incorrect ordering of terms sub agg Incorrect ordering of terms for unique count with sub aggregation Mar 11, 2015
@stormpython stormpython added the bug Fixes for quality problems that affect the customer experience label Mar 11, 2015
@stormpython
Copy link
Contributor

So this seems to be a bug in the vislib. Just reproduced. The response from elasticsearch seems to return the results in the correct order, however, the chart displays the data out of order.

@zaakiy
Copy link

zaakiy commented Mar 11, 2015

+1. I have reproduced this also.

/* sent while mobile */


From: Antoine Baudouxmailto:notifications@github.com
Sent: ý11/ý03/ý2015 11:49 AM
To: elastic/kibanamailto:kibana@noreply.github.com
Subject: [kibana] Incorrect ordering of terms sub agg (#3314)

[screen shot 2015-03-10 at 17 45 15]https://cloud.githubusercontent.com/assets/5154448/6588348/a74418d8-c74d-11e4-8ca2-5e7283a67845.png
[screen shot 2015-03-10 at 17 45 58]https://cloud.githubusercontent.com/assets/5154448/6588347/a730c67a-c74d-11e4-9d9e-933dd8a4e6eb.png


Reply to this email directly or view it on GitHubhttps://github.com//issues/3314.

@antoinebaudoux
Copy link
Author

If you look at both screenshot you can see that the ordering seems to be good with the split, since it is identical to the ordering without the split. Its more the bars heights that are messed up.

@stormpython
Copy link
Contributor

@ab-taktik yes, that is what I was referring to when I titled it ordering. By default, the bars should be ordered on the x axis in descending fashion.

@stormpython stormpython changed the title Incorrect ordering of terms for unique count with sub aggregation Incorrect bar ordering for unique count with terms sub aggregation Mar 13, 2015
@blop
Copy link

blop commented Mar 16, 2015

+1

1 similar comment
@ajrasch
Copy link

ajrasch commented Mar 24, 2015

+1

@antoinebaudoux
Copy link
Author

Hello, any news on this? Do you have an idea what is the root cause?

@antoinebaudoux
Copy link
Author

Maybe this has to do with the approximate nature of count/cardinality aggregations, and also the fact that we take only the top X terms and not all terms

@stormpython
Copy link
Contributor

@ab-taktik I think you may be right. By default Elasticsearch sends the documents in descending order by doc_count of buckets returned. Therefore, we have been rendering bar charts with this assumption. However this is not always the case.

Take for example this dataset and this chart:

screen shot 2015-03-25 at 4 38 00 pm

As you can see, the second set of stacked bars in this example should go first. The reason it is not returned first is because the total doc_count is higher in the first bar, but when you subtract the sum_other_doc_count from the doc_count to get the value that is actually displayed, then its clear why the first set of stacked bars is smaller than the second set of stacked bars.

Best solution: Re-order the buckets returned from elasticsearch based on doc_count - sum_other_doc_count. I will add the appropriate time table for a fix.

@stormpython stormpython removed their assignment Mar 25, 2015
@spalger spalger added release_note:enhancement and removed v4.1.0 bug Fixes for quality problems that affect the customer experience labels Apr 2, 2015
@spalger
Copy link
Contributor

spalger commented Apr 2, 2015

@stormpython @ab-taktik this is just the way that aggregations work. Here is a hypothetical step-by-step of what's happening in elasticsearch:

  1. the x-axis agg defines that the following happen
    1. takes the entire result set and splits it into buckets based on scheduleFull.raw
    2. the "the unique count of user.ids" is calculated for each bucket
    3. the buckets are sorted in descending order based on the "unique count of user.ids"
    4. the first 50 buckets are considered the source for the next phase
  2. a copy of the the split-bars agg begins to execute inside of each bucket from step 1(i). individually
    1. the bucket is split up into sub-buckets based on language.raw
    2. each sub-bucket calculates it's "unique count of user.ids"
    3. the sub-buckets are sorted descending based on the "unique count of user.ids"
    4. the first 10 buckets are selected and returned in the elasticsearch response.

This process is precisely what we are visualizing in the second screenshot, and why we can't just subtract the sum_other_doc_count.

In the outlined steps, "unique count of user.ids" can be replaced with any metric, even "99.99th percentile", and therefore the sum_other_doc_count would not have any relevance.

@ab-taktik I think what you really want is for step 1(ii). to happen in a third phase, and for it to go more like "the sum of the 'unique count of user.ids' from the selected child buckets is calculated for each bucket" and then for 1(iii). and 1(vi). to use this new metric in order to sort and select the top 50 buckets. This functionality is something that the elasticsearch 2.0 feature bucket reducers is aiming to solve. Until it is available, I don't think this is a feature Kibana 4 will support.

@spalger
Copy link
Contributor

spalger commented Apr 2, 2015

Another way to think of this problem is that the buckets that create the bars are sorted based on the ordering parameters in the x-axis aggregation:

image

and the value used to do that sorting include documents that are excluded by the sub aggregation (grey area added to illustrate the excluded documents)

image

@bradvido
Copy link

bradvido commented May 1, 2015

FWIW I've reproduced this issue without using unique count metrics in #3734

@driskell
Copy link

Reading what @spalger says, it seems to me that the ordering is actually correct. But that the problem is the Terms Sub Aggregation for Split Bars is incorrectly excluding data, creating what is unarguably a misleading representation of the data.
*Sorry about the "what @spalger is saying" - it was rude and badly phrased - I've rephrased! 👍 *

I just did a graph like this with Top 5 browser across operating systems, and all of a sudden it looked like iOS was the top operating system, but it wasn't... Windows was, it just had so many variations of browser it only showed the top 5.

There should be a part of the bar, which @spalger showed in grey, to show "Other" - this would fix both the ordering (which in my opinion is correct actually) and would fix the misleading representation of data. In my case Windows would jump up with a huge "Other" area, and the iOS would still be there at the end but much much tinier.

Summary: Ordering is fine, but what's happening is "Split Bar" + "Terms" is actually doing a "Filtered Split Bar" and filtering data, taking away all meaning from the original X-Axis aggregation. I can't see why somebody would only want to compare bars containing only the Top 5 entries...

@spalger
Copy link
Contributor

spalger commented Jun 11, 2015

@driskell I totally agree that we should be able to produce "other" buckets, but the feature must be implemented in elasticsearch first (see elastic/elasticsearch#5324 for progress). Once that is implemented this will be a far less confusing experience. For now, I recommend setting the size of the aggregation to something that makes the most sense for your data.

@spalger
Copy link
Contributor

spalger commented Sep 9, 2015

Looks like elastic/elasticsearch#11042, so we can move forward with #1961.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience v5.1.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants