New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percentage discrepancy when creating "quick values" pie chart with large data table. #2639

Closed
casepie opened this Issue Aug 9, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@casepie

casepie commented Aug 9, 2016

When analyzing the flow logs from my firewall and building a graph of IDS alerts centered around "source_address" (source IP), I'll get a pie graph and a data table (obviously). The problem is this. Often times, when creating the query, there may be 100 or more unique values for "source_address".

 When you create a "Quick Values" chart, the pie graph is built from the numbers and percentages in the data table (maximum of 50 IPs).  But the percentages in the data table, are built based on the entire query.  So you can end up with your top IP showing up as 18% in the data table, but taking up roughly 70% of your pie graph.

Expected Behavior

One would expect the percentages for a given value (in my case, source_address) shown visually, to be the same from the pie graph, to the data table below.

Current Behavior

If you have more than 50 unique data values for the query in the field used to create your pie graph, then you'll have a discrepancy between the pie graph and the data table on the dashboard widget. The data table appears to still build it's percentage based on the entire query results. (all 100+ IP addresses)

However, Graylog only shows 50 results for source_address in the data table. The problem comes in when the pie graph appears to calculate the percentage for that value (in my case, source_address) based only on the 50 source_addresses in the displayed data table (and not on the full query results).

Possible Solution

Would suggest that the pie graph should also be calculated / drawn based on the percentage from the full query results so that the numbers there visually match what is displayed in the data table (i.e. If the data table says that IP number 10.10.16.1 accounted for 18% of the results, then that slice of the pie should visually represent about 18% of the pie graph.

Steps to Reproduce (for bugs)

  1. This will vary from system to system but build a query that results in hundreds or thousands of results, with a key field ( the one you're going to graph on) that will have more than 50 unique values. Ideally, one or two of those values will be outliers, with many more appearances than the others. An ideal type of query for this is "top talkers" on a busy network.
  2. Build a "Quick Values" graph based on that key field (in my example, source_address or destination_address).
  3. compare the percentage in the data table to the visual percentage of the pie graph displayed.

Context

Our use case is based on using Juniper SRX firewall logs. We capture Intrusion Detection (IDS) logs and then build a dashboard item for "IDS alerts by Source IP". This is a "quick values" chart based on "source_address". It usually results in many hundreds of unique values for "source_address" with only a few that are statistically significant (above 3-5%). However the pie graph looks very skewed when compared to the data table.

Your Environment

  • Graylog Version: 2.0.3
  • Elasticsearch Version: 2.3.5-1
  • MongoDB Version: 2.6.11-1.el7
  • Operating System: Centos 7
  • Browser version: Chrome 51.0.2704

graph_discrepancy

@kroepke

This comment has been minimized.

Member

kroepke commented Aug 9, 2016

Good point, the result apparently ignores the sum of the long tail terms in the result set.
I'll see if we can quickly fix that.

Thanks for your comprehensive report!

@kroepke kroepke self-assigned this Aug 9, 2016

@kroepke kroepke added the bug label Aug 9, 2016

@kroepke kroepke added this to the 2.1.0 milestone Aug 9, 2016

@kroepke

This comment has been minimized.

Member

kroepke commented Aug 9, 2016

For reference, we need to take into account the sum_other_doc_count return value of the aggregation.
The "other" pie chart entry should then be the overflow (the values shown in the "other") table plus the sum_other_doc_count.
Ideally we'd also show the sum_other_doc_count in the table somehow, e.g.:

Others (${sum_other_doc_count} values not shown)
@kroepke

This comment has been minimized.

Member

kroepke commented Aug 11, 2016

Turns out this was purely a display bug with the pie chart, the data table was ok.
We also already use the correct numbers in the data table, but rendered them incorrectly in the chart.

kroepke added a commit that referenced this issue Aug 11, 2016

use proper other count for pie chart slices
the others group did not take into account the other document groups which were outside of the first 45 "other" buckets
this led to incorrect rendering of pie chart slices

fixes #2639

edmundoa added a commit that referenced this issue Aug 12, 2016

use proper other count for pie chart slices (#2671)
the others group did not take into account the other document groups which were outside of the first 45 "other" buckets
this led to incorrect rendering of pie chart slices

fixes #2639
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment