"histogram" is not a histogram, and mistake wrt timing metrics #349

Closed
Dieterbe opened this Issue Jan 21, 2013 · 3 comments

Comments

Projects
None yet
3 participants

Hi,
looking at http://docs.datadoghq.com/guides/metrics/ ,
count/avg/median/max/percentile have AFAICT nothing to do with histograms, as with histograms you divide your spectrum in buckets and count how many values fall within each bucket.
I actually implemented histograms in statsd, and have a pending PR here:
etsy/statsd#162
(I also have a corresponding blogpost with more info but it may be TMI http://dieter.plaetinck.be/histogram-statsd-graphing-over-time-with-graphite.html)

Related, at http://docs.datadoghq.com/guides/dogstatsd/ I read "Statsd only supports histograms for timing, not generic values (like the size of uploaded files or the number of rows returned from a query)."
this is not correct, although it is a common misconception because of the 'ms' code in the protocol and the fact that the metric name is "timing".
I opened a ticket for this and plan to rename it to "aggregate" in statsd (see etsy/statsd#98), because this is essentially just computing aggregate statistics on the input set. The histogram feature is then just another set of computed aggregatic statistics, just like mean/avg/percentiles/... are. and they are all in the scope of the same metric type.

Owner

olidb2 commented Jan 21, 2013

Hi Dieter!

So - we actually thought quite a bit before calling this a Histogram -
there were a few heated arguments internally about it - but here's what it
boils down to for us:

  • Histograms are our semantics for users to say they're interested in the
    distribution of the values, not just an average/min/max.
  • We actually can produce a histogram out of the time-series we gather,
    since we have the sample sizes and the values determining the bins we
    deemed meaningful (median, 75pct, 95pct, etc..). The only caveat is that
    bins won't be even, and they will change with each time interval. This
    doesn't require any per-series configuration of bin ranges, though.
    • We limited ourselves to these summary stats as opposed to full deciles
      or quartiles because those were the most frequently used for ops metrics -
      in particular latency. So our "histogram" doesn't know much about the lower
      parts of the distribution. And it's quite low-res. Both are limitations.
  • We are planning to add some higher-definition (and more symmetrical)
    histogram data tracking, but will likely do so while keeping the same
    semantics. Behind the scenes, it would look quite similar to your statsd
    patch. The most annoying issue here - as you noted in your own patch - is
    configuring fixed-bins, which is something we'll have to experiment with a
    bit more.

Thanks also for the pointers to statsd - we'll take a look and fix the docs.

Happy to discuss this offline if you want.
O.

On Mon, Jan 21, 2013 at 11:45 AM, Dieter Plaetinck <notifications@github.com

wrote:

Hi,
looking at http://docs.datadoghq.com/guides/metrics/ ,
count/avg/median/max/percentile have AFAICT nothing to do with histograms,
as with histograms you divide your spectrum in buckets and count how many
values fall within each bucket.
I actually implemented histograms in statsd, and have a pending PR here:
etsy/statsd#162 etsy/statsd#162
(I also have a corresponding blogpost with more info but it may be TMI
http://dieter.plaetinck.be/histogram-statsd-graphing-over-time-with-graphite.html
)

Related, at http://docs.datadoghq.com/guides/dogstatsd/ I read "Statsd
only supports histograms for timing, not generic values (like the size of
uploaded files or the number of rows returned from a query)."
this is not correct, although it is a common misconception because of the
'ms' code in the protocol and the fact that the metric name is "timing".
I opened a ticket for this and plan to rename it to "aggregate" in statsd
(see etsy/statsd#98 etsy/statsd#98), because
this is essentially just computing aggregate statistics on the input set.
The histogram feature is then just another set of computed aggregatic
statistics, just like mean/avg/percentiles/... are. and they are all in the
scope of the same metric type.


Reply to this email directly or view it on GitHubhttps://github.com/DataDog/dd-agent/issues/349.

hmm that's interesting. I've never encountered histograms where the intervals are defined by percentiles but I guess it's
technically ok to do.
I wonder how to best render such histograms over time. My gold standard so far has been http://imgur.com/P4Hu0, with its fixed intervals and color coding it's easy to spot anomalies (such as differences from a bell curve) (note how this example results in a decreased median/average but there are more high outliers).

thinking about it, the datadog histogram graphs look actually relatively fine, but:

  • it's a bit confusing at first, one would think a histogram plot contains the buckets plotted, but since the percentiles are plotted as-is they cover the entire range from 0 to their value, and hence are inherently "stacked" despite the graph not looking like a stacked graph. I definitely think it's better to draw "stacked", maybe fill the areas under the graph to make it more obvious. (I think that's sortof the unofficial convention) maybe the intervals can be color-coded too. (so you end up with bands colored from, say, green to red)
  • another thing is that the scale gets disproportionate and lower percentiles are harder to see (but my colorcoding example has a similar problem, a color scale has limited use across a similar wide spectrum)

I def. like your idea of adding more bands in the lower part of the distribution, it's good to see trends there.
More so, I'm starting to think my colorcoding example doesn't do anything more or better than your current approach (assuming both have the same amount of buckets to keep it fair), and predefining the interval boundaries with "static" values is not fun indeed.

Contributor

alq666 commented Dec 23, 2013

Closing this as we will stick to the statsd terminology, however broken it is.

alq666 closed this Dec 23, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment