log(frequency) in histograms confusing #237

karajan9 · 2021-07-16T10:04:32Z

While I was benchmarking two very similar functions I noticed the histograms looked very different, which surprised me. This was because the second run switched to a log display of the histogram.

(I overwrote the function so they share the same name)

Having the option of a log histogram means I need to check what histogram I'm getting every time. I personally have also trouble reading a log histogram without thinking through what that actually means for my numbers. If one time dominates the histogram I would prefer to see it that way -- I assume the histogram is supposed to be a summary and a single dominant bar would be a good summary in that case. Therefore I would suggest always showing the histogram against the frequency, maybe with an option to display the log histogram.

If the behaviour should stay the same I would like to see an option to pick the histogram type so I can at least compare histograms of measurements easily.

tecosaur · 2021-07-17T04:59:45Z

Hmmm, yea - this seems reasonable. I don't have a good idea right at this moment how to make it optional, but I'll get back to this at some point. In the meantime, the shape/profile you're getting isn't as extreme as the cases which I added a log-frequency histogram for, so the heuristic could maybe do with tweaking. The current logic is:

# if median size of (bins with >10% average data/bin) is less than 5% of max bin size, log the bin sizes

karajan9 · 2021-07-17T13:24:51Z

Sounds good.

If we are arguing a good default (of course log plots are fine in general, I just don't like them here as a part of the default) maybe gathering a few more voices would be a good idea to see if my opinion is more rule or more exception.

mkborregaard · 2021-09-13T13:27:31Z

I agree with @karajan9
Could you explain - what is the usefulness of log histograms in terms of diagnosing issues with the code?

tecosaur · 2021-09-13T13:47:03Z

I added them because in some circumstances a non-log plot can push all the other values to round to 0 on the bar graph, which gives the false impression that the executions all took that one time to execute. However, if as can happen that "one time" is say 2us, but 1/100th of the results were 1ms then that rare result which is visually buried actually can account for a large portion of the total time elapsed. This may sound overly contrived, but I have seen examples like that (though I cannot recall one the spot).

mkborregaard · 2021-09-13T13:51:37Z

Sure, but would having normal axes not actually help in spotting this phenomenon, and making them automatically log hide it?

tecosaur · 2021-09-13T14:04:20Z

I don' quite see what you mean. Normal axis hide this phenomenon, and log axis make it visible.

mkborregaard · 2021-09-15T14:34:24Z

I guess that's hard to discuss without a sample distribution and a criterion for what diagnostic aspect of the benchmarking runs we are trying to isolate.

timholy · 2021-09-18T10:49:23Z

Manual control was essentially fixed in #255 (and made more readable in #257), because you can specify :logbins=>false in an IOContext.

In most cases, I agree that the log scaling adds confusion rather than helping. However, I have been able to generate a case which I think illustrates @tecosaur's concern:

julia> ioctxfalse = IOContext(stdout, :logbins=>false);

julia> ioctxtrue  = IOContext(stdout, :logbins=>true);

julia> t = [fill(1, 100); 1:100]; # fake execution times (will be scaled so the total is 5s, just to ensure consistency with the parameters)

julia> b = BenchmarkTools.Trial(BenchmarkTools.DEFAULT_PARAMETERS, t/sum(t)*1e9*BenchmarkTools.DEFAULT_PARAMETERS.seconds, zeros(length(t)), 0, 0);

julia> show(ioctxfalse, MIME("text/plain"), b)
BenchmarkTools.Trial: 200 samples with 1 evaluation.
 Range (min … max):  970.874 μs … 97.087 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     970.874 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    25.000 ms ± 31.225 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
  971 μs          Histogram: frequency by time         95.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia> show(ioctxtrue, MIME("text/plain"), b)
BenchmarkTools.Trial: 200 samples with 1 evaluation.
 Range (min … max):  970.874 μs … 97.087 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     970.874 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    25.000 ms ± 31.225 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▅▃▅▅▃▅▃▅▅▃▅▅▃▅▃▅▅▃▅▃▅▅▃▅▅▃▅▃▅▅▃▅▃▅▅▃▅▅▃▅▃▅▅▃▅▃▅▅▃▅▅▃▅▃▅▅▃▅▅ ▅
  971 μs        Histogram: log(frequency) by time      95.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

So I see what @tecosaur means. But to be clear, I think the log scaling is also confusing, because there's no vertical scale; I can't tell if the taller bar is 10x taller or 2x taller or 100x taller than the shorter ones. From this output I would never guess that the shortest time was obtained in just over half the runs.

So if no better way of clarifying this can be found, I'd still favor dropping the log scaling. But perhaps someone has a good idea about how to fix it. Would the cumulative distribution simply be too confusing?

tecosaur · 2021-09-18T12:34:12Z

I'm thinking now that there's manual control, with the demonstrated potential for confusion it would be worth dropping the "use log if..." logic and just have it off by default. Regarding the vertical scale, we could set the log basis such that each line / full height bar = 10x.

mkborregaard · 2021-09-18T14:46:02Z

FWIW I still think the arithmetic scale is more diagnostic than the log one in @timholy 's example

mcabbott · 2021-09-18T15:19:40Z

Notice in that example that min 970.874 μs, mean 25.000 ms, max 97.087 ms is quite revealing. Combined with the histogram of many points at the minimum, something is going on. While median 970.874 μs misses this effect completely.

Agree on linear scale being more natural. Colouring in below a line on a log scale is pretty weird, where do you stop, why is this an area?

timholy · 2021-09-19T07:44:27Z

Re the summary statistics, but median(t) ± meanad(t) = (970 ± 24000)us is just as revealing. Or moreso, because unlike min..mean it reveals where the peak of the histogram lies.

I'm not disagreeing that we need a sense of the variability, but max is useless. (If my machine got busy checking email during one run, what does that have to do with the performance of my algorithm?) We probably shouldn't even display max because of the potential for misunderstanding. The combination of min and mean isn't too bad, though.

mcabbott · 2021-09-19T15:06:20Z

Yes I don't think I've ever learned anything from max. It's possible that other people do, if they care about real-time things, or network time-outs?

vchuravy · 2021-09-19T16:12:40Z

I think instead of Mac, what would be more interesting is the 90/95/99 percentile

mcabbott · 2021-09-19T19:29:37Z

A high percentile might be neat, we should try it out.

One nice thing about printing min, 50th, mean, 95th or whatever is that all numbers displayed are total times. When they are all similar, you need not care which one you read at first glance. Whereas with μ ± σ, one of them is a difference, which is a different scale.

simonbyrne · 2023-07-05T20:41:26Z

I'd like to add that a log(frequency) histogram doesn't really make any sense: the point of a histogram is that the area of a single bar / area of all bars is equal to the proportion in that bin. If you log-transform the vertical axis that no longer applies.

It would make more sense to log-transform the horizontal axis: this would help capture long-tails better.

tecosaur mentioned this issue Aug 25, 2021

Incorrect(?) bar heights and plot labels from @benchmark #246

Closed

mcabbott mentioned this issue Sep 10, 2021

More copy-able result for @benchmark #236

Open

timholy mentioned this issue Sep 18, 2021

Simplify control logic for :logbins #257

Merged

mcabbott mentioned this issue Sep 18, 2021

Overhaul display of a Trial #217

Merged

mcabbott mentioned this issue Sep 21, 2021

Re-organise @benchmark display #260

Open

gdalle added the enhancement label Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log(frequency) in histograms confusing #237

log(frequency) in histograms confusing #237

karajan9 commented Jul 16, 2021

tecosaur commented Jul 17, 2021

karajan9 commented Jul 17, 2021

mkborregaard commented Sep 13, 2021

tecosaur commented Sep 13, 2021

mkborregaard commented Sep 13, 2021

tecosaur commented Sep 13, 2021

mkborregaard commented Sep 15, 2021

timholy commented Sep 18, 2021 •

edited

Loading

tecosaur commented Sep 18, 2021 •

edited

Loading

mkborregaard commented Sep 18, 2021

mcabbott commented Sep 18, 2021

timholy commented Sep 19, 2021 •

edited

Loading

mcabbott commented Sep 19, 2021

vchuravy commented Sep 19, 2021

mcabbott commented Sep 19, 2021

simonbyrne commented Jul 5, 2023

log(frequency) in histograms confusing #237

log(frequency) in histograms confusing #237

Comments

karajan9 commented Jul 16, 2021

tecosaur commented Jul 17, 2021

karajan9 commented Jul 17, 2021

mkborregaard commented Sep 13, 2021

tecosaur commented Sep 13, 2021

mkborregaard commented Sep 13, 2021

tecosaur commented Sep 13, 2021

mkborregaard commented Sep 15, 2021

timholy commented Sep 18, 2021 • edited Loading

tecosaur commented Sep 18, 2021 • edited Loading

mkborregaard commented Sep 18, 2021

mcabbott commented Sep 18, 2021

timholy commented Sep 19, 2021 • edited Loading

mcabbott commented Sep 19, 2021

vchuravy commented Sep 19, 2021

mcabbott commented Sep 19, 2021

simonbyrne commented Jul 5, 2023

timholy commented Sep 18, 2021 •

edited

Loading

tecosaur commented Sep 18, 2021 •

edited

Loading

timholy commented Sep 19, 2021 •

edited

Loading