Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log(frequency) in histograms confusing #237

Open
karajan9 opened this issue Jul 16, 2021 · 16 comments
Open

log(frequency) in histograms confusing #237

karajan9 opened this issue Jul 16, 2021 · 16 comments

Comments

@karajan9
Copy link

While I was benchmarking two very similar functions I noticed the histograms looked very different, which surprised me. This was because the second run switched to a log display of the histogram.
image
(I overwrote the function so they share the same name)

Having the option of a log histogram means I need to check what histogram I'm getting every time. I personally have also trouble reading a log histogram without thinking through what that actually means for my numbers. If one time dominates the histogram I would prefer to see it that way -- I assume the histogram is supposed to be a summary and a single dominant bar would be a good summary in that case. Therefore I would suggest always showing the histogram against the frequency, maybe with an option to display the log histogram.

If the behaviour should stay the same I would like to see an option to pick the histogram type so I can at least compare histograms of measurements easily.

@tecosaur
Copy link
Collaborator

Hmmm, yea - this seems reasonable. I don't have a good idea right at this moment how to make it optional, but I'll get back to this at some point. In the meantime, the shape/profile you're getting isn't as extreme as the cases which I added a log-frequency histogram for, so the heuristic could maybe do with tweaking. The current logic is:

# if median size of (bins with >10% average data/bin) is less than 5% of max bin size, log the bin sizes

@karajan9
Copy link
Author

Sounds good.

If we are arguing a good default (of course log plots are fine in general, I just don't like them here as a part of the default) maybe gathering a few more voices would be a good idea to see if my opinion is more rule or more exception.

@mkborregaard
Copy link

I agree with @karajan9
Could you explain - what is the usefulness of log histograms in terms of diagnosing issues with the code?

@tecosaur
Copy link
Collaborator

I added them because in some circumstances a non-log plot can push all the other values to round to 0 on the bar graph, which gives the false impression that the executions all took that one time to execute. However, if as can happen that "one time" is say 2us, but 1/100th of the results were 1ms then that rare result which is visually buried actually can account for a large portion of the total time elapsed. This may sound overly contrived, but I have seen examples like that (though I cannot recall one the spot).

@mkborregaard
Copy link

Sure, but would having normal axes not actually help in spotting this phenomenon, and making them automatically log hide it?

@tecosaur
Copy link
Collaborator

I don' quite see what you mean. Normal axis hide this phenomenon, and log axis make it visible.

@mkborregaard
Copy link

I guess that's hard to discuss without a sample distribution and a criterion for what diagnostic aspect of the benchmarking runs we are trying to isolate.

@timholy
Copy link
Member

timholy commented Sep 18, 2021

Manual control was essentially fixed in #255 (and made more readable in #257), because you can specify :logbins=>false in an IOContext.

In most cases, I agree that the log scaling adds confusion rather than helping. However, I have been able to generate a case which I think illustrates @tecosaur's concern:

julia> ioctxfalse = IOContext(stdout, :logbins=>false);

julia> ioctxtrue  = IOContext(stdout, :logbins=>true);

julia> t = [fill(1, 100); 1:100]; # fake execution times (will be scaled so the total is 5s, just to ensure consistency with the parameters)

julia> b = BenchmarkTools.Trial(BenchmarkTools.DEFAULT_PARAMETERS, t/sum(t)*1e9*BenchmarkTools.DEFAULT_PARAMETERS.seconds, zeros(length(t)), 0, 0);

julia> show(ioctxfalse, MIME("text/plain"), b)
BenchmarkTools.Trial: 200 samples with 1 evaluation.
 Range (min  max):  970.874 μs  97.087 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     970.874 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    25.000 ms ± 31.225 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
  971 μs          Histogram: frequency by time         95.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia> show(ioctxtrue, MIME("text/plain"), b)
BenchmarkTools.Trial: 200 samples with 1 evaluation.
 Range (min  max):  970.874 μs  97.087 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     970.874 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    25.000 ms ± 31.225 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                             
  █▅▃▅▅▃▅▃▅▅▃▅▅▃▅▃▅▅▃▅▃▅▅▃▅▅▃▅▃▅▅▃▅▃▅▅▃▅▅▃▅▃▅▅▃▅▃▅▅▃▅▅▃▅▃▅▅▃▅▅ ▅
  971 μs        Histogram: log(frequency) by time      95.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

So I see what @tecosaur means. But to be clear, I think the log scaling is also confusing, because there's no vertical scale; I can't tell if the taller bar is 10x taller or 2x taller or 100x taller than the shorter ones. From this output I would never guess that the shortest time was obtained in just over half the runs.

So if no better way of clarifying this can be found, I'd still favor dropping the log scaling. But perhaps someone has a good idea about how to fix it. Would the cumulative distribution simply be too confusing?

@tecosaur
Copy link
Collaborator

tecosaur commented Sep 18, 2021

I'm thinking now that there's manual control, with the demonstrated potential for confusion it would be worth dropping the "use log if..." logic and just have it off by default. Regarding the vertical scale, we could set the log basis such that each line / full height bar = 10x.

@mkborregaard
Copy link

FWIW I still think the arithmetic scale is more diagnostic than the log one in @timholy 's example

@mcabbott
Copy link

Notice in that example that min 970.874 μs, mean 25.000 ms, max 97.087 ms is quite revealing. Combined with the histogram of many points at the minimum, something is going on. While median 970.874 μs misses this effect completely.

Agree on linear scale being more natural. Colouring in below a line on a log scale is pretty weird, where do you stop, why is this an area?

@timholy
Copy link
Member

timholy commented Sep 19, 2021

Re the summary statistics, but median(t) ± meanad(t) = (970 ± 24000)us is just as revealing. Or moreso, because unlike min..mean it reveals where the peak of the histogram lies.

I'm not disagreeing that we need a sense of the variability, but max is useless. (If my machine got busy checking email during one run, what does that have to do with the performance of my algorithm?) We probably shouldn't even display max because of the potential for misunderstanding. The combination of min and mean isn't too bad, though.

@mcabbott
Copy link

Yes I don't think I've ever learned anything from max. It's possible that other people do, if they care about real-time things, or network time-outs?

@vchuravy
Copy link
Member

I think instead of Mac, what would be more interesting is the 90/95/99 percentile

@mcabbott
Copy link

A high percentile might be neat, we should try it out.

One nice thing about printing min, 50th, mean, 95th or whatever is that all numbers displayed are total times. When they are all similar, you need not care which one you read at first glance. Whereas with μ ± σ, one of them is a difference, which is a different scale.

@simonbyrne
Copy link
Member

I'd like to add that a log(frequency) histogram doesn't really make any sense: the point of a histogram is that the area of a single bar / area of all bars is equal to the proportion in that bin. If you log-transform the vertical axis that no longer applies.

It would make more sense to log-transform the horizontal axis: this would help capture long-tails better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants