Benchmarking should be faster #30

noahgibbs · 2021-08-09T13:42:09Z

There are a few things we can do about that. For instance:

several benchmarks do setup stages that really only need to happen once (e.g. Railsbench doing bundle install and rails db:migrate.) It should be possible to run it once with setup and then skip the setup on following tries. This will require an API change to benchmark.rb, but that change could be optional -- any benchmark.rb that always does setup would still be correct, just slower than necessary.
a lot of benchmarks do iterations-inside-iterations, where we'll repeat many times and call it one iteration (e.g. psych-load's simple "100.times" inner loop; Railsbench uses random routes, but also does something similar.) That helps the UI from the command line a bit since it's not really designed for very short (e.g. 5ms) benchmarks. And it helps reduce overhead of the harness's benchmarking loop, which is also not really designed for very short benchmarks. But we can fix this with something resembling Chris Seaton's continuous-adaptive iteration timing (Diagnostic harnesses for continuous-adaptive and BIPS yjit-bench#32) it we want to, and then the harness will get a lot more control over how long to run and when to stop.
right now we just use a fixed number of iterations of warmup for any benchmarks we run. There are easily fifty different ways we can improve how we handle warmup, and quite a lot of them result in a faster overall set of runs. e.g. we can have adaptive warmup until the graph is sufficiently flat in slope and low in variance; or we could have some method of tracking a "reasonable" level of warmup per-benchmark and per-ruby, which would allow quick warmup on non-TruffleRuby runs, and medium-quick warmup on TruffleRuby runs where only certain benchmarks were used. Right now mixing (e.g.) activerecord and psych-load for the same number of iterations means psych-load is warmed up vastly more than needed.
related: we don't allow running the benchmark until some level of stability is reached - only a fixed amount of time or number of iterations. This has to be done carefully to avoid bias, but we can do something reasonable here, I think. Extra-fancy would be running two sets of benchmarks until we've proven (to within some margin) they are the same or different, but even something simple would be better than where we're at now.

Some of these things break compatibility with the yjit-bench harness. It would be nice to retain compatibility, so I'll try not to do anything to gratuitously break that without some correspondingly useful feature that's currently not supportable.

maximecb · 2021-08-11T14:22:53Z

Something to keep in mind is that having an adaptive harness could make the benchmarking results less stable. The nice thing about yjit-bench right now is that it's very simple and predictable. It delivers results fairly quickly. There's reasons to stick with a less "fancy" strategy.

Some of these things break compatibility with the yjit-bench harness. It would be nice to retain compatibility, so I'll try not to do anything to gratuitously break that without some correspondingly useful feature that's currently not supportable.

Appreciated 👍

What became obvious to me after we did the whole benchmarking for VMIL is that: we should have started with one single run per benchmark, and kept iterating until we were satisfied we had the "right" benchmarking strategy, and only then gone for multiple runs for more accuracy. As it pertains to yjit-metrics, maybe it's also OK to just do a single run per benchmark. Yes that does introduce a little bit of noise, but maybe a few percent of noise is OK, because when you can still smooth the data over multiple days.

Also worth considering whether or not we want to measure the perf of every engine every night. I was thinking we mostly cared about the performance of YJIT and the interpreter for this project. Since you're building the tools, it could make sense to also benchmark TruffleRuby, but does it have to be benchmarked every night? If yjit-metrics ends up taking on benchmarking TruffleRuby regularly as a goal, they will want a different set of benchmark and their own strict set of requirements.

noahgibbs · 2021-08-11T14:51:11Z

I definitely don't think every engine needs benchmarking every night, and I've tried to be careful to not make that assumption in the benchmarking tools. We could do a low-warmup run with just interpreter and YJIT, for instance, and make that the nightly.

But yeah, not benchmarking TruffleRuby nightly is a great idea because its settings need to be so different from what YJIT, MJIT or CRuby want.

we should have started with one single run per benchmark, and kept iterating until we were satisfied we had the "right" benchmarking strategy, and only then gone for multiple runs for more accuracy.

We could do that. I don't feel like we ever really nailed warmup given how late in the game we were making TruffleRuby-related changes. And long-term I definitely think it makes zero sense to warm up each benchmark for the same number of iterations (e.g. optcarrot vs activerecord). While warmup time might work as a substitute, it's still going to be a little rough around the edges.

Given how much you'd prefer keeping things simple (use shellscripts, avoid multiple runs, avoid keeping settings in the framework), though, we should probably assume we should keep it at a fixed number of iterations. Then we'll just take the hit on how long it takes to run benchmarks.

So even if we do everything for a singlerun, that run is going to be very long. Being fancy (variable numbers of iterations for different benchmarks, etc) allows you to skip work where it's not needed. Whereas if you (e.g.) hardcode the number of iterations for everything, you wind up getting 500 iterations of a 200ms benchmark, plus 500 iterations of a 10-second benchmark. The longer-running benchmark takes a lot longer, and has enough built-in noise that 500 iterations is usually not nearly enough. As our benchmarks get bigger (e.g. Discourse) that problem is going to grow a lot.

There are some ways we could move more control into the harness to reduce iteration size, but those all add complexity. The simple version (shellscript says run a fixed number of iterations) all wind up with the Discourse benchmark, or any other large long-running benchmark, needing days or more to run.

We're basically there already. Even the single-run and warmup data I did for VMIL was about a day of computation each. And with TruffleRuby we never really hit the point where it stopped warming up, even for the single-run data.

maximecb · 2021-08-11T15:20:44Z

I'm open to having an adaptive strategy for warmup. It's just that with TruffleRuby being so unpredictable, we could get wildly different results every time, as in we think TR is done warming up, but it isn't. In that context maybe doing 50 warmup iterations for everyone, but 200 for TR makes sense, or just having a minimum amount of time we give for warmup, like 5 or 10 minutes (with however many iterations each engine does in that time).

Maybe something that could make sense is: we guarantee a minimum of 20 warmup iterations, and also a minimum of 5 minutes of warmup (could be more than 20 iterations), then we begin doing timing iterations, and we guarantee that we get at least 10 timing iterations?

noahgibbs · 2021-08-11T15:37:25Z

The tradeoff is the one you mentioned earlier: we can vary more, which will often give us better results faster. But then it's hard to guarantee we haven't done it wrong.

We can definitely do what you mention there. It'd be fine. I was considering having some kind of warmup "hint" built into each benchmark, or into the framework, because it's really hard to automatically deduce warmup from iteration times. Then we'd get that number unless we specified one. But your way is more uniform -- it doesn't need any additional per-benchmark information, at the cost that sometimes it will do a bad job on a specific benchmark.

maximecb · 2021-08-11T17:16:02Z

My potentially controversial opinion is: if we give some minimum iteration count for warmup (eg: 20) and some minimum time budget for warmup (eg: 5 minutes), we're actually being very generous. It's not us who are doing a bad job at enabling JIT compilers to be sufficiently warmed up, it's Truffle that is doing a bad job at warming up rapidly and predictably (which is what users of such a JIT compiler would expect).

noahgibbs · 2021-08-16T15:37:16Z

Oh - something I should mention here: TruffleRuby has a tool for viewing its internal state that @chrisseaton showed me. Among other things, it determines whether a particular method is fully compiled, per method. So while we can't fully prevent deoptimisation, it should be possible to determine if TruffleRuby has any more known optimisations in the queue, if we can use the same interface that tool uses. So we could at least be able to tell when TruffleRuby thinks it's fully warmed up.

I get the impression that it can take awhile for everything to fully shake out. But at a minimum we could accept TruffleRuby's estimate of whether it's fully method-compiled when it says yes, and have some maximum time cutoff if it's still saying "not yet." That might let us speed up the fast-warmup cases for TruffleRuby significantly, even if (e.g.) Railsbench still took a long time to warm up.

chrisseaton · 2021-08-16T22:05:26Z

This is the GraalVM Thermometer.

oracle/graal#3198

It can tell you how much of your time you're spending in optimised code. If it's 95% of something, you can called that warmed up. You could measure the same thing you own way in MJIT and YJIT and it could be a universal metric.

noahgibbs · 2021-08-17T09:39:08Z

We only monitor that in YJIT-capable builds when RUBY_DEBUG is true -- so for most of the builds we care about, you can't easily get that information at runtime. No clue how/whether MJIT does the same, but I don't think the information is currently exported from MJIT.

noahgibbs · 2021-08-17T10:00:01Z

My Java-fu is old and my Graal-fu is basically nonexistent. I'm not having any luck with Google, tracking down a protocol that those components use to connect from outside the main TruffleRuby process. If I wanted to export that information in a way other than as dynamically-updating text on the console, can you give me a hint where to start looking?

I'm not even sure how I'd get the dynamically updating text on the console, but I've seen you do so, so I know it's possible. Also, that looks like a draft PR, so I can't tell if that would be usable on a released TruffleRuby yet.

Also: can you tell me if there's significant overhead to tracking that? It'd be great to know when TruffleRuby is done warming up, but if monitoring warmup would give worse performance numbers then we shouldn't do that for benchmarking.

chrisseaton · 2021-08-17T10:50:20Z

If you tell me a way you want to access the data I'll try to give you that interface.

For example would a call from within the Ruby code be useful? Such as perecent_warmed_up or something?

There is an overhead - I'm trying to significantly reduce it at the moment.

noahgibbs · 2021-08-17T12:31:09Z

Being able to get percent_warmed_up from Ruby would do exactly what we want, yeah. The right model for us would be calling from inside the harness (the worker process running in TruffleRuby) to get the current warmed-up percentage. Then we could ~~set~~use that as a possible threshold for warmup.

The current plan is to keep iterating until we're past some minimum number of iterations, and then stop at a maximum number of iterations or total warmup time. This would add a third (TruffleRuby-only) condition that would terminate warmup and start the timed iterations. And that all happens inside the harness process.

At a minimum we could use the percent_warmed_up metric to get rough early results in TruffleRuby faster, and then run without the overhead for final results. That shouldn't be hard to do, and would still speed up a lot of our runs significantly.

eregon · 2021-10-30T13:16:17Z

FYI something I did is a harness which tries to minimize the median absolute deviation (a robust estimator of the variability), and slowly increase it so we eventually finish benchmarking even if the variance between iterations is not fully stable.

This is an approach I used a while ago in the perfer benchmark harness: https://github.com/eregon/perfer/blob/98c4b23aa1884b3bbc45cde377c3d9c2f6260f6a/lib/perfer/job/iteration_job.rb#L138 and https://github.com/jruby/perfer/wiki/Methodology (this one computes the median absolute deviation of the last 10 iterations).

I've also tried that approach on yjit-bench benchmarks (https://github.com/eregon/yjit-bench/blob/harness-warmup-20211024/harness-warmup/harness.rb) and computing the median absolute deviation of all iterations (since it's a robust estimator) and it seems to work fairly well.
I added the additional condition to run for at least 5 seconds (otherwise some micros might have low median absolute deviation due to consistent early very fast iterations but not be warmed up yet, but those micros are basically optimized out by TruffleRuby anyway).
And also a max time limit, so if some benchmark e.g. is bimodal or very unstable we still stop at some point.
I think it works well, but it's not perfect either and sometimes feels like it does too many iterations, maybe I should increase the threshold more than linearly per iteration, or maybe the threshold should be based on time passed since the start of the benchmark (since as you mention an iteration can be a few milliseconds on micros or ~10 seconds on some macros).

As we all know, it's hard to automatically detect warmup, even though when manually looking at iterations times (or warmup plots) it doesn't feel that hard (e.g., if the last 10 runs are all extremely close, it's a good sign it warmed up, and running more iterations and noticing they are stable increase that confidence).
It'd be great to have some kind of "is it warmed up" metric like @chrisseaton said, but I'm not sure that's easily portable to all Rubies being benchmarked.

noahgibbs · 2021-11-01T10:12:04Z

It's going to be very hard to do this in a robust way across all Rubies, I agree. One difficulty is that, as you say, it feels quite easy when looking at it. Yet "last ten iterations were all very close" is exactly the kind of heuristic where TruffleRuby often surprises me with its later behaviour. If we were going to benchmark TruffleRuby more regularly, I'd want some level of visibility into its internal state.

For now we're handling it with fixed numbers/times for iterations, recording the variability and the warmup iterations, and not benchmarking TruffleRuby regularly. Its internal state is so much more complex and unpredictable than CRuby, MJIT or YJIT that it normally makes more sense to exclude it from regular runs rather than confidently make wrong assertions about it.

The approach you mention seems solid, well-thought out and (in the case of TruffleRuby) still error-prone. When/if I hear from Chris that the GraalVM Thermometer interface is available, I'll revisit including TruffleRuby.

I'll also probably benchmark TruffleRuby again when we next submit results for a paper or give a conference talk where it's relevant -- and in that case we'll handle it by warming up a lot and doing a lot of manual inspection to make sure there are no obvious flaws. That's not an approach we can scale to twice-daily CI runs, though. The VMIL results took a very long time to collect, primarily because of how much warmup we needed to do for TruffleRuby results.

noahgibbs · 2023-02-14T16:46:36Z

At this point I don't think we need to do these things, including variable warmup. If we revisit this later we can open a new bug.

noahgibbs added the enhancement New feature or request label Sep 1, 2021

noahgibbs closed this as completed Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking should be faster #30

Benchmarking should be faster #30

noahgibbs commented Aug 9, 2021

maximecb commented Aug 11, 2021

noahgibbs commented Aug 11, 2021

maximecb commented Aug 11, 2021

noahgibbs commented Aug 11, 2021

maximecb commented Aug 11, 2021

noahgibbs commented Aug 16, 2021

chrisseaton commented Aug 16, 2021

noahgibbs commented Aug 17, 2021

noahgibbs commented Aug 17, 2021

chrisseaton commented Aug 17, 2021

noahgibbs commented Aug 17, 2021 •

edited

Loading

eregon commented Oct 30, 2021 •

edited

Loading

noahgibbs commented Nov 1, 2021

noahgibbs commented Feb 14, 2023

Benchmarking should be faster #30

Benchmarking should be faster #30

Comments

noahgibbs commented Aug 9, 2021

maximecb commented Aug 11, 2021

noahgibbs commented Aug 11, 2021

maximecb commented Aug 11, 2021

noahgibbs commented Aug 11, 2021

maximecb commented Aug 11, 2021

noahgibbs commented Aug 16, 2021

chrisseaton commented Aug 16, 2021

noahgibbs commented Aug 17, 2021

noahgibbs commented Aug 17, 2021

chrisseaton commented Aug 17, 2021

noahgibbs commented Aug 17, 2021 • edited Loading

eregon commented Oct 30, 2021 • edited Loading

noahgibbs commented Nov 1, 2021

noahgibbs commented Feb 14, 2023

noahgibbs commented Aug 17, 2021 •

edited

Loading

eregon commented Oct 30, 2021 •

edited

Loading