-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking should be faster #30
Comments
Something to keep in mind is that having an adaptive harness could make the benchmarking results less stable. The nice thing about yjit-bench right now is that it's very simple and predictable. It delivers results fairly quickly. There's reasons to stick with a less "fancy" strategy.
Appreciated 👍 What became obvious to me after we did the whole benchmarking for VMIL is that: we should have started with one single run per benchmark, and kept iterating until we were satisfied we had the "right" benchmarking strategy, and only then gone for multiple runs for more accuracy. As it pertains to yjit-metrics, maybe it's also OK to just do a single run per benchmark. Yes that does introduce a little bit of noise, but maybe a few percent of noise is OK, because when you can still smooth the data over multiple days. Also worth considering whether or not we want to measure the perf of every engine every night. I was thinking we mostly cared about the performance of YJIT and the interpreter for this project. Since you're building the tools, it could make sense to also benchmark TruffleRuby, but does it have to be benchmarked every night? If yjit-metrics ends up taking on benchmarking TruffleRuby regularly as a goal, they will want a different set of benchmark and their own strict set of requirements. |
I definitely don't think every engine needs benchmarking every night, and I've tried to be careful to not make that assumption in the benchmarking tools. We could do a low-warmup run with just interpreter and YJIT, for instance, and make that the nightly. But yeah, not benchmarking TruffleRuby nightly is a great idea because its settings need to be so different from what YJIT, MJIT or CRuby want.
We could do that. I don't feel like we ever really nailed warmup given how late in the game we were making TruffleRuby-related changes. And long-term I definitely think it makes zero sense to warm up each benchmark for the same number of iterations (e.g. optcarrot vs activerecord). While warmup time might work as a substitute, it's still going to be a little rough around the edges. Given how much you'd prefer keeping things simple (use shellscripts, avoid multiple runs, avoid keeping settings in the framework), though, we should probably assume we should keep it at a fixed number of iterations. Then we'll just take the hit on how long it takes to run benchmarks. So even if we do everything for a singlerun, that run is going to be very long. Being fancy (variable numbers of iterations for different benchmarks, etc) allows you to skip work where it's not needed. Whereas if you (e.g.) hardcode the number of iterations for everything, you wind up getting 500 iterations of a 200ms benchmark, plus 500 iterations of a 10-second benchmark. The longer-running benchmark takes a lot longer, and has enough built-in noise that 500 iterations is usually not nearly enough. As our benchmarks get bigger (e.g. Discourse) that problem is going to grow a lot. There are some ways we could move more control into the harness to reduce iteration size, but those all add complexity. The simple version (shellscript says run a fixed number of iterations) all wind up with the Discourse benchmark, or any other large long-running benchmark, needing days or more to run. We're basically there already. Even the single-run and warmup data I did for VMIL was about a day of computation each. And with TruffleRuby we never really hit the point where it stopped warming up, even for the single-run data. |
I'm open to having an adaptive strategy for warmup. It's just that with TruffleRuby being so unpredictable, we could get wildly different results every time, as in we think TR is done warming up, but it isn't. In that context maybe doing 50 warmup iterations for everyone, but 200 for TR makes sense, or just having a minimum amount of time we give for warmup, like 5 or 10 minutes (with however many iterations each engine does in that time). Maybe something that could make sense is: we guarantee a minimum of 20 warmup iterations, and also a minimum of 5 minutes of warmup (could be more than 20 iterations), then we begin doing timing iterations, and we guarantee that we get at least 10 timing iterations? |
The tradeoff is the one you mentioned earlier: we can vary more, which will often give us better results faster. But then it's hard to guarantee we haven't done it wrong. We can definitely do what you mention there. It'd be fine. I was considering having some kind of warmup "hint" built into each benchmark, or into the framework, because it's really hard to automatically deduce warmup from iteration times. Then we'd get that number unless we specified one. But your way is more uniform -- it doesn't need any additional per-benchmark information, at the cost that sometimes it will do a bad job on a specific benchmark. |
My potentially controversial opinion is: if we give some minimum iteration count for warmup (eg: 20) and some minimum time budget for warmup (eg: 5 minutes), we're actually being very generous. It's not us who are doing a bad job at enabling JIT compilers to be sufficiently warmed up, it's Truffle that is doing a bad job at warming up rapidly and predictably (which is what users of such a JIT compiler would expect). |
Oh - something I should mention here: TruffleRuby has a tool for viewing its internal state that @chrisseaton showed me. Among other things, it determines whether a particular method is fully compiled, per method. So while we can't fully prevent deoptimisation, it should be possible to determine if TruffleRuby has any more known optimisations in the queue, if we can use the same interface that tool uses. So we could at least be able to tell when TruffleRuby thinks it's fully warmed up. I get the impression that it can take awhile for everything to fully shake out. But at a minimum we could accept TruffleRuby's estimate of whether it's fully method-compiled when it says yes, and have some maximum time cutoff if it's still saying "not yet." That might let us speed up the fast-warmup cases for TruffleRuby significantly, even if (e.g.) Railsbench still took a long time to warm up. |
This is the GraalVM Thermometer. It can tell you how much of your time you're spending in optimised code. If it's 95% of something, you can called that warmed up. You could measure the same thing you own way in MJIT and YJIT and it could be a universal metric. |
We only monitor that in YJIT-capable builds when RUBY_DEBUG is true -- so for most of the builds we care about, you can't easily get that information at runtime. No clue how/whether MJIT does the same, but I don't think the information is currently exported from MJIT. |
My Java-fu is old and my Graal-fu is basically nonexistent. I'm not having any luck with Google, tracking down a protocol that those components use to connect from outside the main TruffleRuby process. If I wanted to export that information in a way other than as dynamically-updating text on the console, can you give me a hint where to start looking? I'm not even sure how I'd get the dynamically updating text on the console, but I've seen you do so, so I know it's possible. Also, that looks like a draft PR, so I can't tell if that would be usable on a released TruffleRuby yet. Also: can you tell me if there's significant overhead to tracking that? It'd be great to know when TruffleRuby is done warming up, but if monitoring warmup would give worse performance numbers then we shouldn't do that for benchmarking. |
If you tell me a way you want to access the data I'll try to give you that interface. For example would a call from within the Ruby code be useful? Such as There is an overhead - I'm trying to significantly reduce it at the moment. |
Being able to get percent_warmed_up from Ruby would do exactly what we want, yeah. The right model for us would be calling from inside the harness (the worker process running in TruffleRuby) to get the current warmed-up percentage. Then we could The current plan is to keep iterating until we're past some minimum number of iterations, and then stop at a maximum number of iterations or total warmup time. This would add a third (TruffleRuby-only) condition that would terminate warmup and start the timed iterations. And that all happens inside the harness process. At a minimum we could use the percent_warmed_up metric to get rough early results in TruffleRuby faster, and then run without the overhead for final results. That shouldn't be hard to do, and would still speed up a lot of our runs significantly. |
FYI something I did is a harness which tries to minimize the median absolute deviation (a robust estimator of the variability), and slowly increase it so we eventually finish benchmarking even if the variance between iterations is not fully stable. This is an approach I used a while ago in the perfer benchmark harness: https://github.com/eregon/perfer/blob/98c4b23aa1884b3bbc45cde377c3d9c2f6260f6a/lib/perfer/job/iteration_job.rb#L138 and https://github.com/jruby/perfer/wiki/Methodology (this one computes the median absolute deviation of the last 10 iterations). I've also tried that approach on yjit-bench benchmarks (https://github.com/eregon/yjit-bench/blob/harness-warmup-20211024/harness-warmup/harness.rb) and computing the median absolute deviation of all iterations (since it's a robust estimator) and it seems to work fairly well. As we all know, it's hard to automatically detect warmup, even though when manually looking at iterations times (or warmup plots) it doesn't feel that hard (e.g., if the last 10 runs are all extremely close, it's a good sign it warmed up, and running more iterations and noticing they are stable increase that confidence). |
It's going to be very hard to do this in a robust way across all Rubies, I agree. One difficulty is that, as you say, it feels quite easy when looking at it. Yet "last ten iterations were all very close" is exactly the kind of heuristic where TruffleRuby often surprises me with its later behaviour. If we were going to benchmark TruffleRuby more regularly, I'd want some level of visibility into its internal state. For now we're handling it with fixed numbers/times for iterations, recording the variability and the warmup iterations, and not benchmarking TruffleRuby regularly. Its internal state is so much more complex and unpredictable than CRuby, MJIT or YJIT that it normally makes more sense to exclude it from regular runs rather than confidently make wrong assertions about it. The approach you mention seems solid, well-thought out and (in the case of TruffleRuby) still error-prone. When/if I hear from Chris that the GraalVM Thermometer interface is available, I'll revisit including TruffleRuby. I'll also probably benchmark TruffleRuby again when we next submit results for a paper or give a conference talk where it's relevant -- and in that case we'll handle it by warming up a lot and doing a lot of manual inspection to make sure there are no obvious flaws. That's not an approach we can scale to twice-daily CI runs, though. The VMIL results took a very long time to collect, primarily because of how much warmup we needed to do for TruffleRuby results. |
At this point I don't think we need to do these things, including variable warmup. If we revisit this later we can open a new bug. |
There are a few things we can do about that. For instance:
Some of these things break compatibility with the yjit-bench harness. It would be nice to retain compatibility, so I'll try not to do anything to gratuitously break that without some correspondingly useful feature that's currently not supportable.
The text was updated successfully, but these errors were encountered: