Skip to content

Benchmarking should be faster #30

Closed
@noahgibbs

Description

@noahgibbs

There are a few things we can do about that. For instance:

  • several benchmarks do setup stages that really only need to happen once (e.g. Railsbench doing bundle install and rails db:migrate.) It should be possible to run it once with setup and then skip the setup on following tries. This will require an API change to benchmark.rb, but that change could be optional -- any benchmark.rb that always does setup would still be correct, just slower than necessary.
  • a lot of benchmarks do iterations-inside-iterations, where we'll repeat many times and call it one iteration (e.g. psych-load's simple "100.times" inner loop; Railsbench uses random routes, but also does something similar.) That helps the UI from the command line a bit since it's not really designed for very short (e.g. 5ms) benchmarks. And it helps reduce overhead of the harness's benchmarking loop, which is also not really designed for very short benchmarks. But we can fix this with something resembling Chris Seaton's continuous-adaptive iteration timing (Diagnostic harnesses for continuous-adaptive and BIPS yjit-bench#32) it we want to, and then the harness will get a lot more control over how long to run and when to stop.
  • right now we just use a fixed number of iterations of warmup for any benchmarks we run. There are easily fifty different ways we can improve how we handle warmup, and quite a lot of them result in a faster overall set of runs. e.g. we can have adaptive warmup until the graph is sufficiently flat in slope and low in variance; or we could have some method of tracking a "reasonable" level of warmup per-benchmark and per-ruby, which would allow quick warmup on non-TruffleRuby runs, and medium-quick warmup on TruffleRuby runs where only certain benchmarks were used. Right now mixing (e.g.) activerecord and psych-load for the same number of iterations means psych-load is warmed up vastly more than needed.
  • related: we don't allow running the benchmark until some level of stability is reached - only a fixed amount of time or number of iterations. This has to be done carefully to avoid bias, but we can do something reasonable here, I think. Extra-fancy would be running two sets of benchmarks until we've proven (to within some margin) they are the same or different, but even something simple would be better than where we're at now.

Some of these things break compatibility with the yjit-bench harness. It would be nice to retain compatibility, so I'll try not to do anything to gratuitously break that without some correspondingly useful feature that's currently not supportable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions