Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YJIT: Improved JIT trigger heuristics #520

Closed
maximecb opened this issue Jul 5, 2023 · 7 comments
Closed

YJIT: Improved JIT trigger heuristics #520

maximecb opened this issue Jul 5, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@maximecb
Copy link

maximecb commented Jul 5, 2023

At the moment we use a very simple heuristic, namely a call threshold, to decide when to start compiling. This seems to work fine for all our benchmarks, and also for SFR. However, with Core, which is a much bigger application, my understanding is that @casperisfine had to do a lot of manual tuning to avoid generating too much code. This prompts me to want to re-examine our JIT "trigger heuristic" and wonder if we could do better.

I asked around on twitter and some people would say "why don't you just compile everything". I think that works fine for a smaller application, but with a large application, there are two issues. The first is how fast we compile things. We can't start compiling everything immediately, otherwise we're going to hurt response times. The second issue is that core is actually huge. There's probably a "long tail" of methods that get called just a small percentage of the time, but over time, they get called a lot and will eventually hit the call threshold. Ideally, we want to start by compiling the hottest (most called) execution paths.

I've been doing some amount of literature search to see what kinds of heuristics other JITs are using. The good news is that most of the time they are relatively simple. Pretty much everyone uses a call threshold, but they often combine that with other additional heuristics. The less good news is that I couldn't really find a survey of such heuristics, or a paper with some "real-world" industrial data. This topic is a bit understudied, most of the papers are from over 20 years ago, and realistically, we'll have to do our own experiments and come to our own conclusions.

Method size

The Simplest Heuristics May Be the Best in Java JIT Compilers:
http://cse.unl.edu/~witty/research/repository/upload/25.pdf

This paper suggests taking into account the method size in addition to a call threshold. This is something that would be easy for us to do. The main idea is that bigger methods can have a lower call threshold. Presumably bigger methods have more complex logic, they execute for longer, they might even contain loops. They're also probably better places to start compiling from. So this might be a good place to start.

Loop counters

Java JIT compiler inlining
https://techblug.wordpress.com/2013/08/19/java-jit-compiler-inlining/

"To determine which methods are hot, the JVM keeps counters to see how many times a method is called and how many loop iterations it has executed."

This blog post mentions something I've seen come up many other times. Namely, most JITs don't just use a call threshold. They also use backwards branches to increment the same threshold. This accounts for how long methods spend inside loops. It would be smart for us to implement this. However, it's maybe also tricky considering that Ruby loops are usually method calls.

We might be able to approximate this by scanning through an ISEQ and seeing if it contains calls that look like they could be loops. Then we could increment the call threshold more when these methods containing loops are called? Ideas welcome.

JavaScriptCore also uses loop counters. This is an excellent blog post full of technical details about JSC:
https://webkit.org/blog/10308/speculation-in-javascriptcore/

Stack sampling

Graal uses stack sampling to determine which methods are hot:
http://lafo.ssw.uni-linz.ac.at/papers/2012_VMIL_Graal.pdf

AFAIK modern JVMs do this as well, but I don't love this because it's asynchronous, and it's also non-deterministic. It would make our JIT compiler less predictable IMO. Without wanting to sound mean, I don't necessarily think we should imitate Graal's warmup strategy.

  • Has disadvantage that this is non-deterministic
  • We may want to consider stack depth as well?
    • How deep in the stack is this function when it gets called?
  • Execution counting section of this blog post:

=> what is the proper term for this?

Looking at the call stack

The Java HotSpot (TM) Server Compiler:
https://www.researchgate.net/publication/220817735_The_Java_HotSpot_Server_Compiler/link/5978f707aca27203ecc63127/download

"The runtime environment uses adaptive optimization
to focus compilation efforts on performance critical
methods [HU96]. These methods are identified using
method−entry and backward−branch counters with
additional heuristics that investigate the caller of a
triggering method. When the combined method−entry
and backward−branch counters for a method exceed
the CompileThreshold, the runtime system determines
which method to make the root of a compilation by
examining the call stack. If the caller frequently in−
vokes the callee, the recompilation policy may decide
to start compiling at the caller. This upwards traversal
may continue multiple times relying upon the compiler
to inline the path to the triggering method."

This seems to be a bit more tricky because we'd have to determine how often method X calls Y. Though in theory, we could look at the call threshold for X and Y, and if both thresholds are similar, we could start compiling from X instead of Y. This could be relatively simple to implement.

Another possibility is to take into account stack depth. In theory, if the percentage in YJIT is high, it's to our advantage to start compiling from methods that are at the bottom of the stack rather than to start compiling with leaf methods, I would think. Though, to some extent, taking method size into account may do some of that for us as well. If we give bigger methods more weight (effectively a lower threshold), then that should minimize how often we start compiling from leaves.

Decay rate

Some people have suggested the use of a decay rate. This means periodically decrementing or resetting call thresholds. That has the benefit that less hot methods may not get compiled. The tricky part here is that I'm not sure how often this decrementation should happen. Should there be a parallel thread that goes and decrements the call counter? How do we go about deciding what rate this happens at?

Other ideas

It seems to me that we should be able to keep track of the total number of ISEQs fairly easily (does CRuby already have a way to do this?). This can give us a proxy for the size of the application. Possibly, we could automatically adjust the call threshold in function of how many ISEQs are loaded. Tiny applications such as benchmarks should have low call thresholds. Massive applications like Core should have much higher values (what are we currently at, 200?). Other applications could be somewhere in between.

We could also try to force bottom-up compilation. In theory it should be easy to know how many call frames deep in the Ruby stack we are. We could have a limit to prevent compiling methods that are too deep (more likely to be leaves). We start with a very low limit, and then gradually increase this limit over time. This would effectively force compilation to start with methods that are closer to the bottom of the stack, and gradually increase the amount of methods deeper in the stack that we compile.

Summary / Next Steps

As a first step, I would like to add a counter to count how many JIT method entry stubs we actually hit. I think this may prove to be an interesting metric.

I think it would also be nice to have an easy way to know how many ISEQs are loaded. Might be even better if we could count, say, how many ISEQs have been called at least once. Maybe we could reuse our existing call counter mechanism for this. The first time a method gets called, we could increment the count of how many methods were called at least once. The trick is how to do this without affecting interpreter performance. Maybe we could implement that by having a call threshold for each ISEQ, and start with a value of 1. Then, once that value gets hit, instead of immediately triggering JIT compilation, we increment the count of methods called at least once, and then set the actual call threshold value to something higher?

Another step would be to think about how to incorporate method size into our JIT heuristic. We want to give higher priority to larger methods (longer ISEQ lengths). This paper discusses that, but it wasn't fully clear to me how it works.
Method size

Then we need to think about experiments and how we would go about evaluating this. It might be kind of awkward to run experiments directly on core? At the very minimum our logic needs to make sense for all of our benchmarks. Then we can validate on SFR. For core we would need to do back-porting.

@maximecb maximecb added the enhancement New feature or request label Jul 5, 2023
@casperisfine
Copy link

Decay rate

A somewhat similar idea I had was to increate the call threshold when more methods are compiled. e.g. if you start at call-threshold=100, once you started to use say 10% of exec-memory, you double the call threshold, etc.

Either way, I'd recommend against using a "wall clock" based decay method, otherwise the result would be widely different when you are over-provisioned vs under-provisioned. A web worker may spend a very significant amount of time idle, just blocked on IO, waiting for work. So ISEQ compiled/executed is IMHO a better way to define the "cadence" than elapsed time.

@maximecb
Copy link
Author

maximecb commented Jul 5, 2023

A somewhat similar idea I had was to increate the call threshold when more methods are compiled. e.g. if you start at call-threshold=100, once you started to use say 10% of exec-memory, you double the call threshold, etc.

This could make sense to scale the call threshold in terms of application size 🤔

Either way, I'd recommend against using a "wall clock" based decay method, otherwise the result would be widely different when you are over-provisioned vs under-provisioned. A web worker may spend a very significant amount of time idle, just blocked on IO, waiting for work. So ISEQ compiled/executed is IMHO a better way to define the "cadence" than elapsed time.

I agree. In general I much prefer to keep everything deterministic. I like the idea of having repeatable and predictable warmup behavior.

@k0kubun
Copy link
Member

k0kubun commented Jul 5, 2023

Thank you for writing this up 👍

my understanding is that @casperisfine had to do a lot of manual tuning to avoid generating too much code. This prompts me to want to re-examine our JIT "trigger heuristic" and wonder if we could do better.

To provide more contexts, I think we had two different problems with the initial parameters in Core:

  1. --yjit-call-threshold=1000 was too big to warm up the application while a unicorn worker in Core only serves 1000~2000 requests and dies.
  2. --yjit-exec-mem-size=64 was too small. Too frequent code GC incurs too much overhead and limits ratio_in_yjit.

Improved JIT trigger heuristics would be particularly helpful for (1). For such short-living processes, we need the threshold to be as small as possible while avoiding compilation of initialization code and code size bloat.

No matter how good the heuristics are, the size of the application might still be a problem for (2). I think we should bump the default to 128 (compatible with ruby#7671) or 256 (what Core ended up using).

@maximecb
Copy link
Author

maximecb commented Jul 5, 2023

while a unicorn worker in Core only serves 1000~2000 requests and dies.

That's really not a lot. Why do core workers serve so few requests? Is it because of constant deployments @casperisfine ?

What call threshold did you end up setting for Core?

No matter how good the heuristics are, the size of the application might still be a problem for (2). I think we should bump the default to 128 (compatible with ruby#7671) or 256 (what Core ended up using).

I'd be in favor of bumping the default to 128. The good thing is that really small applications will not use that much memory.

@k0kubun
Copy link
Member

k0kubun commented Jul 5, 2023

That's really not a lot. Why do core workers serve so few requests? Is it because of constant deployments @casperisfine ?

The number comes from the fact that Jean picked a unicorn worker with 1840 requests when he looked at YJIT stats, assuming he picked a relatively old worker. I added a unicorn request count graph to Jean's dashboard for Core, and it often averages at 2000~4000 requests during work hours.

I also suspect YJIT-enabled processes have even lower numbers than 2000~4000 because I see it does get killed more often than YJIT-disabled processes due to a fat unicorn killer. I have opened a PR to add YJIT tags to the metric so that we can check the count specifically for YJIT-enabled processes.

We'd need to try reforking, relax the kill threshold for YJIT-enabled processes, and/or wait for Ruby 3.3 to improve the memory usage.

I'd be in favor of bumping the default to 128. The good thing is that really small applications will not use that much memory.

👍 will file a PR.

@casperisfine
Copy link

Yeah, during work hours the main thing limiting worker age is definitely deploys, but we do hit memory limit disproportionally often with YJIT, so we should be able to improve worker lifetime significantly with reforking.

Another reason for exit, but more rare, is a request timing out, but here too reforking would help as the reforked worker would start warmed up, rather than cold, so it would effectively already have process a large number of requests instead of 0.

I don't know for sure about what the usual number of requests is so 👍 to tagging that metrics.

@eregon
Copy link

eregon commented Jul 6, 2023

As C. Humer pointed out on Twitter, Truffle and so TruffleRuby uses this: https://github.com/oracle/graal/blob/master/truffle/docs/TraversingCompilationQueue.md, which is based on HotSpot's compilation queue.
So it's still call + loop count, but the threshold changes over time and it's smarter about what is more important to compile sooner in the compilation queue.

Before that Truffle used call count + loop count thresholds. Truffle always counted loop iterations IIRC, and the weight is just 1/1, i.e., loop iterations and calls are just added together per method. The threshold changed quite a bit once we introduced 2-tiers Truffle compilation.

YJIT compiles synchronously though, right? So I guess no compilation queue and needs to decide as soon as the threshold/condition is reached.


Graal uses stack sampling to determine which methods are hot

The paper says on page 3:

The Graal VM uses invocation counters to detect hot methods, similar to the HotSpotTM VM, on which it is based.

So it seems to not use stack sampling (but I didn't read the full paper recently).
IIRC Jikes used stack sampling, HotSpot doesn't (or rarely, maybe for decay purposes).


However, it's maybe also tricky considering that Ruby loops are usually method calls.
We might be able to approximate this by scanning through an ISEQ and seeing if it contains calls that look like they could be loops.

The block given to the loop method will still be executed N times, so that will reach the threshold naturally.

The Ruby methods which loop have a while loop inside, so those can count iterations naturally.
Of course if the loop is in a C method then it's trickier for YJIT.
(In TruffleRuby whether the loop is in Ruby or Java we count it via LoopNode.reportLoopCount()).

It can be valuable to compile the caller of the method with the loop if you have inlining, as this notably enables to not allocate the block passed to it and inline everything and compile it as if it was a single method, example:

def m(word)
  5.times do
    p word
  end
end

m("hello")

The block or Integer#times becomes hottest first, but the really interesting one to compile is m, and inline Integer#times and the block:

def m(word)
  _i = 0;
  while _i < 5;
    _i += 1
    p word
  end
end

I talk about this in https://rubykaigi.org/2023/presentations/eregontp.html#day2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants