New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to discard/free machine code (code GC) #87
Comments
@XrXr feel free to add your thoughts here. Let's take some time and try to think of creative solutions to exit the running code in a way that's performant if possible. Presumably, all executing threads/ractors will eventually hit an interrupt check. There might be some way to catch them at that point, make them side-exit to the interpreter, and count how many threads have exited vs how many threads we have in total. Presumably, the interpreter already handles the issue of tracepoints modifying code that is executing in ractors. How do they do it? It might make some sense to implement code GC first, separately from tracepoints code invalidation, though there could be some design constraints imposed by tracepoints as well. |
At the moment, when a we invalidate a block, we don't immediately reset the code belonging to the block to class Foo
def adversary
::Foo.remove_method(:adversary)
end
end the output for the call to ; push a new control frame
; save yjit registers
call rb_mod_remove_method
; restore yjit registers
; (1) pop control frame
; write rb_mod_remove_method() calls invalidate_block_version() on the block for Foo#adversary. If we reset the output to a bunch of I bring up this example to show the more general issue that the lifetime of our output code does not end when we invalidate it. The code may still be on the native call stack, and functions may return to it. Related to this lifetime problem,
This is an abridged version of the events for clarity, but you get the idea. Immediately freeing branch entries in invalidate_block_version() is unsound because some other threads might be running code that references the branch entry. The HotSpot JVM has to deal with similar lifetime problems when it comes to native code, and here is my understanding of what it does:
cc @chrisseaton to keep me honest about HotSpot. We already do (1), and we can get (2) and (3) by letting the GC manage branch entries. Moving branch entries to the GC heap won't change our native code output at all, but it'll use a bit more memory, because of the 40 byte restriction. Related to this, the |
@XrXr Thanks for taking the time to detail all this. This is probably the biggest flaw in our current implementation. I think we should address code GC issues in separate PRs from the call threaded refactoring. I think we should start by creating a collection of failing tests representing all the problems that we can find. Or adversarial tests that should not fail. Maybe in You've identified two:
You suggest having the Ruby GC manage the machine code lifetime and When it comes to freeing machine code, the challenge I see is, all the cores/ractors have to be outside of that machine code when it gets freed. In fact, they have to be outside of any machine code that could jump to the machine code that gets freed. Also, ideally, we want to free/overwrite all the machine code at once, because we rely on this scheme where we essentially do bump pointer allocation for machine code. Is it fair to assume that when the GC is running, everyone has exited to the interpreter, because all ractors must have taken an exit caused by an interrupt? If we can assume that at GC time, everyone has exited the machine code, then we could take that opportunity, if a certain condition is met (inline/outine code heap more than 75% full?) to go and remove all entry points and free/clear all our data structures. We'd have to scan the stack and if anyone has a |
I'm not sure what you mean by this. Can you go into more detail? In the situation I described, the branch_t pointer is passed into branch_stub_hit() has an argument. We can't really set a local variable from a different thread. Assuming we could set the pointer from a different thread, how long does the storage for the pointer live? I guess it's possible to do a bespoke ref counting setup.
Not necessarily. The GC scans the native stack, so as long as there is any branch_t pointers on the stack the branch_t would be considered alive. When we invalidate and free the block that holds a branch_t pointer, the native call stacks from various threads become what keep the branch_t alive. When all the C functions that have the branch_t pointer as locals/arguments return, the GC can then decide that the branch_t is dead.
No. The ractor thread that runs the GC might have output code on stack. For example when we call a C method, and that C method allocates a Ruby object, that might enter GC.
One possible solution is to allocate one GC object that represents a lease on the entire executable region and have all the object that own machine code reference it. When the GC decides that the lease object is dead, then we know that everyone has exited. We can kick the system into a mode where we prevent future entries, and the GC will eventually decide that everyone has exited.
Having one single chunk of executable region might be a bad fit for Ruby because of Fibers. Fibers users can suspend execution in the middle of a C method like this: fiber = Fiber.new do
# imagine we compiled this whole Ruby block
puts "In Fiber 1"
Fiber.yield # in a real app the stack might have multiple JIT frames when this call happens
puts "In Fiber 1 again"
end
fiber.resume
loop do
# other logic. Maybe never resume `fiber` again, but `fiber` is alive.
end So the waiting period for all pieces of code to exit can be arbitrarily long. Yes, the app has a leak in this case, but maybe we shouldn't add to its problems by barring it from the JIT. I think the main reason for having just a single chunk of memory is simplicity, but it seems that because of multi threading and other factors, it doesn't really make our lives easier. |
That pointer comes from the stub itself. We could essentially pass a pointer to a struct we store in the stub, which contains a pointer to the
I see, so it does a conservative GC scan of the native stack, that makes sense.
Another way to solve the problem, in that case, is making sure that we don't return to dead code. There's a few ways that we could achieve that. Maybe we could use
That's true. When researching ARM64, I actually ran across the fact that it can only do short relative jumps efficiently, +- 4KiB IIRC. Longer relative jumps take more space to encode. If we want to support this, we might actually be forced to have smaller code allocation pools on a per-method basis, or a system with small code "pages". This seems to also point towards the notion of having the GC managing code ownership and collection. Not sure what's the best way to go about this 🤔 Presumably the code pages/chunks would belong to individual methods and die with them. |
@tenderlove and I sketched some code to allocate "code pages" out of a pool. Migrating to using that seems fairly straightforward. Collecting all the code pages for an ISEQ when it dies also seems not too difficult. However. There's a use case that we aren't addressing with that, which is that sometimes, we probably do want to actually discard all the code, start from scratch. For example, if we enable tracepoints and invalidate a ton of code. We'd could end up with a lot of dead machine code taking up space for no reason. If this machine code can't be collected until the respective ISEQs are completely unreachable, that's bad, because it seems obvious that a lot of global classes will remain reachable until the program is done running. I was wondering if simply setting the |
I think with branches/blocks referencing the code page instead of the iseq, we could throw away most of the code correctly by doing a full GC run.
When we invalidate the blocks, we remove them from the iseq, once that happens, the only thing preventing collection would be running frames. So code pages can die sooner than iseqs in this setup. It seems that the GC handles this case naturally? I think doing a full GC run covers the use case you hint at? |
I think you're right. If the blocks point to code pages, then we can have a more granular system for collecting dead pages than if ISEQs own the pages 👍 However, multiple blocks can reside in a single code page. That seems to imply that we need a GC object that "owns" each individual code page. How do you go about creating an object that belongs to the GC? You make sure to store an RBasic as the first field and you call the appropriate functions to mark/traverse it? |
There is a public C API Line 1053 in cdc3115
Lines 319 to 323 in cdc3115
|
In discussions this week, we identified two challenges with code GC.
Potential solution for (1): assuming A different approach would be to use the EDIT: it does still require writing jit_return before calling into C functions. We can't reliably get the return address of a native function while it's running. For (2), we can address it by adding a fields to |
I think using the To expand further on this: right now each iseq has an associated list of We probably want a GC marking strategy that more accurately reflects the fact that code ( This week we're finishing the paper, and I'll be away Monday & Tuesday, but I want to get back to this after and try to think of a better solution. |
@XrXr I had some thoughts for the code GC: I think we could allocate a chunk of executable memory all at once, something like 64MiB, and then split that in code pages, say 8KiB. That leaves us with a total of 8192 code pages. Given that this number is fairly small, we could use a simple bitmap scheme to keep track of which code pages are live. That is, we start with all zeroes, and then, as we traverse When the GC run is completed, we'd need to go through the bitmap and find all the pages that were never reached, add them back to a free list that we manage. So my questions are: is there a GC callback to do that? Also note that this would only work for a full GC run. We couldn't collect code pages in a minor GC. For the |
No, the GC doesn't offer an API to do this. It's very different from how the GC treats objects.
That works, but would require adding jumps to the code to jump over the data. For example when we call a C func method, we would need to write jit_return to the block, and the block_t pointer would need to be in the block's generated code. If the goes at the end, it would stop falling through after the C func returns. Putting it in the middle of the block similarly requires an additional jump. |
It's basically just a simple mark and sweep algorithm with a bitmap. I think it could work pretty well, but if we can't do it that way, then we'd want to have a pointer from each block to its code page, and also a pointer from each stub to its code page 🤔
Fair enough, I guess we might be better served with a good old hash map. Also opens the possibility of associating other info with return addresses if we ever need to. |
Assuming we go with a scheme that maps For finding the code page GC object given a pointer into the code page, we could store the code page object pointer at the start of the page and do |
Instead of (or in addition to) keeping the bitmap, could we keep a wrapper object for each code page and have its finalizer add the page back to the free list? The object relationship could look like this: When blocks are invalidated, or when an iseq is collected, the finalizer for the code page wrapper could add the code page back to the freelist: Pushing block_t wrapper objects on the stack could keep the code page alive until everything returns (as @XrXr says) |
Yes we can use wrapper objects for code pages instead of using a bitmap. We may want to have some kind of mapping of code addresses to code pages. That could potentially be done with a fixed-size array. I would rather avoid having to push block_t objects to the stack for each call because we're already in a situation where CFP objects are huge, and calls are very costly. Ideally we need to work towards reducing the overhead per call. Pushing more data to the stack for each call would go in the other direction. |
An alternative to writing jit_return when calling C methods and routines is mapping |
I've been messing a bit with code GC today. The plan is that we'll allocate a chunk of executable memory, then subdivide the executable memory in to code pages (which will form a linked list). The compiler will further subdivide the code pages by allocating block_t structures from the code page. Why do we need the intermediate code page structure? I think the answer is "code pages are fixed size so they're easier to manage", but I wanted to write it down. Also I'd like to give a name to the executable memory we allocate so that it's easier to talk about. I suggest something like "code page pool". WDYT? |
Sure that sounds good. Another thing to keep in mind is that we need a way to map pointer into machine code to the code pages that contain these addresses. Alan suggested just writing a pointer to the code page object at the beginning of each code page. That way you can use a bitwise arithmetic page to recover the address. Alternatively, I suggested that we could have an array of pointers to code page objects and use bitwise arithmetic to go from code pointer to code page. Both are equivalent. Each code page should be split into two so that there's an inline and outlined section (for things like side-exits). However, the code that does the allocation and freeing of code pages probably doesn't need to be concerned with this split. |
Ya, this makes more sense than making When marking iseqs, we can just iterate the block_t objects, then do arithmetic on the address to get back to the I have it kind of working now, but I think I might need some help if either of you are free at some point. I'm hitting some assertion errors that I don't know how to fix. |
We should be able to get a code page from a jit_return address. Can't directly get the block_t though. Not sure if that matters (can a block_t be live longer than its parent iseq? maybe after some code invalidation event?)
Not quite tho. If we just iterate through all the blocks, that will keep all the blocks for an iseq alive forever. However, you can have a situation where blocks jump A -> B -> C -> D. Then you invalidate B, technically C and D are now dead too, because they become unreachable. Ideally the blocks need to be traversed as a graph of their own. Each block has a list of outgoing branches, and each outgoing branch can point to 0, 1, or 2 target blocks.
I can pair with you tomorrow afternoon if that works for you. Free at 1PM and 2PM eastern. |
Ah right 🤦🏻♂️. Can we make an iterator that traverses the live blocks? I'm not sure how many blocks per method we have, but I'm somewhat worried about creating a bunch of Ruby objects when we don't need to.
I've got a meeting from 1:30pm to 2:30pm eastern. Maybe I could work on a graph traversal function tomorrow and we could work on GC on Friday? (Or just ping me after 2:30pm EST?) |
Currently this simple test fails, but we should be able to pass it once the code GC is working:
|
Really? Do you mean on its thread, it called you/interpreter, or cross thread? |
It's certainly possible to try to unwind but it seems pretty complex to do (involves parsing DWARF on Linux and maybe COFF on Windows? See libunwind). Because CRuby supports arbitrary native extensions that don't necessarily include all the required metadata, in the worst case scenario we'd need to stack scan, as you said. It does seem like it could give a win for marking perf if we tried but it seems a bit too hard to do for the first iteration. |
On Windows, it is just RtlLookupFunctionEntry + RtVirtualUnwind. On Linux yeah I don't know if the data is always present but I'd still hope to "just" call libunwind. If calls to native are rare enough..you can set a thread local on transitions to/from native. Terrible performance I realize, must be done rarely. Maybe that is what is described above -- I don't know anything about Ruby yet. |
I think it's fair to close this issue today. Closing. |
At the moment, we generated code in a linear array of executable memory until we run out of space. In practice, this isn't a problem for most applications, because we can set the limit quite high and it can be increased from the command line. However, it's a limitation of our system that eventually needs to be addressed.
Potentially, the easiest solution might be to implement a mechanism that allows us to throw away all of the generated code, as well as the accompanying blocks and various assumptions that we made during code generation.
Benefits:
Something to think about is that there might be some complexity in terms of exiting compiled code. That is, if we decide to throw away all generated code, we need to correctly exit to the interpreter, which is currently done by generating a side-exit. For tracepoints, if we discard the generated code while inside a C function that may have been called by JITted code, we need to make sure that we can exit from said C function correctly (could we use
longjmp
to do that?). There's also ractors to think about.We should probably reinitialize the executable block to
INT3
(0xCC) once we're done collecting the executable code.May be useful to have a
YJIT.discard_code!
method for testing.The text was updated successfully, but these errors were encountered: