Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to discard/free machine code (code GC) #87

Closed
maximecb opened this issue Jun 15, 2021 · 28 comments
Closed

Ability to discard/free machine code (code GC) #87

maximecb opened this issue Jun 15, 2021 · 28 comments
Assignees
Projects

Comments

@maximecb
Copy link
Contributor

maximecb commented Jun 15, 2021

At the moment, we generated code in a linear array of executable memory until we run out of space. In practice, this isn't a problem for most applications, because we can set the limit quite high and it can be increased from the command line. However, it's a limitation of our system that eventually needs to be addressed.

Potentially, the easiest solution might be to implement a mechanism that allows us to throw away all of the generated code, as well as the accompanying blocks and various assumptions that we made during code generation.

Benefits:

Something to think about is that there might be some complexity in terms of exiting compiled code. That is, if we decide to throw away all generated code, we need to correctly exit to the interpreter, which is currently done by generating a side-exit. For tracepoints, if we discard the generated code while inside a C function that may have been called by JITted code, we need to make sure that we can exit from said C function correctly (could we use longjmp to do that?). There's also ractors to think about.

We should probably reinitialize the executable block to INT3 (0xCC) once we're done collecting the executable code.

May be useful to have a YJIT.discard_code! method for testing.

@maximecb maximecb added this to TODO in YJIT via automation Jun 15, 2021
@maximecb maximecb changed the title Ability to discard generated code (code GC) Ability to discard/free all generated code (code GC) Jun 21, 2021
@maximecb
Copy link
Contributor Author

maximecb commented Jun 21, 2021

@XrXr feel free to add your thoughts here. Let's take some time and try to think of creative solutions to exit the running code in a way that's performant if possible.

Presumably, all executing threads/ractors will eventually hit an interrupt check. There might be some way to catch them at that point, make them side-exit to the interpreter, and count how many threads have exited vs how many threads we have in total. Presumably, the interpreter already handles the issue of tracepoints modifying code that is executing in ractors. How do they do it?

It might make some sense to implement code GC first, separately from tracepoints code invalidation, though there could be some design constraints imposed by tracepoints as well.

@XrXr
Copy link
Contributor

XrXr commented Jun 30, 2021

At the moment, when a we invalidate a block, we don't immediately reset the code belonging to the block to int3. If we tried to do it, we would run into problems with Ruby code that invalidate YJIT blocks that are on stack. Here is one such example:

class Foo
  def adversary
    ::Foo.remove_method(:adversary)
  end
end

the output for the call to remove_method looks like

; push a new control frame
; save yjit registers
  call rb_mod_remove_method
; restore yjit registers
; (1) pop control frame
; write 

rb_mod_remove_method() calls invalidate_block_version() on the block for Foo#adversary. If we reset the output to a bunch of int3 immediately in invalidate_block_version(), rb_mod_remove_method() would return to a sea of int3. That's undesirable.

I bring up this example to show the more general issue that the lifetime of our output code does not end when we invalidate it. The code may still be on the native call stack, and functions may return to it.

Related to this lifetime problem, struct yjit_branch_entry should live at least as long as the branch stub that references it. Interestingly, we already fail to maintain this lifetime contstraint and I run into the following race condition with ractors on my branch that has higher coverage:

  • Ractor A and B hit same branch stub
  • A wins the race to acquire the VM lock. B sleeps.
  • A compiles the block
  • A runs the block it compiled, invalidating the block containing the stub that A just compiled
  • A hits an interrupt check and goes to sleep. B wakes up with a free'ed pointer to the branch

This is an abridged version of the events for clarity, but you get the idea. Immediately freeing branch entries in invalidate_block_version() is unsound because some other threads might be running code that references the branch entry.

The HotSpot JVM has to deal with similar lifetime problems when it comes to native code, and here is my understanding of what it does:

  1. when a piece of code is invalidated, it is marked as "not_entrant" and immediately patched to prevent any future entries
  2. stack scanning is used to detect when all stack frames using the code are gone, at which point the piece of code becomes a "zombie"
  3. Later, the memory region the zombie owns is released back to the system for future use

cc @chrisseaton to keep me honest about HotSpot.

We already do (1), and we can get (2) and (3) by letting the GC manage branch entries. Moving branch entries to the GC heap won't change our native code output at all, but it'll use a bit more memory, because of the 40 byte restriction.

Related to this, the c_return TracePoint will need a similar unwinding arrangement (ref #54). Consider a modified version of the Ruby example I gave that calls TracePoint#enable with instead of remove_method. Just like remove_method, it invalidates the code of the caller, but unlike remove_method, when the C function returns, our output code needs to fire the c_return TracePoint. It needs to do it right before (1) in the assembly listing above, before the control frame for the C function is popped. One way to solve this is to have a branch entry own (1) and on in the listing, that way they can be patched to do fire the TracePoint when it is enabled. The naming for branch entries becomes kind of weird, though. It's not a fork in control flow at all when no TracePoints are used. I digress.

@maximecb
Copy link
Contributor Author

@XrXr Thanks for taking the time to detail all this. This is probably the biggest flaw in our current implementation.

I think we should address code GC issues in separate PRs from the call threaded refactoring. I think we should start by creating a collection of failing tests representing all the problems that we can find. Or adversarial tests that should not fail. Maybe in bootstraptest/test_yjit_codegc.rb.

You've identified two:

  • Returning from C code to code that has been invalidated (which now contains a jump to a stub)
    • We can also probably return from Ruby code to Ruby code that has been invalidated
  • Branch entries may get freed, and then ractors can try to go modify them

You suggest having the Ruby GC manage the machine code lifetime and branch_t lifetime. This isn't necessarily a bad idea because the Ruby GC already manages the lifetime of our block_t objects when methods are collected. I'm not opposed to that solution. However, maybe isn't 100% necessary for branch entries and stubs. It could be possible, when a branch entry dies, to go and update the stubs. Maybe turn their branch_t pointer to NULL and have the stubs return after they get the lock. If we have the GC manage the lifetime of branch_t objects, I'm guessing that will mean that the GC would free those when a method is freed?

When it comes to freeing machine code, the challenge I see is, all the cores/ractors have to be outside of that machine code when it gets freed. In fact, they have to be outside of any machine code that could jump to the machine code that gets freed. Also, ideally, we want to free/overwrite all the machine code at once, because we rely on this scheme where we essentially do bump pointer allocation for machine code. Is it fair to assume that when the GC is running, everyone has exited to the interpreter, because all ractors must have taken an exit caused by an interrupt?

If we can assume that at GC time, everyone has exited the machine code, then we could take that opportunity, if a certain condition is met (inline/outine code heap more than 75% full?) to go and remove all entry points and free/clear all our data structures. We'd have to scan the stack and if anyone has a jit_return set, we set it to NULL. Technically if we want to be lazy, we could set the jit_return to NULL on every GC (or just every major GC?).

@XrXr
Copy link
Contributor

XrXr commented Jun 30, 2021

Maybe turn their branch_t pointer to NULL and have the stubs return after they get the lock.

I'm not sure what you mean by this. Can you go into more detail? In the situation I described, the branch_t pointer is passed into branch_stub_hit() has an argument. We can't really set a local variable from a different thread. Assuming we could set the pointer from a different thread, how long does the storage for the pointer live? I guess it's possible to do a bespoke ref counting setup.

I'm guessing that will mean that the GC would free those when a method is freed?

Not necessarily. The GC scans the native stack, so as long as there is any branch_t pointers on the stack the branch_t would be considered alive. When we invalidate and free the block that holds a branch_t pointer, the native call stacks from various threads become what keep the branch_t alive. When all the C functions that have the branch_t pointer as locals/arguments return, the GC can then decide that the branch_t is dead.

Is it fair to assume that when the GC is running, everyone has exited to the interpreter

No. The ractor thread that runs the GC might have output code on stack. For example when we call a C method, and that C method allocates a Ruby object, that might enter GC.

When it comes to freeing machine code, the challenge I see is, all the cores/ractors have to be outside of that machine code when it gets freed.

One possible solution is to allocate one GC object that represents a lease on the entire executable region and have all the object that own machine code reference it. When the GC decides that the lease object is dead, then we know that everyone has exited. We can kick the system into a mode where we prevent future entries, and the GC will eventually decide that everyone has exited.

Also, ideally, we want to free/overwrite all the machine code at once

Having one single chunk of executable region might be a bad fit for Ruby because of Fibers. Fibers users can suspend execution in the middle of a C method like this:

fiber = Fiber.new do
  # imagine we compiled this whole Ruby block
  puts "In Fiber 1"
  Fiber.yield # in a real app the stack might have multiple JIT frames when this call happens
  puts "In Fiber 1 again"
end

fiber.resume

loop do
  # other logic. Maybe never resume `fiber` again, but `fiber` is alive.
end

So the waiting period for all pieces of code to exit can be arbitrarily long. Yes, the app has a leak in this case, but maybe we shouldn't add to its problems by barring it from the JIT. I think the main reason for having just a single chunk of memory is simplicity, but it seems that because of multi threading and other factors, it doesn't really make our lives easier.

@maximecb
Copy link
Contributor Author

I'm not sure what you mean by this. Can you go into more detail? In the situation I described, the branch_t pointer is passed into branch_stub_hit() has an argument. We can't really set a local variable from a different thread. Assuming we could set the pointer from a different thread, how long does the storage for the pointer live? I guess it's possible to do a bespoke ref counting setup.

That pointer comes from the stub itself. We could essentially pass a pointer to a struct we store in the stub, which contains a pointer to the branch_t. That makes it possible to go and nullify that pointer on the stubs, or to set a bit inside the stubs to tell them they are no longer valid. It adds complexity but it's doable. That being said, if you think moving the branch_t objects to the GC heap is easier, like I said, I'm not opposed.

Not necessarily. The GC scans the native stack, so as long as there is any branch_t pointers on the stack the branch_t would be considered alive. When we invalidate and free the block that holds a branch_t pointer, the native call stacks from various threads become what keep the branch_t alive. When all the C functions that have the branch_t pointer as locals/arguments return, the GC can then decide that the branch_t is dead.

I see, so it does a conservative GC scan of the native stack, that makes sense.

No. The ractor thread that runs the GC might have output code on stack. For example when we call a C method, and that C method allocates a Ruby object, that might enter GC.

Another way to solve the problem, in that case, is making sure that we don't return to dead code. There's a few ways that we could achieve that. Maybe we could use longjmp to exit the GC instead of a normal return. The other thing is, when we run a "code GC", it's not necessarily a normal GC pass that can happen anywhere. Potentially, we control exactly when we tell the GC to collect code. That makes it easy for the thread entering the GC to have special handling to exit to the interpreter after collecting the code. Presumably, in most cases, we'll be in the JIT when we run out of machine code space and we decide to collect it.

I think the main reason for having just a single chunk of memory is simplicity, but it seems that because of multi threading and other factors, it doesn't really make our lives easier.

That's true. When researching ARM64, I actually ran across the fact that it can only do short relative jumps efficiently, +- 4KiB IIRC. Longer relative jumps take more space to encode. If we want to support this, we might actually be forced to have smaller code allocation pools on a per-method basis, or a system with small code "pages". This seems to also point towards the notion of having the GC managing code ownership and collection. Not sure what's the best way to go about this 🤔 Presumably the code pages/chunks would belong to individual methods and die with them.

@maximecb maximecb changed the title Ability to discard/free all generated code (code GC) Ability to discard/free machine code (code GC) Jul 6, 2021
@maximecb
Copy link
Contributor Author

maximecb commented Jul 7, 2021

@XrXr

@tenderlove and I sketched some code to allocate "code pages" out of a pool. Migrating to using that seems fairly straightforward. Collecting all the code pages for an ISEQ when it dies also seems not too difficult.

However. There's a use case that we aren't addressing with that, which is that sometimes, we probably do want to actually discard all the code, start from scratch. For example, if we enable tracepoints and invalidate a ton of code. We'd could end up with a lot of dead machine code taking up space for no reason. If this machine code can't be collected until the respective ISEQs are completely unreachable, that's bad, because it seems obvious that a lot of global classes will remain reachable until the program is done running.

I was wondering if simply setting the jit_return addresses in CFPs to NULL in the GC (or only when doing a major GC) would actually be sufficient to ensure we can safely discard all our generated code. If we're in the GC, then we should have exited JITted code on all ractors through interrupts. Then, if we set jit_return to NULL in every CFP, that ensures that no interpreter stack frame can return into JITted code. The only challenge that remains is making sure that the GC isn't returning to something that was called by the JIT? Technically, we might be able to do a conservative scan of the C stack for JIT address ranges, and not free those specific pages. I'm sure we can think of other creative solutions as well.

@XrXr
Copy link
Contributor

XrXr commented Jul 8, 2021

I think with branches/blocks referencing the code page instead of the iseq, we could throw away most of the code correctly by doing a full GC run.

If this machine code can't be collected until the respective ISEQs are completely unreachable, that's bad, because it seems obvious that a lot of global classes will remain reachable until the program is done running.

When we invalidate the blocks, we remove them from the iseq, once that happens, the only thing preventing collection would be running frames. So code pages can die sooner than iseqs in this setup. It seems that the GC handles this case naturally?

I think doing a full GC run covers the use case you hint at?

@maximecb
Copy link
Contributor Author

maximecb commented Jul 8, 2021

I think you're right. If the blocks point to code pages, then we can have a more granular system for collecting dead pages than if ISEQs own the pages 👍

However, multiple blocks can reside in a single code page. That seems to imply that we need a GC object that "owns" each individual code page. How do you go about creating an object that belongs to the GC? You make sure to store an RBasic as the first field and you call the appropriate functions to mark/traverse it?

@XrXr
Copy link
Contributor

XrXr commented Jul 9, 2021

There is a public C API TypedData_Make_Struct() that lets us make objects with custom mark, free, and compaction callbacks. We use it for our dependency table:

yjit/yjit_iface.c

Line 1053 in cdc3115

VALUE yjit_root = TypedData_Make_Struct(0, struct yjit_root_struct, &yjit_root_type, root);

yjit/yjit_iface.c

Lines 319 to 323 in cdc3115

static const rb_data_type_t yjit_root_type = {
"yjit_root",
{yjit_root_mark, yjit_root_free, yjit_root_memsize, yjit_root_update_references},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};

@maximecb maximecb moved this from TODO to In Progress in YJIT Jul 15, 2021
@XrXr
Copy link
Contributor

XrXr commented Jul 28, 2021

In discussions this week, we identified two challenges with code GC.

  1. Invalidated blocks need special handling to be reachable by the GC. In situations where a method invalidates its caller, the caller's block_t needs to stay reachable until it leaves the stack. During normal execution, the calling block_t is reachable
    through its cfp->iseq, but when we invalidate the block, we remove it from the iseq, leaving no way to reach the block_t.
  2. We generate stubs for incoming branches when we invalidate blocks, and the new stubs we generate can be on a different code
    page than the one the stub itself is on. We currently only mark the code page that the branch is on.

Potential solution for (1): assuming block_t is GC managed, we could push a pointer to the block_t onto the native stack or the cfp whenever generated code makes calls. The stack reference would keep the block alive in situation where the iseq would not.

A different approach would be to use the jit_return address to figure out the block that owns it. This would require some additional metadata book keeping, but doesn't require adding overhead to output code. It seems to be what HotSpot does.

EDIT: it does still require writing jit_return before calling into C functions. We can't reliably get the return address of a native function while it's running.

For (2), we can address it by adding a fields to branch_t for marking the branch target. It can be a GC managed block_t or a code page object in case the target is a stub.

@maximecb
Copy link
Contributor Author

maximecb commented Aug 3, 2021

I think using the jit_return to keep track of which block is on the stack is the right way to go. There's a potentially simple trick that we can use here, which is to write the block_t pointer right before the address jit_return points to.

To expand further on this: right now each iseq has an associated list of block_t objects, and we use that to mark all the blocks attached to the iseq. I think this is somewhat "wrong" because it doesn't really take into accounts whether blocks are reachable or not. We could end up in situations where code invalidation "kills" a block in the middle of a method, and this effectively makes many other blocks unreachable. However, we'd still keep considering all these other blocks as live.

We probably want a GC marking strategy that more accurately reflects the fact that code (block_t objects and stubs) form a call graph. Every block has a list of outgoing branches, and each branch has target jump addresses. It would be super convenient if we could traverse the graph of blocks (make block_t GC objects), and automatically derive which code pages are live as a function of where the blocks jump to.

This week we're finishing the paper, and I'll be away Monday & Tuesday, but I want to get back to this after and try to think of a better solution.

@maximecb
Copy link
Contributor Author

maximecb commented Aug 4, 2021

@XrXr I had some thoughts for the code GC:

I think we could allocate a chunk of executable memory all at once, something like 64MiB, and then split that in code pages, say 8KiB. That leaves us with a total of 8192 code pages. Given that this number is fairly small, we could use a simple bitmap scheme to keep track of which code pages are live. That is, we start with all zeroes, and then, as we traverse block_t objects, we look at their outgoing branches, and for each target address, we mark the bit for the corresponding page in our bitmap. This takes care both of normal jumps and jumps to outlined stubs.

When the GC run is completed, we'd need to go through the bitmap and find all the pages that were never reached, add them back to a free list that we manage. So my questions are: is there a GC callback to do that? Also note that this would only work for a full GC run. We couldn't collect code pages in a minor GC.

For the block_t, I think we should make them GC objects. We'd traverse them starting with the entry point(s) for a given iseq, and follow the outgoing branches to mark the successor blocks. Then, when a block is finalized, we would remove it from the list associated with a given iseq. As you mentioned, we also need to keep track of which blocks are live based on jit_return addresses. For that, we could write the block_t associated with a given address just before said address in the machine code.

@XrXr
Copy link
Contributor

XrXr commented Aug 4, 2021

is there a GC callback to do that? Also note that this would only work for a full GC run. We couldn't collect code pages in a minor GC.

No, the GC doesn't offer an API to do this. It's very different from how the GC treats objects.

For that, we could write the block_t associated with a given address just before said address in the machine code.

That works, but would require adding jumps to the code to jump over the data. For example when we call a C func method, we would need to write jit_return to the block, and the block_t pointer would need to be in the block's generated code. If the goes at the end, it would stop falling through after the C func returns. Putting it in the middle of the block similarly requires an additional jump.

@maximecb
Copy link
Contributor Author

maximecb commented Aug 4, 2021

No, the GC doesn't offer an API to do this. It's very different from how the GC treats objects.

It's basically just a simple mark and sweep algorithm with a bitmap. I think it could work pretty well, but if we can't do it that way, then we'd want to have a pointer from each block to its code page, and also a pointer from each stub to its code page 🤔

That works, but would require adding jumps to the code to jump over the data. For example when we call a C func method, we would need to write jit_return to the block, and the block_t pointer would need to be in the block's generated code. If the goes at the end, it would stop falling through after the C func returns. Putting it in the middle of the block similarly requires an additional jump.

Fair enough, I guess we might be better served with a good old hash map. Also opens the possibility of associating other info with return addresses if we ever need to.

@XrXr
Copy link
Contributor

XrXr commented Aug 11, 2021

Assuming we go with a scheme that maps jit_return address to GC object, we'll need to take care to write jit_return when we push the CFP for cfunc calls, to have something referencing the calling block_t. Additionally, when we call routines that could allocate without pushing a CFP, such as when we call rb_vm_setinstancevariable(), we need to push the block_t pointer onto the native stack. Without doing so, nothing would reference the calling block when the GC runs inside the routine.

For finding the code page GC object given a pointer into the code page, we could store the code page object pointer at the start of the page and do VALUE code_page_gc_object = *(code_address - (code_address % CODE_PAGE_SIZE). This way, we don't need to store additional data on block_t or stubs as to what page they reference. Note that when CODE_PAGE_SIZE is a power of two, the calculation can be a bitwise and.

@tenderlove
Copy link
Contributor

No, the GC doesn't offer an API to do this. It's very different from how the GC treats objects.

It's basically just a simple mark and sweep algorithm with a bitmap. I think it could work pretty well, but if we can't do it that way, then we'd want to have a pointer from each block to its code page, and also a pointer from each stub to its code page 🤔

Instead of (or in addition to) keeping the bitmap, could we keep a wrapper object for each code page and have its finalizer add the page back to the free list? The object relationship could look like this:

diagram 001

When blocks are invalidated, or when an iseq is collected, the finalizer for the code page wrapper could add the code page back to the freelist:

animation

Pushing block_t wrapper objects on the stack could keep the code page alive until everything returns (as @XrXr says)

@maximecb
Copy link
Contributor Author

maximecb commented Aug 17, 2021

Yes we can use wrapper objects for code pages instead of using a bitmap. We may want to have some kind of mapping of code addresses to code pages. That could potentially be done with a fixed-size array.

I would rather avoid having to push block_t objects to the stack for each call because we're already in a situation where CFP objects are huge, and calls are very costly. Ideally we need to work towards reducing the overhead per call. Pushing more data to the stack for each call would go in the other direction.

@XrXr
Copy link
Contributor

XrXr commented Aug 19, 2021

An alternative to writing jit_return when calling C methods and routines is mapping cfp->pc to block_t when marking the stack. We already write to cfp->pc before making these calls. The caveat is that if the interpreter happens to call the same method that would also keep generated code alive. Not a huge problem, and if we're willing to do some extra work on generated code entry and exit this accuracy problem could be solved, too.

@tenderlove
Copy link
Contributor

I've been messing a bit with code GC today. The plan is that we'll allocate a chunk of executable memory, then subdivide the executable memory in to code pages (which will form a linked list). The compiler will further subdivide the code pages by allocating block_t structures from the code page.

Why do we need the intermediate code page structure? I think the answer is "code pages are fixed size so they're easier to manage", but I wanted to write it down.

Also I'd like to give a name to the executable memory we allocate so that it's easier to talk about. I suggest something like "code page pool". WDYT?

@maximecb
Copy link
Contributor Author

Sure that sounds good.

Another thing to keep in mind is that we need a way to map pointer into machine code to the code pages that contain these addresses. Alan suggested just writing a pointer to the code page object at the beginning of each code page. That way you can use a bitwise arithmetic page to recover the address. Alternatively, I suggested that we could have an array of pointers to code page objects and use bitwise arithmetic to go from code pointer to code page. Both are equivalent.

Each code page should be split into two so that there's an inline and outlined section (for things like side-exits). However, the code that does the allocation and freeing of code pages probably doesn't need to be concerned with this split.

@tenderlove
Copy link
Contributor

Alan suggested just writing a pointer to the code page object at the beginning of each code page. That way you can use a bitwise arithmetic page to recover the address.

Ya, this makes more sense than making block_t wrappers. I'm making the relationship look something like this:

yjit-code-pages 2021-08-25 14-14-23

When marking iseqs, we can just iterate the block_t objects, then do arithmetic on the address to get back to the code_page_t object which has a pointer to the code_page_t Ruby wrapper object. We can do something similar when scanning the stack, though I'm not 100% sure what that code will look like.

I have it kind of working now, but I think I might need some help if either of you are free at some point. I'm hitting some assertion errors that I don't know how to fix.

@maximecb
Copy link
Contributor Author

We can do something similar when scanning the stack, though I'm not 100% sure what that code will look like.

We should be able to get a code page from a jit_return address. Can't directly get the block_t though. Not sure if that matters (can a block_t be live longer than its parent iseq? maybe after some code invalidation event?)

When marking iseqs, we can just iterate the block_t objects, then do arithmetic on the address to get back to the code_page_t object which has a pointer to the code_page_t Ruby wrapper object.

Not quite tho. If we just iterate through all the blocks, that will keep all the blocks for an iseq alive forever. However, you can have a situation where blocks jump A -> B -> C -> D. Then you invalidate B, technically C and D are now dead too, because they become unreachable. Ideally the blocks need to be traversed as a graph of their own. Each block has a list of outgoing branches, and each outgoing branch can point to 0, 1, or 2 target blocks.

I have it kind of working now, but I think I might need some help if either of you are free at some point. I'm hitting some assertion errors that I don't know how to fix.

I can pair with you tomorrow afternoon if that works for you. Free at 1PM and 2PM eastern.

@tenderlove
Copy link
Contributor

tenderlove commented Aug 25, 2021

Not quite tho. If we just iterate through all the blocks, that will keep all the blocks for an iseq alive forever. However, you can have a situation where blocks jump A -> B -> C -> D. Then you invalidate B, technically C and D are now dead too, because they become unreachable. Ideally the blocks need to be traversed as a graph of their own. Each block has a list of outgoing branches, and each outgoing branch can point to 0, 1, or 2 target blocks.

Ah right 🤦🏻‍♂️. Can we make an iterator that traverses the live blocks? I'm not sure how many blocks per method we have, but I'm somewhat worried about creating a bunch of Ruby objects when we don't need to.

Can't directly get the block_t though.

block_t points inside the code page via start_pos. We should be able to do math on that address to find the head of the code page, I think. Then we wouldn't need the extra wrapper objects.

I can pair with you tomorrow afternoon if that works for you. Free at 1PM and 2PM eastern.

I've got a meeting from 1:30pm to 2:30pm eastern. Maybe I could work on a graph traversal function tomorrow and we could work on GC on Friday? (Or just ping me after 2:30pm EST?)

@maximecb
Copy link
Contributor Author

Currently this simple test fails, but we should be able to pass it once the code GC is working:

500_000.times do |i|
    eval("def test; 1; end", TOPLEVEL_BINDING)
    test()
end

@Shopify Shopify deleted a comment from noahgibbs Nov 1, 2021
@Shopify Shopify deleted a comment from noahgibbs Nov 1, 2021
@jaykrell
Copy link

We can't reliably get the return address of a native function while it's running.

Really? Do you mean on its thread, it called you/interpreter, or cross thread?
On Windows, maybe irrelevant here, the ABI is designed so that stackwalk is possible from every instruction, i.e. even prologues/epilogues, anywhere. (prologues/epilogues aren't what people think, as well, somewhat separate matter).
I realize other ABIs don't care about walking from prologues/epilogues but I'd suspect you can still win by being able to walk them forward/backward until you are not in them, like with runtime disassembly and partial simulation (i.e. at least to discover stack adjustment/restores).

@XrXr
Copy link
Contributor

XrXr commented May 13, 2022

It's certainly possible to try to unwind but it seems pretty complex to do (involves parsing DWARF on Linux and maybe COFF on Windows? See libunwind). Because CRuby supports arbitrary native extensions that don't necessarily include all the required metadata, in the worst case scenario we'd need to stack scan, as you said. It does seem like it could give a win for marking perf if we tried but it seems a bit too hard to do for the first iteration.

@jaykrell
Copy link

On Windows, it is just RtlLookupFunctionEntry + RtVirtualUnwind.
The data is always there and you don't have to parse it.
Stack walkability is not optional.
Not quite always PE/COFF, as JITs provide data too.
RtlLookupFunctionEntry handles that.

On Linux yeah I don't know if the data is always present but I'd still hope to "just" call libunwind.

If calls to native are rare enough..you can set a thread local on transitions to/from native. Terrible performance I realize, must be done rarely. Maybe that is what is described above -- I don't know anything about Ruby yet.

@k0kubun
Copy link
Member

k0kubun commented Nov 7, 2022

I think it's fair to close this issue today. Closing.

@k0kubun k0kubun closed this as completed Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
YJIT
In Progress
Development

No branches or pull requests

5 participants