Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: "Multi-return" functions for faster Rust `Result` and/or simpler exceptions #553

Open
Ericson2314 opened this Issue Oct 9, 2018 · 20 comments

Comments

3 participants
@Ericson2314
Copy link

Ericson2314 commented Oct 9, 2018

From the paper "Multi-return Function Call" (http://www.ccs.neu.edu/home/shivers/papers/mrlc-jfp.pdf).
The basic idea from the perspective of compiled code is to include multiple return pointers in a stack frame so functions can return to different places.

Compared to Result<T,E>

This is denotationally the same as return a value of a Rust enum with 1 variant according to each of the return pointer slots, with fields according to the returned data associated with that slot (registers, spilled stack slots, etc). But with the naive enum calling convention of adding a tag field, the caller needs to branch on the tag field, even if the enum value was just created before the return so nothing is in principle unpredictable. In the common case of a function "rethrowing" a Err, (Err(e0) => ... return Err(e1) ... math arm), the native way results results on O(n) branches (one per stack frame) one each of the Err tags, while this way allows the error return pointer to point to disjoint control flow for the failure case, catching and rethrowing without additional branches, so the only 1 branch is the original failure condition.

Compared to unwinding

Success and failure control flow is implemented identically, avoiding significant effort on the part of compiler writers in maintaining a completely separate implementation concepts while optimizations can work with both, and don't get stuck on the success failure boundary. At run time, the lack of any DWARF-like interpreters reduces dependencies and simplifies things too.

In short, we have the asymptotic efficiency of unwinding with the implementation (compiler and run-time) efficiency of enum return.


I talked to @eddyb about this once and he said to talk to someone on #cranelift, but alas I am on IRC less these days and I forgot their nick. Opening this to make sure the idea isn't lost completely due to my negligence. Cheers.

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 9, 2018

This is a cool idea! It's very interesting to think about using Cranelift to explore custom calling conventions for Rust.

(For actual unwinding, Rust's hands are tied in practice by the practical need to interoperate with existing C/C++ ABIs. And the DWARF unwinding logic is already provided, so all we have to do is provide the appropriate tables. But for Result<T, E> we're of course free to do whatever we want.)

One of my ideas in this space is to use the CF flag on x86 (other architectures have other architecture-specific things we could look at) to hold the discriminant of Result<T, E>, sketched out in a little more detail here. One thing that makes it similar to multi-return functions is that it would also benefit from the concept of a tuple of return types. It seems like there'd be slightly different performance characteristics, but it'd be good to experiment to get a better sense of how it plays out in practice.

If anyone is interested in laying the foundations for this kind of exploration in Cranelift, I think the first step here is to generalize Cranelift IR's concept of a function signature to permit a tuple of return types.

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 9, 2018

If anyone is interested in laying the foundations for this kind of exploration in Cranelift, I think the first step here is to generalize Cranelift IR's concept of a function signature to permit a tuple of return types.

Glad to here it! That's what i was thinking / hoping wouldn't get in the way of any other things. Sounds like a nice fun refactor.

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 9, 2018

One further thought here: I think Cranelift IR wants a concept of a primary return type, optionally with secondary return types added. This might look a little unusual from the abstract perspective where it's just a tuple of return types, however I think it makes sense at the level of Cranelift IR.

When we get the the point where we're adding a version of the call instruction which supports multiple return destinations, the primary return can resume the EBB containing the call, and the secondary returns can target other EBBs. And, the lowering for secondary returns will look a little different than for primary returns, because the primary return can just use the hardware's normal return mechanism.

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 9, 2018

Well next like to do the empty list for diverging functions, since they don't return at all. Couldn't a convention that the first return type is the normal one suffice? For your CF example, the ABI is the same (flag must be checked either way). For mine, the only loss is short-hands like the ret instruction. The conceptual two steps of IP <- SP[N]; SP += M are the same, just N != M - 1 in general. For yours, the flag must be checked in all cases so the ABIs are the same!

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 11, 2018

Oh, cool. It hadn't occurred to me that diverging functions would just have an empty tuple. But I see how that makes sense.

Currently in Cranelift IR, the last instruction in an EBB is required to use one of the opcodes which is specially designated as a "terminator". Regular calls aren't terminators, so for a call to a diverging function, maybe we could use a special call_diverging family of instructions. That would get awkward as we gain more kinds of calls, that each need a "diverging" variant.

There might be ways we could fix that, though this may take a bit more design work.

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 11, 2018

I went with a call_table terminator, which can deal with any number of return slots. Perhaps regular call can be extended to, but this plays nice with the existing surface syntax and was easy to do.

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 11, 2018

I've started implementing things. https://github.com/Ericson2314/cranelift/tree/multi-fun-sig is the signature generalization. https://github.com/Ericson2314/cranelift/tree/call_table is the boilerplate to add call_table and call (which would come later).

For the former, most of the work is boring returns -> multi_returns.get_single_return_unwrap(), in lieu of any instructions supporting multi-return. But one issue popped up already for legalization:

legalize_signature's current param sketches me out. I can't decide whether the regs/stack slots for additional return pointers should always be allocated (like args) or only sometimes allocated (like the link). I'd understand current better if this was just a caller- vs callee- saved distinction, but it isn't.

From the limited amount I know so far, I'd like to make link unconditionally part of the legalized signature:

  1. Even on x86 we can say the first stack slot is the "link", AbiParam should support that.
  2. For no-return functions, and custom calling conventions, we could reuse the link register/stack slot for something else if we wanted. So it strictly speaking isn't an unconditional constraint.
@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 26, 2018

Catching up here: Yes, making the x86 return value on the stack a "link" argument (when it needs one) sounds reasonable to me.

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 26, 2018

Thanks. How about the current parameter from

fn legalize_signature(&self, sig: &mut ir::Signature, current: bool);
? What makes link regs/stack slots --- caller-visible parts of the ABI for sure---only done in the "current" case?

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 26, 2018

For the link register, Cranelift certainly does need to be aware of the link register Value passed into a function on the callee side, so that it can do the return. But the outgoing link register to a call is set by the call instruction itself in most ISAs (for traditional single-return calls). It's awkward to model this precisely, because the call would need to be a single instruction that defines the return address Value and also uses that Value itself as an argument to the call. Normally, instructions can't read values they produce.

For stack slots, where are they handled differently for the "current" case?

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 26, 2018

For stack slots, where are they handled differently for the "current" case?

I forget in the 2 weeks since I last touched the code :). I'm guess-remembering they may have been just implicit in both cases.

It's awkward to model this precisely...Normally, instructions can't read values they produce.

I think the solution is to get rid of the current and always model the link.

First, the machine's operation: Since the callee consumes the link (register or stack slock)---i.e after return the stack poiner is bumped or the register can contain something else---the call instruction indeed shouldn't produce or consume the return address. But is important that the callee have that register in it's ABI: a no-return function should be "called" with an unconditional jump and not a call instruction if the link stack/reg is to be repurposed for some other use.

If one thinks thinks of ABIs as types like then we have something like:

call :  reg
     -> (a = (pre-constraints + link-with-addr => post-constraints))
     -> ((reg = fun(a)) + pre-constraints => post-constraints + link-clobbered)
jump :  reg
     -> (pre-constraints => truely-empty-constraint-meaning-unreachable) 
     -> (reg = fun(a)) + caller-pre-constraints => truely-empty-constraint-meaning-unreachable)

So it's sort of like a higher order function (-> is the syntactic function type for assembly code operators, => is the "semantic" function type for instructions). The point being is that call's constraints to not precisely correspond to the constraints of the function being called. The finally call instruction has the same rough constraints, but those don't correspond exactly to constraints of the function being called.

My guess is current is sort of hack to recalculate the function's ABI so it more closely matches the call instruction's ABI, but that ignores the fact that call and jump require different things of their callee / destination address as I described first.

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 26, 2018

I think you're right, and I agree that it would be nice to make caller and callee work the same way here. However I don't yet see how we can do that cleanly.

If we put the link in the callee's signature, we'll need to also add a Value as an actual argument to the call, because basic IR invariants require the number of actual arguments to match the number of formal parameters. The problem is, we don't have such a Value. It won't exist until the call instruction itself produces it. Basic IR invariants require that Value uses be dominated by their definitions, and an instruction's outputs don't dominate its inputs.

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 26, 2018

Where are these "Basic IR invariants" in the code? I'm very much willing to try teach them the higher-order nature of this stuff :).

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 26, 2018

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 26, 2018

Thanks!

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 27, 2018

Sweet! So I think the first bit can stay the same. The second bit I'd change by adding some extra link args to match against the inferred virtual parms, so everything verifies up. Those args wouldn't come from any particular instruction per the normal way, but rather be made up out of thin air since they come from the "first half" of the call instruction. Initially, they'd be the same for every call instruction (for a given ISA), but once we get multi return, different call instructions would yield different virtual args.

[BTW, this also gets us closer to using br for tail calls. In that case the old link value needs to be routed to the right place like a normal argument. br wouldn't create any special arguments, so the currently-current-only link parameter and normally-routed link argument would need to match up just as the spoofed call-provided arguments would.]

Looking at that code reminds me, is there any issue to add argument/returns for br_table? I'd say it would make sense to so generalize br_table and remove current, then clean up and finish what I already started (call_table and multiple ret in sigs).

@Ericson2314

This comment has been minimized.

Copy link
Author

Ericson2314 commented Oct 27, 2018

[Argument/returns for br_table also gets us closer to your trick for storing the tag in condition flags + scalar replacement of aggregates. (I realize now mine is fundamentally different as it is getting rid of the tag altogether rather than storing it differently.) In that case, there are some opaquely used registers, and based on the branch taken those registers are revealed to be used for something or actually unused, based on the revealed variant.]

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 27, 2018

Ok, I'll be interested to see what this ends up looking like :-).

Adding arguments to br_table is theoretically doable, though it's not trivial. The big issue is the interaction with critical edges. For example, if we have a CFG like this:

     A     B    C    D
       \  /  \ /  \ /
        E     F    G

Suppose all these edges are branched via br_table, and for whatever reason we decide we need an argument on the branch from A->E. Now E has a parameter, so B needs an argument. But that means F needs a corresponding parameter, then C needs a corresponding argument, then G needs a corresponding parameter, then D needs a corresponding argument. This is pretty obscure, but this kind of situation where what should have been a localized transformation ends up needing us to ping pong around the IR to keep everything in sync is a corner case we'd like to avoid worrying about.

One idea we've tossed around is to change br_table's arguments from being a single sequence of values to be a sequence of values for each destination. For example, instead of:

     br_table v0, ebb7, jt0(v2, v3)

we might have:

    br_table v0, ebb7(v2, v3), jt0(ebb8(v2, v3), ebb9(v2, v3))

or so. Now we can do things between this branch and say ebb8 without affecting the other destinations.

All that said, if you have other ideas here, I'm interested!

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented Oct 27, 2018

A few more notes:

  • Traditional SSA form with phis rather than branch arguments avoids this problem because the phis hold the actual arguments in the successors. This is why the approach sketched above works, because it provides the same information that phis would provide.
  • The reason we don't currently have this problem is that our regular conditional branches use a fallthrough when the condition is false, and the fallthrough path doesn't have parameters. br_table and indirect_jump_table_br (which br_table lowers to) are the only two branch instructions which have multiple successors that have parameters.

But also, thinking about this more, I think we may not actually need br_table for here. If you're just picking between a small number of return paths, a conditional branch or two will be much more efficient than a jump table.

@gnzlbg

This comment has been minimized.

Copy link

gnzlbg commented Mar 14, 2019

For actual unwinding, Rust's hands are tied in practice by the practical need to interoperate with existing C/C++ ABIs.

Note that if a foreign extern "C" { } (or extern { }) function declaration unwinds the behavior is undefined. Also, an uncaught panic in extern "C" fn aborts the process. That is, an implementation can assume that C foreign functions and extern "C" functions do not unwind.

Being able to interoperate with C++ ABIs that unwind would require extern "c++", and handling those ABIs will probably require inserting shims that translate Rust panics to/from C++ exceptions for the particular target (unless the extern "c++" fn or the extern "c++" { } declaration are explicitly specified to never unwind).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.