Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pointer alignment hack setting #209

Closed
SoniEx2 opened this issue Aug 1, 2016 · 10 comments
Closed

Pointer alignment hack setting #209

SoniEx2 opened this issue Aug 1, 2016 · 10 comments
Labels

Comments

@SoniEx2
Copy link

SoniEx2 commented Aug 1, 2016

LuaJIT can be made RAM-hungry by adding a pointer alignment config, like JVMs do. This would allow LuaJIT to use more RAM by tricking the VM into not using lower N bits (e.g. shifting them out) for pointers.

Pointers and offsets are separate so by using e.g. 16-byte alignment you free up 4 bits, letting you use up to 2^35 bytes (32 GB) of RAM instead of 2^31 bytes (2 GB).

Strings and stuff would still be limited to a max size of 2 GB but that's a non-issue I think.

This is one way to maintain bytecode compatibility between 32-bit and 64-bit modes (would be really nice to have bytecode compatibility between them).

@MageSlayer
Copy link

Hm.
Looks like a potential possibility to ease pre-GC64 pain.
Unfortunately I cannot estimate if it's easy/difficult to implement, but I am definitely interested.
Do you have something to share?

@SoniEx2
Copy link
Author

SoniEx2 commented Aug 3, 2016

It's very hard. But it'd be pretty nice to have.

@corsix
Copy link

corsix commented Aug 3, 2016

  • x86 addressing modes make exploiting 8-byte alignment preferable to exploiting 16-byte alignment on grounds of efficiency (but then you only get 2^34 bytes of RAM).
  • LuaJIT already exploits alignment in a few places, and you can't exploit alignment twice (the place which comes to mind is call frames - adding three or four bits to a call frame would probably necessitate FR2 mode, which is precisely the bytecode incompatibility problem you cite as wanting to avoid).
  • This would be as much work as LJ_GC64 mode, for 1/8192th of the benefit.

Point 2 rules this out for technical reasons, and point 3 rules it out for effort and maintenance reasons.

@corsix corsix closed this as completed Aug 3, 2016
@corsix corsix added the wontfix label Aug 3, 2016
@SoniEx2
Copy link
Author

SoniEx2 commented Aug 3, 2016

Can you explain point 2?

@corsix
Copy link

corsix commented Aug 4, 2016

Not considering continuation frames, the metadata for a call frame contains three things:

  1. The function being called (a GC object pointer).
  2. The kind of the previous frame (one of seven values).
  3. How to return to the previous frame, meaning either:
    1. A bytecode instruction pointer (a pointer into GC memory).
    2. The size of the previous frame's stack.

This metadata has to fit into a number of TValues; when LJ_FR2=0, it fits into one TValue, and when LJ_FR2=1, it fits into two TValues. Fitting it into one TValue is a squeeze: at first sight, a TValue is 64 bits, and items 1 and 3i are both 32 bits, leaving no space for item 2. Space for item 2 is found by exploiting alignment of item 3: item 3i is 4-byte aligned, and item 3ii is a multiple of 8. This leaves two or three bits to re-purpose, which is precisely enough to fit item 2.

@SoniEx2
Copy link
Author

SoniEx2 commented Aug 4, 2016

So why's that encoded in the bytecode?

@corsix
Copy link

corsix commented Aug 4, 2016

Because the bytecode constructs the frame: when FR2=0, f(x,y,z) compiles to bytecode something like:

mov Reg(N), f
mov Reg(N+1), x
mov Reg(N+2), y
mov Reg(N+3), z
call Reg(N), 3

In this case, as part of the function call, Reg(N) is converted into 64 bits of frame metadata (32 bits for the function being called are already in place, so all that has to happen is storing the current bytecode instruction pointer in the other 32 bits).

When FR2=1, f(x,y,z) instead compiles to bytecode something like:

mov Reg(N), f
mov Reg(N+2), x
mov Reg(N+3), y
mov Reg(N+4), z
call Reg(N), 3

In this case, as part of the function call, Reg(N) and Reg(N+1) are converted into 128 bits of frame metadata (by storing the current bytecode instruction pointer in Reg(N+1)). The alternative (which happens for calls done through the C API) would be to compile as for FR2=0, and have the call instruction shift all of the arguments up by one slot, but this would come at a performance cost.

@SoniEx2
Copy link
Author

SoniEx2 commented Aug 4, 2016

Well can you add another IR that optimizes for the specific system it's running on and cannot be turned off/compiled without?

And then redesign the bytecode entirely so it can stay the same between 32-bit mode and 64-bit mode.

@corsix
Copy link

corsix commented Aug 4, 2016

Such things would come with a maintenance cost and a performance cost, which nobody is prepared to pay.

@SoniEx2
Copy link
Author

SoniEx2 commented Aug 4, 2016

Slower code loading is fine by me (because you shouldn't be loading code all the time). Slower first run is also fine by me (because after the first run it'll run like a charm). Incompatibilities are not fine by me because I rely on bytecode compatibility for my thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants