Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

47 bit address space restriction on ARM64 #49

Closed
dodiadodia opened this issue Aug 19, 2015 · 30 comments

Comments

@dodiadodia
Copy link

commented Aug 19, 2015

Hi, I encounter a problem to run luajit-2.1 on ARM64 platform, as follow:
My kernel enable "AArch64 Linux memory layout with 64KB pages + 3 levels", show at "https://www.kernel.org/doc/Documentation/arm64/memory.txt". So I have 48 bits Virtual Address.
But in the luajit lj_obj.c, that said "64 bit platform, 47 bit pointers", and when the code use the "LJ_GCVMASK" to get the real pointers, return value is wrong, the 48 bits VA is cut to 47 bits length. So the luajit program get the SIGSEGV signal.
We can use the "AArch64 Linux memory layout with 64KB pages + 2 levels" in our kernel that have 42bits VA, that will resolve this problem, but because of some limitations we can't use this mode.
So how can I resolve this problem?

@MikePall

This comment has been minimized.

Copy link
Member

commented Aug 19, 2015

Sorry, but I don't think there's an easy way to solve this. The 47 bit restriction is pretty much hardcoded due to the use of NaN-tagging: 13 bits to indicate NaN, 4 bits for the tag, 47 bits left.

However, I doubt that it makes much of a difference for your user mode whether it gets 2^47 or 2^48 bytes of address space. :-) So, your kernel could just hand out pages in the lowest 2^47 bytes of the address space.

AFAIR there are memory region settings that can be tuned somewhere in the arch section, but this is more a question for Linux kernel experts then.

@MikePall MikePall added the ARM64 label Aug 19, 2015
@MikePall MikePall changed the title A SIGSEGV problem of v2.1 branch on ARM64 47 bit address space restriction on ARM64 Aug 19, 2015
@dodiadodia

This comment has been minimized.

Copy link
Author

commented Aug 19, 2015

I track the source code to src/lj_api.c, in "getcurrenv" function.
We will get GCfunc *fn from "lua_State *L", and we can see that L address is 0xffff7a190378, and the fn address is 0x7fff7a190378, the 48th bit the L is turned to 0, as we debug on level 2 memory layout, the L address is equal to fn, so now we guess the fn address is illegal, when we access the fn, the program receive SIGSEGV signal and exit.
The curr_func(L) is using the gcval macro that will use "LJ_GCVMASK" to get the 47 bit address.

@MikePall

This comment has been minimized.

Copy link
Member

commented Aug 19, 2015

Yes, this is expected. For LJ_GC64 mode, all addresses need to be in the range from 0 to 0x7fff_ffff_ffff. I.e. the lowest 2^47 bytes of the address space.

@DemiMarie

This comment has been minimized.

Copy link

commented Aug 24, 2015

One solution would be to call mmap with MAP_FIXED, to force the kernel to give out addresses in the right range (or fail, which would cause a LuaJIT panic). This would require some non-portable hacks to find a viable starting address, or using /proc/self/maps to find a free area. Steel Bank Common Lisp (SBCL) uses a fixed address on all supported platforms, though for a different reason (SBCL core images are not position independent and must be loaded to the same address every time).

@dodiadodia

This comment has been minimized.

Copy link
Author

commented Aug 25, 2015

Yes, I tried it and it's a method can temporary resolve the problem.
Thank you.

@MikePall MikePall closed this Aug 26, 2015
@apinski-cavium

This comment has been minimized.

Copy link

commented Feb 25, 2016

stack is allocated with the 48bit set. So nothing here is a valuable option. Please reopen this bug. This is an important bug. We need to use 48bit VA as our PA is uses the full 48bits.

@apinski-cavium

This comment has been minimized.

Copy link

commented Feb 25, 2016

Oh one more point ARMv8.2 will bring in the ability to do 52bit VA. This is going to be a mess in the server business if this is not supported. I am going to try to hack up this by removing this whole overload usage. Compression is just a joke when it comes to small and fast languages.

@DemiMarie

This comment has been minimized.

Copy link

commented Feb 28, 2016

I can see a couple options:

  • Give up NaN-boxing (but that is a perf loss as well as requiring major
    changes)
  • Use mmap to reserve a massive address space at startup. Then allocate
    everything from this address space.

I strongly suspect that the second approach will work. Lua code in LuaJIT
does not use the C stack and SpiderMonkey has the same problems as LuaJIT I
believe.
On Feb 25, 2016 1:50 AM, "apinski-cavium" notifications@github.com wrote:

Oh one more point ARMv8.2 will bring in the ability to do 52bit VA. This
is going to be a mess in the server business if this is not supported. I am
going to try to hack up this by removing this whole overload usage.
Compression is just a joke when it comes to small and fast languages.


Reply to this email directly or view it on GitHub
#49 (comment).

@wookey

This comment has been minimized.

Copy link

commented Mar 22, 2016

As there is now arm64 support in luajit2.1 this came up in Debian. Currently laujit just segfaults immediately because the pointers get the top bit clipped. I hacked about a bit before reaching an understanding that we can't just move everything up by one bit without losing tag space or the 'NaN-boxing'. Anyway there is a debian bug here tracking the issue.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818616

As apinski says, we need to allow for 52-bit pointers soon (and so will x86), so a long-term solution involves changing the layout.
I see from https://bugzilla.mozilla.org/show_bug.cgi?id=1143022 that it looks like mmapping was used to make it work there for the time being. We could do the same thing for now.

@MikePall

This comment has been minimized.

Copy link
Member

commented Mar 22, 2016

The mmap hinting approach described in the Mozilla bug report does not work properly with a (recent) Linux kernel. Neither does their code.

The address passed to mmap is only accepted as a hint if the requested address is not yet allocated.

But if it's already taken, then the hint is ignored and a random base address from the designated mmap VM area is allocated (i.e. outside of the acceptable addresses). The Linux kernel does not perform a linear search starting at the hint address (unlike BSD kernels etc.).

In other words: the user-space memory allocator would need to get a lot more clever about hinting. Otherwise it'll a) fail randomly (especially with different VMs in multiple threads), or b) perform very expensive linear address space searches, and/or c) kill performance due to overly spread-out randomized addresses (think about page tables & TLBs).

@daurnimator

This comment has been minimized.

Copy link

commented Mar 23, 2016

I don't suppose it's possible to remove one of the type bits so that we have 48 bits available for the GCRef? And then store the GCRef subtype in the GC object itself?

e.g.

** number           -----------------double------------------
** nil              |1111111111111|000|1...................1|
** false            |1111111111111|001|1...................1|
** true             |1111111111111|010|1...................1|
** int (LJ_DUALNUM) |1111111111111|011|0..0|------int-------|
** lightuserdata    |1111111111111|101|----48 bit pointer---|
** GC objects       |1111111111111|111|-----48 bit GCRef----|

This has one bit available for lightuserdata and GC objects. Could extend to 49 bit pointers, or keep the bit free for future/other use.

@DemiMarie

This comment has been minimized.

Copy link

commented Mar 28, 2016

One approach is to use /proc/self/maps or the like to try to find a suitable unmapped address. Then call mmap with MAP_FIXED to force the kernel to either give you the address you asked for, or fail. Loop on failure. In a single-threaded process, this is guaranteed to terminate the first time. In a multithreaded process, the probability that the loop is still going after n iterations should decrease exponentially with n, provided that the addresses are chosen randomly among the free addresses available (just use /dev/urandom).

The Linux kernel allocates physical memory lazily, and I suspect other kernels do as well. As a result, it is perfectly OK to allocate in GiB-sized chunks – it is routine for programs to allocate 1TB of virtual memory on 64-bit systems.

To mark a page as unused, use mprotect with PROT_NONE.

Finally, I would file a feature request with the kernel about changing the behavior, perhaps with a new flag to mmap.

@DemiMarie

This comment has been minimized.

Copy link

commented Mar 28, 2016

@MikePall Did the Linux kernel used to do the linear search? If so this is a regression.

@MikePall

This comment has been minimized.

Copy link
Member

commented Mar 28, 2016

@daurnimator Unacceptable slow-down. There's a reason the tag is part of the tagged value and not (just) part of the box.

@drbo There's no way to combine /proc/self/maps with MAP_FIXED, since this cannot be made thread-safe. Allocating a giant amount of memory from within a library, which is intended to be embedded, is unacceptable. Forget about the kernel, they will never change a thing.

@drbo I'm pretty sure the behavior of Linux/x64 was always that way. Maybe Mozilla never tested their code in the real world (the #ifdefs are for IA64 ;-) ).

@daurnimator

This comment has been minimized.

Copy link

commented Mar 29, 2016

@daurnimator Unacceptable slow-down. There's a reason the tag is part of the tagged value and not (just) part of the box.

Is it though?
Are there hot paths where the type is accessed but not the value? Usually when you ask the type of something, you're going to do something with it. This should mean the cost of a cache miss due to the dereference shouldn't be high.

Quick survey:

  • The obvious: lua_type(): this function will suffer from an extra pointer lookup
  • via the macro checktp (and friends such as checktab, checkstr)
    • commonly in fast funcs to branch to the fallback path (jumps to fff_fallback).
    • e.g. in BC_TGETV, checktab is used to figure out whether to look in the table before going via __index.
@daurnimator

This comment has been minimized.

Copy link

commented Mar 29, 2016

Actually... don't we know the alignment of GC objects is to 8 bytes; and hence have an extra 3 bits to work with?
Lightuserdata is user provided, so we have to store the full 48 bits. This demands a variable length encoding such as:

** number           -----------------double------------------
** nil              |1111111111111|000|1...................1|
** false            |1111111111111|001|1...................1|
** true             |1111111111111|010|1...................1|
** int (LJ_DUALNUM) |1111111111111|011|0..0|------int-------|
** lightuserdata    |1111111111111|111|----48 bit pointer---|
** GC objects       |1111111111111|1??|---45 bit GCRef--|???|
@wookey

This comment has been minimized.

Copy link

commented Mar 29, 2016

+++ daurnimator [2016-03-29 00:07 -0700]:

Actually... (stupid?) idea: we know the alignment of GC objects is to 8 bytes;
so we have an extra 3 bits.

** number -----------------double------------------
** nil |1111111111111|000|1...................1|
** false |1111111111111|001|1...................1|
** true |1111111111111|010|1...................1|
** int (LJ_DUALNUM) |1111111111111|011|0..0|------int-------|
** lightuserdata |1111111111111|111|----48 bit pointer---|
** GC objects |1111111111111|???|--45 bit GCRef--|???|

Right. I believe this is a good way to go if you want to keep the
tagging in-value. You have control over the alignment of objects, and
thus those bottom bits. They won't go away when arches adjust for
larger address spaces.

I assume all 4 bits of the tag are currently being used? So you would
either need to shrink it down to 3 (which would also mean that they
can stay where they are for now), or keep 3 here and 1 in the current
spot.

Wookey

Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/

@corsix

This comment has been minimized.

Copy link

commented Mar 29, 2016

Tag bits toward the MSB are far more convenient than tag bits at the LSB,
and keeping all the tag bits together is also convenient.

I'd be tempted to keep 4 bits of tag in the current position, give
lightuserdata two tag values (i.e. encode their 48th address bit in the
tag), and fit GC object pointers into 47 bits by dropping the least
significant bit of their address (which, as already noted, we can do thanks
to alignment). That said, not a decision to be taken lightly, and 52-bit
addresses are a whole other world of pain.

On Tue, Mar 29, 2016 at 2:55 PM, wookey notifications@github.com wrote:

+++ daurnimator [2016-03-29 00:07 -0700]:

Actually... (stupid?) idea: we know the alignment of GC objects is to 8
bytes;
so we have an extra 3 bits.

** number -----------------double------------------
** nil |1111111111111|000|1...................1|
** false |1111111111111|001|1...................1|
** true |1111111111111|010|1...................1|
** int (LJ_DUALNUM) |1111111111111|011|0..0|------int-------|
** lightuserdata |1111111111111|111|----48 bit pointer---|
** GC objects |1111111111111|???|--45 bit GCRef--|???|

Right. I believe this is a good way to go if you want to keep the
tagging in-value. You have control over the alignment of objects, and
thus those bottom bits. They won't go away when arches adjust for
larger address spaces.

I assume all 4 bits of the tag are currently being used? So you would
either need to shrink it down to 3 (which would also mean that they
can stay where they are for now), or keep 3 here and 1 in the current
spot.

Wookey

Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#49 (comment)

@daurnimator

This comment has been minimized.

Copy link

commented Mar 30, 2016

I'd be tempted to keep 4 bits of tag in the current position, give
lightuserdata two tag values (i.e. encode their 48th address bit in the
tag), and fit GC object pointers into 47 bits by dropping the least
significant bit of their address (which, as already noted, we can do thanks
to alignment).

Sounds good to me :)

That said, not a decision to be taken lightly, and 52-bit
addresses are a whole other world of pain.

I'm not probably caught-up-enough with it, but I know that even HP's "The machine" was still stuck with 48 bits of virtual address space.
How far away is 52 bits for "normal" developers?

Obviously we can never pack a 52 bit lightuserdata into a NaN (as we only have 51 bits to play with (or do we have 52? there's the nan sign bit too)), so any changes will be significant.

@DemiMarie

This comment has been minimized.

Copy link

commented Mar 30, 2016

@MikePall Yes, using /proc/self/maps and MAP_FIXED is racy.

My preferred approach is probabilistic: on any reasonable machine, and in the vast majority of cases, there is enough unmapped usable address space that one can keep trying random addresses in the usable range until one succeeds, and will succeed in a reasonable amount of time with high probability. In fact, the probability of failure decreases exponentially, with a very steep curve (2**47 (address space available for use)/ 2**40 (max reasonable virtual mem usage) = 2**7 = 128) so after 10 iterations success is virtually guaranteed. I can't think of any other way to make LuaJIT work, short of the kernel getting a new API. Yes, failure can theoretically happen, but cryptographers consider a 2**-70 chance, which no attacker can try to brute force, to be good enough for them, so it is good enough for me (1 in a billion trillion).

Also, note that the large amount of memory is virtual memory, which is essentially free.

Finally, I have also filed a feature request against the kernel.

@daurnimator

This comment has been minimized.

Copy link

commented Apr 12, 2016

I'd be tempted to keep 4 bits of tag in the current position, give
lightuserdata two tag values (i.e. encode their 48th address bit in the
tag), and fit GC object pointers into 47 bits by dropping the least
significant bit of their address (which, as already noted, we can do thanks
to alignment).

Sounds good to me :)

@corsix is this something you'd be interested in taking on?
Should we create a new issue for it?

@corsix

This comment has been minimized.

Copy link

commented Apr 12, 2016

I have no intent to play with this in the immediate future.

@wookey

This comment has been minimized.

Copy link

commented Apr 12, 2016

+++ daurnimator [2016-03-29 20:18 -0700]:

I'd be tempted to keep 4 bits of tag in the current position, give
lightuserdata two tag values (i.e. encode their 48th address bit in the
tag), and fit GC object pointers into 47 bits by dropping the least
significant bit of their address (which, as already noted, we can do thanks
to alignment).

Sounds good to me :)

That said, not a decision to be taken lightly, and 52-bit
addresses are a whole other world of pain.

I'm not probably caught-up-enough with it, but HP's "The machine" is the only
working system I know of that has more than 48 bits of virtual address space.
How far away are 52 bit virtual addresses for "normal" developers?

You are right that 'the machine' is the only current implementation,
but the tech that that is a test-bed for, '3D cross-point' (disk-sized
non-volatile 'RAM') is coming, so over the next 5 years 52-bit
addressing will become a thing we have to deal with. Which means a
major redesign of these JITs that use NAN-sandboxing and in-word tags.

I'm not following the kernel development, but I guess this will get
into the kernel quite soon (marked experimental). I have no idea how
long before people start getting hardware with this available. Not
this year, but maybe next. So it will need dealing with in the
not-too-distant.

Obviously we can never pack a 52 bit lightuserdata into a NaN (as we only have
51 bits to play with), so any changes there will be significant.

Exactly. As the memory guru in my office just said: 'you have to use
kernel values as-supplied' You can't just go doubling file-handles
before use, and you can't change the pointers provided (except the
bottom bits, because you control alignment) the kernel+hardware
controls pointer size and memory allocation.

Wookey

Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/

@DemiMarie

This comment has been minimized.

Copy link

commented Apr 13, 2016

That is what MAP_FIXED is for: tell the kernel to give you the exact
address you asked for, or fail. If you get an error, retry with a
different, randomly chosen address in the acceptable part of the address
space.

The alternatives are:

  • Box everything (slow -- probably requires a generational moving collector
    for "immediate" objects to get enough perf).
  • Use double-word representations for everything: 1 word object + 1 word
    metadata
  • Use an auxillary data structure to distinguish pointers from non-pointers
  • Use a 1-bit LSB tag to distinguish pointers from non-pointers.
  • Others I have not figured out.
    On Apr 12, 2016 8:11 AM, "wookey" notifications@github.com wrote:

+++ daurnimator [2016-03-29 20:18 -0700]:

I'd be tempted to keep 4 bits of tag in the current position, give
lightuserdata two tag values (i.e. encode their 48th address bit in the
tag), and fit GC object pointers into 47 bits by dropping the least
significant bit of their address (which, as already noted, we can do
thanks
to alignment).

Sounds good to me :)

That said, not a decision to be taken lightly, and 52-bit
addresses are a whole other world of pain.

I'm not probably caught-up-enough with it, but HP's "The machine" is the
only
working system I know of that has more than 48 bits of virtual address
space.
How far away are 52 bit virtual addresses for "normal" developers?

You are right that 'the machine' is the only current implementation,
but the tech that that is a test-bed for, '3D cross-point' (disk-sized
non-volatile 'RAM') is coming, so over the next 5 years 52-bit
addressing will become a thing we have to deal with. Which means a
major redesign of these JITs that use NAN-sandboxing and in-word tags.

I'm not following the kernel development, but I guess this will get
into the kernel quite soon (marked experimental). I have no idea how
long before people start getting hardware with this available. Not
this year, but maybe next. So it will need dealing with in the
not-too-distant.

Obviously we can never pack a 52 bit lightuserdata into a NaN (as we only
have
51 bits to play with), so any changes there will be significant.

Exactly. As the memory guru in my office just said: 'you have to use
kernel values as-supplied' You can't just go doubling file-handles
before use, and you can't change the pointers provided (except the
bottom bits, because you control alignment) the kernel+hardware
controls pointer size and memory allocation.

Wookey

Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#49 (comment)

@MikePall

This comment has been minimized.

Copy link
Member

commented Apr 18, 2016

Resolved in v2.1 branch as 0c6fdc1.

akopytov pushed a commit to akopytov/LuaJIT that referenced this issue Oct 15, 2016
@zhongweiy

This comment has been minimized.

Copy link

commented Oct 22, 2016

Hi, all, I've create a patch to fix issue in lightud with 48-bit address on ARM64:
https://www.freelists.org/post/luajit/fix-lightud-type-for-48bit-virtual-address

Could you help take a review? Thanks!

@daurnimator

This comment has been minimized.

Copy link

commented Feb 23, 2017

so over the next 5 years 52-bit
addressing will become a thing we have to deal with. Which means a
major redesign of these JITs that use NAN-sandboxing and in-word tags.

I noticed that x86-64 is now going to a 56 bit address space: https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf

Do we have a plan?

@vcunat

This comment has been minimized.

Copy link

commented Mar 14, 2019

Has anyone considered switching from unsigned 47-bit to signed 47-bit? (I.e. the "bit 47" would have to be equal to all the bits above.) Well, I know basically nothing about luaJIT implementation or what kind of address layouts are commonly in use, but I strongly suspect the signed case would cover significantly larger fraction of addresses used in practice, and conversion back seems still relatively cheap.

@DemiMarie

This comment has been minimized.

Copy link

commented Apr 6, 2019

@MikePall Would it be possible to box and intern lightuserdata? Obviously, it would be slow, but then the Lua/C API is not know for its speed anyway, and LuaJIT is not robust against allocation failures.

@vcunat

This comment has been minimized.

Copy link

commented Apr 7, 2019

In some cases I now use full userdata as replacement to pass C pointers to lua. That's heavier (possibly inevitable), but the real complication of that approach is that "implicit conversions" make a pointer-to-pointer from it instead of the pointer itself, so some lua code around that has to be tweaked.

details
static inline void lua_pushpointer(lua_State *L, void *p)
{
       void *addr = lua_newuserdata(L, sizeof(void *));
       memcpy(addr, &p, sizeof(void *));
}
local foo_ptr = ffi.cast('foo_type **', foo_userdata)[0]

In my case I needed cdata at the end anyway, but there's no way to create those from C side yet. Your use cases might be different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.