Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix arm64 register allocation issue for XLOAD. #438

Conversation

pgalizia-qdt
Copy link

For the arm64 implementation of asm_xload(), it is possible for
the dest register selected to be the same as one of the source
registers generated in the asm_fusexref() call. To prevent this,
exclude the dest register from the list of allowed registers for
that call.

Thanks to Javier for guidance as well as his script to replicate
the issue.

For the arm64 implementation of asm_xload(), it is possible for
the dest register selected to be the same as one of the source
registers generated in the asm_fusexref() call.  To prevent this,
exclude the dest register from the list of allowed registers for
that call.

Thanks to Javier for guidance as well as his script to replicate
the issue.
@sindrom91
Copy link

Could you share your testcase?

@pgalizia-qdt
Copy link
Author

Certainly. This was pointed out by @javierguerragiraldez

local empty = {}
local _a, _b, _c = nil, nil, nil


local function scan(vs)
   for _, v in ipairs(vs) do
      local sep = v:find("@", 1, true)
      if v:sub(sep+2, -2):byte() == 0x3f then -- 0x3f == ?
      end

      local int = empty[v]

      v:find(":", 1, true)
   end
end


local function w()
   for _ = 1, 500 do
      scan({"ab@xyz"})
   end
end
w()

@sindrom91
Copy link

It doesn't seem to me that you fixed the real issue. Destination and operand registers don't necesaarily need to be different.

I've taken a look at this and it seems that constant rematerialization uses one of the operands (which has the same register as destination) to rematerialize a constant. That is probably what the real issue is. Constant rematerialization shouldn't use modified registers.

@javierguerragiraldez
Copy link

hi @sindrom91 if I follow you, in asm_fusexref there's no need to totally exclude rd from allow parameter, since it's used for several temporary register allocations. but only on line 298, where it allocates rn, so replacing Reg rn = ra_alloc1(as, ir->op1, allow); with Reg rn = ra_alloc1(as, ir->op1, rset_exclude(allow, rd)); am i right?

if so, can you elaborate on why asm_xstore() does exclude the src reigster? I assumed it would be better to make asm_sload() similar to that, since these two are the only uses of asm_fusexref()

@pgalizia-qdt
Copy link
Author

Or should we add the fix in emit_kdelta() itself (changing the list of registers generated by ~as->freeset & RSET_GPR to be (~as->freeset & RSET_GPR) & ~as->modset to exclude the modified registers from the list? Now I'm wondering if there are additional cases where we could hit this other than the testcase above...

@javierguerragiraldez
Copy link

hum... shouldn't the dest register had its ra->cost[] entry accordingly modified, and thus wouldn't be picked by emit_kdelta() ? probably not in every case...

@sindrom91
Copy link

if so, can you elaborate on why asm_xstore() does exclude the src reigster? I assumed it would be better to make asm_sload() similar to that, since these two are the only uses of asm_fusexref()

I think it gets excluded because it holds data and we don't want it to get evicted and lose whatever data it holds. On the other hand, this doesn't apply to dest registers because, in general, they don't hold any data, so they can be reused for operands.

Or should we add the fix in emit_kdelta() itself (changing the list of registers generated by ~as->freeset & RSET_GPR to be (~as->freeset & RSET_GPR) & ~as->modset to exclude the modified registers from the list?

Yes, I was referring to emit_kdelta. When picking base register for constant rematerialization, it picks the one that has been modified in the mean time. I'm not sure if using as->modset is the right way, I don't understand it well. You have to check it for yourself.

Now I'm wondering if there are additional cases where we could hit this other than the testcase above...

This crossed my mind as well.

@pgalizia-qdt
Copy link
Author

@sindrom91 Thanks for that info. The modset change does seem to fix the problem, but I want to look further into whether that's the right way to fix it, or if there's a better/preferred way.

@MikePall
Copy link
Member

Yes, there's a more general problem behind that. Constant rematerialization must not use other registers that contain constants, if the register is in-flight. This is an issue for many IR instructions, not just XLOAD.

What happens here: The assembly of an IR instruction allocates a constant into a free register. Then it spills another register (due to high register pressure), which is rematerialized using the same constant (which it assumes is now in the allocated register). If the first register also happens to be the destination register, then it's modified before the rematerialization. Boom.

rn = const
...
[encode] rd = op(rn, rm)  # rd is modified here
[rematk] rm = rn + k      # must not use rd, i.e. rd != rn
[alloc]  rm               # causes eviction and rematk
[alloc]  rn
^^^^^^^^ read from bottom up

Note that it's quite common to use the same register for the destination and one of the sources. This is very helpful for PHIs in loops, too.

And the particular problem only happens when the rematerialization can make use of the same constant as one of the sources of the IR instruction.

What I'm wondering though: why doesn't this happen more often and on more architectures? There's no guard in cross-register constant rematerialization against that. Even something as basic as asm_intop() is affected. ARM (32 bit) uses the same rematerialization and has a lot higher register pressure.

Sigh ... running out of time.

[as->modset has loop scope, this is not what you want.]

@pgalizia-qdt
Copy link
Author

Thanks @MikePall for that explanation. It sounds like my initial proposed solution is backwards -- instead of prohibiting rd to be used in the rematk, I should be selecting rd in such a way that it doesn't collide with the rematk. And the as->modset solution is just wrong.

What I'm wondering though: why doesn't this happen more often and on more architectures? There's no guard in cross-register constant rematerialization against that. Even something as basic as asm_intop() is affected. ARM (32 bit) uses the same rematerialization and has a lot higher register pressure.

That's a good question... I'm do some digging to see if I can figure that part out. I'd rather fix the root issue here than slap on a (wrong) workaround...

@pgalizia-qdt
Copy link
Author

Did some more digging into this issue and did some parallel gdb walkthroughs between arm64, armv7, and ppc.

First of all, all three platforms generate the same/similar instruction sequence (load byte, rematerialize the string, compare the loaded-byte value), but armv7 and ppc never seem to hit the same register issue as arm64. It looks like this is due to when both of those platforms call ra_rematk(), compared to when arm64 does.

The armv7 and ppc architectures work -- and arm64 fails -- because the ra_rematk() function is called for armv7/ppc before the r4 register is cleared in as->freeset, while for arm64 the ra_rematk() function is called after the x4 register is cleared in as->freeset.

In ra_allocref(), both armv7 and ppc pass the ra_hashint() call, which results in a call to ra_rematk(), whereas for arm64 the ra_hashint() call fails, so we drop further down the function and eventually end up with x4 selected via rset_pickbot(). This returns, and x4 is cleared in as->freeset, which means that when we later do hit ra_rematk() later on, x4 is still seen as an eligible register for the source.

The reason armv7 and ppc both pass ra_hashint() is because in their asm_fusexref() functions, they pass the irref_isk(ir->op1)) check, which means ra_alloc1() gets called with ir->op2, which has the hint set.

But arm64 -- on the other hand -- uses asm_isk32() to check both ir->op1 and ir->op2. The asm_isk32() call checks irref_isk(), then also checks if this passes checki32(). The irref_isk() fails immediately for ir->op2, but passes for ir->op1 However, the checki32() call fails for ir->op1, so it fails the overall asm_isk32() call.

At that point, arm64 drops into the "else" clause at lj_asm_arm64.h:297. This code handles the checki32() failure case. However, this code assumes we want to use ir->op1 for the ra_alloc1() call, but ir->op1 doesn't have the hint -- ir->op2 does. That's causing the problem with ra_rematk() being called late as opposed to earlier in the callstack. (and results in x4 being selected due to rset_pickbot() grabbing it).

My proposal to fix this for arm64 is to modify the "else" clause at lj_asm_arm64.h:297 to invoke irref_isk() again on ir->op1 and ir->op2 (but not checki32() -- we're already in the block that deals with that). That passes the IRRef with the hint set to ra_alloc1(), which would cause the ra_rematk() call to be done earlier rather than later.

I've made a code change locally, and I'm currently testing it to make sure I didn't break anything else obvious, but figured I'd check to make sure my reasoning was correct.

@javierguerragiraldez
Copy link

while i still don't grasp all the details, the base explanation seems solid (the last check is done only on ir->op1 and not on ir->op2, which this case, on this architectures is one of the few (maybe only?) to actually use).
how are your tests going? I know that some kinds of fixes can be pretty stable because they steer the code into a more common case, while others open a whole new can of worms...

@pgalizia-qdt
Copy link
Author

Hi Javier,

Testing is coming around pretty good. I ran this patch through LuaJIT-test-cleanup bench and test (with some of the bench tests expanded to hit the JIT a bit harder), and haven't seen any regressions yet. Doesn't mean they're not in there.

Will run this against more scripts as I find them in an attempt to see if I can break it.

For arm64, it's possible for both IRRefs to fail asm_isk32(), but
one of them pass irref_isk().  Add a secondary check for the latter
call if both asm_isk32() calls fail.
siddhesh added a commit to moonjit/moonjit that referenced this pull request Feb 22, 2019
PR LuaJIT#438 describes an issue with constant rematerialization where a
rematerializaed constant could use a register it thought was constant
but was in fact clobbered in the same instruction that it was defined
in.

Fix the problem by adding a mask and a safe entry point for the
allocator that marks constant holding registers that are unsafe for
rematerialization due to them being the same as the destination
register for an instruction it is used in.  While this is only
implemented in asm_fusexref, it should ideally be implemented for all
instructions where the destination is likely to clobber one of the
source registers and if one or more of the source registers hold
constants.
@siddhesh
Copy link

I spent some time on this and created a new PR to fix this problem here:

#479

However I thought about this a bit more and it looks like your approach (the patch still needs a minor fix-up; I'll post it soon) might be the most minimal way to fix this.

This problem of remat not getting things right when a const register is reused for the destination can only occur when the const register is allocated before the other operand because this is the only case when ra_rematk sees the const register. This might also explain why the problem is not seen more commonly; we just need to make sure that constants are allocated last. This won't happen when there are two const operands, nor will it happen with two variants.

I'll leave my PR as is for now in case there are comments but otherwise I'll also post a small fixup for this patch.

siddhesh added a commit to moonjit/moonjit that referenced this pull request Feb 22, 2019
Even if the offset is a constant, it is not 32-bit since it failed
that check earlier before it came here.  The code is thus useless and
hence removed.  This also fixes inconsistencies with op1/op2 renaming
that were introduced in PR LuaJIT#438.  They were never triggered because
the code path is effectively dead for arm64.
siddhesh added a commit to moonjit/moonjit that referenced this pull request Mar 1, 2019
Even if the offset is a constant, it is not 32-bit since it failed
that check earlier before it came here.  The code is thus useless and
hence removed.  This also fixes inconsistencies with op1/op2 renaming
that were introduced in PR LuaJIT#438.  They were never triggered because
the code path is effectively dead for arm64.
@MikePall
Copy link
Member

MikePall commented Dec 8, 2019

Fixed. Thanks!

@MikePall MikePall closed this Dec 8, 2019
@gentle-king
Copy link

gentle-king commented Mar 20, 2021

@MikePall @siddhesh
I have one issue which is related to register selection, randomly on ARM64, not frequency.
a 32bit register is used to pass a 64bit pointer, finally lost some bits, and coredump.
and the version is before 20190507 in openresty bundle.

I have spent lots of time to check, and find your merge here.
could you help check if my issue is known issue, and fixed by your merge?

openresty/openresty#711 (comment)

Brugarolas pushed a commit to Brugarolas/luajit-experimental that referenced this pull request Jan 7, 2024
Even if the offset is a constant, it is not 32-bit since it failed
that check earlier before it came here.  The code is thus useless and
hence removed.  This also fixes inconsistencies with op1/op2 renaming
that were introduced in PR LuaJIT#438.  They were never triggered because
the code path is effectively dead for arm64.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants