-
Notifications
You must be signed in to change notification settings - Fork 5k
JIT: Support bitwise field insertions for call arguments #115977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Support bitwise field insertions for call arguments #115977
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
src/coreclr/jit/lower.cpp
Outdated
if (arg->OperIs(GT_FIELD_LIST)) | ||
{ | ||
unsigned int regIndex = 0; | ||
for (GenTreeFieldList::Use& use : arg->AsFieldList()->Uses()) | ||
{ | ||
const ABIPassingSegment& segment = abiInfo.Segment(regIndex); | ||
InsertPutArgReg(&use.NodeRef(), segment); | ||
|
||
regIndex++; | ||
} | ||
LowerArgFieldList(callArg, arg->AsFieldList()); | ||
arg = *ppArg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is enabled everywhere now since even single-reg args can be structs with multiple fields.
Add support using bitwise operations to reconstruct registers passed into calls from multiple promoted fields. Remove the IR invariant that `FIELD_LIST` args must always map cleanly to registers; instead, any `FIELD_LIST` is allowed for register-only arguments before lowering, and lowering takes care to normalize them into a handled shape. `fgTryMorphStructArg` is changed to take advantage of this by now producing `FIELD_LIST` even when a promoted arg does not match the target ABI. Support in physical promotion will be added in a follow up. Split arguments are not handled and retain the old IR invariant of requiring registers and stack slots to make cleanly from `FIELD_LIST`. win-x64 examples: ```csharp static void Foo(int x) { Use<int?>(x); Use<int?>(5); Use<int?>(null); } ``` ```diff G_M7200_IG02: ;; offset=0x0004 - mov byte ptr [rsp+0x20], 1 - mov dword ptr [rsp+0x24], ecx - mov rcx, qword ptr [rsp+0x20] + mov ecx, ecx + shl rcx, 32 + or rcx, 1 call [Program:Bar(System.Nullable`1[int])] - mov dword ptr [rsp+0x24], 5 - mov rcx, qword ptr [rsp+0x20] + mov rcx, 0x500000001 call [Program:Bar(System.Nullable`1[int])] - mov byte ptr [rsp+0x20], 0 xor ecx, ecx - mov dword ptr [rsp+0x24], ecx - mov rcx, qword ptr [rsp+0x20] - ;; size=55 bbWeight=1 PerfScore 14.25 + ;; size=34 bbWeight=1 PerfScore 7.50 G_M7200_IG03: add rsp, 40 tail.jmp [Program:Bar(System.Nullable`1[int])] ;; size=10 bbWeight=1 PerfScore 2.25 ``` ```csharp static void Foo(int x, float y) { Use((x, y)); } ``` ```diff G_M42652_IG01: ;; offset=0x0000 - push rax - ;; size=1 bbWeight=1 PerfScore 1.00 + ;; size=0 bbWeight=1 PerfScore 0.00 G_M42652_IG02: - mov dword ptr [rsp], ecx - vmovss dword ptr [rsp+0x04], xmm1 - mov rcx, qword ptr [rsp] + vmovd eax, xmm1 + shl rax, 32 + mov ecx, ecx + or rcx, rax ;; size=13 bbWeight=1 PerfScore 3.00 G_M42652_IG03: - add rsp, 8 tail.jmp [Program:Use[System.ValueTuple`2[int,float]](System.ValueTuple`2[int,float])] ``` A win-arm64 example: ```diff G_M33990_IG01: - stp fp, lr, [sp, #-0x20]! + stp fp, lr, [sp, #-0x10]! mov fp, sp - str xzr, [fp, #0x10] // [V03 tmp2] - ;; size=12 bbWeight=1 PerfScore 2.50 + ;; size=8 bbWeight=1 PerfScore 1.50 G_M33990_IG02: cbz x0, G_M33990_IG04 ;; size=4 bbWeight=1 PerfScore 1.00 G_M33990_IG03: - str x0, [fp, #0x10] // [V07 tmp6] - str wzr, [fp, #0x18] // [V08 tmp7] - ldr w0, [x0, #0x08] - str w0, [fp, #0x1C] // [V09 tmp8] + ldr w1, [x0, #0x08] b G_M33990_IG05 - ;; size=20 bbWeight=0.50 PerfScore 3.50 + ;; size=8 bbWeight=0.50 PerfScore 2.00 G_M33990_IG04: - str xzr, [fp, #0x10] // [V07 tmp6] - str xzr, [fp, #0x18] - ;; size=8 bbWeight=0.50 PerfScore 1.00 + mov x0, xzr + mov w1, wzr + ;; size=8 bbWeight=0.50 PerfScore 0.50 G_M33990_IG05: - ldp x0, x1, [fp, #0x10] // [V03 tmp2], [V03 tmp2+0x08] - movz x2, #0xD920 // code for Program:Use[System.Memory`1[int]](System.Memory`1[int]) - movk x2, #0x4590 LSL dotnet#16 + mov w1, w1 + lsl x1, x1, dotnet#32 + movz x2, #0xD950 // code for Program:Use[System.Memory`1[int]](System.Memory`1[int]) + movk x2, #0x4592 LSL dotnet#16 movk x2, #0x7FFE LSL dotnet#32 ldr x2, [x2] - ;; size=20 bbWeight=1 PerfScore 7.50 + ;; size=24 bbWeight=1 PerfScore 6.00 G_M33990_IG06: - ldp fp, lr, [sp], #0x20 + ldp fp, lr, [sp], #0x10 br x2 ;; size=8 bbWeight=1 PerfScore 2.00 -; Total bytes of code: 72 +; Total bytes of code: 60 ```
5748cbf
to
0aa3197
Compare
Wrong for floats...
It feels like some of the above examples could be optimized slightly, by relying on the fact that 32-bit operations implicitly zero the upper 32-bits of the 64-bit register. For example, any place where the lower bits aren't coming from register, we can elide a mov ecx, ecx ; This is unnecessary as the shl causes the upper 32-bits to contain the lower 32-bits and lower 32-bits to be zero
shl rcx, 32
or rcx, 1 Likewise while the shl rcx, 32
vmovd eax, xmm1
or rcx, rax I'm then not sure I understand the Arm64 example. I'd expect for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good/correct to me.
Might be worth seeing if the better codegen is being missed because we're generally missing an optimization or if there is slightly different IR we should generate
Right, I also pointed at this case back in #113178. It is not that simple to optimize because it is a necessary type changing cast in the underlying IR. The simplest way to optimize it would probably be an emitter peephole or containing the int -> long cast when used in the
I had forgotten to include the corresponding C# code. It is actually conversion of an |
Gotcha. I think we're missing a peephole or containment opportunity for the |
It seems that we are still spilling the return value before passing it to a use method call: UseValue(GetValue());
[MethodImpl(MethodImplOptions.NoInlining)]
static int? GetValue()
{
return 42;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void UseValue(int? value) { } Codegen: G_M27646_IG02: ;; offset=0x0004
call [Program:GetValue():System.Nullable`1[int]]
mov qword ptr [rsp+0x20], rax
mov rcx, qword ptr [rsp+0x20]
call [Program:UseValue(System.Nullable`1[int])]
nop I think it can be improved by adding support in physical promotion? |
There is still no bitwise handling for return values from calls. I may add support for that for the single-reg case in .NET 10, but it's unlikely that the support will come for the multireg case in .NET 10 as that requires a new IR representation (it's the two missing items under ABI handling in #109635). Even if support for the single-reg case is added in .NET 10 it will likely result in unnecessary bitwise operations in this case. It will require work to remove old promotion before we can get rid of that, I think. |
if (!argNode->TypeIs(TYP_STRUCT) && arg->AbiInfo.HasExactlyOneRegisterSegment()) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All independently promoted structs passed in one register are converted to FIELD_LIST
above with this change, so this condition is always false now.
(This is referring to deleted code above this line)
//----------------------------------------------------------------------------- | ||
// FieldsMatchAbi: | ||
// Check if the fields of a local map cleanly (in terms of offsets) to the | ||
// specified ABI info. | ||
// | ||
// Arguments: | ||
// varDsc - promoted local | ||
// abiInfo - ABI information | ||
// | ||
// Returns: | ||
// True if it does. In that case FIELD_LIST usage is allowed for split args | ||
// by the backend. | ||
// | ||
bool Compiler::FieldsMatchAbi(LclVarDsc* varDsc, const ABIPassingInformation& abiInfo) | ||
{ | ||
if (varDsc->lvFieldCnt != abiInfo.CountRegsAndStackSlots()) | ||
{ | ||
return false; | ||
} | ||
|
||
for (const ABIPassingSegment& seg : abiInfo.Segments()) | ||
{ | ||
if (seg.IsPassedInRegister()) | ||
{ | ||
unsigned fieldLclNum = lvaGetFieldLocal(varDsc, seg.Offset); | ||
if (fieldLclNum == BAD_VAR_NUM) | ||
{ | ||
return false; | ||
} | ||
} | ||
else | ||
{ | ||
for (unsigned offset = 0; offset < seg.Size; offset += TARGET_POINTER_SIZE) | ||
{ | ||
if (lvaGetFieldLocal(varDsc, seg.Offset + offset) == BAD_VAR_NUM) | ||
{ | ||
return false; | ||
} | ||
} | ||
} | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used for args split across registers and stack (arm32 mainly). These args retain the old IR requirements of needing fields to map cleanly to registers and stack slots.
I might clean that up as a separate future PR, but I have a long term ambition to remove special handling of split arguments in the backend entirely by physically splitting the arg in the call node into multiple args instead, as part of lowering. In fact I think multireg args could also be handled this way which would make the backend much more uniform around this.
cc @dotnet/jit-contrib PTAL @EgorBo Diffs. Overall perfscore and size improvements across the board, with some regressions here and there. The size improvements on arm64 are generally not as pronounced since the existing pattern uses loads which are quite compact ( - ldp x2, x3, [fp, #0x40] // [V08 tmp4], [V08 tmp4+0x08]
- ldp x0, x1, [fp, #0x30] // [V09 tmp5], [V09 tmp5+0x08]
+ mov x1, x21
+ mov w0, w20
+ mov w2, w19
+ orr x0, x2, x0, LSL #32
+ ldp x2, x3, [fp, #0x28] // [V08 tmp4], [V08 tmp4+0x08] We should be able to eliminate at least one of those One case that is clearly a regression -- sometimes we transform a promoted local into There are some throughput regressions. The big ones are in collections with very few MinOpts collections though, so generally it is just around ~0.1% in MinOpts. The detailed TP diff is Base: 17274781343, Diff: 17290066161, +0.0885%
39109000 : NA : 60.61% : +0.2264% : private: void __cdecl Lowering::InsertPutArgReg(struct GenTree **, class ABIPassingSegment const &)
795360 : NA : 1.23% : +0.0046% : `Compiler::fgTryMorphStructArg'::`39'::<lambda_1>::operator()
-96207 : -5.45% : 0.15% : -0.0006% : public: bool __cdecl Compiler::fgTryMorphStructArg(class CallArg *)
-795360 : -100.00% : 1.23% : -0.0046% : `Compiler::fgTryMorphStructArg'::`63'::<lambda_1>::operator()
-23725022 : -40.22% : 36.77% : -0.1373% : private: void __cdecl Lowering::LowerArg(struct GenTreeCall *, class CallArg *) so it looks like MSVC just stopped inlining |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
// GetRegisterType: | ||
// Return the smallest type larger or equal to Size that most naturally | ||
// represents the register this segment is passed in, taking into account the | ||
// GC info of the specified layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the formatting is a bit off (2 spaces vs 3 spaces for Return value, also, missing arg section)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me address that in a follow-up
if ((i == numRegs - 1) && !varTypeUsesFloatReg(regType)) | ||
{ | ||
// Allow tail end to pass undefined bits into the register | ||
regEnd = regStart + REGSIZE_BYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just curious, do we have a guarantee that we won't load the undefined bits as a payload somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, upper bits of registers containing struct values are undefined in the managed ABI. So this is ok. (This is just ensuring the handling matches what we had before; without this there was a bunch of regressions).
I think the actual source that creates these loads is morph:
runtime/src/coreclr/jit/morph.cpp
Lines 2442 to 2449 in 7e41ae6
#ifdef TARGET_ARM64 | |
if ((offset > 0) && argNode->OperIsLocalRead()) | |
{ | |
// For arm64 it's beneficial to consider all tails to | |
// be TYP_I_IMPL to allow more ldp's. | |
type = TYP_I_IMPL; | |
} | |
#endif |
I'm not sure it's worth it -- this means we sometimes can generate 2x 8byte loads => ldp instead of one 8 byte + one < 8 byte load, so the code is slightly smaller. But then maybe that may result in a stall? Probably worth to see if we can come up with an example and then maybe remove it if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty cool!
Add support using bitwise operations to reconstruct registers passed into calls from multiple promoted fields. Remove the IR invariant that
FIELD_LIST
args must always map cleanly to registers; instead, anyFIELD_LIST
is allowed for register-only arguments before lowering, and lowering takes care to normalize them into a handled shape.fgTryMorphStructArg
is changed to take advantage of this by now producingFIELD_LIST
even when a promoted arg does not match the target ABI. Support in physical promotion will be added in a follow up.Split arguments are not handled and retain the old IR invariant of requiring registers and stack slots to make cleanly from
FIELD_LIST
.win-x64 examples:
A win-arm64 example: