Skip to content

JIT: Support bitwise field insertions for call arguments #115977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 28, 2025

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented May 25, 2025

Add support using bitwise operations to reconstruct registers passed into calls from multiple promoted fields. Remove the IR invariant that FIELD_LIST args must always map cleanly to registers; instead, any FIELD_LIST is allowed for register-only arguments before lowering, and lowering takes care to normalize them into a handled shape.

fgTryMorphStructArg is changed to take advantage of this by now producing FIELD_LIST even when a promoted arg does not match the target ABI. Support in physical promotion will be added in a follow up.

Split arguments are not handled and retain the old IR invariant of requiring registers and stack slots to make cleanly from FIELD_LIST.

win-x64 examples:

static void Foo(int x)
{
    Use<int?>(x);
    Use<int?>(5);
    Use<int?>(null);
}
 G_M7200_IG02:  ;; offset=0x0004
-       mov      byte  ptr [rsp+0x20], 1
-       mov      dword ptr [rsp+0x24], ecx
-       mov      rcx, qword ptr [rsp+0x20]
+       mov      ecx, ecx
+       shl      rcx, 32
+       or       rcx, 1
        call     [Program:Use(System.Nullable`1[int])]
-       mov      dword ptr [rsp+0x24], 5
-       mov      rcx, qword ptr [rsp+0x20]
+       mov      rcx, 0x500000001
        call     [Program:Use(System.Nullable`1[int])]
-       mov      byte  ptr [rsp+0x20], 0
        xor      ecx, ecx
-       mov      dword ptr [rsp+0x24], ecx
-       mov      rcx, qword ptr [rsp+0x20]
-						;; size=55 bbWeight=1 PerfScore 14.25
+						;; size=34 bbWeight=1 PerfScore 7.50

 G_M7200_IG03:
        add      rsp, 40
        tail.jmp [Program:Use(System.Nullable`1[int])]
 						;; size=10 bbWeight=1 PerfScore 2.25
static void Foo(int x, float y)
{
    Use((x, y));
}
 G_M42652_IG01:  ;; offset=0x0000
-       push     rax
-						;; size=1 bbWeight=1 PerfScore 1.00
+						;; size=0 bbWeight=1 PerfScore 0.00

 G_M42652_IG02:
-       mov      dword ptr [rsp], ecx
-       vmovss   dword ptr [rsp+0x04], xmm1
-       mov      rcx, qword ptr [rsp]
+       vmovd    eax, xmm1
+       shl      rax, 32
+       mov      ecx, ecx
+       or       rcx, rax
 						;; size=13 bbWeight=1 PerfScore 3.00

 G_M42652_IG03:
-       add      rsp, 8
        tail.jmp [Program:Use[System.ValueTuple`2[int,float]](System.ValueTuple`2[int,float])]

A win-arm64 example:

static void Foo(int[] arr)
{
    Use(arr.AsMemory());
}
 G_M33990_IG01:
-            stp     fp, lr, [sp, #-0x20]!
+            stp     fp, lr, [sp, #-0x10]!
             mov     fp, sp
-            str     xzr, [fp, #0x10]	// [V03 tmp2]
-						;; size=12 bbWeight=1 PerfScore 2.50
+						;; size=8 bbWeight=1 PerfScore 1.50

 G_M33990_IG02:
             cbz     x0, G_M33990_IG04
 						;; size=4 bbWeight=1 PerfScore 1.00

 G_M33990_IG03:
-            str     x0, [fp, #0x10]	// [V07 tmp6]
-            str     wzr, [fp, #0x18]	// [V08 tmp7]
-            ldr     w0, [x0, #0x08]
-            str     w0, [fp, #0x1C]	// [V09 tmp8]
+            ldr     w1, [x0, #0x08]
             b       G_M33990_IG05
-						;; size=20 bbWeight=0.50 PerfScore 3.50
+						;; size=8 bbWeight=0.50 PerfScore 2.00

 G_M33990_IG04:
-            str     xzr, [fp, #0x10]	// [V07 tmp6]
-            str     xzr, [fp, #0x18]
-						;; size=8 bbWeight=0.50 PerfScore 1.00
+            mov     x0, xzr
+            mov     w1, wzr
+						;; size=8 bbWeight=0.50 PerfScore 0.50

 G_M33990_IG05:
-            ldp     x0, x1, [fp, #0x10]	// [V03 tmp2], [V03 tmp2+0x08]
-            movz    x2, #0xD920      // code for Program:Use[System.Memory`1[int]](System.Memory`1[int])
-            movk    x2, #0x4590 LSL #16
+            mov     w1, w1
+            lsl     x1, x1, #32
+            movz    x2, #0xD950      // code for Program:Use[System.Memory`1[int]](System.Memory`1[int])
+            movk    x2, #0x4592 LSL #16
             movk    x2, #0x7FFE LSL #32
             ldr     x2, [x2]
-						;; size=20 bbWeight=1 PerfScore 7.50
+						;; size=24 bbWeight=1 PerfScore 6.00

 G_M33990_IG06:
-            ldp     fp, lr, [sp], #0x20
+            ldp     fp, lr, [sp], #0x10
             br      x2
 						;; size=8 bbWeight=1 PerfScore 2.00
-; Total bytes of code: 72
+; Total bytes of code: 60

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 25, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Comment on lines 1728 to 1731
if (arg->OperIs(GT_FIELD_LIST))
{
unsigned int regIndex = 0;
for (GenTreeFieldList::Use& use : arg->AsFieldList()->Uses())
{
const ABIPassingSegment& segment = abiInfo.Segment(regIndex);
InsertPutArgReg(&use.NodeRef(), segment);

regIndex++;
}
LowerArgFieldList(callArg, arg->AsFieldList());
arg = *ppArg;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enabled everywhere now since even single-reg args can be structs with multiple fields.

Add support using bitwise operations to reconstruct registers passed
into calls from multiple promoted fields. Remove the IR invariant that
`FIELD_LIST` args must always map cleanly to registers; instead, any
`FIELD_LIST` is allowed for register-only arguments before lowering, and
lowering takes care to normalize them into a handled shape.

`fgTryMorphStructArg` is changed to take advantage of this by now
producing `FIELD_LIST` even when a promoted arg does not match the
target ABI. Support in physical promotion will be added in a follow up.

Split arguments are not handled and retain the old IR invariant of
requiring registers and stack slots to make cleanly from `FIELD_LIST`.

win-x64 examples:
```csharp
static void Foo(int x)
{
    Use<int?>(x);
    Use<int?>(5);
    Use<int?>(null);
}
```

```diff
 G_M7200_IG02:  ;; offset=0x0004
-       mov      byte  ptr [rsp+0x20], 1
-       mov      dword ptr [rsp+0x24], ecx
-       mov      rcx, qword ptr [rsp+0x20]
+       mov      ecx, ecx
+       shl      rcx, 32
+       or       rcx, 1
        call     [Program:Bar(System.Nullable`1[int])]
-       mov      dword ptr [rsp+0x24], 5
-       mov      rcx, qword ptr [rsp+0x20]
+       mov      rcx, 0x500000001
        call     [Program:Bar(System.Nullable`1[int])]
-       mov      byte  ptr [rsp+0x20], 0
        xor      ecx, ecx
-       mov      dword ptr [rsp+0x24], ecx
-       mov      rcx, qword ptr [rsp+0x20]
-						;; size=55 bbWeight=1 PerfScore 14.25
+						;; size=34 bbWeight=1 PerfScore 7.50

 G_M7200_IG03:
        add      rsp, 40
        tail.jmp [Program:Bar(System.Nullable`1[int])]
 						;; size=10 bbWeight=1 PerfScore 2.25
```

```csharp
static void Foo(int x, float y)
{
    Use((x, y));
}
```

```diff
 G_M42652_IG01:  ;; offset=0x0000
-       push     rax
-						;; size=1 bbWeight=1 PerfScore 1.00
+						;; size=0 bbWeight=1 PerfScore 0.00

 G_M42652_IG02:
-       mov      dword ptr [rsp], ecx
-       vmovss   dword ptr [rsp+0x04], xmm1
-       mov      rcx, qword ptr [rsp]
+       vmovd    eax, xmm1
+       shl      rax, 32
+       mov      ecx, ecx
+       or       rcx, rax
 						;; size=13 bbWeight=1 PerfScore 3.00

 G_M42652_IG03:
-       add      rsp, 8
        tail.jmp [Program:Use[System.ValueTuple`2[int,float]](System.ValueTuple`2[int,float])]
```

A win-arm64 example:
```diff
 G_M33990_IG01:
-            stp     fp, lr, [sp, #-0x20]!
+            stp     fp, lr, [sp, #-0x10]!
             mov     fp, sp
-            str     xzr, [fp, #0x10]	// [V03 tmp2]
-						;; size=12 bbWeight=1 PerfScore 2.50
+						;; size=8 bbWeight=1 PerfScore 1.50

 G_M33990_IG02:
             cbz     x0, G_M33990_IG04
 						;; size=4 bbWeight=1 PerfScore 1.00

 G_M33990_IG03:
-            str     x0, [fp, #0x10]	// [V07 tmp6]
-            str     wzr, [fp, #0x18]	// [V08 tmp7]
-            ldr     w0, [x0, #0x08]
-            str     w0, [fp, #0x1C]	// [V09 tmp8]
+            ldr     w1, [x0, #0x08]
             b       G_M33990_IG05
-						;; size=20 bbWeight=0.50 PerfScore 3.50
+						;; size=8 bbWeight=0.50 PerfScore 2.00

 G_M33990_IG04:
-            str     xzr, [fp, #0x10]	// [V07 tmp6]
-            str     xzr, [fp, #0x18]
-						;; size=8 bbWeight=0.50 PerfScore 1.00
+            mov     x0, xzr
+            mov     w1, wzr
+						;; size=8 bbWeight=0.50 PerfScore 0.50

 G_M33990_IG05:
-            ldp     x0, x1, [fp, #0x10]	// [V03 tmp2], [V03 tmp2+0x08]
-            movz    x2, #0xD920      // code for Program:Use[System.Memory`1[int]](System.Memory`1[int])
-            movk    x2, #0x4590 LSL dotnet#16
+            mov     w1, w1
+            lsl     x1, x1, dotnet#32
+            movz    x2, #0xD950      // code for Program:Use[System.Memory`1[int]](System.Memory`1[int])
+            movk    x2, #0x4592 LSL dotnet#16
             movk    x2, #0x7FFE LSL dotnet#32
             ldr     x2, [x2]
-						;; size=20 bbWeight=1 PerfScore 7.50
+						;; size=24 bbWeight=1 PerfScore 6.00

 G_M33990_IG06:
-            ldp     fp, lr, [sp], #0x20
+            ldp     fp, lr, [sp], #0x10
             br      x2
 						;; size=8 bbWeight=1 PerfScore 2.00
-; Total bytes of code: 72
+; Total bytes of code: 60

```
Wrong for floats...
@tannergooding
Copy link
Member

tannergooding commented May 26, 2025

It feels like some of the above examples could be optimized slightly, by relying on the fact that 32-bit operations implicitly zero the upper 32-bits of the 64-bit register.

For example, any place where the lower bits aren't coming from register, we can elide a mov:

mov      ecx, ecx    ; This is unnecessary as the shl causes the upper 32-bits to contain the lower 32-bits and lower 32-bits to be zero
shl      rcx, 32
or       rcx, 1

Likewise while the (int, float) example looks optimal, I would expect for (float, int) you could also do 3 instructions instead:

shl      rcx, 32
vmovd    eax, xmm1
or       rcx, rax

I'm then not sure I understand the Arm64 example.

I'd expect for the (int, int) case we would be generating uxtw dst, src (or mov dst, src) followed by orr dst, src1, src2, lsl 32. Similarly the (float, int) example I'd expect we get fmov dst, src followed by orr dst, src1, src2, lsl 32 and for the (int, float) case we'd get fmov dst, src, then uxtw dst, src (or mov dst, src) followed by orr dst, src1, src2, lsl 32 -- If we have a 32-bit constant involved then the uxtw/mov at worst becomes a mov+movk

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good/correct to me.

Might be worth seeing if the better codegen is being missed because we're generally missing an optimization or if there is slightly different IR we should generate

@jakobbotsch
Copy link
Member Author

It feels like some of the above examples could be optimized slightly, by relying on the fact that 32-bit operations implicitly zero the upper 32-bits of the 64-bit register.

Right, I also pointed at this case back in #113178. It is not that simple to optimize because it is a necessary type changing cast in the underlying IR. The simplest way to optimize it would probably be an emitter peephole or containing the int -> long cast when used in the x << 32 context.

I'm then not sure I understand the Arm64 example.

I had forgotten to include the corresponding C# code. It is actually conversion of an int[] to a Memory<int> and passing it as an argument afterwards.

@tannergooding
Copy link
Member

I had forgotten to include the corresponding C# code. It is actually conversion of an int[] to a Memory and passing it as an argument afterwards.

Gotcha. I think we're missing a peephole or containment opportunity for the lsl here then. Nothing we should block on of course, but I think we can remove 1 or 2 of the instructions here and just get orr dst, src1, src2, lsl 32 generated instead

@hez2010
Copy link
Contributor

hez2010 commented May 27, 2025

It seems that we are still spilling the return value before passing it to a use method call:

UseValue(GetValue());

[MethodImpl(MethodImplOptions.NoInlining)]
static int? GetValue()
{
    return 42;
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void UseValue(int? value) { }

Codegen:

G_M27646_IG02:  ;; offset=0x0004
       call     [Program:GetValue():System.Nullable`1[int]]
       mov      qword ptr [rsp+0x20], rax
       mov      rcx, qword ptr [rsp+0x20]
       call     [Program:UseValue(System.Nullable`1[int])]
       nop

I think it can be improved by adding support in physical promotion?

@jakobbotsch
Copy link
Member Author

jakobbotsch commented May 27, 2025

It seems that we are still spilling the return value before passing it to a use method call:

UseValue(GetValue());

[MethodImpl(MethodImplOptions.NoInlining)]
static int? GetValue()
{
    return 42;
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void UseValue(int? value) { }

Codegen:

G_M27646_IG02:  ;; offset=0x0004
       call     [Program:GetValue():System.Nullable`1[int]]
       mov      qword ptr [rsp+0x20], rax
       mov      rcx, qword ptr [rsp+0x20]
       call     [Program:UseValue(System.Nullable`1[int])]
       nop

I think it can be improved by adding support in physical promotion?

There is still no bitwise handling for return values from calls. I may add support for that for the single-reg case in .NET 10, but it's unlikely that the support will come for the multireg case in .NET 10 as that requires a new IR representation (it's the two missing items under ABI handling in #109635).

Even if support for the single-reg case is added in .NET 10 it will likely result in unnecessary bitwise operations in this case. It will require work to remove old promotion before we can get rid of that, I think.

Comment on lines +2327 to 2328
if (!argNode->TypeIs(TYP_STRUCT) && arg->AbiInfo.HasExactlyOneRegisterSegment())
{
Copy link
Member Author

@jakobbotsch jakobbotsch May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All independently promoted structs passed in one register are converted to FIELD_LIST above with this change, so this condition is always false now.

(This is referring to deleted code above this line)

Comment on lines +2505 to +2548
//-----------------------------------------------------------------------------
// FieldsMatchAbi:
// Check if the fields of a local map cleanly (in terms of offsets) to the
// specified ABI info.
//
// Arguments:
// varDsc - promoted local
// abiInfo - ABI information
//
// Returns:
// True if it does. In that case FIELD_LIST usage is allowed for split args
// by the backend.
//
bool Compiler::FieldsMatchAbi(LclVarDsc* varDsc, const ABIPassingInformation& abiInfo)
{
if (varDsc->lvFieldCnt != abiInfo.CountRegsAndStackSlots())
{
return false;
}

for (const ABIPassingSegment& seg : abiInfo.Segments())
{
if (seg.IsPassedInRegister())
{
unsigned fieldLclNum = lvaGetFieldLocal(varDsc, seg.Offset);
if (fieldLclNum == BAD_VAR_NUM)
{
return false;
}
}
else
{
for (unsigned offset = 0; offset < seg.Size; offset += TARGET_POINTER_SIZE)
{
if (lvaGetFieldLocal(varDsc, seg.Offset + offset) == BAD_VAR_NUM)
{
return false;
}
}
}
}

return true;
}
Copy link
Member Author

@jakobbotsch jakobbotsch May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used for args split across registers and stack (arm32 mainly). These args retain the old IR requirements of needing fields to map cleanly to registers and stack slots.

I might clean that up as a separate future PR, but I have a long term ambition to remove special handling of split arguments in the backend entirely by physically splitting the arg in the call node into multiple args instead, as part of lowering. In fact I think multireg args could also be handled this way which would make the backend much more uniform around this.

@jakobbotsch jakobbotsch marked this pull request as ready for review May 28, 2025 09:33
@jakobbotsch
Copy link
Member Author

jakobbotsch commented May 28, 2025

cc @dotnet/jit-contrib PTAL @EgorBo

Diffs. Overall perfscore and size improvements across the board, with some regressions here and there.

The size improvements on arm64 are generally not as pronounced since the existing pattern uses loads which are quite compact (ldp), and the new pattern is quite large on arm64. For example, a common example diff in the size regressions looks like

-            ldp     x2, x3, [fp, #0x40]	// [V08 tmp4], [V08 tmp4+0x08]
-            ldp     x0, x1, [fp, #0x30]	// [V09 tmp5], [V09 tmp5+0x08]
+            mov     x1, x21
+            mov     w0, w20
+            mov     w2, w19
+            orr     x0, x2, x0,  LSL #32
+            ldp     x2, x3, [fp, #0x28]	// [V08 tmp4], [V08 tmp4+0x08]

We should be able to eliminate at least one of those mov's as discussed above, and perhaps ubfiz/sbfiz can be used in some other cases to improve things even more. Perf-wise I am not that worried in these cases, I would still expect the bitwise version to do better.

One case that is clearly a regression -- sometimes we transform a promoted local into FIELD_LIST(LCL_VAR<long>, LCL_VAR<int>, LCL_VAR<int>) fields in morph, but then still end up with dependent promotion of that local for other reasons (typically because it is stored from a call). In that case it would have been better to pass it as FIELD_LIST(LCL_FLD<long>, LCL_FLD<long>) like before. E.g. a diff in those cases looks like
image
We could identify and recombine fields in this case during lowering, but I expect to continue removing dependent promotion cases in the future, so this is a point-in-time problem.

There are some throughput regressions. The big ones are in collections with very few MinOpts collections though, so generally it is just around ~0.1% in MinOpts. The detailed TP diff is

Base: 17274781343, Diff: 17290066161, +0.0885%

39109000  : NA       : 60.61% : +0.2264% : private: void __cdecl Lowering::InsertPutArgReg(struct GenTree **, class ABIPassingSegment const &)
795360    : NA       : 1.23%  : +0.0046% : `Compiler::fgTryMorphStructArg'::`39'::<lambda_1>::operator()                                      
-96207    : -5.45%   : 0.15%  : -0.0006% : public: bool __cdecl Compiler::fgTryMorphStructArg(class CallArg *)                                
-795360   : -100.00% : 1.23%  : -0.0046% : `Compiler::fgTryMorphStructArg'::`63'::<lambda_1>::operator()                                      
-23725022 : -40.22%  : 36.77% : -0.1373% : private: void __cdecl Lowering::LowerArg(struct GenTreeCall *, class CallArg *)                    

so it looks like MSVC just stopped inlining InsertPutArgReg into LowerArg. Maybe native PGO will fix that.

@jakobbotsch jakobbotsch requested a review from EgorBo May 28, 2025 09:52
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

Failures are #116041, #116060, #110988

// GetRegisterType:
// Return the smallest type larger or equal to Size that most naturally
// represents the register this segment is passed in, taking into account the
// GC info of the specified layout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the formatting is a bit off (2 spaces vs 3 spaces for Return value, also, missing arg section)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me address that in a follow-up

if ((i == numRegs - 1) && !varTypeUsesFloatReg(regType))
{
// Allow tail end to pass undefined bits into the register
regEnd = regStart + REGSIZE_BYTES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious, do we have a guarantee that we won't load the undefined bits as a payload somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, upper bits of registers containing struct values are undefined in the managed ABI. So this is ok. (This is just ensuring the handling matches what we had before; without this there was a bunch of regressions).

I think the actual source that creates these loads is morph:

#ifdef TARGET_ARM64
if ((offset > 0) && argNode->OperIsLocalRead())
{
// For arm64 it's beneficial to consider all tails to
// be TYP_I_IMPL to allow more ldp's.
type = TYP_I_IMPL;
}
#endif

I'm not sure it's worth it -- this means we sometimes can generate 2x 8byte loads => ldp instead of one 8 byte + one < 8 byte load, so the code is slightly smaller. But then maybe that may result in a stall? Probably worth to see if we can come up with an example and then maybe remove it if we can.

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants