Consider instruction input size when considering embedded mask optimization #115074

BruceForstall · 2025-04-26T04:53:23Z

Some instructions, such as instructions vinserti32x4, vinserti32x8, and
vinserti64x2, have specific mask sizes assumed, which don't match the simd
base type size of the intrinsics which generate them. Check the
instruction "input size" and reject embedded mask optimization if it is
mismatched.

There are a few diffs in the HWINTRINSIC test cases, where this
optimization doesn't kick in anymore. The diffs look correct.

Fixes #114921

Copilot

Pull Request Overview

This PR introduces special-case handling for the vinserti* intrinsics to account for differences between the embedded mask size and the SIMD base type size, and adds tests to validate the behavior.

Updated lowerxarch.cpp to disable embedded mask optimization when the mask and operand sizes do not match, with special-case logic for vinserti128, vinserti32x8, and vinserti64x2.
Added regression tests in Runtime_114921.cs to validate both debug and release outputs under various intrinsic scenarios.

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/tests/JIT/Regression/JitBlue/Runtime_114921/Runtime_114921.cs	Added tests exercising the special-case intrinsic behavior.
src/coreclr/jit/lowerxarch.cpp	Adjusted the embedded mask check logic with new special-case handling for specific vinserti* instructions.

Files not reviewed (1)

src/tests/JIT/Regression/JitBlue/Runtime_114921/Runtime_114921.csproj: Language not supported

Comments suppressed due to low confidence (1)

src/coreclr/jit/lowerxarch.cpp:10636

Verify that deferring the invariant range check until later in the code does not inadvertently disable valid embedded mask optimizations when maskSize equals operSize. Consider adding an inline comment to clarify how and why the invariant check is applied after this condition.

if (maskSize != operSize)

Copilot · 2025-04-26T04:53:45Z

src/coreclr/jit/lowerxarch.cpp

+                                            assert(ins != INS_invalid);
+                                            switch (ins)
+                                            {
+                                                case INS_vinserti128:


Double-check that using the operand size (operSize) for determining mask validity correctly reflects the special-case assumptions for these intrinsics. A brief comment describing the rationale for the comparison against 4 (or 8 for vinserti64x2) would enhance maintainability.

BruceForstall · 2025-04-26T04:54:15Z

@dotnet/intel PTAL

saucecontrol · 2025-04-27T14:25:50Z

src/coreclr/jit/lowerxarch.cpp

+                                            {
+                                                case INS_vinserti128:
+                                                case INS_vinserti32x8:
+                                                    if (operSize != 4)


I wouldn't be surprised if we had more cases where SimdBaseType ends up not matching the actual instruction input size.

Anything that supports EVEX encoding will have its input size defined in the instruction attributes, so it would probably be better to resolve it from there. I added a helper recently for this: CodeGenInterface::instInputSize.

BruceForstall · 2025-05-03T00:12:30Z

/azp run runtime-coreclr outerloop, Fuzzlyn

azure-pipelines · 2025-05-03T00:12:46Z

Azure Pipelines successfully started running 2 pipeline(s).

BruceForstall · 2025-05-03T00:13:42Z

@saucecontrol @tannergooding I changed this to use the 'input size', as suggested. This fixes the issue, and leads to a few other diffs which appear correct to me. PTAL.

cc @dotnet/jit-contrib @dotnet/intel

saucecontrol · 2025-05-04T16:30:38Z

Interesting, the other instructions that show up in the diffs actually look like they had good codegen before the change. The test methods are comparing and masking based on the instruction result type:

runtime/src/tests/JIT/HardwareIntrinsics/General/Shared/VectorBinaryOpTest.template

Lines 254 to 261 in 4a75bea

    
           var result = {RetVectorType}.ConditionalSelect( 
        
               {RetVectorType}.Equals<{RetBaseType}>(op1, {RetVectorType}<{RetBaseType}>.Zero), 
        
               {Isa}.{Method}( 
        
                   op1, 
        
                   Unsafe.Read<{Op2VectorType}<{Op2BaseType}>>(_dataTable.inArray2Ptr) 
        
               ), 
        
               {RetVectorType}.Create<{RetBaseType}>(1) 
        
           );

Both vpmaddwd and vpmaddubsw are widening instructions, and their embedded mask size is the same as the (widened) result size. We have the input size set to... well... the input size, rather than the result size, in the table. Since neither of those instructions supports embedded broadcast (they're both tuple type Full Mem), input size doesn't mean anything in terms of load containment, so I think it would be more correct to change them to match the mask element size.

tannergooding · 2025-05-05T17:01:24Z

src/coreclr/jit/lowerxarch.cpp

+                                        uint32_t  maskSize        = genTypeSize(simdBaseType);
+                                        var_types op2SimdBaseType = op2->AsHWIntrinsic()->GetSimdBaseType();
+                                        uint32_t  operSize        = genTypeSize(op2SimdBaseType);
+
+                                        if (maskSize != operSize)
+                                        {
+                                            isEmbeddedMask = false;
+                                        }
+                                        else
+                                        {
+                                            // Check the op2 instruction input size to see if it's the same as the
+                                            // mask size.
+
+                                            NamedIntrinsic op2IntrinsicId = op2->AsHWIntrinsic()->GetHWIntrinsicId();
+                                            instruction    ins =
+                                                HWIntrinsicInfo::lookupIns(op2IntrinsicId, op2SimdBaseType);
+                                            assert(ins != INS_invalid);
+                                            unsigned inputSize = CodeGenInterface::instInputSize(ins);
+                                            if (maskSize != inputSize)
+                                            {
+                                                isEmbeddedMask = false;
+                                            }
+                                        }
+                                    }


Carrying over @saucecontrol's comment

Both vpmaddwd and vpmaddubsw are widening instructions, and their embedded mask size is the same as the (widened) result size. We have the input size set to... well... the input size, rather than the result size, in the table. Since neither of those instructions supports embedded broadcast (they're both tuple type Full Mem), input size doesn't mean anything in terms of load containment, so I think it would be more correct to change them to match the mask element size.

Basically what's being said is that for most instructions the size and type of the input/output is the same. So if it takes in a V128, it returns a V128. If it takes in 8-bit elements, it produces 8-bit elements. However, some instructions deviate from this and so may take V128 and return V256 or may take 8-bit elements and produce 16-bit elements.

For the purposes of loads, this is always an input and so INS_TT_* (tuple type) is sufficient for determining if embedded broadcast can be done. On the other hand loads need INS_TT_* + Input_* to determine if containment can be done (as some tuple types, like TUPLE8 indicate that 8 * Input_* elements are loaded).

For the purposes of embedded masking, however, it instead always impacts the output instead and so we rather need some Output_* instead to ensure that instructions like vpmaddwd which take in V128<ushort> (Input_16bit) but return V128<uint> (Output_32bit) can have the correct mask type.

The simpler immediate fix here is to mark InsertVector128 with HW_Flag_NormalizeSmallTypeToInt like we're doing for Sse2_And or other operations that don't have Input_8bit or Input_16bit variants.

However, it won't necessarily get everything and I think we need to add Output_* in a separate PR and then consume it here in lowering to ensure everything is correct.

Notably the meaning of certain fields on GenTreeHWIntrinsic are not entirely consistent today either. That is, unlike gtType which is always the type the node produces; we have some cases where gtSimdSize/gtSimdBaseJitType is sometimes for the input and sometimes the output. We should probably normalize this to reduce issues as well.

Namely I think that since gtType will typically be a TYP_SIMD and therefore already track the output simd size and since C# APIs cannot overload based on return type but we do sometimes need to track a type for "overload resolution" purposes in picking the right instruction; then I think we should guarantee that gtSimdSize and gtSimdBaseJitType are tracking the input information (so we always know this regardless of whether its a local, a bitcast vector that produced a different base type, etc). We should then have gtAuxiliaryJitType always be the base type of the output (maybe we should rename them gtSimdInputBaseJitType and gtSimdOutputBaseJitType, respectively).

Under such a setup we can then safely use gtSimdOutputBaseJitType to determine if a mask can be used and gtSimdInputBaseJitType to determine if broadcast can be done. This also gives a clean mapping to instructions and a way to assert that gtSimdInputBaseJitType matches Input_*bit and gtSimdOutputBaseJitType matches Output_*bit` for the selected instruction.

Thanks for the detailed reply.

Can you clarify one thing? You wrote:

For the purposes of loads, this is always an input and so INS_TT_* (tuple type) is sufficient for determining if embedded broadcast can be done. On the other hand loads need INS_TT_* + Input_* to determine if containment can be done (as some tuple types, like TUPLE8 indicate that 8 * Input_* elements are loaded).

Namely, "For the purposes of loads...", then "On the other hand loads...'. Is one of those supposed to be different (i.e., not "loads")?

The idea to add Output_* makes sense. How would we add the correct Output_* for each instruction (semi-automatically)? Presumably we could start with explicitly setting Output_* to Input_*, or else assuming an empty/non-existent Output_* means use the Input_* value for the Output_* value; then, just add explicit Output_* values for those that differ.

Namely, "For the purposes of loads...", then "On the other hand loads...'. Is one of those supposed to be different (i.e., not "loads")?

It was meant to be two parts. The first part was covering loads that can be represented as embedded broadcasts and the second part that can be represented as regular containment. All load containment needs to consider at least INS_TT_* while "regular" containment needs to additionally consider the Input_*.

The idea to add Output_* makes sense. How would we add the correct Output_* for each instruction (semi-automatically)?

I think the easiest default would be to define the 4 new Output_* flags and do a regex find/replace for Input_(8Bit |16Bit|32Bit|64Bit) to Input_$1 | Output_$1. There'd then be a little bit of adjusting the column alignment and then a handful of instructions that need to be fixed up where the outputs differ.

Ah, on first glance I didn't see any instructions that had different element sizes for embedded broadcast vs embedded masking, but I see that is a possibility now. A good example is vcvtps2qq, which has 32-bit broadcast size and 64-bit mask element size.

It seems we're probably pessimizing codegen in a few places because we currently use SimdBaseType to make masking decisions, but maybe having a separate flag would allow that to be cleaned up.

…zation Some instructions, such as instructions `vinserti32x4`, `vinserti32x8`, and `vinserti64x2`, have specific mask sizes assumed, which don't match the simd base type size of the intrinsics which generate them. Check the instruction "input size" and reject embedded mask optimization if it is mismatched. There are a few diffs in the HWINTRINSIC test cases, where this optimization doesn't kick in anymore. The diffs look correct. Fixes dotnet#114921

BruceForstall · 2025-05-13T03:44:08Z

It seems a little weird that we have InsertVector128 APIs that take as second argument (the insertion data) Vector128<T> with T != Int32 or UInt32. i.e., Avx512F.InsertVector128(Vector512<Int64>, Vector128<Int64>, Byte) => VINSERTI32x4 doesn't make sense, because of the 32x4 aspect of the instruction.

BruceForstall · 2025-05-13T04:39:34Z

The simpler immediate fix here is to mark InsertVector128 with HW_Flag_NormalizeSmallTypeToInt like we're doing for Sse2_And or other operations that don't have Input_8bit or Input_16bit variants.

I don't understand this suggestion (and it doesn't fix the problem). There are no small types in the failure case, so I don't think it does anything.

(The test cases are using Vector128<long>. Maybe if they were using Vector128<byte> or Vector128<short> it would matter? We don't have a HW_Flag_NormalizeAllTypesToInt, say.)

I'm not sure about the Output_x suggestion, either. For this problem, we would set INS_vinserti128 (which is actually vinserti32x4) to Output_32bit, which is the same as the current Input_32bit. It would remain 32bit in both cases. Perhaps the idea would be that there would be some instructions for which Input_x is not the same size as Output_x, and maybe changing those instructions would prevent this bug fix from causing asm diffs with those instructions? I guess the example above, vcvtps2qq, is one case, also vpmaddwd and vpmaddubsw?

tannergooding · 2025-05-13T04:59:48Z

It seems a little weird that we have InsertVector128 APIs that take as second argument (the insertion data) Vector128 with T != Int32 or UInt32. i.e., Avx512F.InsertVector128(Vector512, Vector128, Byte) => VINSERTI32x4 doesn't make sense, because of the 32x4 aspect of the instruction.

For AVX512, this should be generating VINSERTI64x2 if the type is long or ulong: https://www.felixcloutier.com/x86/vinserti128:vinserti32x4:vinserti64x2:vinserti32x8:vinserti64x4

Perhaps the idea would be that there would be some instructions for which Input_x is not the same size as Output_x, and maybe changing those instructions would prevent this bug fix from causing asm diffs with those instructions? I guess the example above, vcvtps2qq, is one case, also vpmaddwd and vpmaddubsw?

Right, there are several instructions where the Input_* does not match the theoretical Output_*. Input_* is used for cases like broadcasting while Output_* is required for cases like masking.

BruceForstall · 2025-05-16T20:18:50Z

Someone will need to pick this up; I am no longer working on this. @tannergooding @JulieLeeMSFT

tannergooding · 2025-05-16T23:51:51Z

I'll pick this up and do something along the lines of what I mentioned above.

Thanks @BruceForstall

BruceForstall requested review from tannergooding and Copilot April 26, 2025 04:53

ghost added the area-CodeGen-coreclr label Apr 26, 2025

dotnet-policy-service bot assigned BruceForstall Apr 26, 2025

Copilot AI reviewed Apr 26, 2025

View reviewed changes

BruceForstall mentioned this pull request Apr 26, 2025

JIT: wrong result with Avx512F.BlendVariable #114921

Closed

saucecontrol reviewed Apr 27, 2025

View reviewed changes

BruceForstall force-pushed the Fix114921 branch from e72f9bd to 8392660 Compare May 3, 2025 00:11

BruceForstall changed the title ~~Special-case vinserti* when considering embedded mask optimization~~ Consider instruction input size when considering embedded mask optimization May 3, 2025

build-analysis bot mentioned this pull request May 3, 2025

CI flakiness: mono interpreter build getting killed #114123

Open

tannergooding reviewed May 5, 2025

View reviewed changes

Fix comments

11b0b27

BruceForstall force-pushed the Fix114921 branch from 8392660 to 11b0b27 Compare May 14, 2025 06:48

build-analysis bot mentioned this pull request May 14, 2025

Test failure: baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.cmd #110173

Open

BruceForstall closed this May 16, 2025

Consider instruction input size when considering embedded mask optimization #115074

Consider instruction input size when considering embedded mask optimization #115074

Uh oh!

Conversation

BruceForstall commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 26, 2025

Choose a reason for hiding this comment

Uh oh!

BruceForstall commented Apr 26, 2025

Uh oh!

saucecontrol Apr 27, 2025

Choose a reason for hiding this comment

Uh oh!

BruceForstall commented May 3, 2025

Uh oh!

azure-pipelines bot commented May 3, 2025

Uh oh!

BruceForstall commented May 3, 2025

Uh oh!

saucecontrol commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding May 5, 2025

Choose a reason for hiding this comment

Uh oh!

BruceForstall May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding May 6, 2025

Choose a reason for hiding this comment

Uh oh!

saucecontrol May 6, 2025

Choose a reason for hiding this comment

Uh oh!

BruceForstall commented May 13, 2025

Uh oh!

BruceForstall commented May 13, 2025

Uh oh!

tannergooding commented May 13, 2025

Uh oh!

BruceForstall commented May 16, 2025

Uh oh!

tannergooding commented May 16, 2025

Uh oh!

Uh oh!

BruceForstall commented Apr 26, 2025 •

edited

Loading

saucecontrol commented May 4, 2025 •

edited

Loading

BruceForstall May 5, 2025 •

edited

Loading