Skip to content

Consider instruction input size when considering embedded mask optimization #115074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

BruceForstall
Copy link
Contributor

@BruceForstall BruceForstall commented Apr 26, 2025

Some instructions, such as instructions vinserti32x4, vinserti32x8, and
vinserti64x2, have specific mask sizes assumed, which don't match the simd
base type size of the intrinsics which generate them. Check the
instruction "input size" and reject embedded mask optimization if it is
mismatched.

There are a few diffs in the HWINTRINSIC test cases, where this
optimization doesn't kick in anymore. The diffs look correct.

Fixes #114921

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 26, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces special-case handling for the vinserti* intrinsics to account for differences between the embedded mask size and the SIMD base type size, and adds tests to validate the behavior.

  • Updated lowerxarch.cpp to disable embedded mask optimization when the mask and operand sizes do not match, with special-case logic for vinserti128, vinserti32x8, and vinserti64x2.
  • Added regression tests in Runtime_114921.cs to validate both debug and release outputs under various intrinsic scenarios.

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/tests/JIT/Regression/JitBlue/Runtime_114921/Runtime_114921.cs Added tests exercising the special-case intrinsic behavior.
src/coreclr/jit/lowerxarch.cpp Adjusted the embedded mask check logic with new special-case handling for specific vinserti* instructions.
Files not reviewed (1)
  • src/tests/JIT/Regression/JitBlue/Runtime_114921/Runtime_114921.csproj: Language not supported
Comments suppressed due to low confidence (1)

src/coreclr/jit/lowerxarch.cpp:10636

  • Verify that deferring the invariant range check until later in the code does not inadvertently disable valid embedded mask optimizations when maskSize equals operSize. Consider adding an inline comment to clarify how and why the invariant check is applied after this condition.
if (maskSize != operSize)

assert(ins != INS_invalid);
switch (ins)
{
case INS_vinserti128:
Copy link
Preview

Copilot AI Apr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-check that using the operand size (operSize) for determining mask validity correctly reflects the special-case assumptions for these intrinsics. A brief comment describing the rationale for the comparison against 4 (or 8 for vinserti64x2) would enhance maintainability.

Copilot uses AI. Check for mistakes.

@BruceForstall
Copy link
Contributor Author

@dotnet/intel PTAL

{
case INS_vinserti128:
case INS_vinserti32x8:
if (operSize != 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be surprised if we had more cases where SimdBaseType ends up not matching the actual instruction input size.

Anything that supports EVEX encoding will have its input size defined in the instruction attributes, so it would probably be better to resolve it from there. I added a helper recently for this: CodeGenInterface::instInputSize.

@BruceForstall BruceForstall changed the title Special-case vinserti* when considering embedded mask optimization Consider instruction input size when considering embedded mask optimization May 3, 2025
@BruceForstall
Copy link
Contributor Author

/azp run runtime-coreclr outerloop, Fuzzlyn

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BruceForstall
Copy link
Contributor Author

@saucecontrol @tannergooding I changed this to use the 'input size', as suggested. This fixes the issue, and leads to a few other diffs which appear correct to me. PTAL.

cc @dotnet/jit-contrib @dotnet/intel

@saucecontrol
Copy link
Member

saucecontrol commented May 4, 2025

Interesting, the other instructions that show up in the diffs actually look like they had good codegen before the change. The test methods are comparing and masking based on the instruction result type:

var result = {RetVectorType}.ConditionalSelect(
{RetVectorType}.Equals<{RetBaseType}>(op1, {RetVectorType}<{RetBaseType}>.Zero),
{Isa}.{Method}(
op1,
Unsafe.Read<{Op2VectorType}<{Op2BaseType}>>(_dataTable.inArray2Ptr)
),
{RetVectorType}.Create<{RetBaseType}>(1)
);

Both vpmaddwd and vpmaddubsw are widening instructions, and their embedded mask size is the same as the (widened) result size. We have the input size set to... well... the input size, rather than the result size, in the table. Since neither of those instructions supports embedded broadcast (they're both tuple type Full Mem), input size doesn't mean anything in terms of load containment, so I think it would be more correct to change them to match the mask element size.

Comment on lines +10603 to +10633
uint32_t maskSize = genTypeSize(simdBaseType);
var_types op2SimdBaseType = op2->AsHWIntrinsic()->GetSimdBaseType();
uint32_t operSize = genTypeSize(op2SimdBaseType);

if (maskSize != operSize)
{
isEmbeddedMask = false;
}
else
{
// Check the op2 instruction input size to see if it's the same as the
// mask size.

NamedIntrinsic op2IntrinsicId = op2->AsHWIntrinsic()->GetHWIntrinsicId();
instruction ins =
HWIntrinsicInfo::lookupIns(op2IntrinsicId, op2SimdBaseType);
assert(ins != INS_invalid);
unsigned inputSize = CodeGenInterface::instInputSize(ins);
if (maskSize != inputSize)
{
isEmbeddedMask = false;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carrying over @saucecontrol's comment

Both vpmaddwd and vpmaddubsw are widening instructions, and their embedded mask size is the same as the (widened) result size. We have the input size set to... well... the input size, rather than the result size, in the table. Since neither of those instructions supports embedded broadcast (they're both tuple type Full Mem), input size doesn't mean anything in terms of load containment, so I think it would be more correct to change them to match the mask element size.

Basically what's being said is that for most instructions the size and type of the input/output is the same. So if it takes in a V128, it returns a V128. If it takes in 8-bit elements, it produces 8-bit elements. However, some instructions deviate from this and so may take V128 and return V256 or may take 8-bit elements and produce 16-bit elements.

For the purposes of loads, this is always an input and so INS_TT_* (tuple type) is sufficient for determining if embedded broadcast can be done. On the other hand loads need INS_TT_* + Input_* to determine if containment can be done (as some tuple types, like TUPLE8 indicate that 8 * Input_* elements are loaded).

For the purposes of embedded masking, however, it instead always impacts the output instead and so we rather need some Output_* instead to ensure that instructions like vpmaddwd which take in V128<ushort> (Input_16bit) but return V128<uint> (Output_32bit) can have the correct mask type.


The simpler immediate fix here is to mark InsertVector128 with HW_Flag_NormalizeSmallTypeToInt like we're doing for Sse2_And or other operations that don't have Input_8bit or Input_16bit variants.

However, it won't necessarily get everything and I think we need to add Output_* in a separate PR and then consume it here in lowering to ensure everything is correct.

Notably the meaning of certain fields on GenTreeHWIntrinsic are not entirely consistent today either. That is, unlike gtType which is always the type the node produces; we have some cases where gtSimdSize/gtSimdBaseJitType is sometimes for the input and sometimes the output. We should probably normalize this to reduce issues as well.

Namely I think that since gtType will typically be a TYP_SIMD and therefore already track the output simd size and since C# APIs cannot overload based on return type but we do sometimes need to track a type for "overload resolution" purposes in picking the right instruction; then I think we should guarantee that gtSimdSize and gtSimdBaseJitType are tracking the input information (so we always know this regardless of whether its a local, a bitcast vector that produced a different base type, etc). We should then have gtAuxiliaryJitType always be the base type of the output (maybe we should rename them gtSimdInputBaseJitType and gtSimdOutputBaseJitType, respectively).

Under such a setup we can then safely use gtSimdOutputBaseJitType to determine if a mask can be used and gtSimdInputBaseJitType to determine if broadcast can be done. This also gives a clean mapping to instructions and a way to assert that gtSimdInputBaseJitType matches Input_*bit and gtSimdOutputBaseJitType matches Output_*bit` for the selected instruction.

Copy link
Contributor Author

@BruceForstall BruceForstall May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed reply.

Can you clarify one thing? You wrote:

For the purposes of loads, this is always an input and so INS_TT_* (tuple type) is sufficient for determining if embedded broadcast can be done. On the other hand loads need INS_TT_* + Input_* to determine if containment can be done (as some tuple types, like TUPLE8 indicate that 8 * Input_* elements are loaded).

Namely, "For the purposes of loads...", then "On the other hand loads...'. Is one of those supposed to be different (i.e., not "loads")?

The idea to add Output_* makes sense. How would we add the correct Output_* for each instruction (semi-automatically)? Presumably we could start with explicitly setting Output_* to Input_*, or else assuming an empty/non-existent Output_* means use the Input_* value for the Output_* value; then, just add explicit Output_* values for those that differ.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namely, "For the purposes of loads...", then "On the other hand loads...'. Is one of those supposed to be different (i.e., not "loads")?

It was meant to be two parts. The first part was covering loads that can be represented as embedded broadcasts and the second part that can be represented as regular containment. All load containment needs to consider at least INS_TT_* while "regular" containment needs to additionally consider the Input_*.

The idea to add Output_* makes sense. How would we add the correct Output_* for each instruction (semi-automatically)?

I think the easiest default would be to define the 4 new Output_* flags and do a regex find/replace for Input_(8Bit |16Bit|32Bit|64Bit) to Input_$1 | Output_$1. There'd then be a little bit of adjusting the column alignment and then a handful of instructions that need to be fixed up where the outputs differ.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, on first glance I didn't see any instructions that had different element sizes for embedded broadcast vs embedded masking, but I see that is a possibility now. A good example is vcvtps2qq, which has 32-bit broadcast size and 64-bit mask element size.

It seems we're probably pessimizing codegen in a few places because we currently use SimdBaseType to make masking decisions, but maybe having a separate flag would allow that to be cleaned up.

…zation

Some instructions, such as instructions `vinserti32x4`, `vinserti32x8`, and
`vinserti64x2`, have specific mask sizes assumed, which don't match the simd
base type size of the intrinsics which generate them. Check the
instruction "input size" and reject embedded mask optimization if it is
mismatched.

There are a few diffs in the HWINTRINSIC test cases, where this
optimization doesn't kick in anymore. The diffs look correct.

Fixes dotnet#114921
@BruceForstall
Copy link
Contributor Author

It seems a little weird that we have InsertVector128 APIs that take as second argument (the insertion data) Vector128<T> with T != Int32 or UInt32. i.e., Avx512F.InsertVector128(Vector512<Int64>, Vector128<Int64>, Byte) => VINSERTI32x4 doesn't make sense, because of the 32x4 aspect of the instruction.

@BruceForstall
Copy link
Contributor Author

The simpler immediate fix here is to mark InsertVector128 with HW_Flag_NormalizeSmallTypeToInt like we're doing for Sse2_And or other operations that don't have Input_8bit or Input_16bit variants.

I don't understand this suggestion (and it doesn't fix the problem). There are no small types in the failure case, so I don't think it does anything.

(The test cases are using Vector128<long>. Maybe if they were using Vector128<byte> or Vector128<short> it would matter? We don't have a HW_Flag_NormalizeAllTypesToInt, say.)

I'm not sure about the Output_x suggestion, either. For this problem, we would set INS_vinserti128 (which is actually vinserti32x4) to Output_32bit, which is the same as the current Input_32bit. It would remain 32bit in both cases. Perhaps the idea would be that there would be some instructions for which Input_x is not the same size as Output_x, and maybe changing those instructions would prevent this bug fix from causing asm diffs with those instructions? I guess the example above, vcvtps2qq, is one case, also vpmaddwd and vpmaddubsw?

@tannergooding
Copy link
Member

It seems a little weird that we have InsertVector128 APIs that take as second argument (the insertion data) Vector128 with T != Int32 or UInt32. i.e., Avx512F.InsertVector128(Vector512, Vector128, Byte) => VINSERTI32x4 doesn't make sense, because of the 32x4 aspect of the instruction.

For AVX512, this should be generating VINSERTI64x2 if the type is long or ulong: https://www.felixcloutier.com/x86/vinserti128:vinserti32x4:vinserti64x2:vinserti32x8:vinserti64x4

Perhaps the idea would be that there would be some instructions for which Input_x is not the same size as Output_x, and maybe changing those instructions would prevent this bug fix from causing asm diffs with those instructions? I guess the example above, vcvtps2qq, is one case, also vpmaddwd and vpmaddubsw?

Right, there are several instructions where the Input_* does not match the theoretical Output_*. Input_* is used for cases like broadcasting while Output_* is required for cases like masking.

@BruceForstall
Copy link
Contributor Author

Someone will need to pick this up; I am no longer working on this. @tannergooding @JulieLeeMSFT

@tannergooding
Copy link
Member

I'll pick this up and do something along the lines of what I mentioned above.

Thanks @BruceForstall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: wrong result with Avx512F.BlendVariable
3 participants