-
Notifications
You must be signed in to change notification settings - Fork 5k
Consider instruction input size when considering embedded mask optimization #115074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces special-case handling for the vinserti* intrinsics to account for differences between the embedded mask size and the SIMD base type size, and adds tests to validate the behavior.
- Updated lowerxarch.cpp to disable embedded mask optimization when the mask and operand sizes do not match, with special-case logic for vinserti128, vinserti32x8, and vinserti64x2.
- Added regression tests in Runtime_114921.cs to validate both debug and release outputs under various intrinsic scenarios.
Reviewed Changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
src/tests/JIT/Regression/JitBlue/Runtime_114921/Runtime_114921.cs | Added tests exercising the special-case intrinsic behavior. |
src/coreclr/jit/lowerxarch.cpp | Adjusted the embedded mask check logic with new special-case handling for specific vinserti* instructions. |
Files not reviewed (1)
- src/tests/JIT/Regression/JitBlue/Runtime_114921/Runtime_114921.csproj: Language not supported
Comments suppressed due to low confidence (1)
src/coreclr/jit/lowerxarch.cpp:10636
- Verify that deferring the invariant range check until later in the code does not inadvertently disable valid embedded mask optimizations when maskSize equals operSize. Consider adding an inline comment to clarify how and why the invariant check is applied after this condition.
if (maskSize != operSize)
src/coreclr/jit/lowerxarch.cpp
Outdated
assert(ins != INS_invalid); | ||
switch (ins) | ||
{ | ||
case INS_vinserti128: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-check that using the operand size (operSize) for determining mask validity correctly reflects the special-case assumptions for these intrinsics. A brief comment describing the rationale for the comparison against 4 (or 8 for vinserti64x2) would enhance maintainability.
Copilot uses AI. Check for mistakes.
@dotnet/intel PTAL |
src/coreclr/jit/lowerxarch.cpp
Outdated
{ | ||
case INS_vinserti128: | ||
case INS_vinserti32x8: | ||
if (operSize != 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be surprised if we had more cases where SimdBaseType ends up not matching the actual instruction input size.
Anything that supports EVEX encoding will have its input size defined in the instruction attributes, so it would probably be better to resolve it from there. I added a helper recently for this: CodeGenInterface::instInputSize
.
vinserti*
when considering embedded mask optimization
/azp run runtime-coreclr outerloop, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
@saucecontrol @tannergooding I changed this to use the 'input size', as suggested. This fixes the issue, and leads to a few other diffs which appear correct to me. PTAL. cc @dotnet/jit-contrib @dotnet/intel |
Interesting, the other instructions that show up in the diffs actually look like they had good codegen before the change. The test methods are comparing and masking based on the instruction result type: runtime/src/tests/JIT/HardwareIntrinsics/General/Shared/VectorBinaryOpTest.template Lines 254 to 261 in 4a75bea
Both |
uint32_t maskSize = genTypeSize(simdBaseType); | ||
var_types op2SimdBaseType = op2->AsHWIntrinsic()->GetSimdBaseType(); | ||
uint32_t operSize = genTypeSize(op2SimdBaseType); | ||
|
||
if (maskSize != operSize) | ||
{ | ||
isEmbeddedMask = false; | ||
} | ||
else | ||
{ | ||
// Check the op2 instruction input size to see if it's the same as the | ||
// mask size. | ||
|
||
NamedIntrinsic op2IntrinsicId = op2->AsHWIntrinsic()->GetHWIntrinsicId(); | ||
instruction ins = | ||
HWIntrinsicInfo::lookupIns(op2IntrinsicId, op2SimdBaseType); | ||
assert(ins != INS_invalid); | ||
unsigned inputSize = CodeGenInterface::instInputSize(ins); | ||
if (maskSize != inputSize) | ||
{ | ||
isEmbeddedMask = false; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carrying over @saucecontrol's comment
Both vpmaddwd and vpmaddubsw are widening instructions, and their embedded mask size is the same as the (widened) result size. We have the input size set to... well... the input size, rather than the result size, in the table. Since neither of those instructions supports embedded broadcast (they're both tuple type Full Mem), input size doesn't mean anything in terms of load containment, so I think it would be more correct to change them to match the mask element size.
Basically what's being said is that for most instructions the size and type of the input/output is the same. So if it takes in a V128, it returns a V128. If it takes in 8-bit elements, it produces 8-bit elements. However, some instructions deviate from this and so may take V128 and return V256 or may take 8-bit elements and produce 16-bit elements.
For the purposes of loads, this is always an input and so INS_TT_*
(tuple type) is sufficient for determining if embedded broadcast can be done. On the other hand loads need INS_TT_*
+ Input_*
to determine if containment can be done (as some tuple types, like TUPLE8
indicate that 8 * Input_*
elements are loaded).
For the purposes of embedded masking, however, it instead always impacts the output instead and so we rather need some Output_*
instead to ensure that instructions like vpmaddwd
which take in V128<ushort>
(Input_16bit
) but return V128<uint>
(Output_32bit
) can have the correct mask type.
The simpler immediate fix here is to mark InsertVector128
with HW_Flag_NormalizeSmallTypeToInt
like we're doing for Sse2_And
or other operations that don't have Input_8bit
or Input_16bit
variants.
However, it won't necessarily get everything and I think we need to add Output_*
in a separate PR and then consume it here in lowering to ensure everything is correct.
Notably the meaning of certain fields on GenTreeHWIntrinsic
are not entirely consistent today either. That is, unlike gtType
which is always the type the node produces; we have some cases where gtSimdSize
/gtSimdBaseJitType
is sometimes for the input and sometimes the output. We should probably normalize this to reduce issues as well.
Namely I think that since gtType
will typically be a TYP_SIMD
and therefore already track the output simd size and since C# APIs cannot overload based on return type but we do sometimes need to track a type for "overload resolution" purposes in picking the right instruction; then I think we should guarantee that gtSimdSize
and gtSimdBaseJitType
are tracking the input information (so we always know this regardless of whether its a local, a bitcast vector that produced a different base type, etc). We should then have gtAuxiliaryJitType
always be the base type of the output (maybe we should rename them gtSimdInputBaseJitType
and gtSimdOutputBaseJitType
, respectively).
Under such a setup we can then safely use gtSimdOutputBaseJitType
to determine if a mask can be used and gtSimdInputBaseJitType
to determine if broadcast can be done. This also gives a clean mapping to instructions and a way to assert that gtSimdInputBaseJitType
matches Input_*bit
and gtSimdOutputBaseJitType matches
Output_*bit` for the selected instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed reply.
Can you clarify one thing? You wrote:
For the purposes of loads, this is always an input and so INS_TT_* (tuple type) is sufficient for determining if embedded broadcast can be done. On the other hand loads need INS_TT_* + Input_* to determine if containment can be done (as some tuple types, like TUPLE8 indicate that 8 * Input_* elements are loaded).
Namely, "For the purposes of loads...", then "On the other hand loads...'. Is one of those supposed to be different (i.e., not "loads")?
The idea to add Output_*
makes sense. How would we add the correct Output_*
for each instruction (semi-automatically)? Presumably we could start with explicitly setting Output_*
to Input_*
, or else assuming an empty/non-existent Output_*
means use the Input_*
value for the Output_*
value; then, just add explicit Output_*
values for those that differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namely, "For the purposes of loads...", then "On the other hand loads...'. Is one of those supposed to be different (i.e., not "loads")?
It was meant to be two parts. The first part was covering loads that can be represented as embedded broadcasts and the second part that can be represented as regular containment. All load containment needs to consider at least INS_TT_*
while "regular" containment needs to additionally consider the Input_*
.
The idea to add Output_* makes sense. How would we add the correct Output_* for each instruction (semi-automatically)?
I think the easiest default would be to define the 4 new Output_*
flags and do a regex find/replace for Input_(8Bit |16Bit|32Bit|64Bit)
to Input_$1 | Output_$1
. There'd then be a little bit of adjusting the column alignment and then a handful of instructions that need to be fixed up where the outputs differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, on first glance I didn't see any instructions that had different element sizes for embedded broadcast vs embedded masking, but I see that is a possibility now. A good example is vcvtps2qq
, which has 32-bit broadcast size and 64-bit mask element size.
It seems we're probably pessimizing codegen in a few places because we currently use SimdBaseType to make masking decisions, but maybe having a separate flag would allow that to be cleaned up.
…zation Some instructions, such as instructions `vinserti32x4`, `vinserti32x8`, and `vinserti64x2`, have specific mask sizes assumed, which don't match the simd base type size of the intrinsics which generate them. Check the instruction "input size" and reject embedded mask optimization if it is mismatched. There are a few diffs in the HWINTRINSIC test cases, where this optimization doesn't kick in anymore. The diffs look correct. Fixes dotnet#114921
It seems a little weird that we have |
I don't understand this suggestion (and it doesn't fix the problem). There are no small types in the failure case, so I don't think it does anything. (The test cases are using I'm not sure about the |
For AVX512, this should be generating
Right, there are several instructions where the |
Someone will need to pick this up; I am no longer working on this. @tannergooding @JulieLeeMSFT |
I'll pick this up and do something along the lines of what I mentioned above. Thanks @BruceForstall |
Some instructions, such as instructions
vinserti32x4
,vinserti32x8
, andvinserti64x2
, have specific mask sizes assumed, which don't match the simdbase type size of the intrinsics which generate them. Check the
instruction "input size" and reject embedded mask optimization if it is
mismatched.
There are a few diffs in the HWINTRINSIC test cases, where this
optimization doesn't kick in anymore. The diffs look correct.
Fixes #114921