Fixup how xarch instructions check for embedded broadcast and masking support #115704

tannergooding · 2025-05-19T01:18:14Z

This resolves #114921

dotnet-policy-service · 2025-05-19T01:19:04Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

tannergooding · 2025-05-19T01:22:47Z

src/coreclr/jit/instrsxarch.h

-INST3(unpckhps,         "unpckhps",         IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x15),                            INS_TT_FULL,                         Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction | INS_Flags_EmbeddedBroadcastSupported)
-INST3(unpcklps,         "unpcklps",         IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x14),                            INS_TT_FULL,                         Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction | INS_Flags_EmbeddedBroadcastSupported)
-INST3(xorps,            "xorps",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x57),                            INS_TT_FULL,                         Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction | INS_Flags_EmbeddedBroadcastSupported)                                                    // XOR packed singles
+INST3(addps,            "addps",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x58),                            INS_TT_FULL,                         Input_32Bit    | KMask_Base4     | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Add packed singles


Most instructions are fairly straightforward because the input/output kinds match for both the SIMD size and for the base type to be considered.

So something like addps takes in float and returns float. If it takes in a V128, it returns a V128.

tannergooding · 2025-05-19T01:23:46Z

src/coreclr/jit/instrsxarch.h

-INST3(unpcklps,         "unpcklps",         IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x14),                            INS_TT_FULL,                         Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction | INS_Flags_EmbeddedBroadcastSupported)
-INST3(xorps,            "xorps",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x57),                            INS_TT_FULL,                         Input_32Bit    | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction | INS_Flags_EmbeddedBroadcastSupported)                                                    // XOR packed singles
+INST3(addps,            "addps",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x58),                            INS_TT_FULL,                         Input_32Bit    | KMask_Base4     | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Add packed singles
+INST3(addss,            "addss",            IUM_WR, BAD_CODE,     BAD_CODE,     SSEFLT(0x58),                            INS_TT_TUPLE1_SCALAR,                Input_32Bit    | KMask_Base1     | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Add scalar singles


The scalar instructions are an example of ones that differ. They take in a V128<T> and return a V128<T>, but the mask only ends up using 1-bit, rather than 4-bits.

tannergooding · 2025-05-19T01:25:49Z

src/coreclr/jit/instrsxarch.h

+INST3(andnps,           "andnps",           IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x55),                            INS_TT_FULL,                         Input_32Bit    | KMask_Base4     | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // And-Not packed singles
+INST3(andps,            "andps",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x54),                            INS_TT_FULL,                         Input_32Bit    | KMask_Base4     | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // AND packed singles
+INST3(cmpps,            "cmpps",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0xC2),                            INS_TT_FULL,                                                            REX_WIG      | Encoding_VEX                   | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // compare packed singles
+INST3(cmpss,            "cmpss",            IUM_WR, BAD_CODE,     BAD_CODE,     SSEFLT(0xC2),                            INS_TT_TUPLE1_SCALAR,                Input_32Bit                      | REX_WIG      | Encoding_VEX                   | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // compare scalar singles


Instructions which don't support EVEX continue having the Input_*Bit flag because it can still be used for determining that INS_TT_TUPLE1_SCALAR means it's going to read 32-bits.

tannergooding · 2025-05-19T01:26:46Z

src/coreclr/jit/instrsxarch.h

+INST3(movhps,           "movhps",           IUM_WR, PCKFLT(0x17), BAD_CODE,     PCKFLT(0x16),                            INS_TT_TUPLE2,                       Input_32Bit                      | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstSrcSrcAVXInstruction)
+INST3(movlhps,          "movlhps",          IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x16),                            INS_TT_NONE,                                                            REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)
+INST3(movlps,           "movlps",           IUM_WR, PCKFLT(0x13), BAD_CODE,     PCKFLT(0x12),                            INS_TT_TUPLE2,                       Input_32Bit                      | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstSrcSrcAVXInstruction)
+INST3(movmskps,         "movmskps",         IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x50),                            INS_TT_NONE,                                                            REX_WIG      | Encoding_VEX)


Instructions which have INS_TT_NONE means they don't touch memory, which means they don't have an Input_*Bit flags, since it can't be used for anything.

tannergooding · 2025-05-19T01:28:36Z

src/coreclr/jit/instrsxarch.h

+INST3(cmppd,            "cmppd",            IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0xC2),                            INS_TT_FULL,                                                            REX_WIG      | Encoding_VEX                   | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // compare packed doubles
+INST3(cmpsd,            "cmpsd",            IUM_WR, BAD_CODE,     BAD_CODE,     SSEDBL(0xC2),                            INS_TT_TUPLE1_SCALAR,                Input_64Bit                      | REX_WIG      | Encoding_VEX                   | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // compare scalar doubles
+INST3(comisd,           "comisd",           IUM_RD, BAD_CODE,     BAD_CODE,     PCKDBL(0x2F),                            INS_TT_TUPLE1_SCALAR,                Input_64Bit                      | REX_W1_EVEX  | Encoding_VEX  | Encoding_EVEX                                        | Resets_OF    | Resets_SF    | Writes_ZF    | Resets_AF    | Writes_PF    | Writes_CF)    // ordered compare doubles
+INST3(cvtdq2pd,         "cvtdq2pd",         IUM_WR, BAD_CODE,     BAD_CODE,     SSEFLT(0xE6),                            INS_TT_HALF,                         Input_32Bit    | KMask_Base2     | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX)                                                                                                                                  // cvt packed DWORDs to doubles


Instructions like cvtdq2pd are an example where the input is one type (Vector128<int>), while the output is a different type (Vector128<double>)

Generally speaking if the instruction takes in 1-input, then the mask size will be the smaller of the two counts. So in this case its KMask_Base2 because V128<double>.Count == 2 while V128<int>.Count == 4.

tannergooding · 2025-05-19T01:29:40Z

src/coreclr/jit/instrsxarch.h

+INST3(comisd,           "comisd",           IUM_RD, BAD_CODE,     BAD_CODE,     PCKDBL(0x2F),                            INS_TT_TUPLE1_SCALAR,                Input_64Bit                      | REX_W1_EVEX  | Encoding_VEX  | Encoding_EVEX                                        | Resets_OF    | Resets_SF    | Writes_ZF    | Resets_AF    | Writes_PF    | Writes_CF)    // ordered compare doubles
+INST3(cvtdq2pd,         "cvtdq2pd",         IUM_WR, BAD_CODE,     BAD_CODE,     SSEFLT(0xE6),                            INS_TT_HALF,                         Input_32Bit    | KMask_Base2     | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX)                                                                                                                                  // cvt packed DWORDs to doubles
+INST3(cvtdq2ps,         "cvtdq2ps",         IUM_WR, BAD_CODE,     BAD_CODE,     PCKFLT(0x5B),                            INS_TT_FULL,                         Input_32Bit    | KMask_Base4     | REX_W0_EVEX  | Encoding_VEX  | Encoding_EVEX)                                                                                                                                  // cvt packed DWORDs to singles
+INST3(cvtpd2dq,         "cvtpd2dq",         IUM_WR, BAD_CODE,     BAD_CODE,     SSEDBL(0xE6),                            INS_TT_FULL,                         Input_64Bit    | KMask_Base2     | REX_W1_EVEX  | Encoding_VEX  | Encoding_EVEX)                                                                                                                                  // cvt packed doubles to DWORDs


You can see the mirrored example to cvtdq2pd here in cvtpd2dq and it still being KMask_Base2 since it can only process 2 elements of the input in order to produce the output.

tannergooding · 2025-05-19T01:33:33Z

src/coreclr/jit/instrsxarch.h

+INST3(pextrw,           "pextrw",           IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0xC5),                            INS_TT_TUPLE1_SCALAR,                Input_16Bit                      | REX_W0       | Encoding_VEX  | Encoding_EVEX)                                                                                                                                  // Extract 16-bit value into a r32 with zero extended to 32-bits
+INST3(pinsrw,           "pinsrw",           IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0xC4),                            INS_TT_TUPLE1_SCALAR,                Input_16Bit                      | REX_W0       | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Insert word at index


Instructions like pextrw which extract to a general purpose register or like pinsrw which insert from a general-purpose register don't take masks.

tannergooding · 2025-05-19T01:35:02Z

src/coreclr/jit/instrsxarch.h

+INST3(pcmpgtw,          "pcmpgtw",          IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0x65),                            INS_TT_FULL_MEM,                                                        REX_WIG      | Encoding_VEX                   | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Packed compare 16-bit signed integers for greater than
+INST3(pextrw,           "pextrw",           IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0xC5),                            INS_TT_TUPLE1_SCALAR,                Input_16Bit                      | REX_W0       | Encoding_VEX  | Encoding_EVEX)                                                                                                                                  // Extract 16-bit value into a r32 with zero extended to 32-bits
+INST3(pinsrw,           "pinsrw",           IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0xC4),                            INS_TT_TUPLE1_SCALAR,                Input_16Bit                      | REX_W0       | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Insert word at index
+INST3(pmaddwd,          "pmaddwd",          IUM_WR, BAD_CODE,     BAD_CODE,     PCKDBL(0xF5),                            INS_TT_FULL_MEM,                                      KMask_Base4     | REX_WIG      | Encoding_VEX  | Encoding_EVEX  | INS_Flags_IsDstDstSrcAVXInstruction)                                                                                           // Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst


Instructions like pmaddwd which are INS_TT_FULL_MEM don't have the Input_*Bit flag since they cannot support embedded broadcast and only ever take the full simd size.

… support

jakobbotsch · 2025-05-25T19:52:41Z

The Fuzzlyn failures look related to mismatched native dependencies. I pushed jakobbotsch/Fuzzlyn@ac04f5e which should hopefully fix that.

tannergooding · 2025-05-25T23:01:55Z

/azp run runtime-coreclr jitstress-isas-x86, Fuzzlyn, Antigen, runtime-coreclr jitstress, runtime-coreclr jitstressregs

azure-pipelines · 2025-05-25T23:02:17Z

Azure Pipelines successfully started running 5 pipeline(s).

tannergooding · 2025-05-26T14:09:47Z

jitstress and jistress-isas failures are #110173

Looking at the fuzzlyn/antigen failures

tannergooding · 2025-05-26T14:36:33Z

Antigen failures have a common theme of using return string.Join(Environment.NewLine, toPrint).GetHashCode(); which causes different results run to run since hash codes aren't guaranteed to be stable.

tannergooding · 2025-05-26T14:37:58Z

Fuzzlyn failures are also unrelated:

Unhandled exception. System.IO.IOException: Read-only file system : '/root/helix/work/correlation/exploratory/ExecutionServer'
at System.IO.FileSystem.CreateDirectory(String fullPath, UnixFileMode unixCreateMode)
at System.IO.Directory.CreateDirectory(String path)
at Fuzzlyn.Program.CreateExecutionServerPool(FuzzlynOptions options)
at Fuzzlyn.Program.Main(String[] args)

jakobbotsch · 2025-05-26T15:33:39Z

Fuzzlyn failures are also unrelated:

Unhandled exception. System.IO.IOException: Read-only file system : '/root/helix/work/correlation/exploratory/ExecutionServer'
at System.IO.FileSystem.CreateDirectory(String fullPath, UnixFileMode unixCreateMode)
at System.IO.Directory.CreateDirectory(String path)
at Fuzzlyn.Program.CreateExecutionServerPool(FuzzlynOptions options)
at Fuzzlyn.Program.Main(String[] args)

Try 2 at fixing this at jakobbotsch/Fuzzlyn@a766adc ...

tannergooding · 2025-05-26T15:58:55Z

/azp run Fuzzlyn

azure-pipelines · 2025-05-26T15:59:08Z

Azure Pipelines successfully started running 1 pipeline(s).

tannergooding · 2025-05-26T18:38:45Z

Fuzzlyn completed. Failures are the do...while(...) intrinsic issue: #115109 and the unused value issue: #115202

EgorBo · 2025-05-27T13:23:21Z

are the size regressions in the diffs expected? I presume it's a correctness fix?

tannergooding · 2025-05-27T13:27:52Z

are the size regressions in the diffs expected? I presume it's a correctness fix?

Right. There is a chance to peephole a couple instructions from the evex only version to a vex compatible variant instead, but I’d rather do this separately from the correctness changes here

EgorBo

LGTM then given outerloop passed

github-actions bot added the area-CodeGen-coreclr label May 19, 2025

dotnet-policy-service bot assigned tannergooding May 19, 2025

tannergooding commented May 19, 2025

View reviewed changes

tannergooding force-pushed the fix-114921 branch 2 times, most recently from b4dd4bd to c6b2e21 Compare May 25, 2025 04:31

Fixup how xarch instructions check for embedded broadcast and masking…

39defab

… support

tannergooding force-pushed the fix-114921 branch from c6b2e21 to 39defab Compare May 25, 2025 04:57

This was referenced May 25, 2025

Test failure: baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.cmd #110173

Open

CI flakiness: mono interpreter build getting killed #114123

Open

tannergooding marked this pull request as ready for review May 25, 2025 15:40

This comment was marked as outdated.

Sign in to view

Ensure the lastOp is still consumed, since it's no longer part of node

f5a184e

build-analysis bot mentioned this pull request May 26, 2025

SmtpClientSendMailTest_SendAsync.MultipleRecipients_Failure_All test failure #115070

Closed

EgorBo approved these changes May 27, 2025

View reviewed changes

tannergooding merged commit c972a60 into dotnet:main May 27, 2025
169 of 181 checks passed

		INST3(pextrw, "pextrw", IUM_WR, BAD_CODE, BAD_CODE, PCKDBL(0xC5), INS_TT_TUPLE1_SCALAR, Input_16Bit \| REX_W0 \| Encoding_VEX \| Encoding_EVEX) // Extract 16-bit value into a r32 with zero extended to 32-bits
		INST3(pinsrw, "pinsrw", IUM_WR, BAD_CODE, BAD_CODE, PCKDBL(0xC4), INS_TT_TUPLE1_SCALAR, Input_16Bit \| REX_W0 \| Encoding_VEX \| Encoding_EVEX \| INS_Flags_IsDstDstSrcAVXInstruction) // Insert word at index

Fixup how xarch instructions check for embedded broadcast and masking support #115704

Fixup how xarch instructions check for embedded broadcast and masking support #115704

Uh oh!

Conversation

tannergooding commented May 19, 2025

Uh oh!

dotnet-policy-service bot commented May 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

jakobbotsch commented May 25, 2025

Uh oh!

tannergooding commented May 25, 2025

Uh oh!

azure-pipelines bot commented May 25, 2025

Uh oh!

tannergooding commented May 26, 2025

Uh oh!

tannergooding commented May 26, 2025

Uh oh!

tannergooding commented May 26, 2025

Uh oh!

jakobbotsch commented May 26, 2025

Uh oh!

tannergooding commented May 26, 2025

Uh oh!

azure-pipelines bot commented May 26, 2025

Uh oh!

tannergooding commented May 26, 2025

Uh oh!

EgorBo commented May 27, 2025

Uh oh!

tannergooding commented May 27, 2025

Uh oh!

EgorBo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!