[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

huntergr-arm · 2025-06-17T08:59:44Z

Use the dupq instructions (when available) to represent a splat of the same lane within each 128b segment of a wider fixed vector.

github-actions · 2025-06-17T09:02:13Z

✅ With the latest revision this PR passed the C/C++ code formatter.

sdesmalen-arm · 2025-06-17T09:02:25Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

@@ -13430,6 +13430,28 @@ static bool isUZP_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
  return true;
 }

+/// isDUPQMask - matches a splat of equivalent lanes within 128b segments
+static bool isDUPQMask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {


nit: return a std::optional<unsigned> instead.

I was going to do that, but opted to match the style of the surrounding code. Will switch back to the optional.

sdesmalen-arm · 2025-06-17T09:10:35Z

llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll

+  %splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
+                                                                              i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>


Maybe choose 15 and 31 here as indices to test boundaries?

I figured a mix would work, as I have the 3,7 indices for i32 values, but sure.

sdesmalen-arm · 2025-06-17T09:12:35Z

llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll

+; CHECK-NEXT:    dup v0.8h, v0.h[2]
+; CHECK-NEXT:    dup v1.8h, v1.h[2]


Why does this not work for bfloat?

There's not enough bfloat support in general, so this can be fixed up later in another patch.

sdesmalen-arm · 2025-06-17T09:19:27Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+    return false;
+
+  for (unsigned I = 0; I < Segments; ++I) {
+    unsigned Broadcast = (unsigned)M[I * SegmentElts];


Do we want to take 'undef/poison' indices into account?

Eventually, sure. I'd prefer to land the simple form first, as handling arbitrary undef slots makes this more complicated with the edge cases (e.g. single defined lane with the rest undefined could still match, do we want that?)

Makes sense to me but the answer to the question is ultimately "yes we do". It'll be up to the caller to ensure they order their code correctly to match the prefer shuffle masks earlier.

paulwalker-arm · 2025-06-17T09:57:54Z

llvm/lib/Target/AArch64/AArch64InstrInfo.td

+// Scalar-to-vector segmented duplication
+def AArch64duplaneq8  : SDNode<"AArch64ISD::DUPLANEQ8", SDT_AArch64DupLane>;
+def AArch64duplaneq16 : SDNode<"AArch64ISD::DUPLANEQ16", SDT_AArch64DupLane>;
+def AArch64duplaneq32 : SDNode<"AArch64ISD::DUPLANEQ32", SDT_AArch64DupLane>;
+def AArch64duplaneq64 : SDNode<"AArch64ISD::DUPLANEQ64", SDT_AArch64DupLane>;


Do we need dedicated nodes for each element size? The isel patterns only match B->8-bit, H->16-bit etc which implies we can just have a single AArch64ISD::DUPQ and use the operand type to identify the specific variant. This matches the intrinsic, which brings me nicely to...

I'm trying to avoid situations where we have duplicate patterns for intrinsics and ISD nodes. With the above suggestion this becomes the case and so I'd rather we pick one. If we expect to have DAG combines then a dedicated ISD node is more convenient (with the intrinsic lowered to it) otherwise you may as well just create a call to the intrinsic. This is not a firm rule so I'm generally happy either way just as long as we don't have two ways to represent the same thing.

Removed the sdnodes.

*Remove sdnode and use intrinsic

sdesmalen-arm · 2025-06-17T13:31:29Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+  for (unsigned I = 0; I < Segments; ++I) {
+    unsigned Broadcast = (unsigned)M[I * SegmentElts];
+    if (Broadcast - (I * SegmentElts) > SegmentElts)
+      return std::nullopt;
+    for (unsigned J = 0; J < SegmentElts; ++J) {
+      int Idx = M[(I * SegmentElts) + J];
+      if ((unsigned)Idx != Broadcast)
+        return std::nullopt;
+    }
+  }


nit: it might be easier to follow if we'd write this with a single loop instead, using something like this:

for (unsigned I=0; I<M.size(); ++I) { if (M[I] != (M[0] + ((I/Segments)*SegmentElts))) return false; } return M[0];

what do you think?

That discards the range check (does the splatted index value refer to a lane within the current segment). I guess we could have a second loop to check that afterwards.

Honestly, I wanted to use all_equal() on a std::span instead of the inner loop, but we don't have span yet (C++20). I could perhaps make a second ArrayRef instead, since it's just a pair of iterators.

For the range check, I guess you'd only need to check M[0] (because we're not supporting poison values)?

llvmbot · 2025-06-17T14:55:21Z

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

Use the dupq instructions (when available) to represent a splat of the same lane within each 128b segment of a wider fixed vector.

Full diff: https://github.com/llvm/llvm-project/pull/144482.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+37)
(added) llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll (+115)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7519ac5260a64..1e7e40ab663be 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -13430,6 +13430,30 @@ static bool isUZP_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
   return true;
 }
 
+/// isDUPQMask - matches a splat of equivalent lanes within 128b segments in
+/// the first vector operand.
+static std::optional<unsigned> isDUPQMask(ArrayRef<int> M, EVT VT) {
+  unsigned Lane = (unsigned)M[0];
+  unsigned Segments = VT.getFixedSizeInBits() / 128;
+  unsigned SegmentElts = VT.getVectorNumElements() / Segments;
+
+  // Make sure there's no size changes.
+  if (SegmentElts * Segments != M.size())
+    return std::nullopt;
+
+  // Check that the first index corresponds to one of the lanes in the first
+  // segment.
+  if ((unsigned)M[0] >= SegmentElts)
+    return std::nullopt;
+
+  // Check that all lanes match the first, adjusted for segment.
+  for (unsigned I = 0; I < M.size(); ++I)
+    if ((unsigned)M[I] != ((unsigned)M[0] + ((I / SegmentElts) * SegmentElts)))
+      return std::nullopt;
+
+  return Lane;
+}
+
 /// isTRN_v_undef_Mask - Special case of isTRNMask for canonical form of
 /// "vector_shuffle v, v", i.e., "vector_shuffle v, undef".
 /// Mask is e.g., <0, 0, 2, 2> instead of <0, 4, 2, 6>.
@@ -30013,6 +30037,19 @@ SDValue AArch64TargetLowering::LowerFixedLengthVECTOR_SHUFFLEToSVE(
       return convertFromScalableVector(
           DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op1));
     }
+
+    if (Subtarget->hasSVE2p1()) {
+      if (std::optional<unsigned> Lane = isDUPQMask(ShuffleMask, VT)) {
+        SDValue IID =
+            DAG.getConstant(Intrinsic::aarch64_sve_dup_laneq, DL, MVT::i64);
+        return convertFromScalableVector(
+            DAG, VT,
+            DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, {ContainerVT, MVT::i64},
+                        {IID, Op1,
+                         DAG.getConstant(*Lane, DL, MVT::i64,
+                                         /*isTarget=*/true)}));
+      }
+    }
   }
 
   // Try to widen the shuffle before generating a possibly expensive SVE TBL.
diff --git a/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll b/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll
new file mode 100644
index 0000000000000..40d4d0ff60148
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll
@@ -0,0 +1,115 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+
+define void @dupq_i8_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i8_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.b, z0.b[15]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <32 x i8>, ptr %addr
+  %splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15,
+                                                                              i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31>
+  store <32 x i8> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.h, z0.h[2]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x i16>, ptr %addr
+  %splat.lanes = shufflevector <16 x i16> %load, <16 x i16> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x i16> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i32_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i32_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.s, z0.s[3]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <8 x i32>, ptr %addr
+  %splat.lanes = shufflevector <8 x i32> %load, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+                                                                             i32 7, i32 7, i32 7, i32 7>
+  store <8 x i32> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i64_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i64_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    trn1 z0.d, z0.d, z0.d
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <4 x i64>, ptr %addr
+  %splat.lanes = shufflevector <4 x i64> %load, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+  store <4 x i64> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.h, z0.h[2]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x half>, ptr %addr
+  %splat.lanes = shufflevector <16 x half> %load, <16 x half> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                  i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x half> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_bf16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_bf16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldp q0, q1, [x0]
+; CHECK-NEXT:    dup v0.8h, v0.h[2]
+; CHECK-NEXT:    dup v1.8h, v1.h[2]
+; CHECK-NEXT:    stp q0, q1, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x bfloat>, ptr %addr
+  %splat.lanes = shufflevector <16 x bfloat> %load, <16 x bfloat> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                      i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x bfloat> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f32_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f32_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.s, z0.s[3]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <8 x float>, ptr %addr
+  %splat.lanes = shufflevector <8 x float> %load, <8 x float> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+                                                                                 i32 7, i32 7, i32 7, i32 7>
+  store <8 x float> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f64_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f64_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    trn1 z0.d, z0.d, z0.d
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <4 x double>, ptr %addr
+  %splat.lanes = shufflevector <4 x double> %load, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+  store <4 x double> %splat.lanes, ptr %addr
+  ret void
+}
+
+attributes #0 = { noinline vscale_range(2,2) "target-features"="+sve2p1,+bf16" }

paulwalker-arm · 2025-06-17T15:32:51Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+            DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, {ContainerVT, MVT::i64},
+                        {IID, Op1,
+                         DAG.getConstant(*Lane, DL, MVT::i64,
+                                         /*isTarget=*/true)}));


getNode()'s third parameter is the result type? Does aarch64_sve_dup_laneq return two results?

huntergr-arm added 3 commits June 17, 2025 08:48

[AArch64] Use dupq for segmented lane splats (SVE 2.1)

e848506

Add dupq sdnodes and patterns to match

5dd61ba

Lower to new sdnodes

5ffdce6

huntergr-arm requested review from paulwalker-arm, sdesmalen-arm and gbossu June 17, 2025 08:59

sdesmalen-arm reviewed Jun 17, 2025

View reviewed changes

paulwalker-arm reviewed Jun 17, 2025

View reviewed changes

huntergr-arm added 3 commits June 17, 2025 12:18

*Use optional result instead of reference argument

1f05588

*Remove sdnode and use intrinsic

use end lane index in 8b test

1fc5190

Formatting

31ee725

sdesmalen-arm reviewed Jun 17, 2025

View reviewed changes

Single loop

6b380b0

llvmbot added the backend:AArch64 label Jun 17, 2025

sdesmalen-arm approved these changes Jun 17, 2025

View reviewed changes

paulwalker-arm reviewed Jun 17, 2025

View reviewed changes

Drop extra VT

ddcf5d5

		%splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
		i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>

		; CHECK-NEXT: dup v0.8h, v0.h[2]
		; CHECK-NEXT: dup v1.8h, v1.h[2]

[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

Are you sure you want to change the base?

[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

Conversation

huntergr-arm commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Jun 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jun 17, 2025 •

edited

Loading