Skip to content

[AArch64] Use dupq (SVE2.1) for segmented lane splats #144482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

huntergr-arm
Copy link
Collaborator

Use the dupq instructions (when available) to represent a splat of the same lane within each 128b segment of a wider fixed vector.

Copy link

github-actions bot commented Jun 17, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -13430,6 +13430,28 @@ static bool isUZP_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
return true;
}

/// isDUPQMask - matches a splat of equivalent lanes within 128b segments
static bool isDUPQMask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return a std::optional<unsigned> instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to do that, but opted to match the style of the surrounding code. Will switch back to the optional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 12 to 13
%splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe choose 15 and 31 here as indices to test boundaries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured a mix would work, as I have the 3,7 indices for i32 values, but sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +77 to +78
; CHECK-NEXT: dup v0.8h, v0.h[2]
; CHECK-NEXT: dup v1.8h, v1.h[2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this not work for bfloat?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not enough bfloat support in general, so this can be fixed up later in another patch.

return false;

for (unsigned I = 0; I < Segments; ++I) {
unsigned Broadcast = (unsigned)M[I * SegmentElts];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to take 'undef/poison' indices into account?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, sure. I'd prefer to land the simple form first, as handling arbitrary undef slots makes this more complicated with the edge cases (e.g. single defined lane with the rest undefined could still match, do we want that?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me but the answer to the question is ultimately "yes we do". It'll be up to the caller to ensure they order their code correctly to match the prefer shuffle masks earlier.

Comment on lines 858 to 862
// Scalar-to-vector segmented duplication
def AArch64duplaneq8 : SDNode<"AArch64ISD::DUPLANEQ8", SDT_AArch64DupLane>;
def AArch64duplaneq16 : SDNode<"AArch64ISD::DUPLANEQ16", SDT_AArch64DupLane>;
def AArch64duplaneq32 : SDNode<"AArch64ISD::DUPLANEQ32", SDT_AArch64DupLane>;
def AArch64duplaneq64 : SDNode<"AArch64ISD::DUPLANEQ64", SDT_AArch64DupLane>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need dedicated nodes for each element size? The isel patterns only match B->8-bit, H->16-bit etc which implies we can just have a single AArch64ISD::DUPQ and use the operand type to identify the specific variant. This matches the intrinsic, which brings me nicely to...

I'm trying to avoid situations where we have duplicate patterns for intrinsics and ISD nodes. With the above suggestion this becomes the case and so I'd rather we pick one. If we expect to have DAG combines then a dedicated ISD node is more convenient (with the intrinsic lowered to it) otherwise you may as well just create a call to the intrinsic. This is not a firm rule so I'm generally happy either way just as long as we don't have two ways to represent the same thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the sdnodes.

Comment on lines 13442 to 13451
for (unsigned I = 0; I < Segments; ++I) {
unsigned Broadcast = (unsigned)M[I * SegmentElts];
if (Broadcast - (I * SegmentElts) > SegmentElts)
return std::nullopt;
for (unsigned J = 0; J < SegmentElts; ++J) {
int Idx = M[(I * SegmentElts) + J];
if ((unsigned)Idx != Broadcast)
return std::nullopt;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be easier to follow if we'd write this with a single loop instead, using something like this:

  for (unsigned I=0; I<M.size(); ++I) {
    if (M[I] != (M[0] + ((I/Segments)*SegmentElts)))
      return false;
  }
  return M[0];

what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That discards the range check (does the splatted index value refer to a lane within the current segment). I guess we could have a second loop to check that afterwards.

Honestly, I wanted to use all_equal() on a std::span instead of the inner loop, but we don't have span yet (C++20). I could perhaps make a second ArrayRef instead, since it's just a pair of iterators.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the range check, I guess you'd only need to check M[0] (because we're not supporting poison values)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@llvmbot
Copy link
Member

llvmbot commented Jun 17, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

Use the dupq instructions (when available) to represent a splat of the same lane within each 128b segment of a wider fixed vector.


Full diff: https://github.com/llvm/llvm-project/pull/144482.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+37)
  • (added) llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll (+115)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7519ac5260a64..1e7e40ab663be 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -13430,6 +13430,30 @@ static bool isUZP_v_undef_Mask(ArrayRef<int> M, EVT VT, unsigned &WhichResult) {
   return true;
 }
 
+/// isDUPQMask - matches a splat of equivalent lanes within 128b segments in
+/// the first vector operand.
+static std::optional<unsigned> isDUPQMask(ArrayRef<int> M, EVT VT) {
+  unsigned Lane = (unsigned)M[0];
+  unsigned Segments = VT.getFixedSizeInBits() / 128;
+  unsigned SegmentElts = VT.getVectorNumElements() / Segments;
+
+  // Make sure there's no size changes.
+  if (SegmentElts * Segments != M.size())
+    return std::nullopt;
+
+  // Check that the first index corresponds to one of the lanes in the first
+  // segment.
+  if ((unsigned)M[0] >= SegmentElts)
+    return std::nullopt;
+
+  // Check that all lanes match the first, adjusted for segment.
+  for (unsigned I = 0; I < M.size(); ++I)
+    if ((unsigned)M[I] != ((unsigned)M[0] + ((I / SegmentElts) * SegmentElts)))
+      return std::nullopt;
+
+  return Lane;
+}
+
 /// isTRN_v_undef_Mask - Special case of isTRNMask for canonical form of
 /// "vector_shuffle v, v", i.e., "vector_shuffle v, undef".
 /// Mask is e.g., <0, 0, 2, 2> instead of <0, 4, 2, 6>.
@@ -30013,6 +30037,19 @@ SDValue AArch64TargetLowering::LowerFixedLengthVECTOR_SHUFFLEToSVE(
       return convertFromScalableVector(
           DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op1));
     }
+
+    if (Subtarget->hasSVE2p1()) {
+      if (std::optional<unsigned> Lane = isDUPQMask(ShuffleMask, VT)) {
+        SDValue IID =
+            DAG.getConstant(Intrinsic::aarch64_sve_dup_laneq, DL, MVT::i64);
+        return convertFromScalableVector(
+            DAG, VT,
+            DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, {ContainerVT, MVT::i64},
+                        {IID, Op1,
+                         DAG.getConstant(*Lane, DL, MVT::i64,
+                                         /*isTarget=*/true)}));
+      }
+    }
   }
 
   // Try to widen the shuffle before generating a possibly expensive SVE TBL.
diff --git a/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll b/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll
new file mode 100644
index 0000000000000..40d4d0ff60148
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll
@@ -0,0 +1,115 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+
+define void @dupq_i8_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i8_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.b, z0.b[15]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <32 x i8>, ptr %addr
+  %splat.lanes = shufflevector <32 x i8> %load, <32 x i8> poison, <32 x i32> <i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15, i32 15,
+                                                                              i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31, i32 31>
+  store <32 x i8> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.h, z0.h[2]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x i16>, ptr %addr
+  %splat.lanes = shufflevector <16 x i16> %load, <16 x i16> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x i16> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i32_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i32_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.s, z0.s[3]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <8 x i32>, ptr %addr
+  %splat.lanes = shufflevector <8 x i32> %load, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+                                                                             i32 7, i32 7, i32 7, i32 7>
+  store <8 x i32> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_i64_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_i64_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    trn1 z0.d, z0.d, z0.d
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <4 x i64>, ptr %addr
+  %splat.lanes = shufflevector <4 x i64> %load, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+  store <4 x i64> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.h, z0.h[2]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x half>, ptr %addr
+  %splat.lanes = shufflevector <16 x half> %load, <16 x half> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                  i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x half> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_bf16_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_bf16_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldp q0, q1, [x0]
+; CHECK-NEXT:    dup v0.8h, v0.h[2]
+; CHECK-NEXT:    dup v1.8h, v1.h[2]
+; CHECK-NEXT:    stp q0, q1, [x0]
+; CHECK-NEXT:    ret
+  %load = load <16 x bfloat>, ptr %addr
+  %splat.lanes = shufflevector <16 x bfloat> %load, <16 x bfloat> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                                      i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  store <16 x bfloat> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f32_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f32_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    dupq z0.s, z0.s[3]
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <8 x float>, ptr %addr
+  %splat.lanes = shufflevector <8 x float> %load, <8 x float> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+                                                                                 i32 7, i32 7, i32 7, i32 7>
+  store <8 x float> %splat.lanes, ptr %addr
+  ret void
+}
+
+define void @dupq_f64_256b(ptr %addr) #0 {
+; CHECK-LABEL: dupq_f64_256b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ldr z0, [x0]
+; CHECK-NEXT:    trn1 z0.d, z0.d, z0.d
+; CHECK-NEXT:    str z0, [x0]
+; CHECK-NEXT:    ret
+  %load = load <4 x double>, ptr %addr
+  %splat.lanes = shufflevector <4 x double> %load, <4 x double> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+  store <4 x double> %splat.lanes, ptr %addr
+  ret void
+}
+
+attributes #0 = { noinline vscale_range(2,2) "target-features"="+sve2p1,+bf16" }

Comment on lines 30047 to 30050
DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, {ContainerVT, MVT::i64},
{IID, Op1,
DAG.getConstant(*Lane, DL, MVT::i64,
/*isTarget=*/true)}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNode()'s third parameter is the result type? Does aarch64_sve_dup_laneq return two results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants