Skip to content

[SLPVectorizer] Use accurate cost for external users of resize shuffles #137419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 17, 2025

Conversation

jrbyrnes
Copy link
Contributor

When implementing the vectorization, we potentially need to add shuffles for external users. In such cases, we may be shuffling a smaller vector into a larger vector. When this happens ResizeToVF will just build a poison padded identity vector. Then the to build the final shuffle, we just use the SK_InsertSubvector mask.

This is possibly clearer by looking at the included test in SLPVectorizer/AMDGPU/external-shuffle.ll

In the exit block we have a bunch of shuffles to glue the vectorized tree match the InsertElement users. TMP25 holds the result of resizing the v2i16 vectorized sequence to match the InsertElement size v16i16. Then TMP26 is the final shuffle which replaces the InsertElement sequence. This is just an insertsubvector.

However, when calculating the cost for these shuffles, we aren't modelling this correctly. ResizeToVF will indicate to performExtractsShuffleAction that we cannot use the original mask due to the resize shuffle. The consequence is that the cost calculation uses a different shuffle mask than what is ultimately used.

Going back to the included test, we can consider again TMP26. Clearly we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}. However, we will currently calculate the cost with a mask {0, 1, 2, 3, 20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not recognize this as an SK_InsertSubvector mask, and targets which have reduced costs for SK_InsertSubvector will not accurately calculate the cost.

@llvmbot
Copy link
Member

llvmbot commented Apr 25, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Jeffrey Byrnes (jrbyrnes)

Changes

When implementing the vectorization, we potentially need to add shuffles for external users. In such cases, we may be shuffling a smaller vector into a larger vector. When this happens ResizeToVF will just build a poison padded identity vector. Then the to build the final shuffle, we just use the SK_InsertSubvector mask.

This is possibly clearer by looking at the included test in SLPVectorizer/AMDGPU/external-shuffle.ll

In the exit block we have a bunch of shuffles to glue the vectorized tree match the InsertElement users. TMP25 holds the result of resizing the v2i16 vectorized sequence to match the InsertElement size v16i16. Then TMP26 is the final shuffle which replaces the InsertElement sequence. This is just an insertsubvector.

However, when calculating the cost for these shuffles, we aren't modelling this correctly. ResizeToVF will indicate to performExtractsShuffleAction that we cannot use the original mask due to the resize shuffle. The consequence is that the cost calculation uses a different shuffle mask than what is ultimately used.

Going back to the included test, we can consider again TMP26. Clearly we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}. However, we will currently calculate the cost with a mask {0, 1, 2, 3, 20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not recognize this as an SK_InsertSubvector mask, and targets which have reduced costs for SK_InsertSubvector will not accurately calculate the cost.


Patch is 28.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137419.diff

4 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+34-1)
  • (added) llvm/test/Transforms/SLPVectorizer/AMDGPU/external-shuffle.ll (+222)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/extractelement-single-use-many-nodes.ll (+7-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/insertelements-with-reused-indices.ll (+5-5)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 53da78ee599b7..8c261930c0a4c 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -14344,6 +14344,7 @@ static T *performExtractsShuffleAction(
     [[maybe_unused]] auto *V = ValueSelect::get<T *>(Base);
     assert((!V || GetVF(V) == Mask.size()) &&
            "Expected base vector of VF number of elements.");
+
     Prev = Action(Mask, {nullptr, Res.first});
   } else if (ShuffleMask.size() == 1) {
     // Base is undef and only 1 vector is shuffled - perform the action only for
@@ -14802,7 +14803,39 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals,
                  << " for final shuffle of insertelement external users.\n";
           TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
       Cost += C;
-      return std::make_pair(TE, true);
+
+      bool HasLargeIndex =
+          any_of(Mask, [VF](int Idx) { return Idx >= static_cast<int>(VF); });
+
+      // If the resize source is just an identity vector, then will produce an
+      // insert subvector shufflevector
+      bool NeedsResizeExtract = true;
+      if ((VecVF < VF) && !HasLargeIndex) {
+        NeedsResizeExtract = false;
+        SmallVector<int> ResizeMask(VF, PoisonMaskElem);
+        for (unsigned I = 0; I < VF; ++I) {
+          if (Mask[I] != PoisonMaskElem) {
+            assert((size_t)Mask[I] < ResizeMask.size());
+            ResizeMask[Mask[I]] = Mask[I];
+          }
+        }
+
+        unsigned MinVF = std::min(VF, VecVF);
+        // Check if our mask is a a padded identity mask with non poision
+        // intermediaries. This implies that our resize shuffle is just a
+        // contiguous insertsubvector
+        for (auto [Position, Mask] : enumerate(ResizeMask)) {
+          if (Position >= MinVF)
+            break;
+
+          if ((int)Position != Mask) {
+            NeedsResizeExtract = true;
+            break;
+          }
+        }
+      }
+
+      return std::make_pair(TE, NeedsResizeExtract);
     }
     return std::make_pair(TE, false);
   };
diff --git a/llvm/test/Transforms/SLPVectorizer/AMDGPU/external-shuffle.ll b/llvm/test/Transforms/SLPVectorizer/AMDGPU/external-shuffle.ll
new file mode 100644
index 0000000000000..1886606ecb12d
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/AMDGPU/external-shuffle.ll
@@ -0,0 +1,222 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -passes=slp-vectorizer,dce < %s | FileCheck -check-prefixes=GCN %s
+
+define protected  void @phi_4(ptr addrspace(3) %inptr0, ptr addrspace(3) %inptr1, ptr %out, ptr %out1, ptr %out2, i32 %flag) {
+; GCN-LABEL: define protected void @phi_4(
+; GCN-SAME: ptr addrspace(3) [[INPTR0:%.*]], ptr addrspace(3) [[INPTR1:%.*]], ptr [[OUT:%.*]], ptr [[OUT1:%.*]], ptr [[OUT2:%.*]], i32 [[FLAG:%.*]]) #[[ATTR0:[0-9]+]] {
+; GCN-NEXT:  [[ENTRY:.*]]:
+; GCN-NEXT:    [[GEP0:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 0
+; GCN-NEXT:    [[TMP0:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP0]], align 8
+; GCN-NEXT:    [[GEP2:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 2
+; GCN-NEXT:    [[TMP1:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP2]], align 2
+; GCN-NEXT:    [[GEP4:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 4
+; GCN-NEXT:    [[TMP2:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP4]], align 8
+; GCN-NEXT:    [[GEP6:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 6
+; GCN-NEXT:    [[TMP3:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP6]], align 2
+; GCN-NEXT:    [[GEP8:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 8
+; GCN-NEXT:    [[TMP4:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP8]], align 8
+; GCN-NEXT:    [[GEP10:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 10
+; GCN-NEXT:    [[TMP5:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP10]], align 2
+; GCN-NEXT:    [[GEP12:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 12
+; GCN-NEXT:    [[TMP6:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP12]], align 8
+; GCN-NEXT:    [[GEP14:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 14
+; GCN-NEXT:    [[TMP7:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP14]], align 2
+; GCN-NEXT:    br label %[[DO_BODY:.*]]
+; GCN:       [[DO_BODY]]:
+; GCN-NEXT:    [[TMP8:%.*]] = phi <2 x i16> [ [[TMP0]], %[[ENTRY]] ], [ [[TMP16:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT:    [[TMP9:%.*]] = phi <2 x i16> [ [[TMP1]], %[[ENTRY]] ], [ [[TMP17:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT:    [[TMP10:%.*]] = phi <2 x i16> [ [[TMP2]], %[[ENTRY]] ], [ [[TMP18:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT:    [[TMP11:%.*]] = phi <2 x i16> [ [[TMP3]], %[[ENTRY]] ], [ [[TMP19:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT:    [[TMP12:%.*]] = phi <2 x i16> [ [[TMP4]], %[[ENTRY]] ], [ [[TMP20:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT:    [[TMP13:%.*]] = phi <2 x i16> [ [[TMP5]], %[[ENTRY]] ], [ [[TMP21:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT:    [[TMP14:%.*]] = phi <2 x i16> [ [[TMP6]], %[[ENTRY]] ], [ [[TMP22:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT:    [[TMP42:%.*]] = phi <2 x i16> [ [[TMP7]], %[[ENTRY]] ], [ [[TMP23:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT:    [[TMP16]] = load <2 x i16>, ptr addrspace(3) [[GEP0]], align 8
+; GCN-NEXT:    [[TMP17]] = load <2 x i16>, ptr addrspace(3) [[GEP2]], align 2
+; GCN-NEXT:    [[TMP18]] = load <2 x i16>, ptr addrspace(3) [[GEP4]], align 8
+; GCN-NEXT:    [[TMP19]] = load <2 x i16>, ptr addrspace(3) [[GEP6]], align 2
+; GCN-NEXT:    [[TMP20]] = load <2 x i16>, ptr addrspace(3) [[GEP8]], align 8
+; GCN-NEXT:    [[TMP21]] = load <2 x i16>, ptr addrspace(3) [[GEP10]], align 2
+; GCN-NEXT:    [[TMP22]] = load <2 x i16>, ptr addrspace(3) [[GEP12]], align 8
+; GCN-NEXT:    [[TMP23]] = load <2 x i16>, ptr addrspace(3) [[GEP14]], align 2
+; GCN-NEXT:    [[CMP:%.*]] = icmp eq i32 [[FLAG]], 0
+; GCN-NEXT:    br i1 [[CMP]], label %[[EXIT:.*]], label %[[DO_BODY]]
+; GCN:       [[EXIT]]:
+; GCN-NEXT:    [[TMP24:%.*]] = shufflevector <2 x i16> [[TMP16]], <2 x i16> [[TMP17]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP25:%.*]] = shufflevector <2 x i16> [[TMP18]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP26:%.*]] = shufflevector <16 x i16> [[TMP24]], <16 x i16> [[TMP25]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP27:%.*]] = shufflevector <2 x i16> [[TMP19]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP28:%.*]] = shufflevector <16 x i16> [[TMP26]], <16 x i16> [[TMP27]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP29:%.*]] = shufflevector <2 x i16> [[TMP20]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP30:%.*]] = shufflevector <16 x i16> [[TMP28]], <16 x i16> [[TMP29]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP31:%.*]] = shufflevector <2 x i16> [[TMP21]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP32:%.*]] = shufflevector <16 x i16> [[TMP30]], <16 x i16> [[TMP31]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP33:%.*]] = shufflevector <2 x i16> [[TMP22]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP47:%.*]] = shufflevector <16 x i16> [[TMP32]], <16 x i16> [[TMP33]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP48:%.*]] = shufflevector <2 x i16> [[TMP23]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP49:%.*]] = shufflevector <16 x i16> [[TMP47]], <16 x i16> [[TMP48]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 16, i32 17>
+; GCN-NEXT:    [[TMP37:%.*]] = shufflevector <2 x i16> [[TMP0]], <2 x i16> [[TMP1]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP38:%.*]] = shufflevector <2 x i16> [[TMP2]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP39:%.*]] = shufflevector <16 x i16> [[TMP37]], <16 x i16> [[TMP38]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP40:%.*]] = shufflevector <2 x i16> [[TMP3]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP41:%.*]] = shufflevector <16 x i16> [[TMP39]], <16 x i16> [[TMP40]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP57:%.*]] = shufflevector <2 x i16> [[TMP4]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP43:%.*]] = shufflevector <16 x i16> [[TMP41]], <16 x i16> [[TMP57]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP44:%.*]] = shufflevector <2 x i16> [[TMP5]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP45:%.*]] = shufflevector <16 x i16> [[TMP43]], <16 x i16> [[TMP44]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP46:%.*]] = shufflevector <2 x i16> [[TMP6]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP58:%.*]] = shufflevector <16 x i16> [[TMP45]], <16 x i16> [[TMP46]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP60:%.*]] = shufflevector <2 x i16> [[TMP7]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[VEC2157:%.*]] = shufflevector <16 x i16> [[TMP58]], <16 x i16> [[TMP60]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 16, i32 17>
+; GCN-NEXT:    [[TMP50:%.*]] = shufflevector <2 x i16> [[TMP8]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[TMP51:%.*]] = shufflevector <2 x i16> [[TMP9]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[VEC231:%.*]] = shufflevector <16 x i16> [[TMP50]], <16 x i16> [[TMP51]], <16 x i32> <i32 0, i32 1, i32 16, i32 17, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT:    [[TMP52:%.*]] = shufflevector <2 x i16> [[TMP10]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[VEC252:%.*]] = shufflevector <16 x i16> [[VEC231]], <16 x i16> [[TMP52]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 16, i32 17, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT:    [[TMP53:%.*]] = shufflevector <2 x i16> [[TMP11]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[VEC273:%.*]] = shufflevector <16 x i16> [[VEC252]], <16 x i16> [[TMP53]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 16, i32 17, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT:    [[TMP54:%.*]] = shufflevector <2 x i16> [[TMP12]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[VEC294:%.*]] = shufflevector <16 x i16> [[VEC273]], <16 x i16> [[TMP54]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT:    [[TMP55:%.*]] = shufflevector <2 x i16> [[TMP13]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[VEC2115:%.*]] = shufflevector <16 x i16> [[VEC294]], <16 x i16> [[TMP55]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 17, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT:    [[TMP56:%.*]] = shufflevector <2 x i16> [[TMP14]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[VEC2136:%.*]] = shufflevector <16 x i16> [[VEC2115]], <16 x i16> [[TMP56]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 14, i32 15>
+; GCN-NEXT:    [[TMP59:%.*]] = shufflevector <2 x i16> [[TMP42]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT:    [[VEC2151:%.*]] = shufflevector <16 x i16> [[VEC2136]], <16 x i16> [[TMP59]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 16, i32 17>
+; GCN-NEXT:    store <16 x i16> [[VEC2157]], ptr [[OUT]], align 32
+; GCN-NEXT:    store <16 x i16> [[TMP49]], ptr [[OUT1]], align 32
+; GCN-NEXT:    store <16 x i16> [[VEC2151]], ptr [[OUT2]], align 32
+; GCN-NEXT:    ret void
+;
+entry:
+  %gep0 = getelementptr i16, ptr addrspace(3) %inptr0, i32 0
+  %ele0 = load i16, ptr addrspace(3) %gep0, align 8
+  %gep1 = getelementptr i16, ptr addrspace(3) %inptr0, i32 1
+  %ele1 = load i16, ptr addrspace(3) %gep1, align 1
+  %gep2 = getelementptr i16, ptr addrspace(3) %inptr0, i32 2
+  %ele2 = load i16, ptr addrspace(3) %gep2, align 2
+  %gep3 = getelementptr i16, ptr addrspace(3) %inptr0, i32 3
+  %ele3 = load i16, ptr addrspace(3) %gep3, align 1
+  %gep4 = getelementptr i16, ptr addrspace(3) %inptr0, i32 4
+  %ele4 = load i16, ptr addrspace(3) %gep4, align 8
+  %gep5 = getelementptr i16, ptr addrspace(3) %inptr0, i32 5
+  %ele5 = load i16, ptr addrspace(3) %gep5, align 1
+  %gep6 = getelementptr i16, ptr addrspace(3) %inptr0, i32 6
+  %ele6 = load i16, ptr addrspace(3) %gep6, align 2
+  %gep7 = getelementptr i16, ptr addrspace(3) %inptr0, i32 7
+  %ele7 = load i16, ptr addrspace(3) %gep7, align 1
+  %gep8 = getelementptr i16, ptr addrspace(3) %inptr0, i32 8
+  %ele8 = load i16, ptr addrspace(3) %gep8, align 8
+  %gep9 = getelementptr i16, ptr addrspace(3) %inptr0, i32 9
+  %ele9 = load i16, ptr addrspace(3) %gep9, align 1
+  %gep10 = getelementptr i16, ptr addrspace(3) %inptr0, i32 10
+  %ele10 = load i16, ptr addrspace(3) %gep10, align 2
+  %gep11 = getelementptr i16, ptr addrspace(3) %inptr0, i32 11
+  %ele11 = load i16, ptr addrspace(3) %gep11, align 1
+  %gep12 = getelementptr i16, ptr addrspace(3) %inptr0, i32 12
+  %ele12 = load i16, ptr addrspace(3) %gep12, align 8
+  %gep13 = getelementptr i16, ptr addrspace(3) %inptr0, i32 13
+  %ele13 = load i16, ptr addrspace(3) %gep13, align 1
+  %gep14 = getelementptr i16, ptr addrspace(3) %inptr0, i32 14
+  %ele14 = load i16, ptr addrspace(3) %gep14, align 2
+  %gep15 = getelementptr i16, ptr addrspace(3) %inptr0, i32 15
+  %ele15 = load i16, ptr addrspace(3) %gep15, align 1
+  br label %do.body
+
+do.body:
+  %phi0 = phi i16 [ %ele0, %entry ], [ %otherele0, %do.body ]
+  %phi1 = phi i16 [ %ele1, %entry ], [ %otherele1, %do.body ]
+  %phi2 = phi i16 [ %ele2, %entry ], [ %otherele2, %do.body ]
+  %phi3 = phi i16 [ %ele3, %entry ], [ %otherele3, %do.body ]
+  %phi4 = phi i16 [ %ele4, %entry ], [ %otherele4, %do.body ]
+  %phi5 = phi i16 [ %ele5, %entry ], [ %otherele5, %do.body ]
+  %phi6 = phi i16 [ %ele6, %entry ], [ %otherele6, %do.body ]
+  %phi7 = phi i16 [ %ele7, %entry ], [ %otherele7, %do.body ]
+  %phi8 = phi i16 [ %ele8, %entry ], [ %otherele8, %do.body ]
+  %phi9 = phi i16 [ %ele9, %entry ], [ %otherele9, %do.body ]
+  %phi10 = phi i16 [ %ele10, %entry ], [ %otherele10, %do.body ]
+  %phi11 = phi i16 [ %ele11, %entry ], [ %otherele11, %do.body ]
+  %phi12 = phi i16 [ %ele12, %entry ], [ %otherele12, %do.body ]
+  %phi13 = phi i16 [ %ele13, %entry ], [ %otherele13, %do.body ]
+  %phi14 = phi i16 [ %ele14, %entry ], [ %otherele14, %do.body ]
+  %phi15 = phi i16 [ %ele15, %entry ], [ %otherele15, %do.body ]
+
+  %otherele0 = load i16, ptr addrspace(3) %gep0, align 8
+  %otherele1 = load i16, ptr addrspace(3) %gep1, align 1
+  %otherele2 = load i16, ptr addrspace(3) %gep2, align 2
+  %otherele3 = load i16, ptr addrspace(3) %gep3, align 1
+  %otherele4 = load i16, ptr addrspace(3) %gep4, align 8
+  %otherele5 = load i16, ptr addrspace(3) %gep5, align 1
+  %otherele6 = load i16, ptr addrspace(3) %gep6, align 2
+  %otherele7 = load i16, ptr addrspace(3) %gep7, align 1
+  %otherele8 = load i16, ptr addrspace(3) %gep8, align 8
+  %otherele9 = load i16, ptr addrspace(3) %gep9, align 1
+  %otherele10 = load i16, ptr ...
[truncated]

@@ -14802,7 +14803,39 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals,
<< " for final shuffle of insertelement external users.\n";
TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
Cost += C;
return std::make_pair(TE, true);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! What about something like this, which is more close to the codegen:

diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index afba099f5154..c269a1efa415 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -14757,25 +14757,43 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals,

   Cost += ExtractCost;
   auto &&ResizeToVF = [this, &Cost](const TreeEntry *TE, ArrayRef<int> Mask,
-                                    bool) {
+                                    bool ForSingleMask) {
     InstructionCost C = 0;
     unsigned VF = Mask.size();
     unsigned VecVF = TE->getVectorFactor();
     if (VF != VecVF &&
         (any_of(Mask, [VF](int Idx) { return Idx >= static_cast<int>(VF); }) ||
          !ShuffleVectorInst::isIdentityMask(Mask, VF))) {
-      SmallVector<int> OrigMask(VecVF, PoisonMaskElem);
-      std::copy(Mask.begin(), std::next(Mask.begin(), std::min(VF, VecVF)),
-                OrigMask.begin());
-      C = ::getShuffleCost(*TTI, TTI::SK_PermuteSingleSrc,
-                           getWidenedType(TE->getMainOp()->getType(), VecVF),
-                           OrigMask);
-      LLVM_DEBUG(
-          dbgs() << "SLP: Adding cost " << C
-                 << " for final shuffle of insertelement external users.\n";
-          TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
-      Cost += C;
-      return std::make_pair(TE, true);
+      if (any_of(Mask, [VF](int Idx) { return Idx >= static_cast<int>(VF); })) {
+        SmallVector<int> OrigMask(VecVF, PoisonMaskElem);
+        std::copy(Mask.begin(), std::next(Mask.begin(), std::min(VF, VecVF)),
+                  OrigMask.begin());
+        C = ::getShuffleCost(*TTI, TTI::SK_PermuteSingleSrc,
+                             getWidenedType(TE->getMainOp()->getType(), VecVF),
+                             OrigMask);
+        LLVM_DEBUG(
+            dbgs() << "SLP: Adding cost " << C
+                   << " for final shuffle of insertelement external users.\n";
+            TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
+        Cost += C;
+        return std::make_pair(TE, true);
+      }
+      if (!ForSingleMask) {
+        SmallVector<int> ResizeMask(VF, PoisonMaskElem);
+        for (unsigned I = 0; I < VF; ++I) {
+          if (Mask[I] != PoisonMaskElem)
+            ResizeMask[Mask[I]] = Mask[I];
+        }
+        if (!ShuffleVectorInst::isIdentityMask(Mask, VF))
+          C = ::getShuffleCost(
+              *TTI, TTI::SK_PermuteSingleSrc,
+              getWidenedType(TE->getMainOp()->getType(), VecVF), ResizeMask);
+        LLVM_DEBUG(
+            dbgs() << "SLP: Adding cost " << C
+                   << " for final shuffle of insertelement external users.\n";
+            TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
+        Cost += C;
+      }
     }
     return std::make_pair(TE, false);
   };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I'll try to adopt it, but it's causing some concerning looking diffs in the lit tests that I need to work through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-bataev Sorry for the delay, I've updated the PR to include your suggestion. It causes some changes in the lit tests that don't seem to be beneficial. I'm looking into them, but thought I'd post them as well to see if you had any insight.

@@ -0,0 +1,222 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should be precommitted in a separate NFC patch

; CHECK-NEXT: [[TMP10:%.*]] = add nsw <4 x i32> [[TMP5]], [[TMP9]]
; CHECK-NEXT: [[TMP11:%.*]] = mul nsw <4 x i32> [[TMP5]], [[TMP9]]
; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> [[TMP11]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>
; CHECK-NEXT: [[T701:%.*]] = shufflevector <4 x i32> [[TMP12]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 poison, i32 3>
Copy link
Contributor Author

@jrbyrnes jrbyrnes Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect this PR has had on this test I think is fine.

We have VF of 4 to the insertelement user of width 8 (our vectorizable portion starts at insertelement %t65).

The external use mask for our vectorized tree is <-1, -1, -1, -1, 0, 1, -1, 3>

Previously, we used a resize mask of <-1, -1, -1, -1> to calculate the resize cost and a mask of <0, 1, 2, 3, 12, 13, -1, 15> to calculate the external shuffle cost. These shuffles both have costs of 1, this makes it so that we do not vectorize. It seems our previous calculations are based on the assumption that we will prepad our vector with poison to resize it, however, that isn't how the resize + shuffle is actually implemented.

In actuality, we resize using mask <0, 1, -1, 3, -1, -1, -1, -1> which is just identity with poison. Then, we shuffle using <0, 1, 2, 3, 0, 1, -1, 3>. In the new version, we are much more accurate, and use <0, 1, -1, 3, -1, -1, -1, -1> to caculate the resize cost (identity with poison) and <0, 1, 2, 3, 8, 9, -1, 11> to calculate the shuffle cost. These have costs 0 and 1 respectively which keeps up within the cost threshold, so we vectorize.

There is difference in the generated assembly before / after.

; CHECK-NEXT: [[I118:%.*]] = fadd double [[TMP19]], [[TMP20]]
; CHECK-NEXT: [[TMP21:%.*]] = fmul <4 x double> zeroinitializer, [[TMP1]]
; CHECK-NEXT: [[TMP22:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison>
; CHECK-NEXT: [[TMP23:%.*]] = shufflevector <4 x double> <double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double poison>, <4 x double> [[TMP22]], <4 x i32> <i32 0, i32 1, i32 2, i32 5>
; CHECK-NEXT: [[TMP23:%.*]] = insertelement <4 x double> <double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double poison>, double [[I82]], i32 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect this PR has had on this test also looks good to me.

The effect is less immediately obvious on this test compared to the last, because we are evaluating vectorizing a list after already vectorizing several others. In the partially vectorized version, we have some insertelements that are not present in the original IR. These insertelements are external users for our tree.

We have a VF of 2 and are inserting the 2nd element into 4th position of external insertelement vector (external user width is 4). The vector we are inserting into is a literal 0.0 vector, and is the other operand in any external shuffles.

The external use mask for our vectorized tree is <-1, -1, -1, 1>

Previously, we used a resize mask of <-1, -1> to calculate the resize cost and a mask of <0, 1, 2, 7> to calculate the external shuffle cost. These shuffles both have costs of 1 so we vectorize. However, this isn't how the SLP codegen works -- we don't shuffle during the resize (though that may be more optimal in this case). In actuality, we resize using mask <0, 1, -1 -1> which is just identity with poison. Then, we shuffle using <0, 1, 2, 5>. In the new version, we are much more accurate, and use <-1, 1, -1 -1> to calculate the resize cost (identity with poison), and <0, 1, 2, 5> to calculate the shuffle cost. These have costs 0 and 3 respectively, and push us over the cost threshold, so we do not vectorize this part.

There is difference in the generated assembly before / after.

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Jun 11, 2025

I took a look at the lit changes and my conclusion is that they are reflecting our cost model being more accurate -- so I am happy with the current changes.

jrbyrnes added a commit that referenced this pull request Jun 12, 2025
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jun 12, 2025
jrbyrnes added 3 commits June 12, 2025 07:57
Change-Id: I7620d2bd3d65be994bd290b84267832fdb4f1bb4
Change-Id: Ie1209bb7f6c49992d41e3fec8a195a81d04f34d4
Change-Id: I9c4865641974152f6289df79df4aa057cdcdb8ed
@jrbyrnes
Copy link
Contributor Author

Force push for test precommit

@jrbyrnes jrbyrnes requested a review from RKSimon June 13, 2025 21:37
@jrbyrnes
Copy link
Contributor Author

Add @RKSimon for any comment on the x86 test changes. I think they're fine, but might be good to double check.

tomtor pushed a commit to tomtor/llvm-project that referenced this pull request Jun 14, 2025
@jrbyrnes jrbyrnes merged commit c9a87a5 into llvm:main Jun 17, 2025
7 checks passed
ajaden-codes pushed a commit to Jaddyen/llvm-project that referenced this pull request Jun 17, 2025
…es (llvm#137419)

When implementing the vectorization, we potentially need to add shuffles
for external users. In such cases, we may be shuffling a smaller vector
into a larger vector. When this happens `ResizeToVF` will just build a
poison padded identity vector. Then the to build the final shuffle, we
just use the `SK_InsertSubvector` mask.

This is possibly clearer by looking at the included test in
SLPVectorizer/AMDGPU/external-shuffle.ll

In the exit block we have a bunch of shuffles to glue the vectorized
tree match the `InsertElement` users. `TMP25` holds the result of
resizing the v2i16 vectorized sequence to match the `InsertElement` size
v16i16. Then `TMP26` is the final shuffle which replaces the
`InsertElement` sequence. This is just an insertsubvector.

However, when calculating the cost for these shuffles, we aren't
modelling this correctly. `ResizeToVF` will indicate to
`performExtractsShuffleAction` that we cannot use the original mask due
to the resize shuffle. The consequence is that the cost calculation uses
a different shuffle mask than what is ultimately used.

Going back to the included test, we can consider again `TMP26`. Clearly
we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}.
However, we will currently calculate the cost with a mask {0, 1, 2, 3,
20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector
Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not
recognize this as an `SK_InsertSubvector` mask, and targets which have
reduced costs for `SK_InsertSubvector` will not accurately calculate the
cost.
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x86 LGTM (sorry for late reply)

@jrbyrnes
Copy link
Contributor Author

x86 LGTM (sorry for late reply)

Thanks for taking a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants