-
Notifications
You must be signed in to change notification settings - Fork 14k
[SLPVectorizer] Use accurate cost for external users of resize shuffles #137419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Jeffrey Byrnes (jrbyrnes) ChangesWhen implementing the vectorization, we potentially need to add shuffles for external users. In such cases, we may be shuffling a smaller vector into a larger vector. When this happens This is possibly clearer by looking at the included test in SLPVectorizer/AMDGPU/external-shuffle.ll In the exit block we have a bunch of shuffles to glue the vectorized tree match the However, when calculating the cost for these shuffles, we aren't modelling this correctly. Going back to the included test, we can consider again Patch is 28.35 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137419.diff 4 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 53da78ee599b7..8c261930c0a4c 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -14344,6 +14344,7 @@ static T *performExtractsShuffleAction(
[[maybe_unused]] auto *V = ValueSelect::get<T *>(Base);
assert((!V || GetVF(V) == Mask.size()) &&
"Expected base vector of VF number of elements.");
+
Prev = Action(Mask, {nullptr, Res.first});
} else if (ShuffleMask.size() == 1) {
// Base is undef and only 1 vector is shuffled - perform the action only for
@@ -14802,7 +14803,39 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals,
<< " for final shuffle of insertelement external users.\n";
TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
Cost += C;
- return std::make_pair(TE, true);
+
+ bool HasLargeIndex =
+ any_of(Mask, [VF](int Idx) { return Idx >= static_cast<int>(VF); });
+
+ // If the resize source is just an identity vector, then will produce an
+ // insert subvector shufflevector
+ bool NeedsResizeExtract = true;
+ if ((VecVF < VF) && !HasLargeIndex) {
+ NeedsResizeExtract = false;
+ SmallVector<int> ResizeMask(VF, PoisonMaskElem);
+ for (unsigned I = 0; I < VF; ++I) {
+ if (Mask[I] != PoisonMaskElem) {
+ assert((size_t)Mask[I] < ResizeMask.size());
+ ResizeMask[Mask[I]] = Mask[I];
+ }
+ }
+
+ unsigned MinVF = std::min(VF, VecVF);
+ // Check if our mask is a a padded identity mask with non poision
+ // intermediaries. This implies that our resize shuffle is just a
+ // contiguous insertsubvector
+ for (auto [Position, Mask] : enumerate(ResizeMask)) {
+ if (Position >= MinVF)
+ break;
+
+ if ((int)Position != Mask) {
+ NeedsResizeExtract = true;
+ break;
+ }
+ }
+ }
+
+ return std::make_pair(TE, NeedsResizeExtract);
}
return std::make_pair(TE, false);
};
diff --git a/llvm/test/Transforms/SLPVectorizer/AMDGPU/external-shuffle.ll b/llvm/test/Transforms/SLPVectorizer/AMDGPU/external-shuffle.ll
new file mode 100644
index 0000000000000..1886606ecb12d
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/AMDGPU/external-shuffle.ll
@@ -0,0 +1,222 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -passes=slp-vectorizer,dce < %s | FileCheck -check-prefixes=GCN %s
+
+define protected void @phi_4(ptr addrspace(3) %inptr0, ptr addrspace(3) %inptr1, ptr %out, ptr %out1, ptr %out2, i32 %flag) {
+; GCN-LABEL: define protected void @phi_4(
+; GCN-SAME: ptr addrspace(3) [[INPTR0:%.*]], ptr addrspace(3) [[INPTR1:%.*]], ptr [[OUT:%.*]], ptr [[OUT1:%.*]], ptr [[OUT2:%.*]], i32 [[FLAG:%.*]]) #[[ATTR0:[0-9]+]] {
+; GCN-NEXT: [[ENTRY:.*]]:
+; GCN-NEXT: [[GEP0:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 0
+; GCN-NEXT: [[TMP0:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP0]], align 8
+; GCN-NEXT: [[GEP2:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 2
+; GCN-NEXT: [[TMP1:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP2]], align 2
+; GCN-NEXT: [[GEP4:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 4
+; GCN-NEXT: [[TMP2:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP4]], align 8
+; GCN-NEXT: [[GEP6:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 6
+; GCN-NEXT: [[TMP3:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP6]], align 2
+; GCN-NEXT: [[GEP8:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 8
+; GCN-NEXT: [[TMP4:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP8]], align 8
+; GCN-NEXT: [[GEP10:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 10
+; GCN-NEXT: [[TMP5:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP10]], align 2
+; GCN-NEXT: [[GEP12:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 12
+; GCN-NEXT: [[TMP6:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP12]], align 8
+; GCN-NEXT: [[GEP14:%.*]] = getelementptr i16, ptr addrspace(3) [[INPTR0]], i32 14
+; GCN-NEXT: [[TMP7:%.*]] = load <2 x i16>, ptr addrspace(3) [[GEP14]], align 2
+; GCN-NEXT: br label %[[DO_BODY:.*]]
+; GCN: [[DO_BODY]]:
+; GCN-NEXT: [[TMP8:%.*]] = phi <2 x i16> [ [[TMP0]], %[[ENTRY]] ], [ [[TMP16:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT: [[TMP9:%.*]] = phi <2 x i16> [ [[TMP1]], %[[ENTRY]] ], [ [[TMP17:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT: [[TMP10:%.*]] = phi <2 x i16> [ [[TMP2]], %[[ENTRY]] ], [ [[TMP18:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT: [[TMP11:%.*]] = phi <2 x i16> [ [[TMP3]], %[[ENTRY]] ], [ [[TMP19:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT: [[TMP12:%.*]] = phi <2 x i16> [ [[TMP4]], %[[ENTRY]] ], [ [[TMP20:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT: [[TMP13:%.*]] = phi <2 x i16> [ [[TMP5]], %[[ENTRY]] ], [ [[TMP21:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT: [[TMP14:%.*]] = phi <2 x i16> [ [[TMP6]], %[[ENTRY]] ], [ [[TMP22:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT: [[TMP42:%.*]] = phi <2 x i16> [ [[TMP7]], %[[ENTRY]] ], [ [[TMP23:%.*]], %[[DO_BODY]] ]
+; GCN-NEXT: [[TMP16]] = load <2 x i16>, ptr addrspace(3) [[GEP0]], align 8
+; GCN-NEXT: [[TMP17]] = load <2 x i16>, ptr addrspace(3) [[GEP2]], align 2
+; GCN-NEXT: [[TMP18]] = load <2 x i16>, ptr addrspace(3) [[GEP4]], align 8
+; GCN-NEXT: [[TMP19]] = load <2 x i16>, ptr addrspace(3) [[GEP6]], align 2
+; GCN-NEXT: [[TMP20]] = load <2 x i16>, ptr addrspace(3) [[GEP8]], align 8
+; GCN-NEXT: [[TMP21]] = load <2 x i16>, ptr addrspace(3) [[GEP10]], align 2
+; GCN-NEXT: [[TMP22]] = load <2 x i16>, ptr addrspace(3) [[GEP12]], align 8
+; GCN-NEXT: [[TMP23]] = load <2 x i16>, ptr addrspace(3) [[GEP14]], align 2
+; GCN-NEXT: [[CMP:%.*]] = icmp eq i32 [[FLAG]], 0
+; GCN-NEXT: br i1 [[CMP]], label %[[EXIT:.*]], label %[[DO_BODY]]
+; GCN: [[EXIT]]:
+; GCN-NEXT: [[TMP24:%.*]] = shufflevector <2 x i16> [[TMP16]], <2 x i16> [[TMP17]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP25:%.*]] = shufflevector <2 x i16> [[TMP18]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP26:%.*]] = shufflevector <16 x i16> [[TMP24]], <16 x i16> [[TMP25]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP27:%.*]] = shufflevector <2 x i16> [[TMP19]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP28:%.*]] = shufflevector <16 x i16> [[TMP26]], <16 x i16> [[TMP27]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP29:%.*]] = shufflevector <2 x i16> [[TMP20]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP30:%.*]] = shufflevector <16 x i16> [[TMP28]], <16 x i16> [[TMP29]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP31:%.*]] = shufflevector <2 x i16> [[TMP21]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP32:%.*]] = shufflevector <16 x i16> [[TMP30]], <16 x i16> [[TMP31]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP33:%.*]] = shufflevector <2 x i16> [[TMP22]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP47:%.*]] = shufflevector <16 x i16> [[TMP32]], <16 x i16> [[TMP33]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP48:%.*]] = shufflevector <2 x i16> [[TMP23]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP49:%.*]] = shufflevector <16 x i16> [[TMP47]], <16 x i16> [[TMP48]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 16, i32 17>
+; GCN-NEXT: [[TMP37:%.*]] = shufflevector <2 x i16> [[TMP0]], <2 x i16> [[TMP1]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP38:%.*]] = shufflevector <2 x i16> [[TMP2]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP39:%.*]] = shufflevector <16 x i16> [[TMP37]], <16 x i16> [[TMP38]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP40:%.*]] = shufflevector <2 x i16> [[TMP3]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP41:%.*]] = shufflevector <16 x i16> [[TMP39]], <16 x i16> [[TMP40]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP57:%.*]] = shufflevector <2 x i16> [[TMP4]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP43:%.*]] = shufflevector <16 x i16> [[TMP41]], <16 x i16> [[TMP57]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP44:%.*]] = shufflevector <2 x i16> [[TMP5]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP45:%.*]] = shufflevector <16 x i16> [[TMP43]], <16 x i16> [[TMP44]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 17, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP46:%.*]] = shufflevector <2 x i16> [[TMP6]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP58:%.*]] = shufflevector <16 x i16> [[TMP45]], <16 x i16> [[TMP46]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP60:%.*]] = shufflevector <2 x i16> [[TMP7]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[VEC2157:%.*]] = shufflevector <16 x i16> [[TMP58]], <16 x i16> [[TMP60]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 16, i32 17>
+; GCN-NEXT: [[TMP50:%.*]] = shufflevector <2 x i16> [[TMP8]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[TMP51:%.*]] = shufflevector <2 x i16> [[TMP9]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[VEC231:%.*]] = shufflevector <16 x i16> [[TMP50]], <16 x i16> [[TMP51]], <16 x i32> <i32 0, i32 1, i32 16, i32 17, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT: [[TMP52:%.*]] = shufflevector <2 x i16> [[TMP10]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[VEC252:%.*]] = shufflevector <16 x i16> [[VEC231]], <16 x i16> [[TMP52]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 16, i32 17, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT: [[TMP53:%.*]] = shufflevector <2 x i16> [[TMP11]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[VEC273:%.*]] = shufflevector <16 x i16> [[VEC252]], <16 x i16> [[TMP53]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 16, i32 17, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT: [[TMP54:%.*]] = shufflevector <2 x i16> [[TMP12]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[VEC294:%.*]] = shufflevector <16 x i16> [[VEC273]], <16 x i16> [[TMP54]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT: [[TMP55:%.*]] = shufflevector <2 x i16> [[TMP13]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[VEC2115:%.*]] = shufflevector <16 x i16> [[VEC294]], <16 x i16> [[TMP55]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 17, i32 12, i32 13, i32 14, i32 15>
+; GCN-NEXT: [[TMP56:%.*]] = shufflevector <2 x i16> [[TMP14]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[VEC2136:%.*]] = shufflevector <16 x i16> [[VEC2115]], <16 x i16> [[TMP56]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 16, i32 17, i32 14, i32 15>
+; GCN-NEXT: [[TMP59:%.*]] = shufflevector <2 x i16> [[TMP42]], <2 x i16> poison, <16 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; GCN-NEXT: [[VEC2151:%.*]] = shufflevector <16 x i16> [[VEC2136]], <16 x i16> [[TMP59]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 16, i32 17>
+; GCN-NEXT: store <16 x i16> [[VEC2157]], ptr [[OUT]], align 32
+; GCN-NEXT: store <16 x i16> [[TMP49]], ptr [[OUT1]], align 32
+; GCN-NEXT: store <16 x i16> [[VEC2151]], ptr [[OUT2]], align 32
+; GCN-NEXT: ret void
+;
+entry:
+ %gep0 = getelementptr i16, ptr addrspace(3) %inptr0, i32 0
+ %ele0 = load i16, ptr addrspace(3) %gep0, align 8
+ %gep1 = getelementptr i16, ptr addrspace(3) %inptr0, i32 1
+ %ele1 = load i16, ptr addrspace(3) %gep1, align 1
+ %gep2 = getelementptr i16, ptr addrspace(3) %inptr0, i32 2
+ %ele2 = load i16, ptr addrspace(3) %gep2, align 2
+ %gep3 = getelementptr i16, ptr addrspace(3) %inptr0, i32 3
+ %ele3 = load i16, ptr addrspace(3) %gep3, align 1
+ %gep4 = getelementptr i16, ptr addrspace(3) %inptr0, i32 4
+ %ele4 = load i16, ptr addrspace(3) %gep4, align 8
+ %gep5 = getelementptr i16, ptr addrspace(3) %inptr0, i32 5
+ %ele5 = load i16, ptr addrspace(3) %gep5, align 1
+ %gep6 = getelementptr i16, ptr addrspace(3) %inptr0, i32 6
+ %ele6 = load i16, ptr addrspace(3) %gep6, align 2
+ %gep7 = getelementptr i16, ptr addrspace(3) %inptr0, i32 7
+ %ele7 = load i16, ptr addrspace(3) %gep7, align 1
+ %gep8 = getelementptr i16, ptr addrspace(3) %inptr0, i32 8
+ %ele8 = load i16, ptr addrspace(3) %gep8, align 8
+ %gep9 = getelementptr i16, ptr addrspace(3) %inptr0, i32 9
+ %ele9 = load i16, ptr addrspace(3) %gep9, align 1
+ %gep10 = getelementptr i16, ptr addrspace(3) %inptr0, i32 10
+ %ele10 = load i16, ptr addrspace(3) %gep10, align 2
+ %gep11 = getelementptr i16, ptr addrspace(3) %inptr0, i32 11
+ %ele11 = load i16, ptr addrspace(3) %gep11, align 1
+ %gep12 = getelementptr i16, ptr addrspace(3) %inptr0, i32 12
+ %ele12 = load i16, ptr addrspace(3) %gep12, align 8
+ %gep13 = getelementptr i16, ptr addrspace(3) %inptr0, i32 13
+ %ele13 = load i16, ptr addrspace(3) %gep13, align 1
+ %gep14 = getelementptr i16, ptr addrspace(3) %inptr0, i32 14
+ %ele14 = load i16, ptr addrspace(3) %gep14, align 2
+ %gep15 = getelementptr i16, ptr addrspace(3) %inptr0, i32 15
+ %ele15 = load i16, ptr addrspace(3) %gep15, align 1
+ br label %do.body
+
+do.body:
+ %phi0 = phi i16 [ %ele0, %entry ], [ %otherele0, %do.body ]
+ %phi1 = phi i16 [ %ele1, %entry ], [ %otherele1, %do.body ]
+ %phi2 = phi i16 [ %ele2, %entry ], [ %otherele2, %do.body ]
+ %phi3 = phi i16 [ %ele3, %entry ], [ %otherele3, %do.body ]
+ %phi4 = phi i16 [ %ele4, %entry ], [ %otherele4, %do.body ]
+ %phi5 = phi i16 [ %ele5, %entry ], [ %otherele5, %do.body ]
+ %phi6 = phi i16 [ %ele6, %entry ], [ %otherele6, %do.body ]
+ %phi7 = phi i16 [ %ele7, %entry ], [ %otherele7, %do.body ]
+ %phi8 = phi i16 [ %ele8, %entry ], [ %otherele8, %do.body ]
+ %phi9 = phi i16 [ %ele9, %entry ], [ %otherele9, %do.body ]
+ %phi10 = phi i16 [ %ele10, %entry ], [ %otherele10, %do.body ]
+ %phi11 = phi i16 [ %ele11, %entry ], [ %otherele11, %do.body ]
+ %phi12 = phi i16 [ %ele12, %entry ], [ %otherele12, %do.body ]
+ %phi13 = phi i16 [ %ele13, %entry ], [ %otherele13, %do.body ]
+ %phi14 = phi i16 [ %ele14, %entry ], [ %otherele14, %do.body ]
+ %phi15 = phi i16 [ %ele15, %entry ], [ %otherele15, %do.body ]
+
+ %otherele0 = load i16, ptr addrspace(3) %gep0, align 8
+ %otherele1 = load i16, ptr addrspace(3) %gep1, align 1
+ %otherele2 = load i16, ptr addrspace(3) %gep2, align 2
+ %otherele3 = load i16, ptr addrspace(3) %gep3, align 1
+ %otherele4 = load i16, ptr addrspace(3) %gep4, align 8
+ %otherele5 = load i16, ptr addrspace(3) %gep5, align 1
+ %otherele6 = load i16, ptr addrspace(3) %gep6, align 2
+ %otherele7 = load i16, ptr addrspace(3) %gep7, align 1
+ %otherele8 = load i16, ptr addrspace(3) %gep8, align 8
+ %otherele9 = load i16, ptr addrspace(3) %gep9, align 1
+ %otherele10 = load i16, ptr ...
[truncated]
|
@@ -14802,7 +14803,39 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals, | |||
<< " for final shuffle of insertelement external users.\n"; | |||
TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n"); | |||
Cost += C; | |||
return std::make_pair(TE, true); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! What about something like this, which is more close to the codegen:
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index afba099f5154..c269a1efa415 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -14757,25 +14757,43 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals,
Cost += ExtractCost;
auto &&ResizeToVF = [this, &Cost](const TreeEntry *TE, ArrayRef<int> Mask,
- bool) {
+ bool ForSingleMask) {
InstructionCost C = 0;
unsigned VF = Mask.size();
unsigned VecVF = TE->getVectorFactor();
if (VF != VecVF &&
(any_of(Mask, [VF](int Idx) { return Idx >= static_cast<int>(VF); }) ||
!ShuffleVectorInst::isIdentityMask(Mask, VF))) {
- SmallVector<int> OrigMask(VecVF, PoisonMaskElem);
- std::copy(Mask.begin(), std::next(Mask.begin(), std::min(VF, VecVF)),
- OrigMask.begin());
- C = ::getShuffleCost(*TTI, TTI::SK_PermuteSingleSrc,
- getWidenedType(TE->getMainOp()->getType(), VecVF),
- OrigMask);
- LLVM_DEBUG(
- dbgs() << "SLP: Adding cost " << C
- << " for final shuffle of insertelement external users.\n";
- TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
- Cost += C;
- return std::make_pair(TE, true);
+ if (any_of(Mask, [VF](int Idx) { return Idx >= static_cast<int>(VF); })) {
+ SmallVector<int> OrigMask(VecVF, PoisonMaskElem);
+ std::copy(Mask.begin(), std::next(Mask.begin(), std::min(VF, VecVF)),
+ OrigMask.begin());
+ C = ::getShuffleCost(*TTI, TTI::SK_PermuteSingleSrc,
+ getWidenedType(TE->getMainOp()->getType(), VecVF),
+ OrigMask);
+ LLVM_DEBUG(
+ dbgs() << "SLP: Adding cost " << C
+ << " for final shuffle of insertelement external users.\n";
+ TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
+ Cost += C;
+ return std::make_pair(TE, true);
+ }
+ if (!ForSingleMask) {
+ SmallVector<int> ResizeMask(VF, PoisonMaskElem);
+ for (unsigned I = 0; I < VF; ++I) {
+ if (Mask[I] != PoisonMaskElem)
+ ResizeMask[Mask[I]] = Mask[I];
+ }
+ if (!ShuffleVectorInst::isIdentityMask(Mask, VF))
+ C = ::getShuffleCost(
+ *TTI, TTI::SK_PermuteSingleSrc,
+ getWidenedType(TE->getMainOp()->getType(), VecVF), ResizeMask);
+ LLVM_DEBUG(
+ dbgs() << "SLP: Adding cost " << C
+ << " for final shuffle of insertelement external users.\n";
+ TE->dump(); dbgs() << "SLP: Current total cost = " << Cost << "\n");
+ Cost += C;
+ }
}
return std::make_pair(TE, false);
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks -- I'll try to adopt it, but it's causing some concerning looking diffs in the lit tests that I need to work through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexey-bataev Sorry for the delay, I've updated the PR to include your suggestion. It causes some changes in the lit tests that don't seem to be beneficial. I'm looking into them, but thought I'd post them as well to see if you had any insight.
@@ -0,0 +1,222 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should be precommitted in a separate NFC patch
llvm/test/Transforms/SLPVectorizer/X86/insertelements-with-reused-indices.ll
Outdated
Show resolved
Hide resolved
; CHECK-NEXT: [[TMP10:%.*]] = add nsw <4 x i32> [[TMP5]], [[TMP9]] | ||
; CHECK-NEXT: [[TMP11:%.*]] = mul nsw <4 x i32> [[TMP5]], [[TMP9]] | ||
; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <4 x i32> [[TMP10]], <4 x i32> [[TMP11]], <4 x i32> <i32 0, i32 1, i32 6, i32 3> | ||
; CHECK-NEXT: [[T701:%.*]] = shufflevector <4 x i32> [[TMP12]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 poison, i32 3> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effect this PR has had on this test I think is fine.
We have VF of 4 to the insertelement user of width 8 (our vectorizable portion starts at insertelement %t65).
The external use mask for our vectorized tree is <-1, -1, -1, -1, 0, 1, -1, 3>
Previously, we used a resize mask of <-1, -1, -1, -1> to calculate the resize cost and a mask of <0, 1, 2, 3, 12, 13, -1, 15> to calculate the external shuffle cost. These shuffles both have costs of 1, this makes it so that we do not vectorize. It seems our previous calculations are based on the assumption that we will prepad our vector with poison to resize it, however, that isn't how the resize + shuffle is actually implemented.
In actuality, we resize using mask <0, 1, -1, 3, -1, -1, -1, -1> which is just identity with poison. Then, we shuffle using <0, 1, 2, 3, 0, 1, -1, 3>. In the new version, we are much more accurate, and use <0, 1, -1, 3, -1, -1, -1, -1> to caculate the resize cost (identity with poison) and <0, 1, 2, 3, 8, 9, -1, 11> to calculate the shuffle cost. These have costs 0 and 1 respectively which keeps up within the cost threshold, so we vectorize.
There is difference in the generated assembly before / after.
; CHECK-NEXT: [[I118:%.*]] = fadd double [[TMP19]], [[TMP20]] | ||
; CHECK-NEXT: [[TMP21:%.*]] = fmul <4 x double> zeroinitializer, [[TMP1]] | ||
; CHECK-NEXT: [[TMP22:%.*]] = shufflevector <2 x double> [[TMP4]], <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 poison, i32 poison> | ||
; CHECK-NEXT: [[TMP23:%.*]] = shufflevector <4 x double> <double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double poison>, <4 x double> [[TMP22]], <4 x i32> <i32 0, i32 1, i32 2, i32 5> | ||
; CHECK-NEXT: [[TMP23:%.*]] = insertelement <4 x double> <double 0.000000e+00, double 0.000000e+00, double 0.000000e+00, double poison>, double [[I82]], i32 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effect this PR has had on this test also looks good to me.
The effect is less immediately obvious on this test compared to the last, because we are evaluating vectorizing a list after already vectorizing several others. In the partially vectorized version, we have some insertelements that are not present in the original IR. These insertelements are external users for our tree.
We have a VF of 2 and are inserting the 2nd element into 4th position of external insertelement vector (external user width is 4). The vector we are inserting into is a literal 0.0 vector, and is the other operand in any external shuffles.
The external use mask for our vectorized tree is <-1, -1, -1, 1>
Previously, we used a resize mask of <-1, -1> to calculate the resize cost and a mask of <0, 1, 2, 7> to calculate the external shuffle cost. These shuffles both have costs of 1 so we vectorize. However, this isn't how the SLP codegen works -- we don't shuffle during the resize (though that may be more optimal in this case). In actuality, we resize using mask <0, 1, -1 -1> which is just identity with poison. Then, we shuffle using <0, 1, 2, 5>. In the new version, we are much more accurate, and use <-1, 1, -1 -1> to calculate the resize cost (identity with poison), and <0, 1, 2, 5> to calculate the shuffle cost. These have costs 0 and 3 respectively, and push us over the cost threshold, so we do not vectorize this part.
There is difference in the generated assembly before / after.
I took a look at the lit changes and my conclusion is that they are reflecting our cost model being more accurate -- so I am happy with the current changes. |
Change-Id: I7620d2bd3d65be994bd290b84267832fdb4f1bb4
Change-Id: Ie1209bb7f6c49992d41e3fec8a195a81d04f34d4
Change-Id: I9c4865641974152f6289df79df4aa057cdcdb8ed
Force push for test precommit |
Add @RKSimon for any comment on the x86 test changes. I think they're fine, but might be good to double check. |
…es (llvm#137419) When implementing the vectorization, we potentially need to add shuffles for external users. In such cases, we may be shuffling a smaller vector into a larger vector. When this happens `ResizeToVF` will just build a poison padded identity vector. Then the to build the final shuffle, we just use the `SK_InsertSubvector` mask. This is possibly clearer by looking at the included test in SLPVectorizer/AMDGPU/external-shuffle.ll In the exit block we have a bunch of shuffles to glue the vectorized tree match the `InsertElement` users. `TMP25` holds the result of resizing the v2i16 vectorized sequence to match the `InsertElement` size v16i16. Then `TMP26` is the final shuffle which replaces the `InsertElement` sequence. This is just an insertsubvector. However, when calculating the cost for these shuffles, we aren't modelling this correctly. `ResizeToVF` will indicate to `performExtractsShuffleAction` that we cannot use the original mask due to the resize shuffle. The consequence is that the cost calculation uses a different shuffle mask than what is ultimately used. Going back to the included test, we can consider again `TMP26`. Clearly we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}. However, we will currently calculate the cost with a mask {0, 1, 2, 3, 20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not recognize this as an `SK_InsertSubvector` mask, and targets which have reduced costs for `SK_InsertSubvector` will not accurately calculate the cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x86 LGTM (sorry for late reply)
Thanks for taking a look! |
When implementing the vectorization, we potentially need to add shuffles for external users. In such cases, we may be shuffling a smaller vector into a larger vector. When this happens
ResizeToVF
will just build a poison padded identity vector. Then the to build the final shuffle, we just use theSK_InsertSubvector
mask.This is possibly clearer by looking at the included test in SLPVectorizer/AMDGPU/external-shuffle.ll
In the exit block we have a bunch of shuffles to glue the vectorized tree match the
InsertElement
users.TMP25
holds the result of resizing the v2i16 vectorized sequence to match theInsertElement
size v16i16. ThenTMP26
is the final shuffle which replaces theInsertElement
sequence. This is just an insertsubvector.However, when calculating the cost for these shuffles, we aren't modelling this correctly.
ResizeToVF
will indicate toperformExtractsShuffleAction
that we cannot use the original mask due to the resize shuffle. The consequence is that the cost calculation uses a different shuffle mask than what is ultimately used.Going back to the included test, we can consider again
TMP26
. Clearly we can see the shuffle uses a mask {0, 1, 2, 3, 16, 17, poison ..}. However, we will currently calculate the cost with a mask {0, 1, 2, 3, 20, 21, ...} we have replaced 16 and 17 with 20 and 21 (Index + Vector Size). Queries like BasicTTImpl::improveShuffleKindFromMask will not recognize this as anSK_InsertSubvector
mask, and targets which have reduced costs forSK_InsertSubvector
will not accurately calculate the cost.