Skip to content

[LV] Use vscale for tuning when updating profile information #143690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 16, 2025

Conversation

david-arm
Copy link
Contributor

In fixVectorizedLoop we call setProfileInfoAfterUnrolling to update the profile information after vectorising, however for scalable VFs we pessimistically assume vscale=1. We can improve upon this by using the value of vscale used for tuning, i.e. when targeting neoverse-v1 the expected value is 2.

@llvmbot
Copy link
Member

llvmbot commented Jun 11, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: David Sherwood (david-arm)

Changes

In fixVectorizedLoop we call setProfileInfoAfterUnrolling to update the profile information after vectorising, however for scalable VFs we pessimistically assume vscale=1. We can improve upon this by using the value of vscale used for tuning, i.e. when targeting neoverse-v1 the expected value is 2.


Patch is 21.71 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/143690.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+17-17)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/check-prof-info.ll (+123)
  • (modified) llvm/test/Transforms/LoopVectorize/check-prof-info.ll (+135-25)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 333e50ee98418..eeea1cad6abff 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -2688,6 +2688,20 @@ static void cse(BasicBlock *BB) {
   }
 }
 
+/// This function attempts to return a value that represents the vectorization
+/// factor at runtime. For fixed-width VFs we know this precisely at compile
+/// time, but for scalable VFs we calculate it based on an estimate of the
+/// vscale value.
+static unsigned getEstimatedRuntimeVF(ElementCount VF,
+                                      std::optional<unsigned> VScale) {
+  unsigned EstimatedVF = VF.getKnownMinValue();
+  if (VF.isScalable())
+    if (VScale)
+      EstimatedVF *= *VScale;
+  assert(EstimatedVF >= 1 && "Estimated VF shouldn't be less than 1");
+  return EstimatedVF;
+}
+
 InstructionCost
 LoopVectorizationCostModel::getVectorCallCost(CallInst *CI,
                                               ElementCount VF) const {
@@ -2787,10 +2801,10 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State) {
   //
   // For scalable vectorization we can't know at compile time how many
   // iterations of the loop are handled in one vector iteration, so instead
-  // assume a pessimistic vscale of '1'.
+  // use the value of vscale used for tuning.
   Loop *VectorLoop = LI->getLoopFor(HeaderBB);
-  setProfileInfoAfterUnrolling(OrigLoop, VectorLoop, OrigLoop,
-                               VF.getKnownMinValue() * UF);
+  unsigned VFxUF = getEstimatedRuntimeVF(VF * UF, Cost->getVScaleForTuning());
+  setProfileInfoAfterUnrolling(OrigLoop, VectorLoop, OrigLoop, VFxUF);
 }
 
 void InnerLoopVectorizer::fixNonInductionPHIs(VPTransformState &State) {
@@ -4017,20 +4031,6 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
   return MaxVF;
 }
 
-/// This function attempts to return a value that represents the vectorization
-/// factor at runtime. For fixed-width VFs we know this precisely at compile
-/// time, but for scalable VFs we calculate it based on an estimate of the
-/// vscale value.
-static unsigned getEstimatedRuntimeVF(ElementCount VF,
-                                      std::optional<unsigned> VScale) {
-  unsigned EstimatedVF = VF.getKnownMinValue();
-  if (VF.isScalable())
-    if (VScale)
-      EstimatedVF *= *VScale;
-  assert(EstimatedVF >= 1 && "Estimated VF shouldn't be less than 1");
-  return EstimatedVF;
-}
-
 bool LoopVectorizationPlanner::isMoreProfitable(const VectorizationFactor &A,
                                                 const VectorizationFactor &B,
                                                 const unsigned MaxTripCount,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/check-prof-info.ll b/llvm/test/Transforms/LoopVectorize/AArch64/check-prof-info.ll
new file mode 100644
index 0000000000000..9661f1b3b6641
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/check-prof-info.ll
@@ -0,0 +1,123 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --filter "br" --filter "^.*:" --version 5
+; RUN: opt -passes="print<block-freq>,loop-vectorize" -mcpu=neoverse-v1 -force-vector-interleave=1 -S < %s |  FileCheck %s -check-prefix=CHECK-V1-IC1
+; RUN: opt -passes="print<block-freq>,loop-vectorize" -mcpu=neoverse-v2 -force-vector-interleave=1 -S < %s |  FileCheck %s -check-prefix=CHECK-V2-IC1
+; RUN: opt -passes="print<block-freq>,loop-vectorize" -mcpu=neoverse-v2 -force-vector-interleave=4 -S < %s |  FileCheck %s -check-prefix=CHECK-V2-IC4
+
+target triple = "aarch64-unknown-linux-gnu"
+
+@a = dso_local global [1024 x i32] zeroinitializer, align 16
+@b = dso_local global [1024 x i32] zeroinitializer, align 16
+
+; Check correctness of profile info for vectorization without epilog.
+; Function Attrs: nofree norecurse nounwind uwtable
+define dso_local void @_Z3foov() local_unnamed_addr #0 {
+; CHECK-V1-IC1-LABEL: define dso_local void @_Z3foov(
+; CHECK-V1-IC1-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-V1-IC1:  [[ENTRY:.*:]]
+; CHECK-V1-IC1:    br i1 [[MIN_ITERS_CHECK:%.*]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF0:![0-9]+]]
+; CHECK-V1-IC1:  [[VECTOR_PH]]:
+; CHECK-V1-IC1:    br label %[[VECTOR_BODY:.*]]
+; CHECK-V1-IC1:  [[VECTOR_BODY]]:
+; CHECK-V1-IC1:    br i1 [[TMP16:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF0]], !llvm.loop [[LOOP1:![0-9]+]]
+; CHECK-V1-IC1:  [[MIDDLE_BLOCK]]:
+; CHECK-V1-IC1:    br i1 [[CMP_N:%.*]], label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]], !prof [[PROF4:![0-9]+]]
+; CHECK-V1-IC1:  [[SCALAR_PH]]:
+; CHECK-V1-IC1:    br label %[[FOR_BODY:.*]]
+; CHECK-V1-IC1:  [[FOR_BODY]]:
+; CHECK-V1-IC1:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF5:![0-9]+]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK-V1-IC1:  [[FOR_COND_CLEANUP]]:
+;
+; CHECK-V2-IC1-LABEL: define dso_local void @_Z3foov(
+; CHECK-V2-IC1-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-V2-IC1:  [[ENTRY:.*:]]
+; CHECK-V2-IC1:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF0:![0-9]+]]
+; CHECK-V2-IC1:  [[VECTOR_PH]]:
+; CHECK-V2-IC1:    br label %[[VECTOR_BODY:.*]]
+; CHECK-V2-IC1:  [[VECTOR_BODY]]:
+; CHECK-V2-IC1:    br i1 [[TMP6:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF1:![0-9]+]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-V2-IC1:  [[MIDDLE_BLOCK]]:
+; CHECK-V2-IC1:    br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]], !prof [[PROF5:![0-9]+]]
+; CHECK-V2-IC1:  [[SCALAR_PH]]:
+; CHECK-V2-IC1:    br label %[[FOR_BODY:.*]]
+; CHECK-V2-IC1:  [[FOR_BODY]]:
+; CHECK-V2-IC1:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF6:![0-9]+]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK-V2-IC1:  [[FOR_COND_CLEANUP]]:
+;
+; CHECK-V2-IC4-LABEL: define dso_local void @_Z3foov(
+; CHECK-V2-IC4-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-V2-IC4:  [[VEC_EPILOG_VECTOR_BODY1:.*:]]
+; CHECK-V2-IC4:    br i1 [[MIN_ITERS_CHECK:%.*]], label %[[VEC_EPILOG_SCALAR_PH:.*]], label %[[VECTOR_MAIN_LOOP_ITER_CHECK:.*]], !prof [[PROF0:![0-9]+]]
+; CHECK-V2-IC4:  [[VECTOR_MAIN_LOOP_ITER_CHECK]]:
+; CHECK-V2-IC4:    br i1 false, label %[[VEC_EPILOG_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF0]]
+; CHECK-V2-IC4:  [[VECTOR_PH]]:
+; CHECK-V2-IC4:    br label %[[VECTOR_BODY:.*]]
+; CHECK-V2-IC4:  [[VECTOR_BODY]]:
+; CHECK-V2-IC4:    br i1 [[TMP20:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF1:![0-9]+]], !llvm.loop [[LOOP2:![0-9]+]]
+; CHECK-V2-IC4:  [[MIDDLE_BLOCK]]:
+; CHECK-V2-IC4:    br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[VEC_EPILOG_ITER_CHECK:.*]], !prof [[PROF5:![0-9]+]]
+; CHECK-V2-IC4:  [[VEC_EPILOG_ITER_CHECK]]:
+; CHECK-V2-IC4:    br i1 [[MIN_EPILOG_ITERS_CHECK:%.*]], label %[[VEC_EPILOG_SCALAR_PH]], label %[[VEC_EPILOG_PH]], !prof [[PROF6:![0-9]+]]
+; CHECK-V2-IC4:  [[VEC_EPILOG_PH]]:
+; CHECK-V2-IC4:    br label %[[VEC_EPILOG_VECTOR_BODY:.*]]
+; CHECK-V2-IC4:  [[VEC_EPILOG_VECTOR_BODY]]:
+; CHECK-V2-IC4:    br i1 [[TMP38:%.*]], label %[[VEC_EPILOG_MIDDLE_BLOCK:.*]], label %[[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
+; CHECK-V2-IC4:  [[VEC_EPILOG_MIDDLE_BLOCK]]:
+; CHECK-V2-IC4:    br i1 [[CMP_N:%.*]], label %[[FOR_COND_CLEANUP]], label %[[VEC_EPILOG_SCALAR_PH]], !prof [[PROF8:![0-9]+]]
+; CHECK-V2-IC4:  [[VEC_EPILOG_SCALAR_PH]]:
+; CHECK-V2-IC4:    br label %[[FOR_BODY:.*]]
+; CHECK-V2-IC4:  [[FOR_BODY]]:
+; CHECK-V2-IC4:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF9:![0-9]+]], !llvm.loop [[LOOP10:![0-9]+]]
+; CHECK-V2-IC4:  [[FOR_COND_CLEANUP]]:
+;
+entry:
+  br label %for.body
+
+for.body:                                         ; preds = %for.body, %entry
+  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
+  %arrayidx = getelementptr inbounds [1024 x i32], ptr @b, i64 0, i64 %indvars.iv
+  %0 = load i32, ptr %arrayidx, align 4
+  %1 = trunc i64 %indvars.iv to i32
+  %mul = mul nsw i32 %0, %1
+  %arrayidx2 = getelementptr inbounds [1024 x i32], ptr @a, i64 0, i64 %indvars.iv
+  %2 = load i32, ptr %arrayidx2, align 4
+  %add = add nsw i32 %2, %mul
+  store i32 %add, ptr %arrayidx2, align 4
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond = icmp eq i64 %indvars.iv.next, 1024
+  br i1 %exitcond, label %for.cond.cleanup, label %for.body, !prof !0
+
+for.cond.cleanup:                                 ; preds = %for.body
+  ret void
+}
+
+!0 = !{!"branch_weights", i32 1, i32 1023}
+;.
+; CHECK-V1-IC1: [[PROF0]] = !{!"branch_weights", i32 1, i32 127}
+; CHECK-V1-IC1: [[LOOP1]] = distinct !{[[LOOP1]], [[META2:![0-9]+]], [[META3:![0-9]+]]}
+; CHECK-V1-IC1: [[META2]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK-V1-IC1: [[META3]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK-V1-IC1: [[PROF4]] = !{!"branch_weights", i32 1, i32 3}
+; CHECK-V1-IC1: [[PROF5]] = !{!"branch_weights", i32 0, i32 0}
+; CHECK-V1-IC1: [[LOOP6]] = distinct !{[[LOOP6]], [[META3]], [[META2]]}
+;.
+; CHECK-V2-IC1: [[PROF0]] = !{!"branch_weights", i32 1, i32 127}
+; CHECK-V2-IC1: [[PROF1]] = !{!"branch_weights", i32 1, i32 255}
+; CHECK-V2-IC1: [[LOOP2]] = distinct !{[[LOOP2]], [[META3:![0-9]+]], [[META4:![0-9]+]]}
+; CHECK-V2-IC1: [[META3]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK-V2-IC1: [[META4]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK-V2-IC1: [[PROF5]] = !{!"branch_weights", i32 1, i32 3}
+; CHECK-V2-IC1: [[PROF6]] = !{!"branch_weights", i32 0, i32 0}
+; CHECK-V2-IC1: [[LOOP7]] = distinct !{[[LOOP7]], [[META4]], [[META3]]}
+;.
+; CHECK-V2-IC4: [[PROF0]] = !{!"branch_weights", i32 1, i32 127}
+; CHECK-V2-IC4: [[PROF1]] = !{!"branch_weights", i32 1, i32 63}
+; CHECK-V2-IC4: [[LOOP2]] = distinct !{[[LOOP2]], [[META3:![0-9]+]], [[META4:![0-9]+]]}
+; CHECK-V2-IC4: [[META3]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK-V2-IC4: [[META4]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK-V2-IC4: [[PROF5]] = !{!"branch_weights", i32 1, i32 15}
+; CHECK-V2-IC4: [[PROF6]] = !{!"branch_weights", i32 2, i32 0}
+; CHECK-V2-IC4: [[LOOP7]] = distinct !{[[LOOP7]], [[META3]], [[META4]]}
+; CHECK-V2-IC4: [[PROF8]] = !{!"branch_weights", i32 1, i32 1}
+; CHECK-V2-IC4: [[PROF9]] = !{!"branch_weights", i32 0, i32 0}
+; CHECK-V2-IC4: [[LOOP10]] = distinct !{[[LOOP10]], [[META4]], [[META3]]}
+;.
diff --git a/llvm/test/Transforms/LoopVectorize/check-prof-info.ll b/llvm/test/Transforms/LoopVectorize/check-prof-info.ll
index 17013c5908065..0e1e4dfecd1e6 100644
--- a/llvm/test/Transforms/LoopVectorize/check-prof-info.ll
+++ b/llvm/test/Transforms/LoopVectorize/check-prof-info.ll
@@ -1,6 +1,8 @@
-; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --filter "br" --filter "^.*:" --version 5
 ; RUN: opt -passes="print<block-freq>,loop-vectorize" -force-vector-width=4 -force-vector-interleave=1 -S < %s |  FileCheck %s
-; RUN: opt -passes="print<block-freq>,loop-vectorize" -force-vector-width=4 -force-vector-interleave=4 -S < %s |  FileCheck %s -check-prefix=CHECK-MASKED
+; RUN: opt -passes="print<block-freq>,loop-vectorize" -force-vector-width=4 -force-vector-interleave=4 -S < %s |  FileCheck %s -check-prefix=CHECK-IC4
+; RUN: opt -passes="print<block-freq>,loop-vectorize" -force-vector-width=4 -force-vector-interleave=1 \
+; RUN:   -scalable-vectorization=on -force-target-supports-scalable-vectors -S < %s |  FileCheck %s -check-prefix=CHECK-SCALABLE
 
 target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
 
@@ -10,15 +12,53 @@ target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
 ; Check correctness of profile info for vectorization without epilog.
 ; Function Attrs: nofree norecurse nounwind uwtable
 define dso_local void @_Z3foov() local_unnamed_addr #0 {
-; CHECK-LABEL: @_Z3foov(
-; CHECK:  [[VECTOR_BODY:vector\.body]]:
-; CHECK:    br i1 [[TMP:%.*]], label [[MIDDLE_BLOCK:%.*]], label %[[VECTOR_BODY]], !prof [[LP1_255:\!.*]],
-; CHECK:  [[FOR_BODY:for\.body]]:
-; CHECK:    br i1 [[EXITCOND:%.*]], label [[FOR_END_LOOPEXIT:%.*]], label %[[FOR_BODY]], !prof [[LP0_0:\!.*]],
-; CHECK-MASKED:  [[VECTOR_BODY:vector\.body]]:
-; CHECK-MASKED:    br i1 [[TMP:%.*]], label [[MIDDLE_BLOCK:%.*]], label %[[VECTOR_BODY]], !prof [[LP1_63:\!.*]],
-; CHECK-MASKED:  [[FOR_BODY:for\.body]]:
-; CHECK-MASKED:    br i1 [[EXITCOND:%.*]], label [[FOR_END_LOOPEXIT:%.*]], label %[[FOR_BODY]], !prof [[LP0_0:\!.*]],
+; CHECK-LABEL: define dso_local void @_Z3foov(
+; CHECK-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK:  [[ENTRY:.*:]]
+; CHECK:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF2:![0-9]+]]
+; CHECK:  [[VECTOR_PH]]:
+; CHECK:    br label %[[VECTOR_BODY:.*]]
+; CHECK:  [[VECTOR_BODY]]:
+; CHECK:    br i1 [[TMP6:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF7:![0-9]+]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK:  [[MIDDLE_BLOCK]]:
+; CHECK:    br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]], !prof [[PROF11:![0-9]+]]
+; CHECK:  [[SCALAR_PH]]:
+; CHECK:    br label %[[FOR_BODY:.*]]
+; CHECK:  [[FOR_COND_CLEANUP]]:
+; CHECK:  [[FOR_BODY]]:
+; CHECK:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF12:![0-9]+]], !llvm.loop [[LOOP13:![0-9]+]]
+;
+; CHECK-IC4-LABEL: define dso_local void @_Z3foov(
+; CHECK-IC4-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-IC4:  [[ENTRY:.*:]]
+; CHECK-IC4:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF2:![0-9]+]]
+; CHECK-IC4:  [[VECTOR_PH]]:
+; CHECK-IC4:    br label %[[VECTOR_BODY:.*]]
+; CHECK-IC4:  [[VECTOR_BODY]]:
+; CHECK-IC4:    br i1 [[TMP18:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF7:![0-9]+]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK-IC4:  [[MIDDLE_BLOCK]]:
+; CHECK-IC4:    br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]], !prof [[PROF11:![0-9]+]]
+; CHECK-IC4:  [[SCALAR_PH]]:
+; CHECK-IC4:    br label %[[FOR_BODY:.*]]
+; CHECK-IC4:  [[FOR_COND_CLEANUP]]:
+; CHECK-IC4:  [[FOR_BODY]]:
+; CHECK-IC4:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF12:![0-9]+]], !llvm.loop [[LOOP13:![0-9]+]]
+;
+; CHECK-SCALABLE-LABEL: define dso_local void @_Z3foov(
+; CHECK-SCALABLE-SAME: ) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; CHECK-SCALABLE:  [[ENTRY:.*:]]
+; CHECK-SCALABLE:    br i1 [[MIN_ITERS_CHECK:%.*]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF2:![0-9]+]]
+; CHECK-SCALABLE:  [[VECTOR_PH]]:
+; CHECK-SCALABLE:    br label %[[VECTOR_BODY:.*]]
+; CHECK-SCALABLE:  [[VECTOR_BODY]]:
+; CHECK-SCALABLE:    br i1 [[TMP16:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF7:![0-9]+]], !llvm.loop [[LOOP8:![0-9]+]]
+; CHECK-SCALABLE:  [[MIDDLE_BLOCK]]:
+; CHECK-SCALABLE:    br i1 [[CMP_N:%.*]], label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]], !prof [[PROF11:![0-9]+]]
+; CHECK-SCALABLE:  [[SCALAR_PH]]:
+; CHECK-SCALABLE:    br label %[[FOR_BODY:.*]]
+; CHECK-SCALABLE:  [[FOR_COND_CLEANUP]]:
+; CHECK-SCALABLE:  [[FOR_BODY]]:
+; CHECK-SCALABLE:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF12:![0-9]+]], !llvm.loop [[LOOP13:![0-9]+]]
 ;
 entry:
   br label %for.body
@@ -44,15 +84,53 @@ for.body:                                         ; preds = %for.body, %entry
 ; Check correctness of profile info for vectorization with epilog.
 ; Function Attrs: nofree norecurse nounwind uwtable
 define dso_local void @_Z3foo2v() local_unnamed_addr #0 {
-; CHECK-LABEL: @_Z3foo2v(
-; CHECK:  [[VECTOR_BODY:vector\.body]]:
-; CHECK:    br i1 [[TMP:%.*]], label [[MIDDLE_BLOCK:%.*]], label %[[VECTOR_BODY]], !prof [[LP1_255:\!.*]],
-; CHECK:  [[FOR_BODY:for\.body]]:
-; CHECK:    br i1 [[EXITCOND:%.*]], label [[FOR_END_LOOPEXIT:%.*]], label %[[FOR_BODY]], !prof [[LP1_2:\!.*]],
-; CHECK-MASKED:  [[VECTOR_BODY:vector\.body]]:
-; CHECK-MASKED:    br i1 [[TMP:%.*]], label [[MIDDLE_BLOCK:%.*]], label %[[VECTOR_BODY]], !prof [[LP1_63:\!.*]],
-; CHECK-MASKED:  [[FOR_BODY:for\.body]]:
-; CHECK-MASKED:    br i1 [[EXITCOND:%.*]], label [[FOR_END_LOOPEXIT:%.*]], label %[[FOR_BODY]], !prof [[LP1_2:\!.*]],
+; CHECK-LABEL: define dso_local void @_Z3foo2v(
+; CHECK-SAME: ) local_unnamed_addr #[[ATTR0]] {
+; CHECK:  [[ENTRY:.*:]]
+; CHECK:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF2]]
+; CHECK:  [[VECTOR_PH]]:
+; CHECK:    br label %[[VECTOR_BODY:.*]]
+; CHECK:  [[VECTOR_BODY]]:
+; CHECK:    br i1 [[TMP6:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF7]], !llvm.loop [[LOOP14:![0-9]+]]
+; CHECK:  [[MIDDLE_BLOCK]]:
+; CHECK:    br i1 false, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]], !prof [[PROF11]]
+; CHECK:  [[SCALAR_PH]]:
+; CHECK:    br label %[[FOR_BODY:.*]]
+; CHECK:  [[FOR_COND_CLEANUP]]:
+; CHECK:  [[FOR_BODY]]:
+; CHECK:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF15:![0-9]+]], !llvm.loop [[LOOP16:![0-9]+]]
+;
+; CHECK-IC4-LABEL: define dso_local void @_Z3foo2v(
+; CHECK-IC4-SAME: ) local_unnamed_addr #[[ATTR0]] {
+; CHECK-IC4:  [[ENTRY:.*:]]
+; CHECK-IC4:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF2]]
+; CHECK-IC4:  [[VECTOR_PH]]:
+; CHECK-IC4:    br label %[[VECTOR_BODY:.*]]
+; CHECK-IC4:  [[VECTOR_BODY]]:
+; CHECK-IC4:    br i1 [[TMP18:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF7]], !llvm.loop [[LOOP14:![0-9]+]]
+; CHECK-IC4:  [[MIDDLE_BLOCK]]:
+; CHECK-IC4:    br i1 false, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]], !prof [[PROF11]]
+; CHECK-IC4:  [[SCALAR_PH]]:
+; CHECK-IC4:    br label %[[FOR_BODY:.*]]
+; CHECK-IC4:  [[FOR_COND_CLEANUP]]:
+; CHECK-IC4:  [[FOR_BODY]]:
+; CHECK-IC4:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF15:![0-9]+]], !llvm.loop [[LOOP16:![0-9]+]]
+;
+; CHECK-SCALABLE-LABEL: define dso_local void @_Z3foo2v(
+; CHECK-SCALABLE-SAME: ) local_unnamed_addr #[[ATTR0]] {
+; CHECK-SCALABLE:  [[ENTRY:.*:]]
+; CHECK-SCALABLE:    br i1 [[MIN_ITERS_CHECK:%.*]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]], !prof [[PROF2]]
+; CHECK-SCALABLE:  [[VECTOR_PH]]:
+; CHECK-SCALABLE:    br label %[[VECTOR_BODY:.*]]
+; CHECK-SCALABLE:  [[VECTOR_BODY]]:
+; CHECK-SCALABLE:    br i1 [[TMP16:%.*]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !prof [[PROF7]], !llvm.loop [[LOOP14:![0-9]+]]
+; CHECK-SCALABLE:  [[MIDDLE_BLOCK]]:
+; CHECK-SCALABLE:    br i1 [[CMP_N:%.*]], label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]], !prof [[PROF11]]
+; CHECK-SCALABLE:  [[SCALAR_PH]]:
+; CHECK-SCALABLE:    br label %[[FOR_BODY:.*]]
+; CHECK-SCALABLE:  [[FOR_COND_CLEANUP]]:
+; CHECK-SCALABLE:  [[FOR_BODY]]:
+; CHECK-SCALABLE:    br i1 [[EXITCOND:%.*]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF15:![0-9]+]], !llvm.loop [[LOOP16:![0-9]+]]
 ;
 entry:
   br label %for.body
@@ -80,11 +158,6 @@ attributes #0 = { "use-soft-float"="false" }
 !llvm.module.flags = !{!0}
 !llvm.ident = !{!1}
 
-; CHECK: [[LP1_255]] = !{!"branch_weights", i32 1, i32 255}
-; CHECK: [[LP0_0]] = !{!"branch_weights", i32 0, i32 0}
-; CHECK-MASKED: [[LP1_63]] = !{!"branch_weights", i32 1, i32 63}
-; CHECK-MASKED: [[LP0_0]] = !{!"branch_weights", i32 0, i32 0}
-; CHECK: [[LP1_2]] = !{!"branch_weights", i32 1, i32 2}
 
 !0 = !{i32 1, !"wchar_size", i32 4}
 !1 = !{!"clang version 10.0.0 (https://github.com/llvm/llvm-project c292b5b5e059e6ce3e6449e6827ef7e1037c21c4)"}
@@ -94,3 +167,40 @@ attributes #0 = { "use-soft-float"="false" }
 !5 = !{!"Simple C++ TBAA"}
 !6 = !{!"branch_weights", i32 1, i32 1023}
 !7 = !{!"branch_weights", i32 1, i32 1026}
+;.
+; CHECK: [[PROF2]] = !{!"branch_weights", i32 1, i32 127}
+; CHECK: [[PROF7]] = !{!"branch_weights", i32 1, i32 255}
+; CHECK: [[LOOP8]] = distinct !{[[LOOP8]], [[META9:![0-9]+]], [[META10:![0-9]+]]}
+; CHECK: [[META9]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK: [[META10]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK: [[PROF11]] = !{!"branch_weights", i32 1, i32 3}
+; CHECK: [[PROF12]] = !{!"branch_weights", i32 0, i32 0}
+; CHECK: [[LOOP13]] = distinct !{[[LOOP13]], [[META10]], [[META9]]}
+; C...
[truncated]

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


; Check correctness of profile info for vectorization without epilog.
; Function Attrs: nofree norecurse nounwind uwtable
define dso_local void @_Z3foov() local_unnamed_addr #0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to clean up the test a bit, dropping dso_local, local_unnamed_addr , #0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied this from the existing test, which means we should also clean up the existing test too then. Happy to do that in this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

br label %for.body

for.body: ; preds = %for.body, %entry
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]

for consistency with other, newer tests

Comment on lines 77 to 84
%arrayidx = getelementptr inbounds [1024 x i32], ptr @b, i64 0, i64 %indvars.iv
%0 = load i32, ptr %arrayidx, align 4
%1 = trunc i64 %indvars.iv to i32
%mul = mul nsw i32 %0, %1
%arrayidx2 = getelementptr inbounds [1024 x i32], ptr @a, i64 0, i64 %indvars.iv
%2 = load i32, ptr %arrayidx2, align 4
%add = add nsw i32 %2, %mul
store i32 %add, ptr %arrayidx2, align 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really care about the body here right? Might be worth making it a bit simpler, maybe just with a store of the IV or load/store?

Loop *VectorLoop = LI->getLoopFor(HeaderBB);
setProfileInfoAfterUnrolling(OrigLoop, VectorLoop, OrigLoop,
VF.getKnownMinValue() * UF);
unsigned VFxUF = getEstimatedRuntimeVF(VF * UF, Cost->getVScaleForTuning());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unsigned VFxUF = getEstimatedRuntimeVF(VF * UF, Cost->getVScaleForTuning());
unsigned EstimatedVFxUF = getEstimatedRuntimeVF(VF * UF, Cost->getVScaleForTuning());

Might be worth updating the name as well?

@b = dso_local global [1024 x i32] zeroinitializer, align 16

; Check correctness of profile info for vectorization without epilog.
; Function Attrs: nofree norecurse nounwind uwtable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
; Function Attrs: nofree norecurse nounwind uwtable

@a = dso_local global [1024 x i32] zeroinitializer, align 16
@b = dso_local global [1024 x i32] zeroinitializer, align 16

; Check correctness of profile info for vectorization without epilog.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to spell spell out that we expect the branch weigth computations to use vscale = 1 for neoverse-v1 and vscale = 2 for neoverse-v2?

In fixVectorizedLoop we call setProfileInfoAfterUnrolling to
update the profile information after vectorising, however
for scalable VFs we pessimistically assume vscale=1. We can
improve upon this by using the value of vscale used for
tuning, i.e. when targeting neoverse-v1 the expected value
is 2.
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@david-arm david-arm merged commit a75e062 into llvm:main Jun 16, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants