-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64][SVE] Reduce MaxInterleaveFactor for A510 and A520 #132246
base: main
Are you sure you want to change the base?
Conversation
The default MaxInterleaveFactor for AArch64 targets is 2. This produces inefficient codegen on at least two in-order cores, those being Cortex-A510 and Cortex-A520. For example a simple vector add ``` void foo(float a, float b, float dst, unsigned n) { for (unsigned i = 0; i < n; ++i) dst[i] = a[i] + b[i]; } ``` Vectorizes the inner loop into the following interleaved sequence of instructions ``` add x12, x1, x10 ld1b { z0.b }, p0/z, [x1, x10] add x13, x2, x10 ld1b { z1.b }, p0/z, [x2, x10] ldr z2, [x12, llvm#1, mul vl] ldr z3, [x13, llvm#1, mul vl] dech x11 add x12, x0, x10 fadd z0.s, z1.s, z0.s fadd z1.s, z3.s, z2.s st1b { z0.b }, p0, [x0, x10] addvl x10, x10, llvm#2 str z1, [x12, llvm#1, mul vl] ``` while when we reduce MaxInterleaveFactor to 1 we get the following ``` .LBB0_13: // %vector.body // =>This Inner Loop Header: Depth=1 ld1w { z0.s }, p0/z, [x1, x10, lsl llvm#2] ld1w { z1.s }, p0/z, [x2, x10, lsl llvm#2] fadd z0.s, z1.s, z0.s st1w { z0.s }, p0, [x0, x10, lsl llvm#2] incw x10 ``` This patch also introduces IR tests to showcase this. Change-Id: Ie1e862f6a1db851182a95534b3b987feb670d7ca
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-aarch64 Author: Nashe Mncube (nasherm) ChangesThe default MaxInterleaveFactor for AArch64 targets is 2. This produces inefficient codegen on at least two in-order cores, those being Cortex-A510 and Cortex-A520. For example a simple vector add
Vectorizes the inner loop into the following interleaved sequence of instructions
while when we reduce MaxInterleaveFactor to 1 we get the following
This patch also introduces a test Patch is 30.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/132246.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index bb36af8fce5cc..57ae4dfb71c36 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -181,6 +181,7 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
VScaleForTuning = 1;
PrefLoopAlignment = Align(16);
MaxBytesForLoopAlignment = 8;
+ MaxInterleaveFactor = 1;
break;
case CortexA710:
case CortexA715:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleave-inorder-core.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleave-inorder-core.ll
new file mode 100644
index 0000000000000..a3bf37726943f
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleave-inorder-core.ll
@@ -0,0 +1,360 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a510 -mattr=+sve -passes=loop-vectorize -S | FileCheck %s --check-prefix=CHECK-CA510-NOINTERLEAVE
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a510 -mattr=+sve -passes=loop-vectorize -force-target-max-vector-interleave=2 -S | FileCheck %s --check-prefix=CHECK-CA510-INTERLEAVE
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a520 -mattr=+sve -passes=loop-vectorize -S | FileCheck %s --check-prefix=CHECK-CA520-NOINTERLEAVE
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a510 -mattr=+sve -passes=loop-vectorize -force-target-max-vector-interleave=2 -S | FileCheck %s --check-prefix=CHECK-CA520-INTERLEAVE
+
+define void @sve_add(ptr %dst, ptr %a, ptr %b, i64 %n) {
+; CHECK-CA510-NOINTERLEAVE-LABEL: define void @sve_add(
+; CHECK-CA510-NOINTERLEAVE-SAME: ptr [[DST:%.*]], ptr [[A:%.*]], ptr [[B:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ENTRY:.*:]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[B3:%.*]] = ptrtoint ptr [[B]] to i64
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[A2:%.*]] = ptrtoint ptr [[A]] to i64
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[DST1:%.*]] = ptrtoint ptr [[DST]] to i64
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[CMP9_NOT:%.*]] = icmp eq i64 [[N]], 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[CMP9_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY_PREHEADER:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[FOR_BODY_PREHEADER]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 12, i64 [[TMP1]])
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP2]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[VECTOR_MEMCHECK]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP6:%.*]] = sub i64 [[DST1]], [[A2]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP7:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP8:%.*]] = sub i64 [[DST1]], [[B3]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP8]], [[TMP7]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK4]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[VECTOR_PH]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP10]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[VECTOR_BODY]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP13:%.*]] = add i64 [[INDEX]], 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP14:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[TMP13]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP15:%.*]] = getelementptr inbounds nuw float, ptr [[TMP14]], i32 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP15]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP16:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[TMP13]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP17:%.*]] = getelementptr inbounds nuw float, ptr [[TMP16]], i32 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 4 x float>, ptr [[TMP17]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP18:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD5]], [[WIDE_LOAD]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP19:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[TMP13]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP20:%.*]] = getelementptr inbounds nuw float, ptr [[TMP19]], i32 0
+; CHECK-CA510-NOINTERLEAVE-NEXT: store <vscale x 4 x float> [[TMP18]], ptr [[TMP20]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[TMP21]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-CA510-NOINTERLEAVE: [[MIDDLE_BLOCK]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK-CA510-NOINTERLEAVE: [[SCALAR_PH]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK-CA510-NOINTERLEAVE: [[FOR_BODY]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[INDVARS_IV]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP22:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[TMP23:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ADD:%.*]] = fadd fast float [[TMP23]], [[TMP22]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[INDVARS_IV]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: store float [[ADD]], ptr [[ARRAYIDX4]], align 4
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-CA510-NOINTERLEAVE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
+; CHECK-CA510-NOINTERLEAVE-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-CA510-NOINTERLEAVE: [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: br label %[[FOR_COND_CLEANUP]]
+; CHECK-CA510-NOINTERLEAVE: [[FOR_COND_CLEANUP]]:
+; CHECK-CA510-NOINTERLEAVE-NEXT: ret void
+;
+; CHECK-CA510-INTERLEAVE-LABEL: define void @sve_add(
+; CHECK-CA510-INTERLEAVE-SAME: ptr [[DST:%.*]], ptr [[A:%.*]], ptr [[B:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-CA510-INTERLEAVE-NEXT: [[ENTRY:.*:]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[B3:%.*]] = ptrtoint ptr [[B]] to i64
+; CHECK-CA510-INTERLEAVE-NEXT: [[A2:%.*]] = ptrtoint ptr [[A]] to i64
+; CHECK-CA510-INTERLEAVE-NEXT: [[DST1:%.*]] = ptrtoint ptr [[DST]] to i64
+; CHECK-CA510-INTERLEAVE-NEXT: [[CMP9_NOT:%.*]] = icmp eq i64 [[N]], 0
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[CMP9_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY_PREHEADER:.*]]
+; CHECK-CA510-INTERLEAVE: [[FOR_BODY_PREHEADER]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 12, i64 [[TMP1]])
+; CHECK-CA510-INTERLEAVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP2]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-CA510-INTERLEAVE: [[VECTOR_MEMCHECK]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP6:%.*]] = sub i64 [[DST1]], [[A2]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP7:%.*]] = mul i64 [[TMP4]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP8:%.*]] = sub i64 [[DST1]], [[B3]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP8]], [[TMP7]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK4]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK-CA510-INTERLEAVE: [[VECTOR_PH]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP10]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 8
+; CHECK-CA510-INTERLEAVE-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-CA510-INTERLEAVE: [[VECTOR_BODY]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP13:%.*]] = add i64 [[INDEX]], 0
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP14:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[TMP13]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP15:%.*]] = getelementptr inbounds nuw float, ptr [[TMP14]], i32 0
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP18:%.*]] = getelementptr inbounds nuw float, ptr [[TMP14]], i64 [[TMP17]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP15]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 4 x float>, ptr [[TMP18]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP19:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[TMP13]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP20:%.*]] = getelementptr inbounds nuw float, ptr [[TMP19]], i32 0
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP23:%.*]] = getelementptr inbounds nuw float, ptr [[TMP19]], i64 [[TMP22]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[WIDE_LOAD6:%.*]] = load <vscale x 4 x float>, ptr [[TMP20]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[WIDE_LOAD7:%.*]] = load <vscale x 4 x float>, ptr [[TMP23]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP24:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD6]], [[WIDE_LOAD]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP25:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD7]], [[WIDE_LOAD5]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP26:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[TMP13]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP27:%.*]] = getelementptr inbounds nuw float, ptr [[TMP26]], i32 0
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP28:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP29:%.*]] = mul i64 [[TMP28]], 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP30:%.*]] = getelementptr inbounds nuw float, ptr [[TMP26]], i64 [[TMP29]]
+; CHECK-CA510-INTERLEAVE-NEXT: store <vscale x 4 x float> [[TMP24]], ptr [[TMP27]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: store <vscale x 4 x float> [[TMP25]], ptr [[TMP30]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP31:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[TMP31]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-CA510-INTERLEAVE: [[MIDDLE_BLOCK]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK-CA510-INTERLEAVE: [[SCALAR_PH]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; CHECK-CA510-INTERLEAVE-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK-CA510-INTERLEAVE: [[FOR_BODY]]:
+; CHECK-CA510-INTERLEAVE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-CA510-INTERLEAVE-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[INDVARS_IV]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP32:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[TMP33:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[ADD:%.*]] = fadd fast float [[TMP33]], [[TMP32]]
+; CHECK-CA510-INTERLEAVE-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[INDVARS_IV]]
+; CHECK-CA510-INTERLEAVE-NEXT: store float [[ADD]], ptr [[ARRAYIDX4]], align 4
+; CHECK-CA510-INTERLEAVE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-CA510-INTERLEAVE-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
+; CHECK-CA510-INTERLEAVE-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-CA510-INTERLEAVE: [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; CHECK-CA510-INTERLEAVE-NEXT: br label %[[FOR_COND_CLEANUP]]
+; CHECK-CA510-INTERLEAVE: [[FOR_COND_CLEANUP]]:
+; CHECK-CA510-INTERLEAVE-NEXT: ret void
+;
+; CHECK-CA520-NOINTERLEAVE-LABEL: define void @sve_add(
+; CHECK-CA520-NOINTERLEAVE-SAME: ptr [[DST:%.*]], ptr [[A:%.*]], ptr [[B:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[ENTRY:.*:]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[B3:%.*]] = ptrtoint ptr [[B]] to i64
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[A2:%.*]] = ptrtoint ptr [[A]] to i64
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[DST1:%.*]] = ptrtoint ptr [[DST]] to i64
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[CMP9_NOT:%.*]] = icmp eq i64 [[N]], 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[CMP9_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY_PREHEADER:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[FOR_BODY_PREHEADER]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 12, i64 [[TMP1]])
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP2]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[VECTOR_MEMCHECK]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP6:%.*]] = sub i64 [[DST1]], [[A2]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP7:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP8:%.*]] = sub i64 [[DST1]], [[B3]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP8]], [[TMP7]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK4]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[VECTOR_PH]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP10]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[VECTOR_BODY]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP13:%.*]] = add i64 [[INDEX]], 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP14:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[TMP13]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP15:%.*]] = getelementptr inbounds nuw float, ptr [[TMP14]], i32 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x float>, ptr [[TMP15]], align 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP16:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[TMP13]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP17:%.*]] = getelementptr inbounds nuw float, ptr [[TMP16]], i32 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 4 x float>, ptr [[TMP17]], align 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP18:%.*]] = fadd fast <vscale x 4 x float> [[WIDE_LOAD5]], [[WIDE_LOAD]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP19:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[TMP13]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP20:%.*]] = getelementptr inbounds nuw float, ptr [[TMP19]], i32 0
+; CHECK-CA520-NOINTERLEAVE-NEXT: store <vscale x 4 x float> [[TMP18]], ptr [[TMP20]], align 4
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP12]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[TMP21:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[TMP21]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-CA520-NOINTERLEAVE: [[MIDDLE_BLOCK]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK-CA520-NOINTERLEAVE: [[SCALAR_PH]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; CHECK-CA520-NOINTERLEAVE-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK-CA520-NOINTERLEAVE: [[FOR_BODY]]:
+; CHECK-CA520-NOINTERLEAVE-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-CA520-NOINTERLEAVE...
[truncated]
|
I would expect In-order cores should like unrolling, as it enables more hiding of latency hazards. (There are always times based on the trip count of the loop or the overheads that it ends up making things worse, but I would expect at least some level of interleaving to usually be useful overall). It looks from the example code that the addressing modes in the loops is not doing very well. They are usually calculated in LSR. Could they do better, and then get the benefit of interleaving without the cost of the inefficient addressing mode calculations? |
Also - I think this controls Neon too, and Neon will prefer interleaving at least a bit to make use of LDP/STP. |
The other alternative might be to prefer fixed width to scalable when the costs in the vectorizer are equal, if that is more beneficial on these cores. It is controlled via the FeatureUseFixedOverScalableIfEqualCost feature. |
The default MaxInterleaveFactor for AArch64 targets is 2. This produces inefficient codegen on at least two in-order cores, those being Cortex-A510 and Cortex-A520. For example a simple vector add
Vectorizes the inner loop into the following interleaved sequence of instructions
while when we reduce MaxInterleaveFactor to 1 we get the following
This patch also introduces a test