-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Set AllocationPriority in line with LMUL #131176
Conversation
This mechanism causes the greedy register allocator to prefer allocating register classes with higher priority first. This helps to ensure that high LMUL registers obtain a register without having to go through the eviction mechanism. In practice, it seems to cause a bunch of code churn, and some minor improvement around widening and narrowing operations. In a few of the widening tests, we have what look like code size regressions because we end up with two smaller register class copies instead of one larger one after the instruction. However, in any larger code sequence, these are likely to be folded into the producing instructions. (But so were the wider copies after the operation.) Two observations: 1) We're not setting the greedy-regclass-priority-trumps-globalness flag on the register class, so this doesn't help long mask ranges. I thought about doing that, but the benefit is non-obvious, so I decided it was worth a separate change at minimum. 2) We could arguably set the priority higher for the register classes that exclude v0. I tried that, and it caused a whole bunch of further churn. I may return to it in a separate patch.
@llvm/pr-subscribers-backend-risc-v Author: Philip Reames (preames) ChangesThis mechanism causes the greedy register allocator to prefer allocating register classes with higher priority first. This helps to ensure that high LMUL registers obtain a register without having to go through the eviction mechanism. In practice, it seems to cause a bunch of code churn, and some minor improvement around widening and narrowing operations. In a few of the widening tests, we have what look like code size regressions because we end up with two smaller register class copies instead of one larger one after the instruction. However, in any larger code sequence, these are likely to be folded into the producing instructions. (But so were the wider copies after the operation.) Two observations:
Patch is 1.47 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/131176.diff 179 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.td b/llvm/lib/Target/RISCV/RISCVRegisterInfo.td
index a5dfb5ba1a2fc..1e0541e667895 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.td
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.td
@@ -752,18 +752,24 @@ def VR : VReg<!listconcat(VM1VTs, VMaskVTs),
def VRNoV0 : VReg<!listconcat(VM1VTs, VMaskVTs), (sub VR, V0), 1>;
+let AllocationPriority = 2 in
def VRM2 : VReg<VM2VTs, (add (sequence "V%uM2", 8, 31, 2),
(sequence "V%uM2", 6, 0, 2)), 2>;
+let AllocationPriority = 2 in
def VRM2NoV0 : VReg<VM2VTs, (sub VRM2, V0M2), 2>;
+let AllocationPriority = 4 in
def VRM4 : VReg<VM4VTs, (add V8M4, V12M4, V16M4, V20M4,
V24M4, V28M4, V4M4, V0M4), 4>;
+let AllocationPriority = 4 in
def VRM4NoV0 : VReg<VM4VTs, (sub VRM4, V0M4), 4>;
+let AllocationPriority = 8 in
def VRM8 : VReg<VM8VTs, (add V8M8, V16M8, V24M8, V0M8), 8>;
+let AllocationPriority = 8 in
def VRM8NoV0 : VReg<VM8VTs, (sub VRM8, V0M8), 8>;
def VMV0 : VReg<VMaskVTs, (add V0), 1>;
diff --git a/llvm/test/CodeGen/RISCV/redundant-copy-from-tail-duplicate.ll b/llvm/test/CodeGen/RISCV/redundant-copy-from-tail-duplicate.ll
index 5d588ad66b9ca..15b5698c22e81 100644
--- a/llvm/test/CodeGen/RISCV/redundant-copy-from-tail-duplicate.ll
+++ b/llvm/test/CodeGen/RISCV/redundant-copy-from-tail-duplicate.ll
@@ -20,10 +20,10 @@ define signext i32 @sum(ptr %a, i32 signext %n, i1 %prof.min.iters.check, <vscal
; CHECK-NEXT: ret
; CHECK-NEXT: .LBB0_4: # %vector.ph
; CHECK-NEXT: vsetivli zero, 1, e32, m1, ta, ma
-; CHECK-NEXT: vmv.s.x v8, zero
-; CHECK-NEXT: vmv.v.i v12, 0
+; CHECK-NEXT: vmv.s.x v12, zero
+; CHECK-NEXT: vmv.v.i v8, 0
; CHECK-NEXT: vsetivli zero, 1, e32, m4, ta, ma
-; CHECK-NEXT: vredsum.vs v8, v12, v8, v0.t
+; CHECK-NEXT: vredsum.vs v8, v8, v12, v0.t
; CHECK-NEXT: vmv.x.s a0, v8
; CHECK-NEXT: ret
entry:
diff --git a/llvm/test/CodeGen/RISCV/rvv/active_lane_mask.ll b/llvm/test/CodeGen/RISCV/rvv/active_lane_mask.ll
index 4ade6c09fe43d..ec422a8fbb928 100644
--- a/llvm/test/CodeGen/RISCV/rvv/active_lane_mask.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/active_lane_mask.ll
@@ -106,12 +106,12 @@ define <32 x i1> @fv32(ptr %p, i64 %index, i64 %tc) {
; CHECK-NEXT: lui a0, %hi(.LCPI8_0)
; CHECK-NEXT: addi a0, a0, %lo(.LCPI8_0)
; CHECK-NEXT: vsetivli zero, 16, e64, m8, ta, ma
-; CHECK-NEXT: vle8.v v8, (a0)
-; CHECK-NEXT: vid.v v16
-; CHECK-NEXT: vsaddu.vx v16, v16, a1
-; CHECK-NEXT: vmsltu.vx v0, v16, a2
-; CHECK-NEXT: vsext.vf8 v16, v8
-; CHECK-NEXT: vsaddu.vx v8, v16, a1
+; CHECK-NEXT: vle8.v v16, (a0)
+; CHECK-NEXT: vid.v v8
+; CHECK-NEXT: vsaddu.vx v8, v8, a1
+; CHECK-NEXT: vmsltu.vx v0, v8, a2
+; CHECK-NEXT: vsext.vf8 v8, v16
+; CHECK-NEXT: vsaddu.vx v8, v8, a1
; CHECK-NEXT: vmsltu.vx v16, v8, a2
; CHECK-NEXT: vsetivli zero, 4, e8, mf4, ta, ma
; CHECK-NEXT: vslideup.vi v0, v16, 2
diff --git a/llvm/test/CodeGen/RISCV/rvv/combine-store-extract-crash.ll b/llvm/test/CodeGen/RISCV/rvv/combine-store-extract-crash.ll
index 482cf83d540c4..496755738e6fa 100644
--- a/llvm/test/CodeGen/RISCV/rvv/combine-store-extract-crash.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/combine-store-extract-crash.ll
@@ -9,21 +9,21 @@ define void @test(ptr %ref_array, ptr %sad_array) {
; RV32: # %bb.0: # %entry
; RV32-NEXT: th.lwd a2, a3, (a0), 0, 3
; RV32-NEXT: vsetivli zero, 4, e8, mf4, ta, ma
-; RV32-NEXT: vle8.v v8, (a2)
+; RV32-NEXT: vle8.v v12, (a2)
; RV32-NEXT: vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT: vzext.vf4 v12, v8
-; RV32-NEXT: vmv.s.x v8, zero
-; RV32-NEXT: vredsum.vs v9, v12, v8
-; RV32-NEXT: vmv.x.s a0, v9
+; RV32-NEXT: vzext.vf4 v8, v12
+; RV32-NEXT: vmv.s.x v12, zero
+; RV32-NEXT: vredsum.vs v8, v8, v12
+; RV32-NEXT: vmv.x.s a0, v8
; RV32-NEXT: th.swia a0, (a1), 4, 0
; RV32-NEXT: vsetivli zero, 4, e8, mf4, ta, ma
-; RV32-NEXT: vle8.v v9, (a3)
-; RV32-NEXT: vmv.v.i v10, 0
+; RV32-NEXT: vle8.v v13, (a3)
+; RV32-NEXT: vmv.v.i v8, 0
; RV32-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
-; RV32-NEXT: vslideup.vi v9, v10, 4
+; RV32-NEXT: vslideup.vi v13, v8, 4
; RV32-NEXT: vsetivli zero, 16, e32, m4, ta, ma
-; RV32-NEXT: vzext.vf4 v12, v9
-; RV32-NEXT: vredsum.vs v8, v12, v8
+; RV32-NEXT: vzext.vf4 v8, v13
+; RV32-NEXT: vredsum.vs v8, v8, v12
; RV32-NEXT: vsetivli zero, 1, e32, m1, ta, ma
; RV32-NEXT: vse32.v v8, (a1)
; RV32-NEXT: ret
@@ -32,21 +32,21 @@ define void @test(ptr %ref_array, ptr %sad_array) {
; RV64: # %bb.0: # %entry
; RV64-NEXT: th.ldd a2, a3, (a0), 0, 4
; RV64-NEXT: vsetivli zero, 4, e8, mf4, ta, ma
-; RV64-NEXT: vle8.v v8, (a2)
+; RV64-NEXT: vle8.v v12, (a2)
; RV64-NEXT: vsetivli zero, 16, e32, m4, ta, ma
-; RV64-NEXT: vzext.vf4 v12, v8
-; RV64-NEXT: vmv.s.x v8, zero
-; RV64-NEXT: vredsum.vs v9, v12, v8
-; RV64-NEXT: vmv.x.s a0, v9
+; RV64-NEXT: vzext.vf4 v8, v12
+; RV64-NEXT: vmv.s.x v12, zero
+; RV64-NEXT: vredsum.vs v8, v8, v12
+; RV64-NEXT: vmv.x.s a0, v8
; RV64-NEXT: th.swia a0, (a1), 4, 0
; RV64-NEXT: vsetivli zero, 4, e8, mf4, ta, ma
-; RV64-NEXT: vle8.v v9, (a3)
-; RV64-NEXT: vmv.v.i v10, 0
+; RV64-NEXT: vle8.v v13, (a3)
+; RV64-NEXT: vmv.v.i v8, 0
; RV64-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
-; RV64-NEXT: vslideup.vi v9, v10, 4
+; RV64-NEXT: vslideup.vi v13, v8, 4
; RV64-NEXT: vsetivli zero, 16, e32, m4, ta, ma
-; RV64-NEXT: vzext.vf4 v12, v9
-; RV64-NEXT: vredsum.vs v8, v12, v8
+; RV64-NEXT: vzext.vf4 v8, v13
+; RV64-NEXT: vredsum.vs v8, v8, v12
; RV64-NEXT: vsetivli zero, 1, e32, m1, ta, ma
; RV64-NEXT: vse32.v v8, (a1)
; RV64-NEXT: ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/common-shuffle-patterns.ll b/llvm/test/CodeGen/RISCV/rvv/common-shuffle-patterns.ll
index 1845c0e4bd3b6..7649d9ad6059f 100644
--- a/llvm/test/CodeGen/RISCV/rvv/common-shuffle-patterns.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/common-shuffle-patterns.ll
@@ -8,10 +8,11 @@ define dso_local <16 x i16> @interleave(<8 x i16> %v0, <8 x i16> %v1) {
; CHECK-LABEL: interleave:
; CHECK: # %bb.0: # %entry
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
-; CHECK-NEXT: vwaddu.vv v10, v8, v9
+; CHECK-NEXT: vmv1r.v v10, v9
+; CHECK-NEXT: vmv1r.v v11, v8
+; CHECK-NEXT: vwaddu.vv v8, v11, v10
; CHECK-NEXT: li a0, -1
-; CHECK-NEXT: vwmaccu.vx v10, a0, v9
-; CHECK-NEXT: vmv2r.v v8, v10
+; CHECK-NEXT: vwmaccu.vx v8, a0, v10
; CHECK-NEXT: ret
entry:
%v2 = shufflevector <8 x i16> %v0, <8 x i16> poison, <16 x i32> <i32 0, i32 undef, i32 1, i32 undef, i32 2, i32 undef, i32 3, i32 undef, i32 4, i32 undef, i32 5, i32 undef, i32 6, i32 undef, i32 7, i32 undef>
diff --git a/llvm/test/CodeGen/RISCV/rvv/compressstore.ll b/llvm/test/CodeGen/RISCV/rvv/compressstore.ll
index 61fb457a7eb65..69822e9d9d2e3 100644
--- a/llvm/test/CodeGen/RISCV/rvv/compressstore.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/compressstore.ll
@@ -200,12 +200,12 @@ define void @test_compresstore_v256i8(ptr %p, <256 x i1> %mask, <256 x i8> %data
; RV64-NEXT: vsetivli zero, 1, e64, m1, ta, ma
; RV64-NEXT: vmv1r.v v7, v8
; RV64-NEXT: li a2, 128
-; RV64-NEXT: vslidedown.vi v9, v0, 1
+; RV64-NEXT: vslidedown.vi v8, v0, 1
; RV64-NEXT: vmv.x.s a3, v0
; RV64-NEXT: vsetvli zero, a2, e8, m8, ta, ma
; RV64-NEXT: vle8.v v24, (a1)
; RV64-NEXT: vsetvli zero, a2, e64, m1, ta, ma
-; RV64-NEXT: vmv.x.s a1, v9
+; RV64-NEXT: vmv.x.s a1, v8
; RV64-NEXT: vsetvli zero, a2, e8, m8, ta, ma
; RV64-NEXT: vcompress.vm v8, v16, v0
; RV64-NEXT: vcpop.m a4, v0
@@ -227,14 +227,14 @@ define void @test_compresstore_v256i8(ptr %p, <256 x i1> %mask, <256 x i8> %data
; RV32-NEXT: vsetivli zero, 1, e64, m1, ta, ma
; RV32-NEXT: vmv1r.v v7, v8
; RV32-NEXT: li a2, 128
-; RV32-NEXT: vslidedown.vi v9, v0, 1
+; RV32-NEXT: vslidedown.vi v8, v0, 1
; RV32-NEXT: li a3, 32
; RV32-NEXT: vmv.x.s a4, v0
; RV32-NEXT: vsetvli zero, a2, e8, m8, ta, ma
; RV32-NEXT: vle8.v v24, (a1)
; RV32-NEXT: vsetivli zero, 1, e64, m1, ta, ma
-; RV32-NEXT: vsrl.vx v6, v9, a3
-; RV32-NEXT: vmv.x.s a1, v9
+; RV32-NEXT: vsrl.vx v6, v8, a3
+; RV32-NEXT: vmv.x.s a1, v8
; RV32-NEXT: vsrl.vx v5, v0, a3
; RV32-NEXT: vsetvli zero, a2, e8, m8, ta, ma
; RV32-NEXT: vcompress.vm v8, v16, v0
@@ -438,16 +438,16 @@ define void @test_compresstore_v128i16(ptr %p, <128 x i1> %mask, <128 x i16> %da
; RV64-NEXT: vcompress.vm v24, v8, v0
; RV64-NEXT: vcpop.m a2, v0
; RV64-NEXT: vsetivli zero, 8, e8, m1, ta, ma
-; RV64-NEXT: vslidedown.vi v8, v0, 8
+; RV64-NEXT: vslidedown.vi v7, v0, 8
; RV64-NEXT: vsetvli zero, a1, e16, m8, ta, ma
-; RV64-NEXT: vcompress.vm v0, v16, v8
-; RV64-NEXT: vcpop.m a1, v8
+; RV64-NEXT: vcompress.vm v8, v16, v7
+; RV64-NEXT: vcpop.m a1, v7
; RV64-NEXT: vsetvli zero, a2, e16, m8, ta, ma
; RV64-NEXT: vse16.v v24, (a0)
; RV64-NEXT: slli a2, a2, 1
; RV64-NEXT: add a0, a0, a2
; RV64-NEXT: vsetvli zero, a1, e16, m8, ta, ma
-; RV64-NEXT: vse16.v v0, (a0)
+; RV64-NEXT: vse16.v v8, (a0)
; RV64-NEXT: ret
;
; RV32-LABEL: test_compresstore_v128i16:
@@ -635,16 +635,16 @@ define void @test_compresstore_v64i32(ptr %p, <64 x i1> %mask, <64 x i32> %data)
; RV64-NEXT: vsetvli zero, a2, e32, m8, ta, ma
; RV64-NEXT: vse32.v v24, (a0)
; RV64-NEXT: vsetivli zero, 4, e8, mf2, ta, ma
-; RV64-NEXT: vslidedown.vi v8, v0, 4
+; RV64-NEXT: vslidedown.vi v24, v0, 4
; RV64-NEXT: vsetvli zero, a1, e32, m8, ta, ma
; RV64-NEXT: vmv.x.s a1, v0
-; RV64-NEXT: vcompress.vm v24, v16, v8
-; RV64-NEXT: vcpop.m a2, v8
+; RV64-NEXT: vcompress.vm v8, v16, v24
+; RV64-NEXT: vcpop.m a2, v24
; RV64-NEXT: cpopw a1, a1
; RV64-NEXT: slli a1, a1, 2
; RV64-NEXT: add a0, a0, a1
; RV64-NEXT: vsetvli zero, a2, e32, m8, ta, ma
-; RV64-NEXT: vse32.v v24, (a0)
+; RV64-NEXT: vse32.v v8, (a0)
; RV64-NEXT: ret
;
; RV32-LABEL: test_compresstore_v64i32:
@@ -654,16 +654,16 @@ define void @test_compresstore_v64i32(ptr %p, <64 x i1> %mask, <64 x i32> %data)
; RV32-NEXT: vcompress.vm v24, v8, v0
; RV32-NEXT: vcpop.m a2, v0
; RV32-NEXT: vsetivli zero, 4, e8, mf2, ta, ma
-; RV32-NEXT: vslidedown.vi v8, v0, 4
+; RV32-NEXT: vslidedown.vi v7, v0, 4
; RV32-NEXT: vsetvli zero, a1, e32, m8, ta, ma
-; RV32-NEXT: vcompress.vm v0, v16, v8
-; RV32-NEXT: vcpop.m a1, v8
+; RV32-NEXT: vcompress.vm v8, v16, v7
+; RV32-NEXT: vcpop.m a1, v7
; RV32-NEXT: vsetvli zero, a2, e32, m8, ta, ma
; RV32-NEXT: vse32.v v24, (a0)
; RV32-NEXT: slli a2, a2, 2
; RV32-NEXT: add a0, a0, a2
; RV32-NEXT: vsetvli zero, a1, e32, m8, ta, ma
-; RV32-NEXT: vse32.v v0, (a0)
+; RV32-NEXT: vse32.v v8, (a0)
; RV32-NEXT: ret
entry:
tail call void @llvm.masked.compressstore.v64i32(<64 x i32> %data, ptr align 4 %p, <64 x i1> %mask)
@@ -796,18 +796,18 @@ define void @test_compresstore_v32i64(ptr %p, <32 x i1> %mask, <32 x i64> %data)
; RV64-NEXT: vsetvli zero, a1, e64, m8, ta, ma
; RV64-NEXT: vse64.v v24, (a0)
; RV64-NEXT: vsetivli zero, 2, e8, mf4, ta, ma
-; RV64-NEXT: vslidedown.vi v8, v0, 2
+; RV64-NEXT: vslidedown.vi v24, v0, 2
; RV64-NEXT: vsetvli zero, zero, e16, mf2, ta, ma
; RV64-NEXT: vmv.x.s a1, v0
; RV64-NEXT: vsetivli zero, 16, e64, m8, ta, ma
-; RV64-NEXT: vcompress.vm v24, v16, v8
+; RV64-NEXT: vcompress.vm v8, v16, v24
; RV64-NEXT: zext.h a1, a1
; RV64-NEXT: cpopw a1, a1
; RV64-NEXT: slli a1, a1, 3
; RV64-NEXT: add a0, a0, a1
-; RV64-NEXT: vcpop.m a1, v8
+; RV64-NEXT: vcpop.m a1, v24
; RV64-NEXT: vsetvli zero, a1, e64, m8, ta, ma
-; RV64-NEXT: vse64.v v24, (a0)
+; RV64-NEXT: vse64.v v8, (a0)
; RV64-NEXT: ret
;
; RV32-LABEL: test_compresstore_v32i64:
@@ -818,18 +818,18 @@ define void @test_compresstore_v32i64(ptr %p, <32 x i1> %mask, <32 x i64> %data)
; RV32-NEXT: vsetvli zero, a1, e64, m8, ta, ma
; RV32-NEXT: vse64.v v24, (a0)
; RV32-NEXT: vsetivli zero, 2, e8, mf4, ta, ma
-; RV32-NEXT: vslidedown.vi v8, v0, 2
+; RV32-NEXT: vslidedown.vi v24, v0, 2
; RV32-NEXT: vsetvli zero, zero, e16, mf2, ta, ma
; RV32-NEXT: vmv.x.s a1, v0
; RV32-NEXT: vsetivli zero, 16, e64, m8, ta, ma
-; RV32-NEXT: vcompress.vm v24, v16, v8
+; RV32-NEXT: vcompress.vm v8, v16, v24
; RV32-NEXT: zext.h a1, a1
; RV32-NEXT: cpop a1, a1
; RV32-NEXT: slli a1, a1, 3
; RV32-NEXT: add a0, a0, a1
-; RV32-NEXT: vcpop.m a1, v8
+; RV32-NEXT: vcpop.m a1, v24
; RV32-NEXT: vsetvli zero, a1, e64, m8, ta, ma
-; RV32-NEXT: vse64.v v24, (a0)
+; RV32-NEXT: vse64.v v8, (a0)
; RV32-NEXT: ret
entry:
tail call void @llvm.masked.compressstore.v32i64(<32 x i64> %data, ptr align 8 %p, <32 x i1> %mask)
diff --git a/llvm/test/CodeGen/RISCV/rvv/ctlz-sdnode.ll b/llvm/test/CodeGen/RISCV/rvv/ctlz-sdnode.ll
index 208735b18cbab..97e1a7f41b92f 100644
--- a/llvm/test/CodeGen/RISCV/rvv/ctlz-sdnode.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/ctlz-sdnode.ll
@@ -162,12 +162,12 @@ define <vscale x 4 x i8> @ctlz_nxv4i8(<vscale x 4 x i8> %va) {
; CHECK-F-LABEL: ctlz_nxv4i8:
; CHECK-F: # %bb.0:
; CHECK-F-NEXT: vsetvli a0, zero, e16, m1, ta, ma
-; CHECK-F-NEXT: vzext.vf2 v9, v8
+; CHECK-F-NEXT: vzext.vf2 v10, v8
; CHECK-F-NEXT: li a0, 134
-; CHECK-F-NEXT: vfwcvt.f.xu.v v10, v9
-; CHECK-F-NEXT: vnsrl.wi v8, v10, 23
+; CHECK-F-NEXT: vfwcvt.f.xu.v v8, v10
+; CHECK-F-NEXT: vnsrl.wi v10, v8, 23
; CHECK-F-NEXT: vsetvli zero, zero, e8, mf2, ta, ma
-; CHECK-F-NEXT: vnsrl.wi v8, v8, 0
+; CHECK-F-NEXT: vnsrl.wi v8, v10, 0
; CHECK-F-NEXT: vrsub.vx v8, v8, a0
; CHECK-F-NEXT: li a0, 8
; CHECK-F-NEXT: vminu.vx v8, v8, a0
@@ -176,12 +176,12 @@ define <vscale x 4 x i8> @ctlz_nxv4i8(<vscale x 4 x i8> %va) {
; CHECK-D-LABEL: ctlz_nxv4i8:
; CHECK-D: # %bb.0:
; CHECK-D-NEXT: vsetvli a0, zero, e16, m1, ta, ma
-; CHECK-D-NEXT: vzext.vf2 v9, v8
+; CHECK-D-NEXT: vzext.vf2 v10, v8
; CHECK-D-NEXT: li a0, 134
-; CHECK-D-NEXT: vfwcvt.f.xu.v v10, v9
-; CHECK-D-NEXT: vnsrl.wi v8, v10, 23
+; CHECK-D-NEXT: vfwcvt.f.xu.v v8, v10
+; CHECK-D-NEXT: vnsrl.wi v10, v8, 23
; CHECK-D-NEXT: vsetvli zero, zero, e8, mf2, ta, ma
-; CHECK-D-NEXT: vnsrl.wi v8, v8, 0
+; CHECK-D-NEXT: vnsrl.wi v8, v10, 0
; CHECK-D-NEXT: vrsub.vx v8, v8, a0
; CHECK-D-NEXT: li a0, 8
; CHECK-D-NEXT: vminu.vx v8, v8, a0
@@ -225,13 +225,13 @@ define <vscale x 8 x i8> @ctlz_nxv8i8(<vscale x 8 x i8> %va) {
; CHECK-F-LABEL: ctlz_nxv8i8:
; CHECK-F: # %bb.0:
; CHECK-F-NEXT: vsetvli a0, zero, e16, m2, ta, ma
-; CHECK-F-NEXT: vzext.vf2 v10, v8
+; CHECK-F-NEXT: vzext.vf2 v12, v8
; CHECK-F-NEXT: li a0, 134
-; CHECK-F-NEXT: vfwcvt.f.xu.v v12, v10
-; CHECK-F-NEXT: vnsrl.wi v8, v12, 23
+; CHECK-F-NEXT: vfwcvt.f.xu.v v8, v12
+; CHECK-F-NEXT: vnsrl.wi v12, v8, 23
; CHECK-F-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-F-NEXT: vnsrl.wi v10, v8, 0
-; CHECK-F-NEXT: vrsub.vx v8, v10, a0
+; CHECK-F-NEXT: vnsrl.wi v8, v12, 0
+; CHECK-F-NEXT: vrsub.vx v8, v8, a0
; CHECK-F-NEXT: li a0, 8
; CHECK-F-NEXT: vminu.vx v8, v8, a0
; CHECK-F-NEXT: ret
@@ -239,13 +239,13 @@ define <vscale x 8 x i8> @ctlz_nxv8i8(<vscale x 8 x i8> %va) {
; CHECK-D-LABEL: ctlz_nxv8i8:
; CHECK-D: # %bb.0:
; CHECK-D-NEXT: vsetvli a0, zero, e16, m2, ta, ma
-; CHECK-D-NEXT: vzext.vf2 v10, v8
+; CHECK-D-NEXT: vzext.vf2 v12, v8
; CHECK-D-NEXT: li a0, 134
-; CHECK-D-NEXT: vfwcvt.f.xu.v v12, v10
-; CHECK-D-NEXT: vnsrl.wi v8, v12, 23
+; CHECK-D-NEXT: vfwcvt.f.xu.v v8, v12
+; CHECK-D-NEXT: vnsrl.wi v12, v8, 23
; CHECK-D-NEXT: vsetvli zero, zero, e8, m1, ta, ma
-; CHECK-D-NEXT: vnsrl.wi v10, v8, 0
-; CHECK-D-NEXT: vrsub.vx v8, v10, a0
+; CHECK-D-NEXT: vnsrl.wi v8, v12, 0
+; CHECK-D-NEXT: vrsub.vx v8, v8, a0
; CHECK-D-NEXT: li a0, 8
; CHECK-D-NEXT: vminu.vx v8, v8, a0
; CHECK-D-NEXT: ret
@@ -288,13 +288,13 @@ define <vscale x 16 x i8> @ctlz_nxv16i8(<vscale x 16 x i8> %va) {
; CHECK-F-LABEL: ctlz_nxv16i8:
; CHECK-F: # %bb.0:
; CHECK-F-NEXT: vsetvli a0, zero, e16, m4, ta, ma
-; CHECK-F-NEXT: vzext.vf2 v12, v8
+; CHECK-F-NEXT: vzext.vf2 v16, v8
; CHECK-F-NEXT: li a0, 134
-; CHECK-F-NEXT: vfwcvt.f.xu.v v16, v12
-; CHECK-F-NEXT: vnsrl.wi v8, v16, 23
+; CHECK-F-NEXT: vfwcvt.f.xu.v v8, v16
+; CHECK-F-NEXT: vnsrl.wi v16, v8, 23
; CHECK-F-NEXT: vsetvli zero, zero, e8, m2, ta, ma
-; CHECK-F-NEXT: vnsrl.wi v12, v8, 0
-; CHECK-F-NEXT: vrsub.vx v8, v12, a0
+; CHECK-F-NEXT: vnsrl.wi v8, v16, 0
+; CHECK-F-NEXT: vrsub.vx v8, v8, a0
; CHECK-F-NEXT: li a0, 8
; CHECK-F-NEXT: vminu.vx v8, v8, a0
; CHECK-F-NEXT: ret
@@ -302,13 +302,13 @@ define <vscale x 16 x i8> @ctlz_nxv16i8(<vscale x 16 x i8> %va) {
; CHECK-D-LABEL: ctlz_nxv16i8:
; CHECK-D: # %bb.0:
; CHECK-D-NEXT: vsetvli a0, zero, e16, m4, ta, ma
-; CHECK-D-NEXT: vzext.vf2 v12, v8
+; CHECK-D-NEXT: vzext.vf2 v16, v8
; CHECK-D-NEXT: li a0, 134
-; CHECK-D-NEXT: vfwcvt.f.xu.v v16, v12
-; CHECK-D-NEXT: vnsrl.wi v8, v16, 23
+; CHECK-D-NEXT: vfwcvt.f.xu.v v8, v16
+; CHECK-D-NEXT: vnsrl.wi v16, v8, 23
; CHECK-D-NEXT: vsetvli zero, zero, e8, m2, ta, ma
-; CHECK-D-NEXT: vnsrl.wi v12, v8, 0
-; CHECK-D-NEXT: vrsub.vx v8, v12, a0
+; CHECK-D-NEXT: vnsrl.wi v8, v16, 0
+; CHECK-D-NEXT: vrsub.vx v8, v8, a0
; CHECK-D-NEXT: li a0, 8
; CHECK-D-NEXT: vminu.vx v8, v8, a0
; CHECK-D-NEXT: ret
@@ -1375,12 +1375,12 @@ define <vscale x 2 x i64> @ctlz_nxv2i64(<vscale x 2 x i64> %va) {
; CHECK-F-NEXT: fsrmi a1, 1
; CHECK-F-NEXT: vsetvli a2, zero, e32, m1, ta, ma
; CHECK-F-NEXT: vfncvt.f.xu.w v10, v8
-; CHECK-F-NEXT: vmv.v.x v8, a0
-; CHECK-F-NEXT: vsrl.vi v9, v10, 23
-; CHECK-F-NEXT: vwsubu.vv v10, v8, v9
+; CHECK-F-NEXT: vmv.v.x v11, a0
+; CHECK-F-NEXT: vsrl.vi v10, v10, 23
+; CHECK-F-NEXT: vwsubu.vv v8, v11, v10
; CHECK-F-NEXT: li a0, 64
; CHECK-F-NEXT: vsetvli zero, zero, e64, m2, ta, ma
-; CHECK-F-NEXT: vminu.vx v8, v10, a0
+; CHECK-F-NEXT: vminu.vx v8, v8, a0
; CHECK-F-NEXT: fsrm a1
; CHECK-F-NEXT: ret
;
@@ -1515,12 +1515,12 @@ define <vscale x 4 x i64> @ctlz_nxv4i64(<vscale x 4 x i64> %va) {
; CHECK-F-NEXT: fsrmi a1, 1
; CHECK-F-NEXT: vsetvli a2, zero, e32, m2, ta, ma
; CHECK-F-NEXT: vfncvt.f.xu.w v12, v8
-; CHECK-F-NEXT: vmv.v.x v8, a0
-; CHECK-F-NEXT: vsrl.vi v10, v12, 23
-; CHECK-F-NEXT: vwsubu.vv v12, v8, v10
+; CHECK-F-NEXT: vmv.v.x v14, a0
+; CHECK-F-NEXT: vsrl.vi v12, v12, 23
+; CHECK-F-NEXT: vwsubu.vv v8, v14, v12
; CHECK-F-NEXT: li a0, 64
; CHECK-F-NEXT: vsetvli zero, zero, e64, m4, ta, ma
-; CHECK-F-NEXT: vminu.vx v8, v12, a0
+; CHECK-F-NEXT: vminu.vx v8, v8, a0
; CHECK-F-NEXT: fsrm a1
; CHECK-F-NEXT: ret
;
@@ -1655,12 +1655,12 @@ define <vscale x 8 x i64> @ctlz_nxv8i64(<vscale x 8 x i64> %va) {
; CHECK-F-NEXT: fsrmi a1, 1
; CHECK-F-NEXT: vsetvli a2, zero, e32, m4, ta, ma
; CHECK-F-NEXT: vfncvt.f.xu.w v16, v8
-; CHECK-F-NEXT: vmv.v.x v8, a0
-; CHECK-F-NEXT: vsrl.vi v12, v16, 23
-; CHECK-F-NEXT: vwsubu.vv v16, v8, v12
+; CHECK-F-NEXT: vmv.v.x v20, a0
+; CHECK-F-NEXT: vsrl.vi v16, v16, 23
+; CHECK-F-NEXT: vwsubu.vv v8, v20, v16
; CHECK-F-NEXT: li a0, 64
; CHECK-F-NEXT: vsetvli zero, zero, e64, m8, ta, ma
-; CHECK-F-NEXT: vminu.vx v8, v16, a0
+; CHECK-F-NEXT: vminu.vx v8, v8, a0
; CHECK-F-NEXT: fsrm a1
; CHECK-F-NEXT: ret
;
@@ -1832,11 +1832,11 @@ define <vscale x 4 x i8> @ctlz_zero_undef_nxv4i8(<vscale x 4 x i8> %va) {
; CHECK-F-LABEL: ctlz_zero_undef_nxv4i8:
; CHECK-F: # %bb.0:
; CHECK-F-NEXT: vsetvli a0, zero, e16, m1, ta, ma
-; CHECK-F-NEXT: vzext.vf2 v9, v8
-; CHECK-F-NEXT: vfwcvt.f.xu.v v10, v9
-; CHECK-F-NEXT: vnsrl.wi v8, v10, 23
+; CHECK-F-NEXT: vzext.vf2 v10, v8
+; CHECK-F-NEXT: vfwcvt.f.xu.v v8, v10
+; CHECK-F-NEXT: vnsrl.wi v10, v8, 23
; CHECK-F-NEXT: vsetvli zero, zero, e8, mf2, ta, ma
-; CHECK-F-NEXT: vnsrl.wi v8, v8, 0
+; CHECK-F-NEXT: vnsrl.wi v8, v10, 0
; CHECK-F-NEXT: li a0, 134
; CHECK-F-NEXT: vrsu...
[truncated]
|
@@ -752,18 +752,24 @@ def VR : VReg<!listconcat(VM1VTs, VMaskVTs), | |||
|
|||
def VRNoV0 : VReg<!listconcat(VM1VTs, VMaskVTs), (sub VR, V0), 1>; | |||
|
|||
let AllocationPriority = 2 in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this in side of VReg
using the lmul?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current change, yes. I'd done it this way because I'd originally planned to have the NoV0 cases have different values. I'll switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this revealed that I hadn't handled the segment tuple register classes. I decided to treat those as having the same priority as their lmul component; that is, I ignored NF. There might be a better heuristic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at this! I tried this before but I can't remember why I dropped it (maybe it was simply because setting AllocationPriority
couldn't fix #113489).
I do see some improvements and regressions but I don't think they are significant, this PR locates in improving compile-time since it avoids the later eviction? @lukel97 Can you also evaluate the performance please?
; CHECK-RV32-NEXT: sub sp, sp, a1 | ||
; CHECK-RV32-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x18, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 24 * vlenb | ||
; CHECK-RV32-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x20, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 32 * vlenb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regression.
; RV32-NEXT: mul a2, a2, a3 | ||
; RV32-NEXT: sub sp, sp, a2 | ||
; RV32-NEXT: .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0xe4, 0x00, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 100 * vlenb | ||
; RV32-NEXT: .cfi_escape 0x0f, 0x0e, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0xe0, 0x00, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 96 * vlenb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvement.
; RV32-NEXT: mul a2, a2, a3 | ||
; RV32-NEXT: add a2, sp, a2 | ||
; RV32-NEXT: addi a2, a2, 16 | ||
; RV32-NEXT: vs2r.v v14, (a2) # Unknown-size Folded Spill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use less memory but cause more spills?
I've kicked off a run on a banana pi now, it should be done over the weekend |
Results on rva22u64_v -O3 -flto: https://lnt.lukelau.me/db_default/v4/nts/311 |
Looks like we saw a small improvement in x264 and not much else. That's actually a bit better than I'd expected; definitely nothing problematic at least. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see this being landed.
LGTM but please wait for one more approval since this changes a lot.
The cost of a vector spill/reload may vary highly depending on the size of the vector register being spilled, i.e. LMUL, so the usual regalloc.NumSpills/regalloc.NumReloads statistics may not be an accurate reflection of the total cost. This adds two new statistics for RISCVInstrInfo that collects the total LMUL for vector register spills and reloads. It can be used to get a better idea of regalloc changes in e.g. llvm#131176 llvm#113675
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to increase the overall LMUL spilled on 538.imagick_r:
Program riscv-instr-info.TotalLMULSpilled riscv-instr-info.TotalLMULReloaded
lhs rhs diff lhs rhs diff
FP2017rate/538.imagick_r/538.imagick_r 4239.00 5082.00 19.9% 6697.00 7321.00 9.3%
FP2017speed/638.imagick_s/638.imagick_s 4239.00 5082.00 19.9% 6697.00 7321.00 9.3%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s 132.00 134.00 1.5% 274.00 248.00 -9.5%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r 132.00 134.00 1.5% 274.00 248.00 -9.5%
INT2017rate/520.omnetpp_r/520.omnetpp_r 4.00 4.00 0.0% 5.00 5.00 0.0%
INT2017speed/625.x264_s/625.x264_s 83.00 83.00 0.0% 93.00 93.00 0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s 6.00 6.00 0.0% 6.00 6.00 0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s 4.00 4.00 0.0% 5.00 5.00 0.0%
INT2017speed/602.gcc_s/602.gcc_s 85.00 85.00 0.0% 91.00 91.00 0.0%
INT2017spe...00.perlbench_s/600.perlbench_s 4.00 4.00 0.0% 8.00 8.00 0.0%
INT2017rate/525.x264_r/525.x264_r 83.00 83.00 0.0% 93.00 93.00 0.0%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r 6.00 6.00 0.0% 6.00 6.00 0.0%
FP2017rate/508.namd_r/508.namd_r 1.00 1.00 0.0% 6.00 6.00 0.0%
FP2017rate/510.parest_r/510.parest_r 1084.00 1084.00 0.0% 1368.00 1368.00 0.0%
INT2017rat...00.perlbench_r/500.perlbench_r 4.00 4.00 0.0% 8.00 8.00 0.0%
FP2017speed/644.nab_s/644.nab_s 25.00 25.00 0.0% 25.00 25.00 0.0%
FP2017speed/619.lbm_s/619.lbm_s 38.00 38.00 0.0% 38.00 38.00 0.0%
FP2017rate/544.nab_r/544.nab_r 25.00 25.00 0.0% 25.00 25.00 0.0%
FP2017rate/519.lbm_r/519.lbm_r 38.00 38.00 0.0% 38.00 38.00 0.0%
FP2017rate/511.povray_r/511.povray_r 121.00 121.00 0.0% 138.00 138.00 0.0%
INT2017rate/502.gcc_r/502.gcc_r 85.00 85.00 0.0% 91.00 91.00 0.0%
FP2017rate/526.blender_r/526.blender_r 1159.00 1155.00 -0.3% 1292.00 1298.00 0.5%
With that said though there's no actual impact on the runtime performance of that benchmark so I presume these spills are on the cold path. I looked at the codegen changes and the code there is pretty bad to begin with anyway.
I'm just flagging this FYI, I'm happy for this to land in any case.
I wrote this down to follow up on. At a minimum, it may be an interesting register allocation case I can find something from. |
The cost of a vector spill/reload may vary highly depending on the size of the vector register being spilled, i.e. LMUL, so the usual regalloc.NumSpills/regalloc.NumReloads statistics may not be an accurate reflection of the total cost. This adds two new statistics for RISCVInstrInfo that collects the total number of vector registers spilled/reloaded within groups. It can be used to get a better idea of regalloc changes in e.g. #131176 #113675
This mechanism causes the greedy register allocator to prefer allocating register classes with higher priority first. This helps to ensure that high LMUL registers obtain a register without having to go through the eviction mechanism. In practice, it seems to cause a bunch of code churn, and some minor improvement around widening and narrowing operations.
In a few of the widening tests, we have what look like code size regressions because we end up with two smaller register class copies instead of one larger one after the instruction. However, in any larger code sequence, these are likely to be folded into the producing instructions. (But so were the wider copies after the operation.)
Two observations:
on the register class, so this doesn't help long mask ranges. I
thought about doing that, but the benefit is non-obvious, so I
decided it was worth a separate change at minimum.
that exclude v0. I tried that, and it caused a whole bunch of
further churn. I may return to it in a separate patch.