-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV][EVL] Improve AnyOf reduction codegen #132180
Comments
@llvm/issue-subscribers-backend-risc-v Author: Luke Lau (lukel97)
On RISC-V, an AnyOf reduction is vectorized quite poorly with EVL tail folding:
int f(int *x, int y, int n) {
int z = 0;
for (int i = 0; i < n; i++)
if (x[i] == y)
z = 1;
return z;
} .LBB0_2: # %vector.body
# =>This Inner Loop Header: Depth=1
sub t0, a2, a3
sh2add a6, a3, a0
vsetvli t1, t0, e8, mf2, ta, ma
vsetvli a4, zero, e64, m4, ta, ma
vmv.v.x v16, t1
vmsleu.vv v9, v16, v12
vsetvli zero, t0, e32, m2, ta, ma
vle32.v v10, (a6)
sub a5, a5, a7
vsetvli a4, zero, e64, m4, ta, ma
vmsltu.vx v16, v12, t1
vmand.mm v9, v8, v9
vsetvli zero, zero, e32, m2, ta, ma
vmseq.vx v17, v10, a1
vmor.mm v8, v8, v17
vmand.mm v8, v8, v16
vmor.mm v8, v8, v9
add a3, a3, t1
bnez a5, .LBB0_2
# %bb.3: # %middle.block
vcpop.m a0, v8
snez a0, a0
ret Compare this to the non-EVL tail folded version: .LBB0_5: # %vector.body
# =>This Inner Loop Header: Depth=1
vl2re32.v v10, (a3)
add a3, a3, a4
vsetvli zero, zero, e32, m2, ta, ma
vmseq.vx v9, v10, a1
vmor.mm v8, v8, v9
bne a3, a5, .LBB0_5
# %bb.6: # %middle.block
vcpop.m a3, v8
# ... The issue is due to the i1 vp.merge used having poor lowering, in turn due to the lack of tail preserving mask instructions in the RISC-V ISA. A better lowering is likely to widen this to an i8 vector: vsetvli a5, zero, e8, m1, ta, ma
vmv.v.i v9, 0
loop:
vsetvli a5, a7, e32, m1, ta, ma
vle32.v v8, (a0)
add a0, a0, a5
vmseq.vx v0, v8, zero
vsetvli zero, zero, e8, mf4, ta, ma
vmerge.vim v10, v9, 1, v0
vor.vv v11, v11, v10
sub a7, a7, a5
bnez a7, loop
exit:
vmsne.vi v10, v11, 0
vcpop.m a1, v10 Performing this as an in-loop reduction was also discussed in #131830, but it's likely not as profitable as the widened out-of-loop reduction. |
I've hacked together something in the loop vectorizer to widen the reduction, the vp.merge actually ends up coming after the or, which means it gets folded into a masked vor.vi: %14 = xor <vscale x 4 x i1> %13, splat (i1 true)
%15 = zext <vscale x 4 x i1> %14 to <vscale x 4 x i8>
%16 = or <vscale x 4 x i8> %vec.phi, %15
%17 = call <vscale x 4 x i8> @llvm.vp.merge.nxv4i8(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i8> %16, <vscale x 4 x i8> %vec.phi, i32 %9) vsetvli a5, zero, e8, mf2, ta, ma
vmv.v.i v8, 0
.LBB0_6: # %vector.body
# =>This Inner Loop Header: Depth=1
sub a5, a1, a3
slli a6, a3, 2
vsetvli a5, a5, e32, m2, ta, ma
add a6, a0, a6
vle32.v v10, (a6)
vmsne.vi v0, v10, 3
sub a4, a4, a2
vsetvli zero, zero, e8, mf2, tu, mu
vor.vi v8, v8, 1, v0.t
add a3, a5, a3
bnez a4, .LBB0_6
# %bb.7: # %middle.block
vsetvli a0, zero, e8, mf2, ta, ma
vand.vi v8, v8, 1
vmsne.vi v8, v8, 0
vcpop.m a0, v8
bnez a0, .LBB0_10 |
I've been considering whether to address this issue in codegen prepare since it might be a target-dependent problem. Another minor reason is that the vectorizer generally prefers smaller type as possible, making the type larger is less common. If we do need to perform type promotion in the vectorizer, it might be possible to add a merge-type promotion VPlanTransformation after EVL lowering. |
One reason to do it in the vectorizer is that it can model the cheaper cost of the vp.merge, i.e. 5 instructions now become 1. |
FYI, if we ever allow larger LMUL in the vectorizer, I think we are going to want the vcpop lowering. I'm not arguing against having the out-of-loop lowering for lower LMULs (the current default case), but for supporting both. At high LMUL, that vor.vi is going to be linear in LMUL whereas the vcpop is going to be constant. I suspect that at e.g. m8 on the BP3 the vcpop would be profitable. |
On RISC-V, an AnyOf reduction is vectorized quite poorly with EVL tail folding:
Compare this to the non-EVL tail folded version:
The issue is due to the i1 vp.merge used having poor lowering, in turn due to the lack of tail preserving mask instructions in the RISC-V ISA.
A better lowering is likely to widen this to an i8 vector:
Performing this as an in-loop reduction was also discussed in #131830, but it's likely not as profitable as the widened out-of-loop reduction.
The text was updated successfully, but these errors were encountered: