Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV][EVL] Improve AnyOf reduction codegen #132180

Open
lukel97 opened this issue Mar 20, 2025 · 5 comments
Open

[RISCV][EVL] Improve AnyOf reduction codegen #132180

lukel97 opened this issue Mar 20, 2025 · 5 comments

Comments

@lukel97
Copy link
Contributor

lukel97 commented Mar 20, 2025

On RISC-V, an AnyOf reduction is vectorized quite poorly with EVL tail folding:

int f(int *x, int y, int n) {
  int z = 0;
  for (int i = 0; i < n; i++)
    if (x[i] == y)
      z = 1;
  return z;
}
.LBB0_2:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	sub	t0, a2, a3
	sh2add	a6, a3, a0
	vsetvli	t1, t0, e8, mf2, ta, ma
	vsetvli	a4, zero, e64, m4, ta, ma
	vmv.v.x	v16, t1
	vmsleu.vv	v9, v16, v12
	vsetvli	zero, t0, e32, m2, ta, ma
	vle32.v	v10, (a6)
	sub	a5, a5, a7
	vsetvli	a4, zero, e64, m4, ta, ma
	vmsltu.vx	v16, v12, t1
	vmand.mm	v9, v8, v9
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v17, v10, a1
	vmor.mm	v8, v8, v17
	vmand.mm	v8, v8, v16
	vmor.mm	v8, v8, v9
	add	a3, a3, t1
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	vcpop.m	a0, v8
	snez	a0, a0
	ret

Compare this to the non-EVL tail folded version:

.LBB0_5:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vl2re32.v	v10, (a3)
	add	a3, a3, a4
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v9, v10, a1
	vmor.mm	v8, v8, v9
	bne	a3, a5, .LBB0_5
# %bb.6:                                # %middle.block
	vcpop.m	a3, v8
	# ...

The issue is due to the i1 vp.merge used having poor lowering, in turn due to the lack of tail preserving mask instructions in the RISC-V ISA.

A better lowering is likely to widen this to an i8 vector:

	vsetvli a5, zero, e8, m1, ta, ma
	vmv.v.i	v9, 0
        vmv.v.i	v11, 0
loop:
	vsetvli	a5, a7, e32, m1, ta, ma
	vle32.v	v8, (a0)
	add	a0, a0, a5
	vmseq.vx	v0, v8, zero
	vsetvli	zero, zero, e8, mf4, ta, ma
	vmerge.vim	v10, v9, 1, v0
	vor.vv	v11, v11, v10
	sub	a7, a7, a5
	bnez	a7, loop
exit:
	vmsne.vi	v10, v11, 0
	vcpop.m	a1, v10

Performing this as an in-loop reduction was also discussed in #131830, but it's likely not as profitable as the widened out-of-loop reduction.

@llvmbot
Copy link
Member

llvmbot commented Mar 20, 2025

@llvm/issue-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

On RISC-V, an AnyOf reduction is vectorized quite poorly with EVL tail folding:
int f(int *x, int y, int n) {
  int z = 0;
  for (int i = 0; i &lt; n; i++)
    if (x[i] == y)
      z = 1;
  return z;
}
.LBB0_2:                                # %vector.body
                                        # =&gt;This Inner Loop Header: Depth=1
	sub	t0, a2, a3
	sh2add	a6, a3, a0
	vsetvli	t1, t0, e8, mf2, ta, ma
	vsetvli	a4, zero, e64, m4, ta, ma
	vmv.v.x	v16, t1
	vmsleu.vv	v9, v16, v12
	vsetvli	zero, t0, e32, m2, ta, ma
	vle32.v	v10, (a6)
	sub	a5, a5, a7
	vsetvli	a4, zero, e64, m4, ta, ma
	vmsltu.vx	v16, v12, t1
	vmand.mm	v9, v8, v9
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v17, v10, a1
	vmor.mm	v8, v8, v17
	vmand.mm	v8, v8, v16
	vmor.mm	v8, v8, v9
	add	a3, a3, t1
	bnez	a5, .LBB0_2
# %bb.3:                                # %middle.block
	vcpop.m	a0, v8
	snez	a0, a0
	ret

Compare this to the non-EVL tail folded version:

.LBB0_5:                                # %vector.body
                                        # =&gt;This Inner Loop Header: Depth=1
	vl2re32.v	v10, (a3)
	add	a3, a3, a4
	vsetvli	zero, zero, e32, m2, ta, ma
	vmseq.vx	v9, v10, a1
	vmor.mm	v8, v8, v9
	bne	a3, a5, .LBB0_5
# %bb.6:                                # %middle.block
	vcpop.m	a3, v8
	# ...

The issue is due to the i1 vp.merge used having poor lowering, in turn due to the lack of tail preserving mask instructions in the RISC-V ISA.

A better lowering is likely to widen this to an i8 vector:

	vsetvli a5, zero, e8, m1, ta, ma
	vmv.v.i	v9, 0
loop:
	vsetvli	a5, a7, e32, m1, ta, ma
	vle32.v	v8, (a0)
	add	a0, a0, a5
	vmseq.vx	v0, v8, zero
	vsetvli	zero, zero, e8, mf4, ta, ma
	vmerge.vim	v10, v9, 1, v0
	vor.vv	v11, v11, v10
	sub	a7, a7, a5
	bnez	a7, loop
exit:
	vmsne.vi	v10, v11, 0
	vcpop.m	a1, v10

Performing this as an in-loop reduction was also discussed in #131830, but it's likely not as profitable as the widened out-of-loop reduction.

@lukel97 lukel97 self-assigned this Mar 20, 2025
@lukel97
Copy link
Contributor Author

lukel97 commented Mar 20, 2025

I've hacked together something in the loop vectorizer to widen the reduction, the vp.merge actually ends up coming after the or, which means it gets folded into a masked vor.vi:

  %14 = xor <vscale x 4 x i1> %13, splat (i1 true)
  %15 = zext <vscale x 4 x i1> %14 to <vscale x 4 x i8>
  %16 = or <vscale x 4 x i8> %vec.phi, %15
  %17 = call <vscale x 4 x i8> @llvm.vp.merge.nxv4i8(<vscale x 4 x i1> splat (i1 true), <vscale x 4 x i8> %16, <vscale x 4 x i8> %vec.phi, i32 %9)
	vsetvli	a5, zero, e8, mf2, ta, ma
	vmv.v.i	v8, 0
.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	sub	a5, a1, a3
	slli	a6, a3, 2
	vsetvli	a5, a5, e32, m2, ta, ma
	add	a6, a0, a6
	vle32.v	v10, (a6)
	vmsne.vi	v0, v10, 3
	sub	a4, a4, a2
	vsetvli	zero, zero, e8, mf2, tu, mu
	vor.vi	v8, v8, 1, v0.t
	add	a3, a5, a3
	bnez	a4, .LBB0_6
# %bb.7:                                # %middle.block
	vsetvli	a0, zero, e8, mf2, ta, ma
	vand.vi	v8, v8, 1
	vmsne.vi	v8, v8, 0
	vcpop.m	a0, v8
	bnez	a0, .LBB0_10

@Mel-Chen
Copy link
Contributor

I've been considering whether to address this issue in codegen prepare since it might be a target-dependent problem. Another minor reason is that the vectorizer generally prefers smaller type as possible, making the type larger is less common.

If we do need to perform type promotion in the vectorizer, it might be possible to add a merge-type promotion VPlanTransformation after EVL lowering.
cc: @fhahn @topperc

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 20, 2025

One reason to do it in the vectorizer is that it can model the cheaper cost of the vp.merge, i.e. 5 instructions now become 1.

@preames
Copy link
Collaborator

preames commented Mar 21, 2025

FYI, if we ever allow larger LMUL in the vectorizer, I think we are going to want the vcpop lowering. I'm not arguing against having the out-of-loop lowering for lower LMULs (the current default case), but for supporting both. At high LMUL, that vor.vi is going to be linear in LMUL whereas the vcpop is going to be constant. I suspect that at e.g. m8 on the BP3 the vcpop would be profitable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants