Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code generation for packed math instructions #1369

Closed
masahi opened this issue Aug 25, 2021 · 10 comments
Closed

Code generation for packed math instructions #1369

masahi opened this issue Aug 25, 2021 · 10 comments

Comments

@masahi
Copy link

masahi commented Aug 25, 2021

Hi, I'm trying to generate packed math instructions like v_pk_fma_f16 via TVM for DL inference use cases. Using the LLVM AMDGPU backend directly (via rocm), I was able to generate asm having the v_pk_fma_f16 instruction like this https://gist.github.com/masahi/2de1a7dc87e2068ffb50ba6135273f95#file-conv2d_nhwc_float16x2-s-L495-L496. But I couldn't get the equivalent asm if I go through SPIRV and AMDVLK. I only have v_fma_f16 etc, even though the generated SPIRV looks good to me: I have instructions that operate on float16x2 like below.

...
%223 = OpFMul %v2half %217 %222
%224 = OpFAdd %v2half %214 %223
...

After a quick investigation, I identified that the scalarization pass at

passMgr.add(createScalarizerPass());
turns the vectorized llvm.fma.v2f16(...) instruction into the scalar llvm.fma.f16 instruction.

Is the scalarization pass necessary? If so, is there a way to support packed math instructions by AMDVLK?

@dstutt
Copy link
Member

dstutt commented Sep 6, 2021

Hi,

as you've already discovered, this is due to the scalarization pass.
Does disabling the pass cause any performance regressions in your case?

Scalarization is still required in a lot of the use cases as there are issues due to higher register use when subregisters in the vectors are dead (but not the whole vector). Scalarizing is usually a benefit as it leads to lower register use and as a consequence higher occupancy.

The only way at the moment to support the packed instructions is to disable scalarization.

@masahi
Copy link
Author

masahi commented Sep 13, 2021

I confirmed that disabling the scalarization pass indeed enables packed instructions to be generated.

However, enabling packed instructions didn't result in any speed up. On a simple convolution workload I tested, the two runtimes (the original vs modified amdvlk) are not too different.

In particular, the assembly generated by the modified one (the scalarization disabled) is almost 2x as big as the original one.

@dstutt
Copy link
Member

dstutt commented Sep 13, 2021

Are you able to share the spirv for the shader?

@masahi
Copy link
Author

masahi commented Sep 13, 2021

Sorry, my comment about the binary size was not correct, since I ran auto tuning on the conv2d workload separately for two amdvlk versions, so the input SPIR-V are different for two cases.

The attached zip is the SPIR-V used to generate the packed instruction version. Using the modified amdvlk results in this asm having v_pk_fma_f16 instruction.

Shader_0x2F8225C574413D3E.zip

@dstutt
Copy link
Member

dstutt commented Sep 16, 2021

Sorry for the slow response, I've had a look at it now.
Some observations:

  • Stopping the scalarization results in code of the form:
ds_read_b32 v9, v8 offset:96
s_waitcnt lgkmcnt(0)
v_lshrrev_b32_e32 v10, 16, v9
v_fmac_f16_e32 v6, v5, v9
v_fmac_f16_e32 v7, v5, v10

being transformed into something like this:

ds_read_b32 v9, v8 offset:96
v_and_b32_e32 v6, 0xffff, v6
v_lshl_or_b32 v6, v7, 16, v6
s_waitcnt lgkmcnt(0)
v_pk_fma_f16 v6, v5, v9, v6 op_sel_hi:[0,1,1]
v_lshrrev_b32_e32 v7, 16, v6

which looks to be in line with what I'd expect. Nearly all the differences are of this form throughout the code (looks like there's been some unrolling?)

  • The register use is identical - removing scalarization can make this worse, so it is good to see that in this case it makes no difference.
  • Bottom line is that we're swapping 2 fma operations for a single v_pk_fma and a v_and and v_lshl_or to manipulate the values into the vec2 form. It looks from the llvm-ir that's input that the fmul/fadd of the <2 x half> are surrounded by insertelement/extractelement pairs - so there's not a lot to do about that (I haven't looked more closely to try and work out if these are redundant).
  %135 = insertelement <2 x half> undef, half %130, i32 0
  %136 = insertelement <2 x half> %135, half %128, i32 1
  %137 = fmul reassoc nnan nsz arcp contract afn <2 x half> %134, %136
  %138 = fadd reassoc nnan nsz arcp contract afn <2 x half> %132, %137
  %139 = extractelement <2 x half> %138, i32 0
  %140 = extractelement <2 x half> %138, i32 1

@masahi
Copy link
Author

masahi commented Sep 16, 2021

Thank you for taking a look. Yes, currently the inputs to the convolution are scalar fp16 buffers, and I manually added 2-way vectorization in the inner loop to experiment with packed instruction codegen. I believe we can remove those insert/extract if TVM keeps float16x2 buffer input/output throughout.

It is great to learn that packed instructions can be generated by amdvlk with a simple modification.

@dstutt
Copy link
Member

dstutt commented Sep 16, 2021

I'll take a look at some options for enabling f16 vectors through scalarization.

@dstutt
Copy link
Member

dstutt commented Sep 16, 2021

I've created an experimental change that disables scalarization, but only for v2half types. See https://github.com/dstutt/llvm-project/tree/no-scalarize-v2f16

See if that works for you. It might be possible to upstream something based on this with target hooks to allow disabling of specific types (e.g. v2f16 for us).

@masahi
Copy link
Author

masahi commented Sep 17, 2021

Yes, I confirmed that this change enables pk_fma_v16 to be generated, thank you.

For now, I'm happy with this solution. So I'll close this issue.

@oscarbg
Copy link

oscarbg commented Jul 18, 2022

seeing GPUOpen-Drivers/AMDVLK#279
seems also a v2int16 tweak could be nice for vkpeak in addition to v2f16 tweak in:
https://github.com/dstutt/llvm-project/tree/no-scalarize-v2f16
@dstutt can help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants