Code generation for packed math instructions #1369

masahi · 2021-08-25T02:50:48Z

Hi, I'm trying to generate packed math instructions like v_pk_fma_f16 via TVM for DL inference use cases. Using the LLVM AMDGPU backend directly (via rocm), I was able to generate asm having the v_pk_fma_f16 instruction like this https://gist.github.com/masahi/2de1a7dc87e2068ffb50ba6135273f95#file-conv2d_nhwc_float16x2-s-L495-L496. But I couldn't get the equivalent asm if I go through SPIRV and AMDVLK. I only have v_fma_f16 etc, even though the generated SPIRV looks good to me: I have instructions that operate on float16x2 like below.

...
%223 = OpFMul %v2half %217 %222
%224 = OpFAdd %v2half %214 %223
...

After a quick investigation, I identified that the scalarization pass at

llpc/lgc/patch/Patch.cpp

Line 247 in 95b2dfd

passMgr.add(createScalarizerPass());

turns the vectorized llvm.fma.v2f16(...) instruction into the scalar llvm.fma.f16 instruction.

Is the scalarization pass necessary? If so, is there a way to support packed math instructions by AMDVLK?

The text was updated successfully, but these errors were encountered:

dstutt · 2021-09-06T14:24:10Z

Hi,

as you've already discovered, this is due to the scalarization pass.
Does disabling the pass cause any performance regressions in your case?

Scalarization is still required in a lot of the use cases as there are issues due to higher register use when subregisters in the vectors are dead (but not the whole vector). Scalarizing is usually a benefit as it leads to lower register use and as a consequence higher occupancy.

The only way at the moment to support the packed instructions is to disable scalarization.

masahi · 2021-09-13T13:22:52Z

I confirmed that disabling the scalarization pass indeed enables packed instructions to be generated.

However, enabling packed instructions didn't result in any speed up. On a simple convolution workload I tested, the two runtimes (the original vs modified amdvlk) are not too different.

~~In particular, the assembly generated by the modified one (the scalarization disabled) is almost 2x as big as the original one.~~

dstutt · 2021-09-13T13:32:14Z

Are you able to share the spirv for the shader?

masahi · 2021-09-13T13:49:19Z

Sorry, my comment about the binary size was not correct, since I ran auto tuning on the conv2d workload separately for two amdvlk versions, so the input SPIR-V are different for two cases.

The attached zip is the SPIR-V used to generate the packed instruction version. Using the modified amdvlk results in this asm having v_pk_fma_f16 instruction.

Shader_0x2F8225C574413D3E.zip

dstutt · 2021-09-16T10:49:41Z

Sorry for the slow response, I've had a look at it now.
Some observations:

Stopping the scalarization results in code of the form:

ds_read_b32 v9, v8 offset:96
s_waitcnt lgkmcnt(0)
v_lshrrev_b32_e32 v10, 16, v9
v_fmac_f16_e32 v6, v5, v9
v_fmac_f16_e32 v7, v5, v10

being transformed into something like this:

ds_read_b32 v9, v8 offset:96
v_and_b32_e32 v6, 0xffff, v6
v_lshl_or_b32 v6, v7, 16, v6
s_waitcnt lgkmcnt(0)
v_pk_fma_f16 v6, v5, v9, v6 op_sel_hi:[0,1,1]
v_lshrrev_b32_e32 v7, 16, v6

which looks to be in line with what I'd expect. Nearly all the differences are of this form throughout the code (looks like there's been some unrolling?)

The register use is identical - removing scalarization can make this worse, so it is good to see that in this case it makes no difference.
Bottom line is that we're swapping 2 fma operations for a single v_pk_fma and a v_and and v_lshl_or to manipulate the values into the vec2 form. It looks from the llvm-ir that's input that the fmul/fadd of the <2 x half> are surrounded by insertelement/extractelement pairs - so there's not a lot to do about that (I haven't looked more closely to try and work out if these are redundant).

  %135 = insertelement <2 x half> undef, half %130, i32 0
  %136 = insertelement <2 x half> %135, half %128, i32 1
  %137 = fmul reassoc nnan nsz arcp contract afn <2 x half> %134, %136
  %138 = fadd reassoc nnan nsz arcp contract afn <2 x half> %132, %137
  %139 = extractelement <2 x half> %138, i32 0
  %140 = extractelement <2 x half> %138, i32 1

masahi · 2021-09-16T11:19:33Z

Thank you for taking a look. Yes, currently the inputs to the convolution are scalar fp16 buffers, and I manually added 2-way vectorization in the inner loop to experiment with packed instruction codegen. I believe we can remove those insert/extract if TVM keeps float16x2 buffer input/output throughout.

It is great to learn that packed instructions can be generated by amdvlk with a simple modification.

dstutt · 2021-09-16T11:53:31Z

I'll take a look at some options for enabling f16 vectors through scalarization.

dstutt · 2021-09-16T13:50:00Z

I've created an experimental change that disables scalarization, but only for v2half types. See https://github.com/dstutt/llvm-project/tree/no-scalarize-v2f16

See if that works for you. It might be possible to upstream something based on this with target hooks to allow disabling of specific types (e.g. v2f16 for us).

masahi · 2021-09-17T07:24:20Z

Yes, I confirmed that this change enables pk_fma_v16 to be generated, thank you.

For now, I'm happy with this solution. So I'll close this issue.

oscarbg · 2022-07-18T04:31:39Z

seeing GPUOpen-Drivers/AMDVLK#279
seems also a v2int16 tweak could be nice for vkpeak in addition to v2f16 tweak in:
https://github.com/dstutt/llvm-project/tree/no-scalarize-v2f16
@dstutt can help?

masahi closed this as completed Sep 17, 2021

Flakebi mentioned this issue Jul 8, 2022

No double rate 16bit support on RX 6700XT GPUOpen-Drivers/AMDVLK#279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code generation for packed math instructions #1369

Code generation for packed math instructions #1369

masahi commented Aug 25, 2021 •

edited

Loading

dstutt commented Sep 6, 2021

masahi commented Sep 13, 2021 •

edited

Loading

dstutt commented Sep 13, 2021

masahi commented Sep 13, 2021

dstutt commented Sep 16, 2021

masahi commented Sep 16, 2021 •

edited

Loading

dstutt commented Sep 16, 2021

dstutt commented Sep 16, 2021

masahi commented Sep 17, 2021

oscarbg commented Jul 18, 2022

Code generation for packed math instructions #1369

Code generation for packed math instructions #1369

Comments

masahi commented Aug 25, 2021 • edited Loading

dstutt commented Sep 6, 2021

masahi commented Sep 13, 2021 • edited Loading

dstutt commented Sep 13, 2021

masahi commented Sep 13, 2021

dstutt commented Sep 16, 2021

masahi commented Sep 16, 2021 • edited Loading

dstutt commented Sep 16, 2021

dstutt commented Sep 16, 2021

masahi commented Sep 17, 2021

oscarbg commented Jul 18, 2022

masahi commented Aug 25, 2021 •

edited

Loading

masahi commented Sep 13, 2021 •

edited

Loading

masahi commented Sep 16, 2021 •

edited

Loading