Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simd vectorization regression starting with clang 14 #82936

Closed
nschaeff opened this issue Feb 25, 2024 · 3 comments · Fixed by #106441
Closed

simd vectorization regression starting with clang 14 #82936

nschaeff opened this issue Feb 25, 2024 · 3 comments · Fixed by #106441

Comments

@nschaeff
Copy link

A simple loop multiplying two arrays, with different multiplicity fails to vectorize efficiently on clang 14+, while it worked with clang 13.0.1
The loop is the following, where 4 consecutive values in data are multiplied by the same factor :

    for (int i=0; i<n; i++) {
     for (int k=0; k<4; k++) data[4*i+k] *= factor[i];
    }

See on godbolt to see the crazy code generated by clang 14+, while clang 13.0.1 correctly uses vbroadcastsd :
https://godbolt.org/z/desh4E49o

@github-actions github-actions bot added the clang Clang issues not falling into any other category label Feb 25, 2024
@EugeneZelenko EugeneZelenko added vectorizers and removed clang Clang issues not falling into any other category labels Feb 26, 2024
@ZelinMa557
Copy link
Contributor

I think the code generated by clang 18.10 is good enough:

rescale_x4(double*, double const*, int):                   # @rescale_x4(double*, double const*, int)
        test    edx, edx
        jle     .LBB0_8
        mov     eax, edx
        cmp     edx, 4
        jae     .LBB0_3
        xor     ecx, ecx
        jmp     .LBB0_6
.LBB0_3:
        mov     ecx, eax
        and     ecx, 2147483644
        mov     edx, eax
        shr     edx, 2
        and     edx, 536870911
        shl     rdx, 5
        xor     r8d, r8d
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rsi + r8]
        vmovupd ymm1, ymmword ptr [rdi + 4*r8]
        vmovupd ymm2, ymmword ptr [rdi + 4*r8 + 32]
        vmovupd ymm3, ymmword ptr [rdi + 4*r8 + 64]
        vmovupd ymm4, ymmword ptr [rdi + 4*r8 + 96]
        vperm2f128      ymm5, ymm1, ymm3, 32    # ymm5 = ymm1[0,1],ymm3[0,1]
        vperm2f128      ymm6, ymm2, ymm4, 32    # ymm6 = ymm2[0,1],ymm4[0,1]
        vperm2f128      ymm1, ymm1, ymm3, 49    # ymm1 = ymm1[2,3],ymm3[2,3]
        vperm2f128      ymm2, ymm2, ymm4, 49    # ymm2 = ymm2[2,3],ymm4[2,3]
        vunpcklpd       ymm3, ymm5, ymm6        # ymm3 = ymm5[0],ymm6[0],ymm5[2],ymm6[2]
        vunpcklpd       ymm4, ymm1, ymm2        # ymm4 = ymm1[0],ymm2[0],ymm1[2],ymm2[2]
        vunpckhpd       ymm5, ymm5, ymm6        # ymm5 = ymm5[1],ymm6[1],ymm5[3],ymm6[3]
        vunpckhpd       ymm1, ymm1, ymm2        # ymm1 = ymm1[1],ymm2[1],ymm1[3],ymm2[3]
        vmulpd  ymm2, ymm0, ymm3
        vmulpd  ymm3, ymm0, ymm5
        vmulpd  ymm4, ymm0, ymm4
        vmulpd  ymm0, ymm0, ymm1
        vinsertf128     ymm1, ymm2, xmm4, 1
        vinsertf128     ymm5, ymm3, xmm0, 1
        vperm2f128      ymm2, ymm2, ymm4, 49    # ymm2 = ymm2[2,3],ymm4[2,3]
        vperm2f128      ymm0, ymm3, ymm0, 49    # ymm0 = ymm3[2,3],ymm0[2,3]
        vunpcklpd       ymm3, ymm1, ymm5        # ymm3 = ymm1[0],ymm5[0],ymm1[2],ymm5[2]
        vunpcklpd       ymm4, ymm2, ymm0        # ymm4 = ymm2[0],ymm0[0],ymm2[2],ymm0[2]
        vunpckhpd       ymm1, ymm1, ymm5        # ymm1 = ymm1[1],ymm5[1],ymm1[3],ymm5[3]
        vunpckhpd       ymm0, ymm2, ymm0        # ymm0 = ymm2[1],ymm0[1],ymm2[3],ymm0[3]
        vmovupd ymmword ptr [rdi + 4*r8 + 64], ymm4
        vmovupd ymmword ptr [rdi + 4*r8 + 96], ymm0
        vmovupd ymmword ptr [rdi + 4*r8], ymm3
        vmovupd ymmword ptr [rdi + 4*r8 + 32], ymm1
        add     r8, 32
        cmp     rdx, r8
        jne     .LBB0_4
        cmp     rcx, rax
        je      .LBB0_8
.LBB0_6:
        mov     rdx, rcx
        shl     rdx, 5
        add     rdi, rdx
.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        vbroadcastsd    ymm0, qword ptr [rsi + 8*rcx]
        vmulpd  ymm0, ymm0, ymmword ptr [rdi]
        vmovupd ymmword ptr [rdi], ymm0
        inc     rcx
        add     rdi, 32
        cmp     rax, rcx
        jne     .LBB0_7
.LBB0_8:
        vzeroupper
        ret

It seems this problem is fixed in clang 15+
@nschaeff

@nschaeff
Copy link
Author

nschaeff commented Apr 26, 2024

This is exactly what I call bad. Lots of useless shuffles: 16 shuffle instructions for 4 multiplications...
Clang 16+ is a little less bad than 14 and 15, but still very bad compared to clang 13.0.1, that I paste below: zero shuffles (only broadcast directly from memory, which are free).
I think it is not unreasonable to expect newer versions to improve the code produced.

rescale_x4(double*, double const*, int):                   # @rescale_x4(double*, double const*, int)
        test    edx, edx
        jle     .LBB0_8
        mov     ecx, edx
        lea     rax, [rcx - 1]
        mov     r8d, ecx
        and     r8d, 3
        cmp     rax, 3
        jae     .LBB0_3
        xor     edx, edx
        jmp     .LBB0_5
.LBB0_3:
        and     ecx, -4
        lea     rax, [rdi + 96]
        xor     edx, edx
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        vbroadcastsd    ymm0, qword ptr [rsi + 8*rdx]
        vmulpd  ymm0, ymm0, ymmword ptr [rax - 96]
        vmovupd ymmword ptr [rax - 96], ymm0
        vbroadcastsd    ymm0, qword ptr [rsi + 8*rdx + 8]
        vmulpd  ymm0, ymm0, ymmword ptr [rax - 64]
        vmovupd ymmword ptr [rax - 64], ymm0
        vbroadcastsd    ymm0, qword ptr [rsi + 8*rdx + 16]
        vmulpd  ymm0, ymm0, ymmword ptr [rax - 32]
        vmovupd ymmword ptr [rax - 32], ymm0
        vbroadcastsd    ymm0, qword ptr [rsi + 8*rdx + 24]
        vmulpd  ymm0, ymm0, ymmword ptr [rax]
        vmovupd ymmword ptr [rax], ymm0
        add     rdx, 4
        sub     rax, -128
        cmp     rcx, rdx
        jne     .LBB0_4
.LBB0_5:
        test    r8, r8
        je      .LBB0_8
        lea     rax, [rsi + 8*rdx]
        shl     rdx, 5
        add     rdi, rdx
        shl     r8, 3
        xor     ecx, ecx
.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        vbroadcastsd    ymm0, qword ptr [rax + rcx]
        vmulpd  ymm0, ymm0, ymmword ptr [rdi + 4*rcx]
        vmovupd ymmword ptr [rdi + 4*rcx], ymm0
        add     rcx, 8
        cmp     r8, rcx
        jne     .LBB0_7
.LBB0_8:
        vzeroupper
        ret

fhahn added a commit to fhahn/llvm-project that referenced this issue Aug 28, 2024
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm#106431.

Fixes llvm#82936
@fhahn
Copy link
Contributor

fhahn commented Sep 12, 2024

The reason for the change is that LoopVectorize is now able to vectorize. In this particular case, SLP vectorization would be sufficient to make full use of the available vector registers, which is why the codegen is worse now.

Sketched a potential improvement in #106441

fhahn added a commit to fhahn/llvm-project that referenced this issue Sep 25, 2024
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm#106431.

Fixes llvm#82936
fhahn added a commit that referenced this issue Jan 10, 2025
github-actions bot pushed a commit to arm/arm-toolchain that referenced this issue Jan 10, 2025
BaiXilin pushed a commit to BaiXilin/llvm-fix-vnni-instr-types that referenced this issue Jan 12, 2025
@fhahn fhahn closed this as completed in dfa665f Mar 22, 2025
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this issue Mar 22, 2025
…106441)

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm/llvm-project#106431.

Fixes llvm/llvm-project#82936.

PR: llvm/llvm-project#106441
fhahn added a commit that referenced this issue Mar 25, 2025
…6441)"

This reverts commit ff3e2ba.

The recommmitted version limits to transform to cases where no
interleaving is taking place, to avoid a mis-compile when interleaving.

Original commit message:

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on #106431.

Fixes #82936.

PR: #106441
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this issue Mar 25, 2025
…roups. (#106441)"

This reverts commit ff3e2ba.

The recommmitted version limits to transform to cases where no
interleaving is taking place, to avoid a mis-compile when interleaving.

Original commit message:

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm/llvm-project#106431.

Fixes llvm/llvm-project#82936.

PR: llvm/llvm-project#106441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants