Skip to content

[X86] Delaying widening results in an unnecessary vpmovsxwd copy #144266

Open
@dzaima

Description

@dzaima

This C code:

__m256i iter(int16_t* src1p) {
    __m256i ten = _mm256_set1_epi32(10);
    __m256i wload = _mm256_cvtepi16_epi32(_mm_loadu_si128((void*)src1p));
    __m256i mask = _mm256_cmpgt_epi32(wload, ten);
    return _mm256_add_epi32(wload, mask);
}

compiled with -O3 -march=haswell, results in:

iter:
        vmovdqu xmm0, xmmword ptr [rdi]
        vpmovsxwd       ymm1, xmm0
        vpcmpgtw        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpmovsxwd       ymm0, xmm0
        vpaddd  ymm0, ymm0, ymm1
        ret

but it could be

iter:
        vpmovsxwd       ymm0, xmmword ptr [rdi]
        vpbroadcastd    ymm1, dword ptr [rip + .LCPI0_0]
        vpcmpgtd        ymm1, ymm0, ymm1
        vpaddd  ymm0, ymm1, ymm0
        ret

avoiding having two vpmovsxwds, and allowing the one that's left to have the memory operand inline.

https://godbolt.org/z/Ezrf9YbYn

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions