Open
Description
This C code:
__m256i iter(int16_t* src1p) {
__m256i ten = _mm256_set1_epi32(10);
__m256i wload = _mm256_cvtepi16_epi32(_mm_loadu_si128((void*)src1p));
__m256i mask = _mm256_cmpgt_epi32(wload, ten);
return _mm256_add_epi32(wload, mask);
}
compiled with -O3 -march=haswell
, results in:
iter:
vmovdqu xmm0, xmmword ptr [rdi]
vpmovsxwd ymm1, xmm0
vpcmpgtw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpmovsxwd ymm0, xmm0
vpaddd ymm0, ymm0, ymm1
ret
but it could be
iter:
vpmovsxwd ymm0, xmmword ptr [rdi]
vpbroadcastd ymm1, dword ptr [rip + .LCPI0_0]
vpcmpgtd ymm1, ymm0, ymm1
vpaddd ymm0, ymm1, ymm0
ret
avoiding having two vpmovsxwd
s, and allowing the one that's left to have the memory operand inline.