Improve performance of various intrinsics #549

AymenQ · 2022-10-29T21:22:15Z

This PR contains a set of minor commits that improve the performance of various intrinsics.

Performanced improved for the following intrinsics:

_mm_set_ss
_mm_madd_epi16
_mm_testc_si128
_mm_move[l,h]dup_ps
_mm_hadd_epi32
_mm_set_sd
_mm_mpsadbw_epu8

Please see each individual commit message for a brief summary and change in code generation.

Some compilers will not optimise vld1q_f32({a, 0, 0, 0}) to a movi + mov, so explicitly implement _mm_set_ss as such. Example codegen for _mm_set_ss(a) with GCC 11.2.0 (-O3): Prior to commit: sub sp, sp, 0x10 stp xzr, xzr, [sp] str s0, [sp] ldr q0, [sp] add sp, sp, 0x10 After commit: movi v1.4s, 0x0 mov v1.s[0], v0.s[0] mov v0.16b, v1.16b

vset[q]_lane_f64 is only supported on A64. Example codegen for _mm_set_sd(a) with GCC 11.2.0 (-O3): Prior to commit: sub sp, sp, 0x10 str d0, [sp] str xzr, [sp, 8] ldr q0, [sp] add sp, sp, 0x10 After commit: mov d0, v0.d[0]

Use 128-bit `smull`/`smull2` and `addp` for the _mm_madd_epi16 intrinsic if we are on A64. Example codegen for _mm_madd_epi16(a, b) with GCC 11.2.0 (-O3): Prior to commit: smull v2.4s, v0.4h, v1.4h smull2 v0.4s, v0.8h, v1.8h mov d3, v2.d[1] mov d1, v0.d[1] addp v2.2s, v2.2s, v3.2s addp v1.2s, v0.2s, v1.2s mov d0, v2.d[0] mov v0.d[1], v1.d[0] After commit: smull v2.4s, v0.4h, v1.4h smull2 v0.4s, v0.8h, v1.8h addp v0.4s, v2.4s, v0.4s

Example codegen for _mm_hadd_epi32(a, b) with GCC 11.2.0 (-O3): Prior to commit: mov d3, v0.d[1] mov d2, v1.d[1] addp v0.2s, v0.2s, v3.2s addp v1.2s, v1.2s, v2.2s mov d0, v0.d[0] mov v0.d[1], v1.d[0] After commit: addp v0.4s, v0.4s, v1.4s

Not all compilers seem to perform this optimisation, so explicitly call the bitwise clear intrinsic. Example codegen for _mm_testc_si128(a, b) with GCC 11.2.0 (-O3): Prior to commit: mvn v0.16b, v0.16b and v1.16b, v1.16b, v0.16b fmov d0, d1 mov d1, v1.d[1] orr v0.8b, v0.8b, v1.8b fmov x0, d0 cmp x0, 0x0 cset w0, eq // eq = none After commit: bic v0.16b, v1.16b, v0.16b fmov d1, d0 mov d0, v0.d[1] orr v0.8b, v1.8b, v0.8b fmov x0, d0 cmp x0, 0x0 cset w0, eq // eq = none

This results in no code generation difference w/ GCC 11.2.0 since __builtin_shuffle support was added. However, it is worth adding an explicit trn1/trn2 implementation regardless for compilers that do not have __builtin_shuffle or __builtin_shufflevector and to avoid potentially suboptimal code generation.

Use the `uabdl` instruction instead of `usubl` + `abs` for computing the absolute differences in _mm_mpsadbw_epu8. Additionally eliminate pseudo-dependency between vector extracts. Example codegen for _mm_mpsadbw_epu8(a, b, 0xb) with GCC 11.2.0 (-O3): Prior to commit: dup v1.4s, v1.s[3] ext v2.16b, v0.16b, v0.16b, 1 mov v3.8b, v1.8b usubl v1.8h, v0.8b, v1.8b ext v0.16b, v2.16b, v2.16b, 1 usubl v2.8h, v2.8b, v3.8b abs v1.8h, v1.8h ext v4.16b, v0.16b, v0.16b, 1 usubl v0.8h, v0.8b, v3.8b abs v2.8h, v2.8h usubl v3.8h, v4.8b, v3.8b abs v0.8h, v0.8h addp v1.8h, v1.8h, v0.8h abs v0.8h, v3.8h addp v2.8h, v2.8h, v0.8h trn1 v0.4s, v1.4s, v2.4s trn2 v1.4s, v1.4s, v2.4s addp v0.8h, v0.8h, v1.8h After commit: dup v1.4s, v1.s[3] ext v4.16b, v0.16b, v0.16b, 2 ext v5.16b, v0.16b, v0.16b, 3 ext v2.16b, v0.16b, v0.16b, 1 mov v3.8b, v1.8b uabdl v1.8h, v0.8b, v1.8b mov v0.8b, v4.8b uabdl v2.8h, v2.8b, v3.8b uabdl v4.8h, v5.8b, v3.8b uabdl v0.8h, v0.8b, v3.8b addp v2.8h, v2.8h, v4.8h addp v1.8h, v1.8h, v0.8h trn1 v0.4s, v1.4s, v2.4s trn2 v1.4s, v1.4s, v2.4s addp v0.8h, v0.8h, v1.8h

jserv · 2022-10-30T10:14:40Z

sse2neon.h

@@ -6627,7 +6637,10 @@ FORCE_INLINE __m128d _mm_movedup_pd(__m128d a)
 // https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_movehdup_ps
 FORCE_INLINE __m128 _mm_movehdup_ps(__m128 a)
 {
-#ifdef _sse2neon_shuffle
+#if defined(__aarch64__)


This optimization looks great. I wonder if it can be applied to other intrinsic variants.

This in particular doesn't actually result in better codegen on gcc/clang (at least w/ high enough optimisation level) since the __builtin_shuffle or __builtin_shufflevector call already compiles to a trn2---although this is definitely an improvement for compilers that don't support these builtins.

It may be possible to do something similar for other intrinsics---I haven't spotted anything obvious, but I have not looked at every intrinsic. We can go through other implementations that use __builtin_shufflevector and attempt to implement them via explicit Neon intrinsic calls instead, though I'm not sure whether that will always be beneficial or worth doing.

jserv · 2022-10-31T20:04:05Z

Thank @AymenQ for contributing!

AymenQ added 7 commits October 28, 2022 17:02

Use move instead of a load for _mm_set_sd

49705ed

vset[q]_lane_f64 is only supported on A64. Example codegen for _mm_set_sd(a) with GCC 11.2.0 (-O3): Prior to commit: sub sp, sp, 0x10 str d0, [sp] str xzr, [sp, 8] ldr q0, [sp] add sp, sp, 0x10 After commit: mov d0, v0.d[0]

Use A64 pairwise add for _mm_hadd_epi32

8af8270

Example codegen for _mm_hadd_epi32(a, b) with GCC 11.2.0 (-O3): Prior to commit: mov d3, v0.d[1] mov d2, v1.d[1] addp v0.2s, v0.2s, v3.2s addp v1.2s, v1.2s, v2.2s mov d0, v0.d[0] mov v0.d[1], v1.d[0] After commit: addp v0.4s, v0.4s, v1.4s

AymenQ requested review from jserv and marktwtn as code owners October 29, 2022 21:22

jserv reviewed Oct 30, 2022

View reviewed changes

jserv requested a review from howjmay October 30, 2022 13:54

jserv merged commit 9eeae30 into DLTcollab:master Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of various intrinsics #549

Improve performance of various intrinsics #549

AymenQ commented Oct 29, 2022

jserv Oct 30, 2022

AymenQ Oct 31, 2022 •

edited

jserv commented Oct 31, 2022

Improve performance of various intrinsics #549

Improve performance of various intrinsics #549

Conversation

AymenQ commented Oct 29, 2022

jserv Oct 30, 2022

Choose a reason for hiding this comment

AymenQ Oct 31, 2022 • edited

Choose a reason for hiding this comment

jserv commented Oct 31, 2022

AymenQ Oct 31, 2022 •

edited