-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of various intrinsics #549
Conversation
Some compilers will not optimise vld1q_f32({a, 0, 0, 0}) to a movi + mov, so explicitly implement _mm_set_ss as such. Example codegen for _mm_set_ss(a) with GCC 11.2.0 (-O3): Prior to commit: sub sp, sp, 0x10 stp xzr, xzr, [sp] str s0, [sp] ldr q0, [sp] add sp, sp, 0x10 After commit: movi v1.4s, 0x0 mov v1.s[0], v0.s[0] mov v0.16b, v1.16b
vset[q]_lane_f64 is only supported on A64. Example codegen for _mm_set_sd(a) with GCC 11.2.0 (-O3): Prior to commit: sub sp, sp, 0x10 str d0, [sp] str xzr, [sp, 8] ldr q0, [sp] add sp, sp, 0x10 After commit: mov d0, v0.d[0]
Use 128-bit `smull`/`smull2` and `addp` for the _mm_madd_epi16 intrinsic if we are on A64. Example codegen for _mm_madd_epi16(a, b) with GCC 11.2.0 (-O3): Prior to commit: smull v2.4s, v0.4h, v1.4h smull2 v0.4s, v0.8h, v1.8h mov d3, v2.d[1] mov d1, v0.d[1] addp v2.2s, v2.2s, v3.2s addp v1.2s, v0.2s, v1.2s mov d0, v2.d[0] mov v0.d[1], v1.d[0] After commit: smull v2.4s, v0.4h, v1.4h smull2 v0.4s, v0.8h, v1.8h addp v0.4s, v2.4s, v0.4s
Example codegen for _mm_hadd_epi32(a, b) with GCC 11.2.0 (-O3): Prior to commit: mov d3, v0.d[1] mov d2, v1.d[1] addp v0.2s, v0.2s, v3.2s addp v1.2s, v1.2s, v2.2s mov d0, v0.d[0] mov v0.d[1], v1.d[0] After commit: addp v0.4s, v0.4s, v1.4s
Not all compilers seem to perform this optimisation, so explicitly call the bitwise clear intrinsic. Example codegen for _mm_testc_si128(a, b) with GCC 11.2.0 (-O3): Prior to commit: mvn v0.16b, v0.16b and v1.16b, v1.16b, v0.16b fmov d0, d1 mov d1, v1.d[1] orr v0.8b, v0.8b, v1.8b fmov x0, d0 cmp x0, 0x0 cset w0, eq // eq = none After commit: bic v0.16b, v1.16b, v0.16b fmov d1, d0 mov d0, v0.d[1] orr v0.8b, v1.8b, v0.8b fmov x0, d0 cmp x0, 0x0 cset w0, eq // eq = none
This results in no code generation difference w/ GCC 11.2.0 since __builtin_shuffle support was added. However, it is worth adding an explicit trn1/trn2 implementation regardless for compilers that do not have __builtin_shuffle or __builtin_shufflevector and to avoid potentially suboptimal code generation.
Use the `uabdl` instruction instead of `usubl` + `abs` for computing the absolute differences in _mm_mpsadbw_epu8. Additionally eliminate pseudo-dependency between vector extracts. Example codegen for _mm_mpsadbw_epu8(a, b, 0xb) with GCC 11.2.0 (-O3): Prior to commit: dup v1.4s, v1.s[3] ext v2.16b, v0.16b, v0.16b, 1 mov v3.8b, v1.8b usubl v1.8h, v0.8b, v1.8b ext v0.16b, v2.16b, v2.16b, 1 usubl v2.8h, v2.8b, v3.8b abs v1.8h, v1.8h ext v4.16b, v0.16b, v0.16b, 1 usubl v0.8h, v0.8b, v3.8b abs v2.8h, v2.8h usubl v3.8h, v4.8b, v3.8b abs v0.8h, v0.8h addp v1.8h, v1.8h, v0.8h abs v0.8h, v3.8h addp v2.8h, v2.8h, v0.8h trn1 v0.4s, v1.4s, v2.4s trn2 v1.4s, v1.4s, v2.4s addp v0.8h, v0.8h, v1.8h After commit: dup v1.4s, v1.s[3] ext v4.16b, v0.16b, v0.16b, 2 ext v5.16b, v0.16b, v0.16b, 3 ext v2.16b, v0.16b, v0.16b, 1 mov v3.8b, v1.8b uabdl v1.8h, v0.8b, v1.8b mov v0.8b, v4.8b uabdl v2.8h, v2.8b, v3.8b uabdl v4.8h, v5.8b, v3.8b uabdl v0.8h, v0.8b, v3.8b addp v2.8h, v2.8h, v4.8h addp v1.8h, v1.8h, v0.8h trn1 v0.4s, v1.4s, v2.4s trn2 v1.4s, v1.4s, v2.4s addp v0.8h, v0.8h, v1.8h
@@ -6627,7 +6637,10 @@ FORCE_INLINE __m128d _mm_movedup_pd(__m128d a) | |||
// https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_movehdup_ps | |||
FORCE_INLINE __m128 _mm_movehdup_ps(__m128 a) | |||
{ | |||
#ifdef _sse2neon_shuffle | |||
#if defined(__aarch64__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization looks great. I wonder if it can be applied to other intrinsic variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This in particular doesn't actually result in better codegen on gcc/clang (at least w/ high enough optimisation level) since the __builtin_shuffle
or __builtin_shufflevector
call already compiles to a trn2
---although this is definitely an improvement for compilers that don't support these builtins.
It may be possible to do something similar for other intrinsics---I haven't spotted anything obvious, but I have not looked at every intrinsic. We can go through other implementations that use __builtin_shufflevector
and attempt to implement them via explicit Neon intrinsic calls instead, though I'm not sure whether that will always be beneficial or worth doing.
Thank @AymenQ for contributing! |
This PR contains a set of minor commits that improve the performance of various intrinsics.
Performanced improved for the following intrinsics:
_mm_set_ss
_mm_madd_epi16
_mm_testc_si128
_mm_move[l,h]dup_ps
_mm_hadd_epi32
_mm_set_sd
_mm_mpsadbw_epu8
Please see each individual commit message for a brief summary and change in code generation.