Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of various intrinsics #549

Merged
merged 7 commits into from
Oct 31, 2022

Conversation

AymenQ
Copy link
Collaborator

@AymenQ AymenQ commented Oct 29, 2022

This PR contains a set of minor commits that improve the performance of various intrinsics.

Performanced improved for the following intrinsics:

  • _mm_set_ss
  • _mm_madd_epi16
  • _mm_testc_si128
  • _mm_move[l,h]dup_ps
  • _mm_hadd_epi32
  • _mm_set_sd
  • _mm_mpsadbw_epu8

Please see each individual commit message for a brief summary and change in code generation.

Some compilers will not optimise vld1q_f32({a, 0, 0, 0}) to a movi +
mov, so explicitly implement _mm_set_ss as such.

Example codegen for _mm_set_ss(a) with GCC 11.2.0 (-O3):

Prior to commit:
    sub     sp, sp, 0x10
    stp     xzr, xzr, [sp]
    str     s0, [sp]
    ldr     q0, [sp]
    add     sp, sp, 0x10

After commit:
    movi    v1.4s, 0x0
    mov     v1.s[0], v0.s[0]
    mov     v0.16b, v1.16b
vset[q]_lane_f64 is only supported on A64.

Example codegen for _mm_set_sd(a) with GCC 11.2.0 (-O3):

Prior to commit:
    sub     sp, sp, 0x10
    str     d0, [sp]
    str     xzr, [sp, 8]
    ldr     q0, [sp]
    add     sp, sp, 0x10

After commit:
    mov     d0, v0.d[0]
Use 128-bit `smull`/`smull2` and `addp` for the _mm_madd_epi16
intrinsic if we are on A64.

Example codegen for _mm_madd_epi16(a, b) with GCC 11.2.0 (-O3):

Prior to commit:
    smull   v2.4s, v0.4h, v1.4h
    smull2  v0.4s, v0.8h, v1.8h
    mov     d3, v2.d[1]
    mov     d1, v0.d[1]
    addp    v2.2s, v2.2s, v3.2s
    addp    v1.2s, v0.2s, v1.2s
    mov     d0, v2.d[0]
    mov     v0.d[1], v1.d[0]

After commit:
    smull   v2.4s, v0.4h, v1.4h
    smull2  v0.4s, v0.8h, v1.8h
    addp    v0.4s, v2.4s, v0.4s
Example codegen for _mm_hadd_epi32(a, b) with GCC 11.2.0 (-O3):

Prior to commit:
    mov     d3, v0.d[1]
    mov     d2, v1.d[1]
    addp    v0.2s, v0.2s, v3.2s
    addp    v1.2s, v1.2s, v2.2s
    mov     d0, v0.d[0]
    mov     v0.d[1], v1.d[0]

After commit:
    addp    v0.4s, v0.4s, v1.4s
Not all compilers seem to perform this optimisation, so explicitly call
the bitwise clear intrinsic.

Example codegen for _mm_testc_si128(a, b) with GCC 11.2.0 (-O3):

Prior to commit:
    mvn     v0.16b, v0.16b
    and     v1.16b, v1.16b, v0.16b
    fmov    d0, d1
    mov     d1, v1.d[1]
    orr     v0.8b, v0.8b, v1.8b
    fmov    x0, d0
    cmp     x0, 0x0
    cset    w0, eq  // eq = none

After commit:
    bic     v0.16b, v1.16b, v0.16b
    fmov    d1, d0
    mov     d0, v0.d[1]
    orr     v0.8b, v1.8b, v0.8b
    fmov    x0, d0
    cmp     x0, 0x0
    cset    w0, eq  // eq = none
This results in no code generation difference w/ GCC 11.2.0 since
__builtin_shuffle support was added. However, it is worth adding an
explicit trn1/trn2 implementation regardless for compilers that do not
have __builtin_shuffle or __builtin_shufflevector and to avoid
potentially suboptimal code generation.
Use the `uabdl` instruction instead of `usubl` + `abs` for computing the
absolute differences in _mm_mpsadbw_epu8. Additionally eliminate
pseudo-dependency between vector extracts.

Example codegen for _mm_mpsadbw_epu8(a, b, 0xb) with GCC 11.2.0 (-O3):

Prior to commit:
    dup     v1.4s, v1.s[3]
    ext     v2.16b, v0.16b, v0.16b, 1
    mov     v3.8b, v1.8b
    usubl   v1.8h, v0.8b, v1.8b
    ext     v0.16b, v2.16b, v2.16b, 1
    usubl   v2.8h, v2.8b, v3.8b
    abs     v1.8h, v1.8h
    ext     v4.16b, v0.16b, v0.16b, 1
    usubl   v0.8h, v0.8b, v3.8b
    abs     v2.8h, v2.8h
    usubl   v3.8h, v4.8b, v3.8b
    abs     v0.8h, v0.8h
    addp    v1.8h, v1.8h, v0.8h
    abs     v0.8h, v3.8h
    addp    v2.8h, v2.8h, v0.8h
    trn1    v0.4s, v1.4s, v2.4s
    trn2    v1.4s, v1.4s, v2.4s
    addp    v0.8h, v0.8h, v1.8h

After commit:
    dup     v1.4s, v1.s[3]
    ext     v4.16b, v0.16b, v0.16b, 2
    ext     v5.16b, v0.16b, v0.16b, 3
    ext     v2.16b, v0.16b, v0.16b, 1
    mov     v3.8b, v1.8b
    uabdl   v1.8h, v0.8b, v1.8b
    mov     v0.8b, v4.8b
    uabdl   v2.8h, v2.8b, v3.8b
    uabdl   v4.8h, v5.8b, v3.8b
    uabdl   v0.8h, v0.8b, v3.8b
    addp    v2.8h, v2.8h, v4.8h
    addp    v1.8h, v1.8h, v0.8h
    trn1    v0.4s, v1.4s, v2.4s
    trn2    v1.4s, v1.4s, v2.4s
    addp    v0.8h, v0.8h, v1.8h
@@ -6627,7 +6637,10 @@ FORCE_INLINE __m128d _mm_movedup_pd(__m128d a)
// https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_movehdup_ps
FORCE_INLINE __m128 _mm_movehdup_ps(__m128 a)
{
#ifdef _sse2neon_shuffle
#if defined(__aarch64__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization looks great. I wonder if it can be applied to other intrinsic variants.

Copy link
Collaborator Author

@AymenQ AymenQ Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in particular doesn't actually result in better codegen on gcc/clang (at least w/ high enough optimisation level) since the __builtin_shuffle or __builtin_shufflevector call already compiles to a trn2---although this is definitely an improvement for compilers that don't support these builtins.

It may be possible to do something similar for other intrinsics---I haven't spotted anything obvious, but I have not looked at every intrinsic. We can go through other implementations that use __builtin_shufflevector and attempt to implement them via explicit Neon intrinsic calls instead, though I'm not sure whether that will always be beneficial or worth doing.

@jserv jserv requested a review from howjmay October 30, 2022 13:54
@jserv jserv merged commit 9eeae30 into DLTcollab:master Oct 31, 2022
@jserv
Copy link
Member

jserv commented Oct 31, 2022

Thank @AymenQ for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants