Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce ext-based implementations for shift intrinsics #543

Merged
merged 2 commits into from
Oct 20, 2022

Conversation

AymenQ
Copy link
Collaborator

@AymenQ AymenQ commented Oct 20, 2022

Use the vector extract instruction (ext) to implement _mm_srli_si128,
_mm_slli_si128 and _mm_alignr_epi8, significantly improving performance
by avoiding memory.

These were originally changed in #483 and #484 to instead use a store
followed by a shifted load. This was done in order to resolve issue
#482, where the compiler would throw an error if there were an invalid
immediate in the intrinsic arguments after the macros are expanded.

These commits avoid the recurrence of this issue by guarding the
immediate arguments with additional (redundant) ternary expressions.
These are evaluated at compile time and should cause no performance
loss; they only prevent the compilers from attempting to propagate
invalid immediates. This does now require that the immediate argument
to these intrinsics be a compile-time constant expression, but this
matches the requirement imposed by the original SSE intrinsics.

Sample codegen diffs are included in the individual commit messages.

Use a vector extract to implement _mm_s[l,r]li_si128 instead of storing
re-loading from memory. This is very similar to the implementation
prior to DLTcollab#484.

Avoid compiler errors due to invalid immediates from macro expansion by
guarding the immediate arguments with extra redundant ternary
expressions (evaluated at compile time).

This performs significantly better than the memory approach (a single
vector instruction vs stores, loads & potential stack management).

Example codegen for _mm_srli_si128(a, 11) with GCC 11.2.0 (-O3):

Prior to commit:
    str     q0, [sp]
    stp     xzr, xzr, [sp, 16]
    ldur    q0, [sp, 11]

After this commit:
    movi    v1.4s, 0x0
    ext     v0.16b, v0.16b, v1.16b, 11
Use a vector extract to perform _mm_alignr_epi8 instead of storing and
loading from memory. Add additional ternary operators to safeguard
against compiler errors at lower optimisation levels (prevents
recurrence of DLTcollab#482).

This performs significantly better than the memory approach.

Example codegen for _mm_alignr_epi8(a, b, 11) with GCC 11.2.0 (-O3):

Prior to commit:
    stp     q1, q0, [sp]
    ldur    q0, [sp, 11]

After commit:
    ext     v0.16b, v1.16b, v0.16b, 11
@jserv jserv merged commit 82e2c97 into DLTcollab:master Oct 20, 2022
@jserv
Copy link
Member

jserv commented Oct 20, 2022

Thank @AymenQ for contributing! I appreciate the efforts from Arm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants