Reintroduce `ext`-based implementations for shift intrinsics #543

AymenQ · 2022-10-20T08:28:55Z

Use the vector extract instruction (ext) to implement _mm_srli_si128,
_mm_slli_si128 and _mm_alignr_epi8, significantly improving performance
by avoiding memory.

These were originally changed in #483 and #484 to instead use a store
followed by a shifted load. This was done in order to resolve issue
#482, where the compiler would throw an error if there were an invalid
immediate in the intrinsic arguments after the macros are expanded.

These commits avoid the recurrence of this issue by guarding the
immediate arguments with additional (redundant) ternary expressions.
These are evaluated at compile time and should cause no performance
loss; they only prevent the compilers from attempting to propagate
invalid immediates. This does now require that the immediate argument
to these intrinsics be a compile-time constant expression, but this
matches the requirement imposed by the original SSE intrinsics.

Sample codegen diffs are included in the individual commit messages.

Use a vector extract to implement _mm_s[l,r]li_si128 instead of storing re-loading from memory. This is very similar to the implementation prior to DLTcollab#484. Avoid compiler errors due to invalid immediates from macro expansion by guarding the immediate arguments with extra redundant ternary expressions (evaluated at compile time). This performs significantly better than the memory approach (a single vector instruction vs stores, loads & potential stack management). Example codegen for _mm_srli_si128(a, 11) with GCC 11.2.0 (-O3): Prior to commit: str q0, [sp] stp xzr, xzr, [sp, 16] ldur q0, [sp, 11] After this commit: movi v1.4s, 0x0 ext v0.16b, v0.16b, v1.16b, 11

Use a vector extract to perform _mm_alignr_epi8 instead of storing and loading from memory. Add additional ternary operators to safeguard against compiler errors at lower optimisation levels (prevents recurrence of DLTcollab#482). This performs significantly better than the memory approach. Example codegen for _mm_alignr_epi8(a, b, 11) with GCC 11.2.0 (-O3): Prior to commit: stp q1, q0, [sp] ldur q0, [sp, 11] After commit: ext v0.16b, v1.16b, v0.16b, 11

jserv · 2022-10-20T08:50:46Z

Thank @AymenQ for contributing! I appreciate the efforts from Arm.

AymenQ added 2 commits October 20, 2022 08:26

AymenQ requested review from jserv and marktwtn as code owners October 20, 2022 08:28

jserv merged commit 82e2c97 into DLTcollab:master Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reintroduce `ext`-based implementations for shift intrinsics #543

Reintroduce `ext`-based implementations for shift intrinsics #543

AymenQ commented Oct 20, 2022

jserv commented Oct 20, 2022

Reintroduce ext-based implementations for shift intrinsics #543

Reintroduce ext-based implementations for shift intrinsics #543

Conversation

AymenQ commented Oct 20, 2022

jserv commented Oct 20, 2022

Reintroduce `ext`-based implementations for shift intrinsics #543

Reintroduce `ext`-based implementations for shift intrinsics #543