New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM/AArch64] Fix multiple GCC codegen problems #651
Conversation
32-bit ARM changes: - Force GCC to unroll XXH3_accumulate_512 on scalar ARM -> 20% faster on ARMv6 - Use `XXH_FORCE_MEMORY_ACCESS=1` when in ARM strict alignment mode, avoids calls to memcpy(?!???!) XXH3_64bits on a Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: - Raspbian armhf (-march=armv6 -mfpu=vfp -mfloat-abi=hard -munaligned-access) 0.85 GB/s->1.2 GB/s. Note that there is still room; clang 11 gets 1.4 GB/s. - ARMv6, no unaligned access (-march=armv6 -mno-unaligned-access) 0.3 GB/s -> 0.85 GB/s (no longer calls memcpy()) AArch64 changes - Moved the scalar loop above the NEON loop which allows GCC to interleave - AArch64 GCC now uses raw casting instead of `vld1q` which was treated as an intrinsic instead of a load. - Also hides the vreinterprets - Clang and v7a still use the safer vld1q_u8 - Slight reordering of the NEON instructions Pixel 4a (Cortex-A76), GCC 11.1.0: 9.8 GB/s -> 11.1 GB/s Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: 4.2 GB/s -> 4.3 GB/s *GCC is now faster than Clang for aarch64.*
Now, time to go microoptimize so clang is faster again Joke aside, I think I am going to be looking for more GCC ARM/AArch64 optimizations since it is finally being competent. I wonder how much potential is left on the NEON path. I know both GCC and Clang both emit extra instructions on the scalar path. On my phone, memcpy is 16 GB/s. I know AVX2 is already faster than RAM, but NEON is far from it. And perhaps an SVE XXH3 is possible? |
This gets 11.35 GB/s, which is a hand-tweaked version of GCC 11's output for the main I used I can't think of much better than this unless there is some brand new approach I am missing.
|
As could be expected, there's a small conflict after merging #650, but nothing to complex to fix. |
Err, hold up. |
- Use memcpy on ARMv6 and lower when unaligned access is supported - GCC has an internal conflict on whether unaligned access is available on ARMv6 so some parts do byteshift, some parts do not - aligned(1) is better on everything else - All this seems to be safe on even GCC 4.9. - Leave out the alignment check if unaligned access is supported on ARM.
Done. Now it uses memcpy for pre-v7 ARM with unaligned access (as GCC has an internal conflict where I tested it on GCC 4.9 and 10.1 and it still seems to be the best option. |
Testing on my local smartphone (SnapDragon 855 - SM8150 - aarch64) using
As one can see, there is a very small performance penalty from When compiled in |
Yes, that is expected and mirrors my results (both our phones are based on the A76). Clang is going to be very slightly slower, but GCC no longer emits stupid code. I plan to investigate how to get both GCC and Clang to interleave the scalar and NEON instructions properly, as Clang is used on basically every AArch64 target but GNU/Linux distros. |
XXH_VECTOR doesn't currently affect the ARM headers. It just goes on the CPU features it detects. AArch64 is literally encoded as And these changes were only in the NEON path, scalar will have no major changes. |
I think I am going to make the scalar lanes come first and rename the macro to Therefore defining it to 0 will disable it, and it will go in logical order. |
Do you want to do that as part of this PR ? |
Eh, I'll do it separately as that itself isn't related to this. |
For the reference, the problem with Clang's codegen is that it is only partially interleaving the loop (I indented the vector instructions). I'm not sure how to force it to interleave without doing it manually.
|
32-bit ARM changes:
ARMv6
XXH_FORCE_MEMORY_ACCESS=1
when in ARM strict alignment mode, avoidscalls to memcpy(?!???!)
XXH3_64bits on a Raspberry Pi 4B (Cortex-A72), GCC 10.2.1:
0.85 GB/s->1.2 GB/s. Note that there is still room; clang 11 gets 1.4 GB/s.
0.3 GB/s -> 0.85 GB/s (no longer calls memcpy())
AArch64 changes
vld1q
which was treated as anintrinsic instead of a load.
Pixel 4a (Cortex-A76), GCC 11.1.0: 9.8 GB/s -> 11.1 GB/s
Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: 4.2 GB/s -> 4.3 GB/s
GCC is now faster than Clang for aarch64.