[ARM/AArch64] Fix multiple GCC codegen problems #651

easyaspi314 · 2021-12-07T05:25:57Z

32-bit ARM changes:

Force GCC to unroll XXH3_accumulate_512 on scalar ARM -> 20% faster on
ARMv6
Use XXH_FORCE_MEMORY_ACCESS=1 when in ARM strict alignment mode, avoids
calls to memcpy(?!???!)
Disable alignment checks if ARM unaligned access is supported

XXH3_64bits on a Raspberry Pi 4B (Cortex-A72), GCC 10.2.1:

Raspbian armhf (-march=armv6 -mfpu=vfp -mfloat-abi=hard -munaligned-access)
0.85 GB/s->1.2 GB/s. Note that there is still room; clang 11 gets 1.4 GB/s.
ARMv6, no unaligned access (-march=armv6 -mno-unaligned-access)
0.3 GB/s -> 0.85 GB/s (no longer calls memcpy())

AArch64 changes

Moved the scalar loop above the NEON loop which allows GCC to interleave
AArch64 GCC now uses raw casting instead of vld1q which was treated as an
intrinsic instead of a load.
- Also hides the vreinterprets
- Clang and v7a still use the safer vld1q_u8
Slight reordering of the NEON instructions

Pixel 4a (Cortex-A76), GCC 11.1.0: 9.8 GB/s -> 11.1 GB/s
Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: 4.2 GB/s -> 4.3 GB/s

GCC is now faster than Clang for aarch64.

32-bit ARM changes: - Force GCC to unroll XXH3_accumulate_512 on scalar ARM -> 20% faster on ARMv6 - Use `XXH_FORCE_MEMORY_ACCESS=1` when in ARM strict alignment mode, avoids calls to memcpy(?!???!) XXH3_64bits on a Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: - Raspbian armhf (-march=armv6 -mfpu=vfp -mfloat-abi=hard -munaligned-access) 0.85 GB/s->1.2 GB/s. Note that there is still room; clang 11 gets 1.4 GB/s. - ARMv6, no unaligned access (-march=armv6 -mno-unaligned-access) 0.3 GB/s -> 0.85 GB/s (no longer calls memcpy()) AArch64 changes - Moved the scalar loop above the NEON loop which allows GCC to interleave - AArch64 GCC now uses raw casting instead of `vld1q` which was treated as an intrinsic instead of a load. - Also hides the vreinterprets - Clang and v7a still use the safer vld1q_u8 - Slight reordering of the NEON instructions Pixel 4a (Cortex-A76), GCC 11.1.0: 9.8 GB/s -> 11.1 GB/s Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: 4.2 GB/s -> 4.3 GB/s *GCC is now faster than Clang for aarch64.*

easyaspi314 · 2021-12-07T05:32:22Z

Now, time to go microoptimize so clang is faster again

Joke aside, I think I am going to be looking for more GCC ARM/AArch64 optimizations since it is finally being competent.

I wonder how much potential is left on the NEON path. I know both GCC and Clang both emit extra instructions on the scalar path. On my phone, memcpy is 16 GB/s.

I know AVX2 is already faster than RAM, but NEON is far from it.

And perhaps an SVE XXH3 is possible?

easyaspi314 · 2021-12-07T16:05:53Z

I wonder how much potential is left in the NEON path.

This gets 11.35 GB/s, which is a hand-tweaked version of GCC 11's output for the main accumulate_512 step of hashLong_64b_defaultSecret with the scalar/NEON order swapped (which gets 11.25 GB/s)

I used umaddl, avoiding two extra eor (GCC)/and (Clang) instruction, use register writeback on ldp to remove two add, use a non-temporal load, and moved the prefetch.

I can't think of much better than this unless there is some brand new approach I am missing.

.L32:
        ldnp    q2, q1, [x0, #16]
        ldp     q5, q4, [x1, #16]
        ldr     q0, [x0, #48]
        ldr     q3, [x1, #48]
        ldp     x4, x6, [x0], #64
        ldp     x2, x3, [x1], #8
        eor     v5.16b, v2.16b, v5.16b
        eor     v4.16b, v1.16b, v4.16b
        eor     v3.16b, v0.16b, v3.16b
        ext     v2.16b, v2.16b, v2.16b, #8
        xtn     v18.2s, v5.2d
        eor     x7, x4, x2
        xtn     v17.2s, v4.2d
        eor     x5, x6, x3
        xtn     v16.2s, v3.2d
        lsr     x2, x7, #32
        shrn    v5.2s, v5.2d, #32
        ext     v1.16b, v1.16b, v1.16b, #8
        shrn    v4.2s, v4.2d, #32
        lsr     x3, x5, #32
        ext     v0.16b, v0.16b, v0.16b, #8
        umaddl  x2, w2, w7, x6
        shrn    v3.2s, v3.2d, #32
        umlal   v2.2d, v18.2s, v5.2s
        prfm    PLDL1STRM, [x0, #320]
        umaddl  x3, w3, w5, x4
        umlal   v1.2d, v17.2s, v4.2s
        add     x8, x8, x2
        add     x9, x9, x3
        umlal   v0.2d, v16.2s, v3.2s
        add     v6.2d, v6.2d, v2.2d
        add     v7.2d, v7.2d, v1.2d
        add     v23.2d, v23.2d, v0.2d
        cmp     x0, x10
        bne     .L32

Cyan4973 · 2021-12-08T01:09:31Z

As could be expected, there's a small conflict after merging #650, but nothing to complex to fix.

easyaspi314 · 2021-12-08T01:30:57Z

Err, hold up.

- Use memcpy on ARMv6 and lower when unaligned access is supported - GCC has an internal conflict on whether unaligned access is available on ARMv6 so some parts do byteshift, some parts do not - aligned(1) is better on everything else - All this seems to be safe on even GCC 4.9. - Leave out the alignment check if unaligned access is supported on ARM.

easyaspi314 · 2021-12-08T02:40:07Z

Done. Now it uses memcpy for pre-v7 ARM with unaligned access (as GCC has an internal conflict where aligned(1) emits byteshift and memcpy does not), and disables the alignment check when unaligned access is available.

I tested it on GCC 4.9 and 10.1 and it still seems to be the best option.

Cyan4973 · 2021-12-08T03:58:57Z

Testing on my local smartphone (SnapDragon 855 - SM8150 - aarch64) using termux :
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.0

Algo	`v0.8.0`	`dev`	this PR
XXH32	5.4 GB/s	5.4 GB/s	5.4 GB/s
XXH64	3.6 GB/s	3.6 GB/s	3.6 GB/s
XXH3	11.3 GB/s	13.8 GB/s	13.6 GB/s
XXH128	11.3 GB/s	13.8 GB/s	13.5 GB/s

As one can see, there is a very small performance penalty from dev to this PR (which remains globally positive compared to v0.8.0).
The difference is very small, so it probably doesn't matter much. Merely interesting to know (is that expected?)

When compiled in scalar mode, using CPPFLAGS=-DXXH_VECTOR=XXH_SCALAR, this forces XXH3 and XXH128 to use scalar instruction set, instead of neon intrinsics.
In which case, speed for both variants and for all versions is exactly identical : 6.78 GB/s.
However, strangely, the detailed prompt still claims neon support :
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.0 (?)

easyaspi314 · 2021-12-08T04:06:24Z

Yes, that is expected and mirrors my results (both our phones are based on the A76). Clang is going to be very slightly slower, but GCC no longer emits stupid code.

I plan to investigate how to get both GCC and Clang to interleave the scalar and NEON instructions properly, as Clang is used on basically every AArch64 target but GNU/Linux distros.

easyaspi314 · 2021-12-08T04:11:50Z

When compiled in scalar mode, using CPPFLAGS=-DXXH_VECTOR=XXH_SCALAR, this forces XXH3 and XXH128 to use scalar instruction set, instead of neon intrinsics.
In which case, speed for both variants and for all versions is exactly identical : 6.78 GB/s.
However, strangely, the detailed prompt still claims neon support :
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.0 (?)

XXH_VECTOR doesn't currently affect the ARM headers. It just goes on the CPU features it detects. AArch64 is literally encoded as "aarch64 + NEON" since it always has NEON.

And these changes were only in the NEON path, scalar will have no major changes.

easyaspi314 · 2021-12-08T04:23:32Z

I think I am going to make the scalar lanes come first and rename the macro to XXH3_NEON_SCALAR_LANES

Therefore defining it to 0 will disable it, and it will go in logical order.

Cyan4973 · 2021-12-08T04:28:38Z

I think I am going to make the scalar lanes come first and rename the macro to XXH3_NEON_SCALAR_LANES

Therefore defining it to 0 will disable it, and it will go in logical order.

Do you want to do that as part of this PR ?
Or do you prefer we merge this PR first and the proposed update will be done later ?

easyaspi314 · 2021-12-08T04:30:17Z

Eh, I'll do it separately as that itself isn't related to this.

easyaspi314 · 2021-12-09T02:52:46Z

For the reference, the problem with Clang's codegen is that it is only partially interleaving the loop (I indented the vector instructions). I'm not sure how to force it to interleave without doing it manually.

.LBB20_4:
        add     x5, x2, x3
        prfm    pldl1keep, [x5, #320]
         ldp    q7, q17, [x5, #16]
         ldur   q16, [x4, #-16]
        ldp     x6, x7, [x5]
        add     x3, x3, #64
         eor    v16.16b, v16.16b, v7.16b
         ext    v7.16b, v7.16b, v7.16b, #8
         add    v2.2d, v7.2d, v2.2d
         xtn    v7.2s, v16.2d
         shrn   v16.2s, v16.2d, #32
         umlal  v2.2d, v7.2s, v16.2s
         ldp    q7, q16, [x4]
        eor     x19, x19, x6
        and     x21, x19, #0xffffffff // unnecessary, use umaddl
        lsr     x19, x19, #32
         eor    v7.16b, v7.16b, v17.16b
         ext    v17.16b, v17.16b, v17.16b, #8
         add    v1.2d, v17.2d, v1.2d
         xtn    v17.2s, v7.2d
         shrn   v7.2s, v7.2d, #32
         umlal  v1.2d, v17.2s, v7.2s
         ldr    q7, [x5, #48]
        eor     x5, x20, x7
        madd    x9, x21, x19, x9
        add     x10, x6, x10
         ext    v17.16b, v7.16b, v7.16b, #8
         eor    v7.16b, v16.16b, v7.16b
        and     x6, x5, #0xffffffff // unnecessary, use umaddl
        lsr     x5, x5, #32
         xtn    v16.2s, v7.2d
         shrn   v7.2s, v7.2d, #32
         add    v0.2d, v17.2d, v0.2d
        cmp     x3, #1024
        madd    x10, x6, x5, x10
         umlal  v0.2d, v16.2s, v7.2s
        add     x9, x9, x7
        add     x4, x4, #8
        b.ne    .LBB20_4

easyaspi314 · 2021-12-10T20:33:54Z

I can just barely get 12 GB/s with gcc 10.3 (the dumb one, ironically), but that is with manual unrolling and interleaving in XXH3_accumulate().

If I manually do the interleaving, 4 NEON lanes is slightly beneficial, but it hinders the two loop codegen.

Hopefully I can find a way to do it without manual interleaving (and still, Clang reorders it to its own tempo).

I'd love to get a natural umaddl from them though, that is the biggest problem with plain codegen aside from ordering. madd is pretty slow on most chips.

easyaspi314 added 2 commits December 6, 2021 23:47

Merge branch 'dev' of https://github.com/Cyan4973/xxHash into arm_go_brr

1007890

easyaspi314 changed the title ~~Arm go brr~~ [ARM/AArch64] Fix multiple GCC codegen problems Dec 7, 2021

Cyan4973 mentioned this pull request Dec 8, 2021

Use __attribute__((aligned)) instead of packed #650

Merged

Merge branch 'dev' of https://github.com/Cyan4973/xxHash into arm_go_brr

6991122

Cyan4973 approved these changes Dec 8, 2021

View reviewed changes

Cyan4973 merged commit 3fa7300 into Cyan4973:dev Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM/AArch64] Fix multiple GCC codegen problems #651

[ARM/AArch64] Fix multiple GCC codegen problems #651

easyaspi314 commented Dec 7, 2021 •

edited

easyaspi314 commented Dec 7, 2021 •

edited

easyaspi314 commented Dec 7, 2021 •

edited

Cyan4973 commented Dec 8, 2021

easyaspi314 commented Dec 8, 2021

easyaspi314 commented Dec 8, 2021 •

edited

Cyan4973 commented Dec 8, 2021 •

edited

easyaspi314 commented Dec 8, 2021 •

edited

easyaspi314 commented Dec 8, 2021 •

edited

easyaspi314 commented Dec 8, 2021 •

edited

Cyan4973 commented Dec 8, 2021

easyaspi314 commented Dec 8, 2021

easyaspi314 commented Dec 9, 2021 •

edited

easyaspi314 commented Dec 10, 2021 •

edited

[ARM/AArch64] Fix multiple GCC codegen problems #651

[ARM/AArch64] Fix multiple GCC codegen problems #651

Conversation

easyaspi314 commented Dec 7, 2021 • edited

easyaspi314 commented Dec 7, 2021 • edited

easyaspi314 commented Dec 7, 2021 • edited

Cyan4973 commented Dec 8, 2021

easyaspi314 commented Dec 8, 2021

easyaspi314 commented Dec 8, 2021 • edited

Cyan4973 commented Dec 8, 2021 • edited

easyaspi314 commented Dec 8, 2021 • edited

easyaspi314 commented Dec 8, 2021 • edited

easyaspi314 commented Dec 8, 2021 • edited

Cyan4973 commented Dec 8, 2021

easyaspi314 commented Dec 8, 2021

easyaspi314 commented Dec 9, 2021 • edited

easyaspi314 commented Dec 10, 2021 • edited

easyaspi314 commented Dec 7, 2021 •

edited

easyaspi314 commented Dec 7, 2021 •

edited

easyaspi314 commented Dec 7, 2021 •

edited

easyaspi314 commented Dec 8, 2021 •

edited

Cyan4973 commented Dec 8, 2021 •

edited

easyaspi314 commented Dec 8, 2021 •

edited

easyaspi314 commented Dec 8, 2021 •

edited

easyaspi314 commented Dec 8, 2021 •

edited

easyaspi314 commented Dec 9, 2021 •

edited

easyaspi314 commented Dec 10, 2021 •

edited