Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM/AArch64] Fix multiple GCC codegen problems #651

Merged
merged 4 commits into from Dec 8, 2021

Conversation

easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented Dec 7, 2021

32-bit ARM changes:

  • Force GCC to unroll XXH3_accumulate_512 on scalar ARM -> 20% faster on
    ARMv6
  • Use XXH_FORCE_MEMORY_ACCESS=1 when in ARM strict alignment mode, avoids
    calls to memcpy(?!???!)
  • Disable alignment checks if ARM unaligned access is supported

XXH3_64bits on a Raspberry Pi 4B (Cortex-A72), GCC 10.2.1:

  • Raspbian armhf (-march=armv6 -mfpu=vfp -mfloat-abi=hard -munaligned-access)
    0.85 GB/s->1.2 GB/s. Note that there is still room; clang 11 gets 1.4 GB/s.
  • ARMv6, no unaligned access (-march=armv6 -mno-unaligned-access)
    0.3 GB/s -> 0.85 GB/s (no longer calls memcpy())

AArch64 changes

  • Moved the scalar loop above the NEON loop which allows GCC to interleave
  • AArch64 GCC now uses raw casting instead of vld1q which was treated as an
    intrinsic instead of a load.
    • Also hides the vreinterprets
    • Clang and v7a still use the safer vld1q_u8
  • Slight reordering of the NEON instructions

Pixel 4a (Cortex-A76), GCC 11.1.0: 9.8 GB/s -> 11.1 GB/s
Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: 4.2 GB/s -> 4.3 GB/s

GCC is now faster than Clang for aarch64.

32-bit ARM changes:
 - Force GCC to unroll XXH3_accumulate_512 on scalar ARM -> 20% faster on
   ARMv6
 - Use `XXH_FORCE_MEMORY_ACCESS=1` when in ARM strict alignment mode, avoids
   calls to memcpy(?!???!)

XXH3_64bits on a Raspberry Pi 4B (Cortex-A72), GCC 10.2.1:
 - Raspbian armhf (-march=armv6 -mfpu=vfp -mfloat-abi=hard -munaligned-access)
   0.85 GB/s->1.2 GB/s. Note that there is still room; clang 11 gets 1.4 GB/s.
 - ARMv6, no unaligned access (-march=armv6 -mno-unaligned-access)
   0.3 GB/s -> 0.85 GB/s (no longer calls memcpy())

AArch64 changes
 - Moved the scalar loop above the NEON loop which allows GCC to interleave
 - AArch64 GCC now uses raw casting instead of `vld1q` which was treated as an
   intrinsic instead of a load.
   - Also hides the vreinterprets
   - Clang and v7a still use the safer vld1q_u8
 - Slight reordering of the NEON instructions

Pixel 4a (Cortex-A76), GCC 11.1.0: 9.8 GB/s -> 11.1 GB/s
Raspberry Pi 4B (Cortex-A72), GCC 10.2.1: 4.2 GB/s -> 4.3 GB/s

*GCC is now faster than Clang for aarch64.*
@easyaspi314 easyaspi314 changed the title Arm go brr [ARM/AArch64] Fix multiple GCC codegen problems Dec 7, 2021
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 7, 2021

Now, time to go microoptimize so clang is faster again :trollface:

Joke aside, I think I am going to be looking for more GCC ARM/AArch64 optimizations since it is finally being competent.

I wonder how much potential is left on the NEON path. I know both GCC and Clang both emit extra instructions on the scalar path. On my phone, memcpy is 16 GB/s.

I know AVX2 is already faster than RAM, but NEON is far from it.

And perhaps an SVE XXH3 is possible?

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 7, 2021

I wonder how much potential is left in the NEON path.

This gets 11.35 GB/s, which is a hand-tweaked version of GCC 11's output for the main accumulate_512 step of hashLong_64b_defaultSecret with the scalar/NEON order swapped (which gets 11.25 GB/s)

I used umaddl, avoiding two extra eor (GCC)/and (Clang) instruction, use register writeback on ldp to remove two add, use a non-temporal load, and moved the prefetch.

I can't think of much better than this unless there is some brand new approach I am missing.

.L32:
        ldnp    q2, q1, [x0, #16]
        ldp     q5, q4, [x1, #16]
        ldr     q0, [x0, #48]
        ldr     q3, [x1, #48]
        ldp     x4, x6, [x0], #64
        ldp     x2, x3, [x1], #8
        eor     v5.16b, v2.16b, v5.16b
        eor     v4.16b, v1.16b, v4.16b
        eor     v3.16b, v0.16b, v3.16b
        ext     v2.16b, v2.16b, v2.16b, #8
        xtn     v18.2s, v5.2d
        eor     x7, x4, x2
        xtn     v17.2s, v4.2d
        eor     x5, x6, x3
        xtn     v16.2s, v3.2d
        lsr     x2, x7, #32
        shrn    v5.2s, v5.2d, #32
        ext     v1.16b, v1.16b, v1.16b, #8
        shrn    v4.2s, v4.2d, #32
        lsr     x3, x5, #32
        ext     v0.16b, v0.16b, v0.16b, #8
        umaddl  x2, w2, w7, x6
        shrn    v3.2s, v3.2d, #32
        umlal   v2.2d, v18.2s, v5.2s
        prfm    PLDL1STRM, [x0, #320]
        umaddl  x3, w3, w5, x4
        umlal   v1.2d, v17.2s, v4.2s
        add     x8, x8, x2
        add     x9, x9, x3
        umlal   v0.2d, v16.2s, v3.2s
        add     v6.2d, v6.2d, v2.2d
        add     v7.2d, v7.2d, v1.2d
        add     v23.2d, v23.2d, v0.2d
        cmp     x0, x10
        bne     .L32

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 8, 2021

As could be expected, there's a small conflict after merging #650, but nothing to complex to fix.

@easyaspi314
Copy link
Contributor Author

Err, hold up.

 - Use memcpy on ARMv6 and lower when unaligned access is supported
  - GCC has an internal conflict on whether unaligned access is available
    on ARMv6 so some parts do byteshift, some parts do not
  - aligned(1) is better on everything else
  - All this seems to be safe on even GCC 4.9.
 - Leave out the alignment check if unaligned access is supported on ARM.
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 8, 2021

Done. Now it uses memcpy for pre-v7 ARM with unaligned access (as GCC has an internal conflict where aligned(1) emits byteshift and memcpy does not), and disables the alignment check when unaligned access is available.

I tested it on GCC 4.9 and 10.1 and it still seems to be the best option.

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 8, 2021

Testing on my local smartphone (SnapDragon 855 - SM8150 - aarch64) using termux :
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.0

Algo v0.8.0 dev this PR
XXH32 5.4 GB/s 5.4 GB/s 5.4 GB/s
XXH64 3.6 GB/s 3.6 GB/s 3.6 GB/s
XXH3 11.3 GB/s 13.8 GB/s 13.6 GB/s
XXH128 11.3 GB/s 13.8 GB/s 13.5 GB/s

As one can see, there is a very small performance penalty from dev to this PR (which remains globally positive compared to v0.8.0).
The difference is very small, so it probably doesn't matter much. Merely interesting to know (is that expected?)

When compiled in scalar mode, using CPPFLAGS=-DXXH_VECTOR=XXH_SCALAR, this forces XXH3 and XXH128 to use scalar instruction set, instead of neon intrinsics.
In which case, speed for both variants and for all versions is exactly identical : 6.78 GB/s.
However, strangely, the detailed prompt still claims neon support :
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.0 (?)

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 8, 2021

Yes, that is expected and mirrors my results (both our phones are based on the A76). Clang is going to be very slightly slower, but GCC no longer emits stupid code.

I plan to investigate how to get both GCC and Clang to interleave the scalar and NEON instructions properly, as Clang is used on basically every AArch64 target but GNU/Linux distros.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 8, 2021

When compiled in scalar mode, using CPPFLAGS=-DXXH_VECTOR=XXH_SCALAR, this forces XXH3 and XXH128 to use scalar instruction set, instead of neon intrinsics.
In which case, speed for both variants and for all versions is exactly identical : 6.78 GB/s.
However, strangely, the detailed prompt still claims neon support :
compiled as 64-bit aarch64 + NEON little endian with Clang 13.0.0 (?)

XXH_VECTOR doesn't currently affect the ARM headers. It just goes on the CPU features it detects. AArch64 is literally encoded as "aarch64 + NEON" since it always has NEON.

And these changes were only in the NEON path, scalar will have no major changes.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 8, 2021

I think I am going to make the scalar lanes come first and rename the macro to XXH3_NEON_SCALAR_LANES

Therefore defining it to 0 will disable it, and it will go in logical order.

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 8, 2021

I think I am going to make the scalar lanes come first and rename the macro to XXH3_NEON_SCALAR_LANES

Therefore defining it to 0 will disable it, and it will go in logical order.

Do you want to do that as part of this PR ?
Or do you prefer we merge this PR first and the proposed update will be done later ?

@easyaspi314
Copy link
Contributor Author

Eh, I'll do it separately as that itself isn't related to this.

@Cyan4973 Cyan4973 merged commit 3fa7300 into Cyan4973:dev Dec 8, 2021
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 9, 2021

For the reference, the problem with Clang's codegen is that it is only partially interleaving the loop (I indented the vector instructions). I'm not sure how to force it to interleave without doing it manually.

.LBB20_4:
        add     x5, x2, x3
        prfm    pldl1keep, [x5, #320]
         ldp    q7, q17, [x5, #16]
         ldur   q16, [x4, #-16]
        ldp     x6, x7, [x5]
        add     x3, x3, #64
         eor    v16.16b, v16.16b, v7.16b
         ext    v7.16b, v7.16b, v7.16b, #8
         add    v2.2d, v7.2d, v2.2d
         xtn    v7.2s, v16.2d
         shrn   v16.2s, v16.2d, #32
         umlal  v2.2d, v7.2s, v16.2s
         ldp    q7, q16, [x4]
        eor     x19, x19, x6
        and     x21, x19, #0xffffffff // unnecessary, use umaddl
        lsr     x19, x19, #32
         eor    v7.16b, v7.16b, v17.16b
         ext    v17.16b, v17.16b, v17.16b, #8
         add    v1.2d, v17.2d, v1.2d
         xtn    v17.2s, v7.2d
         shrn   v7.2s, v7.2d, #32
         umlal  v1.2d, v17.2s, v7.2s
         ldr    q7, [x5, #48]
        eor     x5, x20, x7
        madd    x9, x21, x19, x9
        add     x10, x6, x10
         ext    v17.16b, v7.16b, v7.16b, #8
         eor    v7.16b, v16.16b, v7.16b
        and     x6, x5, #0xffffffff // unnecessary, use umaddl
        lsr     x5, x5, #32
         xtn    v16.2s, v7.2d
         shrn   v7.2s, v7.2d, #32
         add    v0.2d, v17.2d, v0.2d
        cmp     x3, #1024
        madd    x10, x6, x5, x10
         umlal  v0.2d, v16.2s, v7.2s
        add     x9, x9, x7
        add     x4, x4, #8
        b.ne    .LBB20_4

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Dec 10, 2021

I can just barely get 12 GB/s with gcc 10.3 (the dumb one, ironically), but that is with manual unrolling and interleaving in XXH3_accumulate().

If I manually do the interleaving, 4 NEON lanes is slightly beneficial, but it hinders the two loop codegen.

apparently-i-gotta-spell-it-out-for-you-21052779.jpg

Hopefully I can find a way to do it without manual interleaving (and still, Clang reorders it to its own tempo).

I'd love to get a natural umaddl from them though, that is the biggest problem with plain codegen aside from ordering. madd is pretty slow on most chips.
Screenshot_20211210-154715.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants