[NEON] Refactor NEON code #787

easyaspi314 · 2023-01-22T17:07:49Z

Remove VZIP hack - ARMv7a is going to be using the 4 lane path, so there is no benefit to having the increased complexity.
- The comments focus more on AArch64 now.
Reorder the 4 lane path to clarify the paired operations. This also seems to slightly improve performance on Clang (20->21 GB/s on a Google Tensor/Cortex-X1, not tested on GCC), but we all know how inconsistent benchmarking is on Android.
Rename variables to match SSE2 and be more consistent
Document how the VUZP trick works
Update XXH3_NEON_LANES comment
Make the compiler guard a no-op on GCC (it only benefits Clang)

- Remove VZIP hack - ARMv7a is going to be using the 4 lane path, so there is no benefit to having the increased complexity. - Reorder the 4 lane path to clarify the paired operations. This also seems to slightly improve performance on Clang (not tested on GCC). - Rename variables to match SSE2 and be more consistent - Document how the VUZP trick works

easyaspi314 · 2023-01-22T17:14:19Z

The s390x fail seems to be because I was working on an older commit.

Cyan4973 · 2023-01-23T02:30:52Z

OK, so to summarize:
no more mixed NEON + scalar code path,
NEON just employs NEON now,
and let's reap the benefits of this decision with a simplified code path,
possibly extending to better performance through better code gen,
because the compiler is more likely able to understand the work to do.

easyaspi314 · 2023-01-23T03:09:51Z

No, it is still mixed NEON and scalar, it is a solid up to 15% benefit on higher end Cortex designs due to how the dispatcher works (which is annoying that it is that complex, but that's how it is 🤷‍♀️). I am basically playing a game of FreeCell with the CPU, putting some extra scalar cards in the available slots.

Target	NEON	Scalar	Reason
64-bit Cortex	6	2	Pure NEON instructions can't make full use of the pipeline, making it beneficial to offload some of the work to the integer pipeline.
64-bit Apple	8	0	Apple uses an entirely different design than Cortex so the above does not apply.
32-bit (any)	8	0	While the pipeline optimization also applies to 32-bit Cortexes (although most of the TRUE 32-bit chips are dual-issue), the scalar path with 32-bit instructions is not fast enough to keep up.

This hasn't changed.

What has changed is that I removed the ugly XXH_SPLIT_IN_PLACE macro. This only benefited 32-bit targets on the 2-lane NEON path. This only is executed when XXH3_NEON_LANES % 4 != 0. However, 32-bit uses 8 lanes by default, meaning it runs the 4-lane NEON path twice. Therefore this optimization traded a slight benefit on manual configurations and the cold scrambleAcc loop for a lot of ugliness.

I just replaced it with vmovn/vshrn which is what it expanded to on AArch64.

easyaspi314 · 2023-01-23T03:45:58Z

As for the reordering, I just put the operations as

    // comment about foo
    foo(x[0]);
    foo(x[1]);
    // comment about bar
    bar(x[0]);
    bar(x[1]);

instead of

    // comment about foo [0]
    foo(x[0]);
    // comment about bar [0]
    bar(x[0]);
    // comment about foo [1]
    foo(x[1]);
    // comment about bar [1]
    bar(x[1]);

which makes it much clearer and seems to optimize better on Clang (likely due to the loads being sequential allowing it to be obviously converted to LDP).

easyaspi314 · 2023-01-23T04:48:47Z

I should put that table in the comment as it is confusing.

- Remove acc_vec variable in favor of directly indexing - Removing the compiler guard for GCC 11 allows it to get to 23 GB/s from 20 GB/s - Slight cosmetic tweaks for the 4 lane loop allowing it to be commented out.

easyaspi314 · 2023-01-23T16:35:22Z

Did some testing on GCC, and it is very much beneficial to not compiler guard on it - It actually gets better performance than Clang at 23 GB/s.

This primarily seems to be from breaking a lot of dependency chains, where Clang does comedically bad.

Cyan4973 · 2023-01-23T17:17:00Z

Looks good,
are there further changes or tests intended for this PR ?

easyaspi314 · 2023-01-23T17:28:22Z

I think this is good. Just the performance benefit on GCC is enough.

I saw a max of 23.8 GB/s with these changes vs 20.6 GB/s on dev.

easyaspi314 · 2023-01-23T17:48:01Z

I definitely want to look into Clang's dependency issues in a future PR though.

It literally does

ldp     q7, q16, [x5]
ext     v19.16b, v7.16b, v7.16b, #8 // dependency on q7
ldp     q17, q18, [x6]
eor     v7.16b, v17.16b, v7.16b // dependency on q17

Which is ridiculous - it takes a 6 cycle memory access instruction and puts a dependency on it immediately afterwards, twice.

Nearly every Cortex big.LITTLE cluster relies on in-order efficiency cores, and they will stall for 12 cycles waiting for that result.

[NEON] Update XXH3_NEON_LANES comment

2f54d1b

[NEON] Simplify, only compiler guard on Clang

0230310

- Remove acc_vec variable in favor of directly indexing - Removing the compiler guard for GCC 11 allows it to get to 23 GB/s from 20 GB/s - Slight cosmetic tweaks for the 4 lane loop allowing it to be commented out.

Cyan4973 approved these changes Jan 23, 2023

View reviewed changes

Cyan4973 merged commit 22a9afe into Cyan4973:dev Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEON] Refactor NEON code #787

[NEON] Refactor NEON code #787

easyaspi314 commented Jan 22, 2023 •

edited

easyaspi314 commented Jan 22, 2023

Cyan4973 commented Jan 23, 2023

easyaspi314 commented Jan 23, 2023 •

edited

easyaspi314 commented Jan 23, 2023 •

edited

easyaspi314 commented Jan 23, 2023 •

edited

easyaspi314 commented Jan 23, 2023 •

edited

Cyan4973 commented Jan 23, 2023

easyaspi314 commented Jan 23, 2023

easyaspi314 commented Jan 23, 2023 •

edited

[NEON] Refactor NEON code #787

[NEON] Refactor NEON code #787

Conversation

easyaspi314 commented Jan 22, 2023 • edited

easyaspi314 commented Jan 22, 2023

Cyan4973 commented Jan 23, 2023

easyaspi314 commented Jan 23, 2023 • edited

easyaspi314 commented Jan 23, 2023 • edited

easyaspi314 commented Jan 23, 2023 • edited

easyaspi314 commented Jan 23, 2023 • edited

Cyan4973 commented Jan 23, 2023

easyaspi314 commented Jan 23, 2023

easyaspi314 commented Jan 23, 2023 • edited

easyaspi314 commented Jan 22, 2023 •

edited

easyaspi314 commented Jan 23, 2023 •

edited

easyaspi314 commented Jan 23, 2023 •

edited

easyaspi314 commented Jan 23, 2023 •

edited

easyaspi314 commented Jan 23, 2023 •

edited

easyaspi314 commented Jan 23, 2023 •

edited