Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEON] Refactor NEON code #787

Merged
merged 3 commits into from Jan 25, 2023
Merged

[NEON] Refactor NEON code #787

merged 3 commits into from Jan 25, 2023

Conversation

easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented Jan 22, 2023

  • Remove VZIP hack - ARMv7a is going to be using the 4 lane path, so there is no benefit to having the increased complexity.
    • The comments focus more on AArch64 now.
  • Reorder the 4 lane path to clarify the paired operations. This also seems to slightly improve performance on Clang (20->21 GB/s on a Google Tensor/Cortex-X1, not tested on GCC), but we all know how inconsistent benchmarking is on Android.
  • Rename variables to match SSE2 and be more consistent
  • Document how the VUZP trick works
  • Update XXH3_NEON_LANES comment
  • Make the compiler guard a no-op on GCC (it only benefits Clang)

 - Remove VZIP hack - ARMv7a is going to be using the 4 lane path, so there is
   no benefit to having the increased complexity.
 - Reorder the 4 lane path to clarify the paired operations. This also seems
   to slightly improve performance on Clang (not tested on GCC).
 - Rename variables to match SSE2 and be more consistent
 - Document how the VUZP trick works
@easyaspi314
Copy link
Contributor Author

The s390x fail seems to be because I was working on an older commit.

@Cyan4973
Copy link
Owner

OK, so to summarize:
no more mixed NEON + scalar code path,
NEON just employs NEON now,
and let's reap the benefits of this decision with a simplified code path,
possibly extending to better performance through better code gen,
because the compiler is more likely able to understand the work to do.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jan 23, 2023

No, it is still mixed NEON and scalar, it is a solid up to 15% benefit on higher end Cortex designs due to how the dispatcher works (which is annoying that it is that complex, but that's how it is 🤷‍♀️). I am basically playing a game of FreeCell with the CPU, putting some extra scalar cards in the available slots.

Target NEON Scalar Reason
64-bit Cortex 6 2 Pure NEON instructions can't make full use of the pipeline, making it beneficial to offload some of the work to the integer pipeline.
64-bit Apple 8 0 Apple uses an entirely different design than Cortex so the above does not apply.
32-bit (any) 8 0 While the pipeline optimization also applies to 32-bit Cortexes (although most of the TRUE 32-bit chips are dual-issue), the scalar path with 32-bit instructions is not fast enough to keep up.

This hasn't changed.

What has changed is that I removed the ugly XXH_SPLIT_IN_PLACE macro. This only benefited 32-bit targets on the 2-lane NEON path. This only is executed when XXH3_NEON_LANES % 4 != 0. However, 32-bit uses 8 lanes by default, meaning it runs the 4-lane NEON path twice. Therefore this optimization traded a slight benefit on manual configurations and the cold scrambleAcc loop for a lot of ugliness.

I just replaced it with vmovn/vshrn which is what it expanded to on AArch64.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jan 23, 2023

As for the reordering, I just put the operations as

    // comment about foo
    foo(x[0]);
    foo(x[1]);
    // comment about bar
    bar(x[0]);
    bar(x[1]);

instead of

    // comment about foo [0]
    foo(x[0]);
    // comment about bar [0]
    bar(x[0]);
    // comment about foo [1]
    foo(x[1]);
    // comment about bar [1]
    bar(x[1]);

which makes it much clearer and seems to optimize better on Clang (likely due to the loads being sequential allowing it to be obviously converted to LDP).

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jan 23, 2023

I should put that table in the comment as it is confusing.

 - Remove acc_vec variable in favor of directly indexing
 - Removing the compiler guard for GCC 11 allows it to get to 23 GB/s
   from 20 GB/s
 - Slight cosmetic tweaks for the 4 lane loop allowing it to be commented
   out.
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jan 23, 2023

Did some testing on GCC, and it is very much beneficial to not compiler guard on it - It actually gets better performance than Clang at 23 GB/s.

This primarily seems to be from breaking a lot of dependency chains, where Clang does comedically bad.

@Cyan4973
Copy link
Owner

Looks good,
are there further changes or tests intended for this PR ?

@easyaspi314
Copy link
Contributor Author

I think this is good. Just the performance benefit on GCC is enough.

I saw a max of 23.8 GB/s with these changes vs 20.6 GB/s on dev.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jan 23, 2023

I definitely want to look into Clang's dependency issues in a future PR though.

It literally does

ldp     q7, q16, [x5]
ext     v19.16b, v7.16b, v7.16b, #8 // dependency on q7
ldp     q17, q18, [x6]
eor     v7.16b, v17.16b, v7.16b // dependency on q17

Which is ridiculous - it takes a 6 cycle memory access instruction and puts a dependency on it immediately afterwards, twice.

Nearly every Cortex big.LITTLE cluster relies on in-order efficiency cores, and they will stall for 12 cycles waiting for that result.

@Cyan4973 Cyan4973 merged commit 22a9afe into Cyan4973:dev Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants