New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEON] Refactor NEON code #787
Conversation
easyaspi314
commented
Jan 22, 2023
•
edited
edited
- Remove VZIP hack - ARMv7a is going to be using the 4 lane path, so there is no benefit to having the increased complexity.
- The comments focus more on AArch64 now.
- Reorder the 4 lane path to clarify the paired operations. This also seems to slightly improve performance on Clang (20->21 GB/s on a Google Tensor/Cortex-X1, not tested on GCC), but we all know how inconsistent benchmarking is on Android.
- Rename variables to match SSE2 and be more consistent
- Document how the VUZP trick works
- Update XXH3_NEON_LANES comment
- Make the compiler guard a no-op on GCC (it only benefits Clang)
- Remove VZIP hack - ARMv7a is going to be using the 4 lane path, so there is no benefit to having the increased complexity. - Reorder the 4 lane path to clarify the paired operations. This also seems to slightly improve performance on Clang (not tested on GCC). - Rename variables to match SSE2 and be more consistent - Document how the VUZP trick works
The s390x fail seems to be because I was working on an older commit. |
OK, so to summarize: |
No, it is still mixed NEON and scalar, it is a solid up to 15% benefit on higher end Cortex designs due to how the dispatcher works (which is annoying that it is that complex, but that's how it is 🤷♀️). I am basically playing a game of FreeCell with the CPU, putting some extra scalar cards in the available slots.
This hasn't changed. What has changed is that I removed the ugly I just replaced it with |
As for the reordering, I just put the operations as // comment about foo
foo(x[0]);
foo(x[1]);
// comment about bar
bar(x[0]);
bar(x[1]); instead of // comment about foo [0]
foo(x[0]);
// comment about bar [0]
bar(x[0]);
// comment about foo [1]
foo(x[1]);
// comment about bar [1]
bar(x[1]); which makes it much clearer and seems to optimize better on Clang (likely due to the loads being sequential allowing it to be obviously converted to |
I should put that table in the comment as it is confusing. |
- Remove acc_vec variable in favor of directly indexing - Removing the compiler guard for GCC 11 allows it to get to 23 GB/s from 20 GB/s - Slight cosmetic tweaks for the 4 lane loop allowing it to be commented out.
Did some testing on GCC, and it is very much beneficial to not compiler guard on it - It actually gets better performance than Clang at 23 GB/s. This primarily seems to be from breaking a lot of dependency chains, where Clang does comedically bad. |
Looks good, |
I think this is good. Just the performance benefit on GCC is enough. I saw a max of 23.8 GB/s with these changes vs 20.6 GB/s on dev. |
I definitely want to look into Clang's dependency issues in a future PR though. It literally does
Which is ridiculous - it takes a 6 cycle memory access instruction and puts a dependency on it immediately afterwards, twice. Nearly every Cortex big.LITTLE cluster relies on in-order efficiency cores, and they will stall for 12 cycles waiting for that result. |