Try to improve radix sort with memory prefetching#77029
Try to improve radix sort with memory prefetching#77029taiyang-li wants to merge 31 commits intoClickHouse:masterfrom
Conversation
Algunenano
left a comment
There was a problem hiding this comment.
Leaving many comments. One of the improvement shows a performance degradation in my machine and they are completely separate from one another, so it'd be much appreciated if this was separated in 3 separate PRs instead of one
|
Dear @Algunenano, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
|
Dear @Algunenano, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
|
I need to find time to do more in-depth tests with different CPUs, as this improvements seems to depend on it |
Done. |
|
Please merge with the master branch for changes to take effect. |
|
One query sped up on x86: https://s3.amazonaws.com/clickhouse-test-reports/PRs/77029/92f6117a9b5599fff86f3ac276e665e67a4927b0//performance_comparison_amd_release_master_head_3_3/report.html The overall results are unclear, but at least there are no slowdowns related to Radix Sort. |
|
We need to prove that there are stable speed-ups on different CPU models (e.g., test on Intel and AMD), and we can merge. |
|
If you'd like I can test it on six different architectures like here: #84576 |
|
Let's merge with master, and as a side-effect it will also re-run performance tests, so we will have more confidence. |
|
What about the slowed down tests?
Aren't those using radix sort? Or is it unrelated? |
|
not sure. need to verify it manually... |
|
@Algunenano the slowed down tests are related to current changes. I'll try to fix it. |
dc8c984 to
246272e
Compare
|
@Algunenano now there are no slowed down tests related to radix sort. Could you help review it again ? |
|
I still see a slowdown of AMD Ryzen 9 7950X3D:A 20% degradation MasterPRAMD EPYC 7R13 (AWS)A 3% improvement MasterPRI'm doing some more tests |
|
I've been doing more tests and at least in the 2 machines I have close on hand (the ones mentioned above) and calling prefetch directly is either bad or has no performance impact, even with different values of it, and even with batching it (doing multiple prefetches at the same time). OTOH, the simplification of the code (done here) and then manually unrolling of the position calculation helps first the compiler to use some vectorized operations, and then the CPU do prefetching itself (modern CPUs are pretty good at it). In my tests this leads to a 1.1x perf improvement in 7950X3D and 1.2x in EPYC 7R13 for I'll create a parallel PR so we can compare both approaches and maybe check other CPUs too |



Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Try to improve the performance of radix sort with memory prefetching