Skip to content

Try to improve radix sort with memory prefetching#77029

Closed
taiyang-li wants to merge 31 commits intoClickHouse:masterfrom
bigo-sg:opt_single_order_by
Closed

Try to improve radix sort with memory prefetching#77029
taiyang-li wants to merge 31 commits intoClickHouse:masterfrom
bigo-sg:opt_single_order_by

Conversation

@taiyang-li
Copy link
Copy Markdown
Contributor

@taiyang-li taiyang-li commented Mar 3, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Try to improve the performance of radix sort with memory prefetching

$ ./build_gcc/src/Common/benchmarks/radix_sort   

before
--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
BM_RadixSort1    2593204 ns      2593052 ns          269

after
--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
BM_RadixSort1    1985596 ns      1985545 ns          353

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 3, 2025

Workflow [PR], commit [5a3b8a2]

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Mar 4, 2025
@taiyang-li taiyang-li changed the title Optimize order by single nullable column Optimize order by single nullable column or low cardinality column Mar 4, 2025
@Algunenano Algunenano self-assigned this Mar 6, 2025
Copy link
Copy Markdown
Member

@Algunenano Algunenano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving many comments. One of the improvement shows a performance degradation in my machine and they are completely separate from one another, so it'd be much appreciated if this was separated in 3 separate PRs instead of one

@taiyang-li taiyang-li changed the title Optimize order by single nullable column or low cardinality column Try to improve radix sort with memory prefetching Mar 18, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 22, 2025

Dear @Algunenano, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@Algunenano Algunenano self-assigned this May 14, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 17, 2025

Dear @Algunenano, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@Algunenano Algunenano self-assigned this Jun 23, 2025
@Algunenano
Copy link
Copy Markdown
Member

I need to find time to do more in-depth tests with different CPUs, as this improvements seems to depend on it

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 3, 2025

Workflow [PR], commit [c04bcdd]

@taiyang-li
Copy link
Copy Markdown
Contributor Author

Also, add a changelog entry.

Done.

@alexey-milovidov
Copy link
Copy Markdown
Member

Please merge with the master branch for changes to take effect.

@alexey-milovidov
Copy link
Copy Markdown
Member

One query sped up on x86: https://s3.amazonaws.com/clickhouse-test-reports/PRs/77029/92f6117a9b5599fff86f3ac276e665e67a4927b0//performance_comparison_amd_release_master_head_3_3/report.html

The overall results are unclear, but at least there are no slowdowns related to Radix Sort.

@alexey-milovidov
Copy link
Copy Markdown
Member

We need to prove that there are stable speed-ups on different CPU models (e.g., test on Intel and AMD), and we can merge.

@alexey-milovidov
Copy link
Copy Markdown
Member

If you'd like I can test it on six different architectures like here: #84576

@alexey-milovidov
Copy link
Copy Markdown
Member

Let's merge with master, and as a side-effect it will also re-run performance tests, so we will have more confidence.

@taiyang-li
Copy link
Copy Markdown
Contributor Author

Found one related perf test change:
image

@Algunenano
Copy link
Copy Markdown
Member

What about the slowed down tests?

image

Aren't those using radix sort? Or is it unrelated?

@taiyang-li
Copy link
Copy Markdown
Contributor Author

not sure. need to verify it manually...

@taiyang-li
Copy link
Copy Markdown
Contributor Author

@Algunenano the slowed down tests are related to current changes. I'll try to fix it.

@taiyang-li taiyang-li force-pushed the opt_single_order_by branch from dc8c984 to 246272e Compare August 21, 2025 07:13
@taiyang-li
Copy link
Copy Markdown
Contributor Author

taiyang-li commented Aug 22, 2025

@Algunenano now there are no slowed down tests related to radix sort. Could you help review it again ?
image

@Algunenano
Copy link
Copy Markdown
Member

I still see a slowdown of optimize_sorting_for_input_stream in the perf tests and a slow down in some machines:

AMD Ryzen 9 7950X3D:

A 20% degradation

Master

2025-08-28T13:21:50+02:00
Running ./src/Common/benchmarks/benchmark_radix_sort_master
Run on (32 X 4055.06 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 98304 KiB (x2)
Load Average: 2.22, 5.69, 4.51
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_RadixSort1            477379 ns       475352 ns         1393
BM_RadixSort1            472296 ns       470231 ns         1393
BM_RadixSort1            472134 ns       470163 ns         1393
BM_RadixSort1            475205 ns       473191 ns         1393
BM_RadixSort1            471224 ns       469266 ns         1393
BM_RadixSort1            477741 ns       475671 ns         1393
BM_RadixSort1            470137 ns       468253 ns         1393
BM_RadixSort1            479301 ns       477190 ns         1393
BM_RadixSort1            471139 ns       469251 ns         1393
BM_RadixSort1            478742 ns       476751 ns         1393
BM_RadixSort1_mean       474530 ns       472532 ns           10
BM_RadixSort1_median     473750 ns       471711 ns           10
BM_RadixSort1_stddev       3525 ns         3470 ns           10
BM_RadixSort1_cv           0.74 %          0.73 %            10

PR

2025-08-28T13:21:29+02:00
Running ./src/Common/benchmarks/benchmark_radix_sort
Run on (32 X 5039.07 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 98304 KiB (x2)
Load Average: 2.73, 6.02, 4.59
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_RadixSort1            565985 ns       563487 ns         1209
BM_RadixSort1            559447 ns       557086 ns         1209
BM_RadixSort1            569966 ns       567389 ns         1209
BM_RadixSort1            580683 ns       577609 ns         1209
BM_RadixSort1            567484 ns       564959 ns         1209
BM_RadixSort1            572217 ns       569712 ns         1209
BM_RadixSort1            567858 ns       565392 ns         1209
BM_RadixSort1            566612 ns       564121 ns         1209
BM_RadixSort1            563758 ns       561405 ns         1209
BM_RadixSort1            561171 ns       558821 ns         1209
BM_RadixSort1_mean       567518 ns       564998 ns           10
BM_RadixSort1_median     567048 ns       564540 ns           10
BM_RadixSort1_stddev       6002 ns         5810 ns           10
BM_RadixSort1_cv           1.06 %          1.03 %            10

AMD EPYC 7R13 (AWS)

A 3% improvement

Master

$ ./benchmark_radix_sort_master --benchmark_repetitions=10
2025-08-28T11:28:49+00:00
Running ./benchmark_radix_sort_master
Run on (32 X 1499.1 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.08, 0.05, 0.04
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_RadixSort1           1026726 ns      1026661 ns          680
BM_RadixSort1           1026205 ns      1026104 ns          680
BM_RadixSort1           1028298 ns      1028233 ns          680
BM_RadixSort1           1027260 ns      1027183 ns          680
BM_RadixSort1           1027485 ns      1027379 ns          680
BM_RadixSort1           1030539 ns      1030458 ns          680
BM_RadixSort1           1027567 ns      1027461 ns          680
BM_RadixSort1           1027641 ns      1027549 ns          680
BM_RadixSort1           1028176 ns      1028111 ns          680
BM_RadixSort1           1029456 ns      1029363 ns          680
BM_RadixSort1_mean      1027935 ns      1027850 ns           10
BM_RadixSort1_median    1027604 ns      1027505 ns           10
BM_RadixSort1_stddev       1275 ns         1277 ns           10
BM_RadixSort1_cv           0.12 %          0.12 %            10

PR

$ ./benchmark_radix_sort --benchmark_repetitions=10
2025-08-28T11:34:20+00:00
Running ./benchmark_radix_sort
Run on (32 X 1498.57 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.06, 0.04, 0.03
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_RadixSort1            999743 ns       999661 ns          700
BM_RadixSort1            999456 ns       999383 ns          700
BM_RadixSort1           1001320 ns      1001247 ns          700
BM_RadixSort1            999791 ns       999718 ns          700
BM_RadixSort1            999862 ns       999787 ns          700
BM_RadixSort1           1001359 ns      1001296 ns          700
BM_RadixSort1            999685 ns       999612 ns          700
BM_RadixSort1           1000026 ns       999935 ns          700
BM_RadixSort1            999987 ns       999898 ns          700
BM_RadixSort1            999681 ns       999609 ns          700
BM_RadixSort1_mean      1000091 ns      1000015 ns           10
BM_RadixSort1_median     999826 ns       999753 ns           10
BM_RadixSort1_stddev        678 ns          681 ns           10
BM_RadixSort1_cv           0.07 %          0.07 %            10

I'm doing some more tests

@Algunenano
Copy link
Copy Markdown
Member

I've been doing more tests and at least in the 2 machines I have close on hand (the ones mentioned above) and calling prefetch directly is either bad or has no performance impact, even with different values of it, and even with batching it (doing multiple prefetches at the same time).

OTOH, the simplification of the code (done here) and then manually unrolling of the position calculation helps first the compiler to use some vectorized operations, and then the CPU do prefetching itself (modern CPUs are pretty good at it). In my tests this leads to a 1.1x perf improvement in 7950X3D and 1.2x in EPYC 7R13 for benchmark_radix_sort.

I'll create a parallel PR so we can compare both approaches and maybe check other CPUs too

@taiyang-li taiyang-li closed this Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants