Try to improve radix sort with memory prefetching by taiyang-li · Pull Request #77029 · ClickHouse/ClickHouse

taiyang-li · 2025-03-03T09:03:33Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Try to improve the performance of radix sort with memory prefetching

$ ./build_gcc/src/Common/benchmarks/radix_sort   

before
--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
BM_RadixSort1    2593204 ns      2593052 ns          269

after
--------------------------------------------------------
Benchmark              Time             CPU   Iterations
--------------------------------------------------------
BM_RadixSort1    1985596 ns      1985545 ns          353

clickhouse-gh · 2025-03-03T09:10:21Z

Workflow [PR], commit [5a3b8a2]

This reverts commit 648bc91.

Algunenano

Leaving many comments. One of the improvement shows a performance degradation in my machine and they are completely separate from one another, so it'd be much appreciated if this was separated in 3 separate PRs instead of one

src/Storages/StorageGenerateRandom.cpp

src/Common/RadixSort.h

src/Common/benchmarks/radix_sort.cpp

src/Core/SortCursor.h

…kHouse into opt_single_order_by

clickhouse-gh · 2025-04-22T13:16:21Z

Dear @Algunenano, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

clickhouse-gh · 2025-06-17T13:20:44Z

Dear @Algunenano, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

Algunenano · 2025-06-23T09:17:16Z

I need to find time to do more in-depth tests with different CPUs, as this improvements seems to depend on it

clickhouse-gh · 2025-07-03T10:32:11Z

Workflow [PR], commit [c04bcdd]

taiyang-li · 2025-07-28T09:43:09Z

Also, add a changelog entry.

Done.

alexey-milovidov · 2025-07-28T10:21:34Z

Please merge with the master branch for changes to take effect.

…khouse into opt_single_order_by

alexey-milovidov · 2025-07-28T23:16:39Z

One query sped up on x86: https://s3.amazonaws.com/clickhouse-test-reports/PRs/77029/92f6117a9b5599fff86f3ac276e665e67a4927b0//performance_comparison_amd_release_master_head_3_3/report.html

The overall results are unclear, but at least there are no slowdowns related to Radix Sort.

alexey-milovidov · 2025-07-28T23:18:03Z

We need to prove that there are stable speed-ups on different CPU models (e.g., test on Intel and AMD), and we can merge.

alexey-milovidov · 2025-08-04T22:44:22Z

If you'd like I can test it on six different architectures like here: #84576

alexey-milovidov · 2025-08-04T22:46:13Z

Let's merge with master, and as a side-effect it will also re-run performance tests, so we will have more confidence.

taiyang-li · 2025-08-05T02:53:35Z

Found one related perf test change:

Algunenano · 2025-08-05T09:08:16Z

What about the slowed down tests?

Aren't those using radix sort? Or is it unrelated?

taiyang-li · 2025-08-05T10:37:19Z

not sure. need to verify it manually...

taiyang-li · 2025-08-20T10:30:06Z

@Algunenano the slowed down tests are related to current changes. I'll try to fix it.

taiyang-li · 2025-08-22T07:23:12Z

@Algunenano now there are no slowed down tests related to radix sort. Could you help review it again ?

Algunenano · 2025-08-28T11:38:58Z

I still see a slowdown of optimize_sorting_for_input_stream in the perf tests and a slow down in some machines:

AMD Ryzen 9 7950X3D:

A 20% degradation

Master

2025-08-28T13:21:50+02:00
Running ./src/Common/benchmarks/benchmark_radix_sort_master
Run on (32 X 4055.06 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 98304 KiB (x2)
Load Average: 2.22, 5.69, 4.51
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_RadixSort1            477379 ns       475352 ns         1393
BM_RadixSort1            472296 ns       470231 ns         1393
BM_RadixSort1            472134 ns       470163 ns         1393
BM_RadixSort1            475205 ns       473191 ns         1393
BM_RadixSort1            471224 ns       469266 ns         1393
BM_RadixSort1            477741 ns       475671 ns         1393
BM_RadixSort1            470137 ns       468253 ns         1393
BM_RadixSort1            479301 ns       477190 ns         1393
BM_RadixSort1            471139 ns       469251 ns         1393
BM_RadixSort1            478742 ns       476751 ns         1393
BM_RadixSort1_mean       474530 ns       472532 ns           10
BM_RadixSort1_median     473750 ns       471711 ns           10
BM_RadixSort1_stddev       3525 ns         3470 ns           10
BM_RadixSort1_cv           0.74 %          0.73 %            10

PR

2025-08-28T13:21:29+02:00
Running ./src/Common/benchmarks/benchmark_radix_sort
Run on (32 X 5039.07 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 98304 KiB (x2)
Load Average: 2.73, 6.02, 4.59
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_RadixSort1            565985 ns       563487 ns         1209
BM_RadixSort1            559447 ns       557086 ns         1209
BM_RadixSort1            569966 ns       567389 ns         1209
BM_RadixSort1            580683 ns       577609 ns         1209
BM_RadixSort1            567484 ns       564959 ns         1209
BM_RadixSort1            572217 ns       569712 ns         1209
BM_RadixSort1            567858 ns       565392 ns         1209
BM_RadixSort1            566612 ns       564121 ns         1209
BM_RadixSort1            563758 ns       561405 ns         1209
BM_RadixSort1            561171 ns       558821 ns         1209
BM_RadixSort1_mean       567518 ns       564998 ns           10
BM_RadixSort1_median     567048 ns       564540 ns           10
BM_RadixSort1_stddev       6002 ns         5810 ns           10
BM_RadixSort1_cv           1.06 %          1.03 %            10

AMD EPYC 7R13 (AWS)

A 3% improvement

Master

$ ./benchmark_radix_sort_master --benchmark_repetitions=10
2025-08-28T11:28:49+00:00
Running ./benchmark_radix_sort_master
Run on (32 X 1499.1 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.08, 0.05, 0.04
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_RadixSort1           1026726 ns      1026661 ns          680
BM_RadixSort1           1026205 ns      1026104 ns          680
BM_RadixSort1           1028298 ns      1028233 ns          680
BM_RadixSort1           1027260 ns      1027183 ns          680
BM_RadixSort1           1027485 ns      1027379 ns          680
BM_RadixSort1           1030539 ns      1030458 ns          680
BM_RadixSort1           1027567 ns      1027461 ns          680
BM_RadixSort1           1027641 ns      1027549 ns          680
BM_RadixSort1           1028176 ns      1028111 ns          680
BM_RadixSort1           1029456 ns      1029363 ns          680
BM_RadixSort1_mean      1027935 ns      1027850 ns           10
BM_RadixSort1_median    1027604 ns      1027505 ns           10
BM_RadixSort1_stddev       1275 ns         1277 ns           10
BM_RadixSort1_cv           0.12 %          0.12 %            10

PR

$ ./benchmark_radix_sort --benchmark_repetitions=10
2025-08-28T11:34:20+00:00
Running ./benchmark_radix_sort
Run on (32 X 1498.57 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.06, 0.04, 0.03
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_RadixSort1            999743 ns       999661 ns          700
BM_RadixSort1            999456 ns       999383 ns          700
BM_RadixSort1           1001320 ns      1001247 ns          700
BM_RadixSort1            999791 ns       999718 ns          700
BM_RadixSort1            999862 ns       999787 ns          700
BM_RadixSort1           1001359 ns      1001296 ns          700
BM_RadixSort1            999685 ns       999612 ns          700
BM_RadixSort1           1000026 ns       999935 ns          700
BM_RadixSort1            999987 ns       999898 ns          700
BM_RadixSort1            999681 ns       999609 ns          700
BM_RadixSort1_mean      1000091 ns      1000015 ns           10
BM_RadixSort1_median     999826 ns       999753 ns           10
BM_RadixSort1_stddev        678 ns          681 ns           10
BM_RadixSort1_cv           0.07 %          0.07 %            10

I'm doing some more tests

Algunenano · 2025-08-28T15:27:41Z

I've been doing more tests and at least in the 2 machines I have close on hand (the ones mentioned above) and calling prefetch directly is either bad or has no performance impact, even with different values of it, and even with batching it (doing multiple prefetches at the same time).

OTOH, the simplification of the code (done here) and then manually unrolling of the position calculation helps first the compiler to use some vectorized operations, and then the CPU do prefetching itself (modern CPUs are pretty good at it). In my tests this leads to a 1.1x perf improvement in 7950X3D and 1.2x in EPYC 7R13 for benchmark_radix_sort.

I'll create a parallel PR so we can compare both approaches and maybe check other CPUs too

taiyang-li added 3 commits March 3, 2025 16:30

improve perf of order by single column

a639ed2

commit again

5856391

fix conflict

de073b8

opt2: binary search

97b37d8

clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Mar 4, 2025

taiyang-li changed the title ~~Optimize order by single nullable column~~ Optimize order by single nullable column or low cardinality column Mar 4, 2025

taiyang-li added 6 commits March 4, 2025 18:17

improve radix sort

648bc91

Revert "improve radix sort" for no improvement

c1da119

This reverts commit 648bc91.

fix perf error

a3d064b

commit again

1446130

improve radix sort

697f21f

remove comments

a168495

Algunenano self-assigned this Mar 6, 2025

Algunenano requested changes Mar 10, 2025

View reviewed changes

taiyang-li added 4 commits March 12, 2025 09:47

change as request

db9ad81

Merge branch 'ClickHouse:master' into opt_single_order_by

98f6190

split prs

d01f8fb

Merge branch 'opt_single_order_by' of https://github.com/bigo-sg/Clic…

5a3b8a2

…kHouse into opt_single_order_by

taiyang-li changed the title ~~Optimize order by single nullable column or low cardinality column~~ Try to improve radix sort with memory prefetching Mar 18, 2025

clickhouse-gh bot unassigned Algunenano Apr 22, 2025

Algunenano self-assigned this May 14, 2025

clickhouse-gh bot unassigned Algunenano Jun 17, 2025

Algunenano self-assigned this Jun 23, 2025

Merge branch 'master' into opt_single_order_by

6f0b51a

Merge branch 'ClickHouse:master' into opt_single_order_by

ab38d6d

taiyang-li added 4 commits July 28, 2025 18:22

Merge branch 'ClickHouse:master' into opt_single_order_by

1ddc3ad

Merge remote-tracking branch 'origin/master' into opt_single_order_by

59fd82d

revert files

f459c30

Merge branch 'opt_single_order_by' of https://github.com/bigo-sg/clic…

92f6117

…khouse into opt_single_order_by

Merge branch 'ClickHouse:master' into opt_single_order_by

59abf58

taiyang-li added 2 commits August 8, 2025 09:44

fix building

45e9fd1

Merge remote-tracking branch 'origin/master' into opt_single_order_by

ebf72fd

taiyang-li added 3 commits August 20, 2025 20:25

improve performance

cbafd91

add prefetch distance

faff07b

improve radix sort

246272e

taiyang-li force-pushed the opt_single_order_by branch from dc8c984 to 246272e Compare August 21, 2025 07:13

taiyang-li and others added 2 commits August 21, 2025 16:31

revert allocation alignment

2b927d1

Merge branch 'ClickHouse:master' into opt_single_order_by

fe47e13

Merge branch 'ClickHouse:master' into opt_single_order_by

18081a3

Algunenano mentioned this pull request Aug 28, 2025

RadixSort: Help the compiler use SIMD and the CPU do better prefetching #86378

Merged

minimize CountType according to actual rows

c04bcdd

taiyang-li closed this Sep 2, 2025

Conversation

taiyang-li commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

clickhouse-gh bot commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Algunenano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh bot commented Apr 22, 2025

Uh oh!

clickhouse-gh bot commented Jun 17, 2025

Uh oh!

Algunenano commented Jun 23, 2025

Uh oh!

clickhouse-gh bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taiyang-li commented Jul 28, 2025

Uh oh!

alexey-milovidov commented Jul 28, 2025

Uh oh!

alexey-milovidov commented Jul 28, 2025

Uh oh!

alexey-milovidov commented Jul 28, 2025

Uh oh!

alexey-milovidov commented Aug 4, 2025

Uh oh!

alexey-milovidov commented Aug 4, 2025

Uh oh!

taiyang-li commented Aug 5, 2025

Uh oh!

Algunenano commented Aug 5, 2025

Uh oh!

taiyang-li commented Aug 5, 2025

Uh oh!

taiyang-li commented Aug 20, 2025

Uh oh!

taiyang-li commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Algunenano commented Aug 28, 2025

AMD Ryzen 9 7950X3D:

Master

PR

AMD EPYC 7R13 (AWS)

Master

PR

Uh oh!

Algunenano commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

taiyang-li commented Mar 3, 2025 •

edited

Loading

clickhouse-gh bot commented Mar 3, 2025 •

edited

Loading

clickhouse-gh bot commented Jul 3, 2025 •

edited

Loading

taiyang-li commented Aug 22, 2025 •

edited

Loading