New segmented sort algorithm #357

gevtushenko · 2021-08-17T17:52:02Z

This PR includes a new segmented sort facility. Few approaches to this problem exist.

Embed segment number into keys

This approach provides an elegant solution to the load-balancing issue but can lead to slowdowns. It also can't be applicable if the number of bites representing segments number exceeds a maximal number of bytes used by keys.

Modified merge sort approach

This idea is implemented in modern GPU. I've used this approach as a reference for comparison with the new segmented sort algorithm. As I show below, this approach can be outperformed in most cases.

Kernel specialisation

The idea behind this approach is to partition input segments into size groups. Specialised kernels can further process each size group. The LRB approach initially discussed in the issue falls into this category. It also represents the approach that the new segmented sort algorithm relies on.

I'm going to briefly describe the genesis of the new segmented sort algorithm to justify some design decisions.

To minimise the number of kernel specialisations, I've benchmarked different approaches to small (under a few hundred items) segment sorting. I've benchmarked single-thread even-odd sorting, bitonic warp sorting and newly added warp merge sort. The warp-scope merge-sort approach demonstrated some advantages: it can sort bigger segments and outperforms other methods (in the majority of cases).

Warp-scope merge sort is included in this PR as a separate facility. It's possible to partition architectural warp into multiple virtual ones to sort multiple segments simultaneously. The warp-scope merge sort duplicated a significant part of the previously introduced block-scope merge sort, so I extracted the merge-sort strategy into a separate facility. Both warp and block sort share this strategy.

Here's the speedup of warp-scope merge sort over warp-bitonic sort:

And the speedup of warp-scope merge sort over single-thread odd-even sort:

In the figures above I vary segment sizes and segments number.

To further increase the performance of warp-scope merge sort I needed to load and store data efficiently. I needed warp-scope load, store and exchange facilities. These facilities are also provided in this PR.

Using a proper sorting algorithm was not enough. Initially, I've assigned a CUDA thread block to a segment. Although this approach demonstrated speedup over the existing one, it led to inefficient resource utilisation because most threads were idle. Nonetheless, a kernel like this is used as a fallback solution when there are not enough segments. If idle threads don't block other CTAs from execution, there's no reason to spend time on segments partitioning. The fallback kernel helped to eliminate cases when the partitioning stage led to the overall slowdown of the new algorithm.

Initially, I implemented a single kernel for all size groups. Depending on the CTA number, I allocated a different number of threads per segment. That is, if the segment size exceeded a few thousand items, I've used slow block-scope radix sort. If the data was about a thousand items and fit into shared memory, I've used in-shared-memory block-scope radix sort. In all these cases, the whole CTA was assigned to a single segment. If the CTA number exceeded the number of large segments, I've partitioned CTA into multiple warp groups, each processing a separate segment. It happened that the large-segment branch limited the occupancy of small-segment one. So I separated this kernel into two. One kernel processes large segments and contains two branches - in-shared-memory sort and slow block-scope sort. Another kernel processes medium and small segments.

To overlap small/large segments processing, I've used concurrent kernels. This PR contains a single-stream implementation, though. The multi-stream API is out of the scope of this PR and might be introduced later.

Segments partitioning

The LRB approach discussed in the initial issue balances segments in a non-deterministic way. In some cases, it led to slowdowns because consecutive sub-warps might not end up processing consecutive segments. Here's the speedup of the LRB approach compared to the partitioning based approach.

I've tried applying LRB only to the large segments group. This approach also leads to controversial performance results. In rare cases, when an extremely large segment is located in the tail of the large segments group, LRB leads to performance improvements. Otherwise, there are enough segments to overlap its processing, and the LRB stage leads to slowdowns. Therefore, we decided to opt-in LRB. The API with pre-partitioned segments is going to be implemented later. Here's the speedup of the version where I apply LRB to the large segment group.

Instead of LRB, I've implemented a three-way partition facility, which is included in this PR. It leads to deterministic partitioning of segments and might be used outside of the segmented sort. The three-way partitioning stage is faster than the LRB stage.

Temporary storage layout

The temporary storage layout of the new segmented sort is quite complex. To simplify the code and make it safer, I've introduced temporary storage layout wrappers, which can be found in this PR.

Performance

To benchmark the new algorithm on real-world data, I've converted sparse matrices collection to segment sizes. Segments number is equal to the number of rows in a sparse matrix, while segment size is equal to the number of non-zero values in this row. Here's the speedup of the new algorithm compared to the cub::DeviceSegmentedRadixSort on A100 GPU while sorting pairs of std::uint32_t and std::uint64_t.

speedup	new algorithm	modernGPU
min	1.11	0.16
median	61.24	7.93
max	5314.44	2219.73

The speedup depends on the segments number:

zasdfgbnm · 2021-08-24T16:38:31Z

Hi, do we know how much this PR will benefit well-balanced segmented sort with a small number of segments? For example, 1 -256 segment, 120000 elements in each segment.

gevtushenko · 2021-08-24T20:47:21Z

Hi, do we know how much this PR will benefit well-balanced segmented sort with a small number of segments? For example, 1 -256 segment, 120000 elements in each segment.

Hello, @zasdfgbnm! There should be no significant speedup in this case. The modern GPU version of the segmented sort might perform better in the case of very few large segments. But its speedup doesn't scale well in the range of segments that you've mentioned.

zasdfgbnm · 2021-08-24T20:49:45Z

@senior-zero Noted, thanks for the info!

gevtushenko · 2021-08-25T19:42:52Z

DVS: 30309327

alliepiper

Really happy to see this patch! These speedups are incredible.

Questions related to the PR description:

In the speedup plots for segment size vs. number, which axis is which?
Why were WarpLoad/Store needed? Could they be replaced with a strided thread load?
~~Should WarpExchange be WarpShuffle (to match existing BlockShuffle)?~~ (I forgot about BlockExchange and the existing convention, ignore this :P)
How long does it take to compile this implementation compared to the current implementation with the same inputs? What about binary size?
When describing the single-kernel approach, does "CTA number" refer to "current CTA index" or "number of CTAs"?
In the speedup plot comparing LRB to partitioning, can you describe the input a bit more? How big were the segments? Were they uniformly sized? How many segments in total? Is "matrix id" the number of segments?
Same questions as above about the speedup plot comparing this PR to DeviceSegmentedRadixSort.

Note that I haven't reviewed the code yet, so some of the these may be answered in the patch. Feel free to just point out the relevant code when relevant. I'll post another review of the code soon.

gevtushenko · 2021-08-26T12:01:20Z

In the speedup plots for segment size vs. number, which axis is which?

Y-axis represents segment size. X-axis represents the numbers of segments in the form 1 << (14 + x).

Why were WarpLoad/Store needed? Could they be replaced with a strided thread load?

In some combinations of GPU architecture + ITEMS_PER_{SMALL,MEDIUM}_THREAD WarpLoadAlgorithm::WARP_LOAD_TRANSPOSE is faster than striped loads. For example Policy800 allocates up to 11 items per thread of warp. 11 items are sufficient to make striped loads ineffective. So WarpStore/WarpLoad are used as customization points.

How long does it take to compile this implementation compared to the current implementation with the same inputs? What about binary size?

It's about six times slower to compile the new version. The binary size is about 5 times larger.

When describing the single-kernel approach, does "CTA number" refer to "current CTA index" or "number of CTAs"?

It's the current CTA index.

In the speedup plot comparing LRB to partitioning, can you describe the input a bit more? How big were the segments? Were they uniformly sized? How many segments in total? Is "matrix id" the number of segments?> Same questions as above about the speedup plot comparing this PR to DeviceSegmentedRadixSort.

"To benchmark the new algorithm on real-world data, I've converted sparse matrices collection to segment sizes. Segments number is equal to the number of rows in a sparse matrix, while segment size is equal to the number of non-zero values in this row." So the matrix id is just an identifier of a sparse matrix that was converted to segment sizes. The maximal number of segments was 226196185. The maximal segment size was 97373173.

cub/agent/agent_segmented_radix_sort.cuh

cub/device/dispatch/dispatch_segmented_sort.cuh

cub/device/dispatch/dispatch_three_way_partition.cuh

cub/thread/thread_sort.cuh

test/test_device_segmented_sort.cu

test/test_device_three_way_partition.cu

test/test_temporary_storage_layout.cu

gevtushenko · 2021-10-06T22:59:32Z

It's about six times slower to compile the new version. The binary size is about 5 times larger.

I've removed all the policies except one for both cub::DeviceSegmentedRadixSort and cub::DeviceSegmentedSort to have a more honest comparison. After this change, the compilation time is only 2x slower and the binary size is only 1.5x. So the main source of the compilation time issues is the tuning for the wider GPU architectures set.

alliepiper · 2021-10-08T18:09:29Z

The compilation time is only 2x slower and the binary size is only 1.5x.

Nice, that's much better than 5-6x :) This will be further improved by the if-target branch when it lands.

gevtushenko · 2021-10-12T09:13:55Z

@allisonvacanti, thank you for your comments! I've addressed them in the "Fix review notes" commit and also rebased this branch. I'm looking forward to your opinion on these changes.

gevtushenko · 2021-10-15T16:51:45Z

DVS CL: 30537984

gevtushenko requested a review from alliepiper August 17, 2021 17:52

gevtushenko added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Aug 17, 2021

gevtushenko force-pushed the main-feature/github/seg_sort branch 2 times, most recently from e24e68f to 89f587c Compare August 18, 2021 09:08

gevtushenko added testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Aug 18, 2021

jrhemstad mentioned this pull request Aug 24, 2021

Optimize cub::DeviceSegmented[Radix]Sort for small number of segments NVIDIA/cccl#875

Open

gevtushenko mentioned this pull request Aug 25, 2021

BlockRadixSort needs overloads that take the problem size and correctly sets the padding value for unused inputs NVIDIA/cccl#869

Open

gevtushenko added testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Aug 25, 2021

alliepiper reviewed Aug 25, 2021

View reviewed changes

alliepiper self-assigned this Sep 21, 2021

alliepiper added this to Inbox in PR Tracking via automation Sep 21, 2021

alliepiper moved this from Inbox to Need Review in PR Tracking Sep 21, 2021

alliepiper added this to the 1.15.0 milestone Sep 21, 2021

alliepiper reviewed Sep 28, 2021

View reviewed changes

alliepiper reviewed Sep 29, 2021

View reviewed changes

alliepiper moved this from Need Review to Drafts in PR Tracking Sep 30, 2021

alliepiper mentioned this pull request Sep 30, 2021

[WIP] Allow cub::DeviceRadixSort and cub::DeviceSegmentedRadixSort to use iterator as input #374

Draft

gevtushenko mentioned this pull request Oct 5, 2021

Optimize DeviceSegmentedRadixSort NVIDIA/cccl#879

Open

gevtushenko force-pushed the main-feature/github/seg_sort branch from 30ff59c to 75b194c Compare October 12, 2021 09:10

gevtushenko moved this from Drafts to Need Testing in PR Tracking Oct 15, 2021

gevtushenko added testing: gpuCI in progress Started gpuCI testing. testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Oct 15, 2021

gevtushenko force-pushed the main-feature/github/seg_sort branch from 31a6a68 to 07cf600 Compare October 15, 2021 16:30

alliepiper moved this from Need Testing to Tests Pending in PR Tracking Oct 15, 2021

gevtushenko force-pushed the main-feature/github/seg_sort branch 2 times, most recently from 92e130a to 7ba073c Compare October 18, 2021 18:59

gevtushenko added 14 commits October 20, 2021 21:02

Add temporary storage layout

9656d21

Add warp mask calculator

0c66f37

Add warp exchange

1b271e4

Add warp load

8b2da0f

Add warp store

7cda5c0

Add warp merge sort

1e80b2b

Add three-way partition

c977804

Add device segmented sort

97ccd67

Update integral header

a95d3f8

Add options for debug build

2d1202c

Update merge-path reference

89c6ae8

Refactor documentation

472968c

Fix review notes

1ff85db

Fix __half build for compute < 53

1072b8c

gevtushenko force-pushed the main-feature/github/seg_sort branch from d03cd6b to 1072b8c Compare October 20, 2021 18:03

gevtushenko added 2 commits October 22, 2021 13:20

Speedup compilation time of segmented sort test

3cda69e

Speedup build time of block radix sort test

8a70d82

gevtushenko force-pushed the main-feature/github/seg_sort branch from 3c1a793 to 8a70d82 Compare October 22, 2021 10:25

gevtushenko merged commit c4cfdb4 into NVIDIA:main Oct 22, 2021

PR Tracking automation moved this from Tests Pending to Done Oct 22, 2021

alliepiper mentioned this pull request Feb 7, 2022

Faster segmented sorting (and segmented problems in general) #224

Closed

alliepiper mentioned this pull request May 6, 2022

Complement BlockLoad primitives with WarpLoad primitives #132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New segmented sort algorithm #357

New segmented sort algorithm #357

gevtushenko commented Aug 17, 2021

zasdfgbnm commented Aug 24, 2021

gevtushenko commented Aug 24, 2021

zasdfgbnm commented Aug 24, 2021

gevtushenko commented Aug 25, 2021

alliepiper left a comment •

edited

gevtushenko commented Aug 26, 2021 •

edited

gevtushenko commented Oct 6, 2021 •

edited

alliepiper commented Oct 8, 2021

gevtushenko commented Oct 12, 2021

gevtushenko commented Oct 15, 2021

New segmented sort algorithm #357

New segmented sort algorithm #357

Conversation

gevtushenko commented Aug 17, 2021

Embed segment number into keys

Modified merge sort approach

Kernel specialisation

Segments partitioning

Temporary storage layout

Performance

zasdfgbnm commented Aug 24, 2021

gevtushenko commented Aug 24, 2021

zasdfgbnm commented Aug 24, 2021

gevtushenko commented Aug 25, 2021

alliepiper left a comment • edited

Choose a reason for hiding this comment

gevtushenko commented Aug 26, 2021 • edited

gevtushenko commented Oct 6, 2021 • edited

alliepiper commented Oct 8, 2021

gevtushenko commented Oct 12, 2021

gevtushenko commented Oct 15, 2021

alliepiper left a comment •

edited

gevtushenko commented Aug 26, 2021 •

edited

gevtushenko commented Oct 6, 2021 •

edited