Add radix sort #44230

LilithHafner · 2022-02-17T21:51:52Z

I've implemented radix sort and integrated it into the Julia sorting architecture. Micro-benchmarking reports a substantial speedup in many cases and minor regressions in a few cases. I hope that with a bit of collaboration, we can minimize those regressions as well.

This grows out of a discussion on discourse and a PR to SortingAlgorithm.jl. It is inspired in part by the radix sort algorithm in SortingAlgorithms.jl. This is different from the radix sort algorithm in SortingAlgorithms.jl because it is substantially faster in most cases and more easily extensible.

This PR is based on a fast radix sort for unsigned integers in increasing order. To sort other types and orders—currently all signed and unsigned integers, Float32, Float64, and Char, for Forward and Reverse orderings are supported—we have a function Sort.Serial.serialize that converts objects to unsigned integers which are sorted and then the original objects are reconstructed via Sort.Serial.deserialize. These two functions come with a trait Sort.Serial.Serializable that indicates whether a type is serializable for a given ordering. If a package defines a custom type that can be sorted via radix sort, it may either a) do nothing and we will continue to dispatch to a comparison sort at compile time or b) define these three functions and we will dispatch to radix sort when appropriate.

I have left out many features and minor optimizations that have come up in discussion surrounding this algorithm, preferring simplicity over minor performance benefits. Those optimizations can be added now, later, or never.

A small sample of benchmarking vs. mar 1 master (3a5dc09)

@benchmark sort!(x) setup=(x=rand(1:1, 1)) evals=1 # 50%, 48% speedup
@benchmark sort!(x) setup=(x=rand(1:30, 30)) evals=1 # 36%, 28% speedup
@benchmark sort!(x) setup=(x=rand(1:1000, 1000)) evals=1 # 479%, 450% speedup
@benchmark sort!(x) setup=(x=rand(1:30000, 30000)) evals=1 # 635%, 793% speedup
@benchmark sort!(x) setup=(x=rand(1:1000000, 1000000)) evals=1 # 648%, 642% speedup

@benchmark sort!(x) setup=(x=rand(1)) evals=1 # 71%, 17% speedup
@benchmark sort!(x) setup=(x=rand(10)) evals=1 # 6%, 9% speedup
@benchmark sort!(x) setup=(x=rand(30)) evals=1 # 54%, 14% speedup
@benchmark sort!(x) setup=(x=rand(1000)) evals=1 # 76%, 73% speedup
@benchmark sort!(x) setup=(x=rand(30000)) evals=1 # 121%, 119% speedup
@benchmark sort!(x) setup=(x=rand(1000000)) evals=1 # 139%, 144% speedup

@benchmark sort!(x) setup=(x=rand(Char, 1)) evals=1 # 75%, 155% speedup
@benchmark sort!(x) setup=(x=rand(Char, 30)) evals=1 # 32%, 86% speedup
@benchmark sort!(x) setup=(x=rand(Char, 1000)) evals=1 # 272%, 284% speedup
@benchmark sort!(x) setup=(x=rand(Char, 30000)) evals=1 # 771%, 873% speedup
@benchmark sort!(x) setup=(x=rand(Char, 1000000)) evals=1 # 1002%, 965% speedup


@benchmark sort!(x) setup=(x=rand(Bool, 10^6)) evals=1 # 1843%, 2123% speedup
@benchmark sort!(x) setup=(x=rand(Int8, 10^6)) evals=1 # 5368%, 5272% speedup
@benchmark sort!(x) setup=(x=rand(UInt8, 10^6)) evals=1 # 5275%, 5162% speedup

@benchmark sort!(x) setup=(x=rand(Int16, 10^6)) evals=1 # 2204%, 2165% speedup
@benchmark sort!(x) setup=(x=rand(UInt16, 10^6)) evals=1 # 2197%, 2140% speedup
@benchmark sort!(x) setup=(x=rand(Float16, 10^6)) evals=1 # 2%, -1% speedup

@benchmark sort!(x) setup=(x=rand(Int32, 10^6)) evals=1 # 567%, 548% speedup
@benchmark sort!(x) setup=(x=rand(UInt32, 10^6)) evals=1 # 524%, 525% speedup
@benchmark sort!(x) setup=(x=rand(Char, 10^6)) evals=1 # 1001%, 944% speedup
@benchmark sort!(x) setup=(x=rand(Float32, 10^6)) evals=1 # 491%, 464% speedup

@benchmark sort!(x) setup=(x=rand(Int64, 10^6)) evals=1 # 120%, 118% speedup
@benchmark sort!(x) setup=(x=rand(UInt64, 10^6)) evals=1 # 104%, 106% speedup
@benchmark sort!(x) setup=(x=rand(Float64, 10^6)) evals=1 # 141%, 142% speedup

@benchmark sort!(x) setup=(x=rand(Int128, 10^6)) evals=1 # 12%, 6% speedup
@benchmark sort!(x) setup=(x=rand(UInt128, 10^6)) evals=1 # 1%, -2% speedup

(run 2 times, speedups computed by dividing minimum time on master by minimum time on ffcc043)

Problems & todos

@benchmark sort!(x) setup=(x=rand(1000)) is 300ns slower for radix sort than quick sort.
It should immediately dispatch to quick sort based only on length, so it is unclear why there
is such a substantial slowdown. This is probably related to the 100% slowdown for sorting
single floats. (addressed in Add radix sort #44230 (comment))
I've run millions of tests with SortMark.jl, but this still needs efficient comprehensive
testing in Base, and more hand crafted tests may still find bugs.
Sort.Serial is a bit of a name conflict with the existing package Serialization
Perhaps we should support passing in the temporary array. Is that performance feature used anywhere?
The heuristics right now are pretty primitive. Perhaps they can be improved or tuned.
Potentially rework integration with fpsort!
Perhaps we should integrate this with sortperm, right now it is incompatible, but with some changes, it could work.

base/sort.jl

LilithHafner · 2022-02-17T22:19:49Z

The build error is LoadError(at "compiler/compiler.jl" line 3: LoadError(at "sort.jl" line 672: UndefVarError(var=:Val))). Is this because Val is not available until after sort.jl in bootstrapping? Is it advisable to introduce RadixSort later on (I doubt we sort large arrays during bootstrapping)? (EDIT: resolved. This is my first time making a substantive addition to Base, and I forgot to add using ..Base: everything, i, use, from, base, even, if, exported. Val is defined very early in essentials.jl.)

quinnj · 2022-02-17T22:34:59Z

Yeah, this is exciting. I'll link to my own implementation that lives in the InlineStrings.jl package.

Some notable changes I made to the SortingAlgorithms.jl implementation:

It operates on any "primitive" type, so no specific restrictions on integer vs float vs anything else
I did some extensive benchmarking around the # of bits of the element type and at what point radix sort was suboptimal, findings here
I split out a Radix struct that makes it easy to play around with different radix sizes and stuff, which was helpful in benchmarking
Lots of code comments to explain what everything does! It's a trickier algorithm w/ lots of indirection, so I found it helpful to keep track of what's going on.

For InlineStrings.jl, there are fixed-size string types from 1-byte all the way up to 255-bytes, so there were a lot of permutations of performance to check; here's one sample graph of the String7 type.

Anyway, excited to see the progress in this direction!

LilithHafner · 2022-02-18T01:49:42Z

@quinnj's thresholds (64 bits -> 500 & 128 bits -> 2 million) line up okay with the thresholds I've chosen. I expect that the thresholds and heuristics in this PR could be tuned much more preciesly. If you would like to play around with thresholds, everything runs as is with Revise, I think it's just not building from source with the PR.

@quinnj, are there any changes you would recommend here based on your experience with InlineStrings?

LilithHafner · 2022-02-20T17:54:04Z

Hooray! Everything but buildkite passed on 7b665ef

LilithHafner · 2022-02-22T15:49:17Z

I'm giving up on understanding the ci errors and switching to blind bisection. I'm going to start by rebasing to the latest master and reverting everything.

nalimilan · 2022-03-31T08:13:08Z

Great! Looks good to me too.

It could make sense to use the same threshold for switching to insertion sort (currently 40) and for checking issorted (currently 200). If the overhead of issorted is < 1% for 200 elements, I guess it's < 5% for 40 elements? The advantage could be that 1) the rules are simpler, and 2) we are always fast for pre-sorted inputs, since InsertionSort is fast for them.

Adding a check for reverse sorting also sounds like a good idea to me, as long as we use a high enough threshold to keep the overhead < 1%. I imagine this kind of pattern can happen in real life if you sort data in one direction and then in the other one, and making this up to 7× slower than it could would be too bad.

BTW, it would be nice to keep these benchmarks somewhere. Maybe you make a PR against BaseBenchmarks after this one is merged?

test/sorting.jl

base/sort.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

LilithHafner · 2022-03-31T22:36:47Z

The advantage could be that 1) the rules are simpler, and 2) we are always fast for pre-sorted inputs, since InsertionSort is fast for them.

For adjusting the minimum presorted input threshold, I find your reasoning quite convincing. I like how you provide a theoretical consideration (1) and then back it up with a practical example of why this matters (2: instead of three cases with a bad middle, there are only two cases).

For reverse sorted input, this makes the algorithm marginally more complicated, though the runtime cost-benefit is huge. This also makes a nice pattern where very short (< 40) reverse-sorted input is pathological for insertion sort, medium-length (40 < len < 500) is handled decently by radix sort, and long (> 500) is handled perfectly with a reverse sorted check. Intuitively, as length increases, special-case detection improves. I added it.

LilithHafner · 2022-03-31T23:04:10Z

BTW, it would be nice to keep these benchmarks somewhere. Maybe you make a PR against BaseBenchmarks after this one is merged?

The benchmarks here are pretty similar to the benchmarks already in BaseBenchmarks/sort. The difference is that these span more input distributions, more types, more lengths, and longer lengths, and those at BaseBenchmarks cover sorting functions other than sort!. How do these tests cover a wider area? they take longer to run. These tests take a few hours to run. I might contribute a sparse cross product akin to what I've done in SortMark.jl, but it's not trivial to migrate these benchmarks to BaseBenchmarks without bringing along too much bloat.

LilithHafner · 2022-04-03T01:00:55Z

Okay, this looks good to me too. Thanks so much, @nalimilan, @oscardssmith, and everyone else who has had conversations and helped along the way!

oscardssmith · 2022-04-03T02:27:07Z

thanks so much for setting this through!

nalimilan · 2022-04-03T16:11:02Z

Thanks @LilithHafner! Don't hesitate if you feel like extending radix sort to Union{T, Missing} vectors. :-)

StefanKarpinski · 2022-04-04T21:37:13Z

Brava! Excellent work here.

Thought on the issorted check: we could introduce a sortcheck(v) utility function that returns 1 or 0 or -1 to indicate sorted, reverse sorted or neither. It would short-circuit, of course. I suspect that could be as fast as the issorted check.

LilithHafner · 2022-04-05T00:54:03Z

Thanks!

I'm having trouble coming up with an implementation of sortcheck that is substantially faster than two issorted checks.

StefanKarpinski · 2022-04-05T15:21:56Z

A little rough benchmarking seems to find that this can be substantially more efficient in some cases:

function sortcheck(v::Vector, lt=isless)
    cmp::UInt8 = 0b11
    if !isempty(v)
        y = v[1]
        @inbounds for i = 2:length(v)
            x = v[i]
            cmp &= 2*!lt(x, y) + !lt(y, x)
            iszero(cmp) && break
            y = x
        end
    end
    !iszero(cmp)
end

The worst case for issorted(v) || issorted(v, rev=true) is something like this:

julia> v = [fill(0, 1000); -1; 1];

julia> @btime issorted($v) || issorted($v, rev=true)
  990.154 ns (0 allocations: 0 bytes)
false

julia> @btime sortcheck($v)
  678.273 ns (0 allocations: 0 bytes)
false

In this case two issorted checks both have to scan through the entire array even though at the end of the first check it has already seen the writing on the wall (but not realized it). sortcheck does more work on each loop iteration but only scans the input once. It could probably be improved by eliminating that conditional break and allowing vectorization.

StefanKarpinski · 2022-04-05T15:27:31Z

Sure enough, this slightly incorrect 8x unrolling of the inner loop is 2x faster:

function sortcheck(v::Vector, lt=isless)
    cmp::UInt8 = 0b11
    if !isempty(v)
        y = v[1]
        @inbounds for i = 2:8:length(v)
            for j = 0:7
                x = v[i+j]
                cmp &= 2*!lt(x, y) + !lt(y, x)
                y = x
            end
            iszero(cmp) && break
        end
    end
    !iszero(cmp)
end

If the end of the loop was handled correctly, this could work. Paying attention to memory alignment might improve the speed more. I do wish the compiler could figure that transformation out based on the fact that the inner loop has no side effects and once cmp becomes zero it can't come back, but I suppose that's asking a lot of LLVM.

oscardssmith · 2022-04-05T15:30:22Z

The compiler problem is very similar to vectorizing findfirst automatically which would also be really nice.

LilithHafner · 2022-04-05T16:24:18Z

I'm more concerned with runtime on entirely unsorted input than on input with long sorted prefixes. On random data, the latest sortcheck is fairly expensive:

@belapsed((sort!(x); sort(x; rev=true)), setup=(x=rand(40)))/2 #1.6455882352941177e-7
@belapsed issorted(x) setup=(x=rand(40)) #3.382e-9    2%
@belapsed sortcheck(x) setup=(x=rand(40)) #1.292778335005015e-8  8%

StefanKarpinski · 2022-04-05T16:44:28Z

I was trying to make something faster than two issorted checks, which this is. But yes, it's still slower than just one issorted check, although only 6% of sort time on my system. Why are you sorting the data twice in your benchmark?

LilithHafner · 2022-04-06T00:00:16Z

I'm sorting twice and dividing time by 2 because otherwise it would be repeatedly resorting sorted data, giving an unrealistically fast sorting time. On random data on my system, I'm measuring sortcheck running 2.6x slower than two issorted checks.

@belapsed((sort!(x); sort(x; rev=true)), setup=(x=rand(40)))/2 #1.684660633484163e-7
@belapsed issorted(x) setup=(x=rand(40)) #3.383e-9.                 2%
@belapsed issorted(x) || issorted(x, Reverse) setup=(x=rand(40)) #4.9079079079079075e-9  3%
@belapsed sortcheck(x) setup=(x=rand(40)) #1.2937813440320962e-8    8%
@belapsed sort!(x) setup=(x=rand(40)) #1.1508637873754152e-7        70%

Add radix sort

ee3964b

oscardssmith reviewed Feb 17, 2022

View reviewed changes

base/sort.jl Outdated Show resolved Hide resolved

oscardssmith added the performance Must go faster label Feb 17, 2022

LilithHafner added 8 commits February 17, 2022 19:56

Try to fix build error (1)

569a5c1

Try to fix build error (2)

67b744c

Try to fix build error (3)

b8ad40f

Try to fix build error (4)

5fd8e42

Try to fix build error (5)

7cf473d

Try to fix build error (6)

d8363a5

@LilithHafner Try to fix build error (7)

44b8411

Try to fix build error (8)

a22b120

LilithHafner force-pushed the patch-9 branch from 0abb3d3 to a22b120 Compare February 19, 2022 17:50

LilithHafner added 6 commits February 19, 2022 11:52

Remove set default for debugging

f0d6536

Update sort_int_range! tests

c42ec18

put back update defalg after bootstrapping (ugly)

fc0e7bb

update usage of sort_int_range! in test to add lo, hi

7b665ef

whitespace

0c2e202

add RadixSort tests

44e4ea0

LilithHafner added 5 commits February 20, 2022 12:07

more radix sort tests

9f83339

fix typo

39216ad

maintain backwards compatability in sort_int_range

e045255

fix maybereverse use

3747e79

remove set defalg

2fc70fb

LilithHafner force-pushed the patch-9 branch from 2fc70fb to 548f561 Compare February 22, 2022 15:51

LilithHafner requested a review from DilumAluthge as a code owner February 22, 2022 15:51

nalimilan reviewed Mar 31, 2022

View reviewed changes

test/sorting.jl Outdated Show resolved Hide resolved

test/sorting.jl Outdated Show resolved Hide resolved

base/sort.jl Outdated Show resolved Hide resolved

base/sort.jl Outdated Show resolved Hide resolved

LilithHafner and others added 2 commits March 31, 2022 16:17

Apply suggestions from code review

c07fb05

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

adjust presorted checks

d6087d0

proof read

609abee

This was referenced Apr 3, 2022

radix sort fallback to Base on small input JuliaCollections/SortingAlgorithms.jl#51

Closed

A faster more extensible radix sort JuliaCollections/SortingAlgorithms.jl#53

Closed

oscardssmith merged commit 5dfd6c9 into JuliaLang:master Apr 3, 2022

nalimilan mentioned this pull request Apr 3, 2022

Make sortperm faster for large vectors JuliaCollections/SortingAlgorithms.jl#33

Closed

LilithHafner deleted the patch-9 branch April 4, 2022 04:03

LilithHafner mentioned this pull request Apr 5, 2022

document and update default choice between :dict and :radixsort JuliaStats/StatsBase.jl#720

Merged

vtjnash mentioned this pull request May 11, 2022

sorting an array of Union{} fails on nightly #45280

Closed

LilithHafner mentioned this pull request Jun 15, 2022

Core.Compiler calls into sort with undefined view #45522

Closed

LilithHafner added kind:potential benchmark Could make a good benchmark in BaseBenchmarks domain:sorting Put things in order labels Jul 19, 2022

ericphanson mentioned this pull request Oct 14, 2022

Optimize find_intersections for closed intervals invenia/Intervals.jl#203

Merged

LilithHafner mentioned this pull request Dec 4, 2022

Revise sort.md #47789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add radix sort #44230

Add radix sort #44230

LilithHafner commented Feb 17, 2022 •

edited

Loading

LilithHafner commented Feb 17, 2022 •

edited

Loading

quinnj commented Feb 17, 2022

LilithHafner commented Feb 18, 2022

LilithHafner commented Feb 20, 2022 •

edited

Loading

LilithHafner commented Feb 22, 2022

nalimilan commented Mar 31, 2022

LilithHafner commented Mar 31, 2022

LilithHafner commented Mar 31, 2022

LilithHafner commented Apr 3, 2022

oscardssmith commented Apr 3, 2022

nalimilan commented Apr 3, 2022

StefanKarpinski commented Apr 4, 2022

LilithHafner commented Apr 5, 2022

StefanKarpinski commented Apr 5, 2022

StefanKarpinski commented Apr 5, 2022

oscardssmith commented Apr 5, 2022

LilithHafner commented Apr 5, 2022

StefanKarpinski commented Apr 5, 2022

LilithHafner commented Apr 6, 2022

Add radix sort #44230

Add radix sort #44230

Conversation

LilithHafner commented Feb 17, 2022 • edited Loading

LilithHafner commented Feb 17, 2022 • edited Loading

quinnj commented Feb 17, 2022

LilithHafner commented Feb 18, 2022

LilithHafner commented Feb 20, 2022 • edited Loading

LilithHafner commented Feb 22, 2022

nalimilan commented Mar 31, 2022

LilithHafner commented Mar 31, 2022

LilithHafner commented Mar 31, 2022

LilithHafner commented Apr 3, 2022

oscardssmith commented Apr 3, 2022

nalimilan commented Apr 3, 2022

StefanKarpinski commented Apr 4, 2022

LilithHafner commented Apr 5, 2022

StefanKarpinski commented Apr 5, 2022

StefanKarpinski commented Apr 5, 2022

oscardssmith commented Apr 5, 2022

LilithHafner commented Apr 5, 2022

StefanKarpinski commented Apr 5, 2022

LilithHafner commented Apr 6, 2022

LilithHafner commented Feb 17, 2022 •

edited

Loading

LilithHafner commented Feb 17, 2022 •

edited

Loading

LilithHafner commented Feb 20, 2022 •

edited

Loading