Add PatternDefeatingQuicksort #72

LSchwerdt · 2023-02-11T10:26:40Z

Add Pattern-defeating Quicksort, an unstable, O(1) space, guaranteed* O(n log n) time variant of quicksort. Both normal, and branchless (block-) partitioning schemes are implemented.

(*) by falling back to HeapSort for pathological inputs

Highlights (BranchlessPdqSort):

20% - 50% faster than current default when sorting Int128s and UInt128s (while using O(1) space instead of O(n))
Significantly faster than default for Int64s and UInt64s for special inputs:
- Almost presorted
- Almost reverse presorted
- Many duplicates, but wide value range
Edit: Enables sortperm using the same memory footprint as current default sortperm, by simultaneously sorting values and indices. This achieves 5x-20x speedup vs sortperm for 2 GiB of Int64s, depending on hardware. It is always faster than sortperm, except for
- very small problem sizes (that fit in L1$),
- medium problem sizes (that fit in L3$) when sortperm uses special optimizations (Floats, Missing).

Checklist

Open Question

BranchlessPdqSort is the main star of this PR. The only advantage of BranchyPdqSort is, that it uses zero allocations.
Maybe define const PdqSort = BranchlessPdqSort, or delete BranchyPdqSort entirely?

Notes

Uses randomization instead of deterministic pattern breaking after an unbalanced partition.
The implementations in Go and Rust do aswell, and the orginal author endorses that approach.
There are potentially faster QuickSort variants ([1],[2]), but implementing them in pure Julia is hard. /impossible?

Benchmarks

The benchmarks can be recreated using this Pluto notebook.
A script for very quick benchmarking is below at the end.

Full benchmark results (9d36ce4):

AMD 3900X, 2ch DDR4-3600 CL16 (≙ 8.9ns)
Intel 7820X, 4ch DDR4-2666 CL19 (≙14.3ns)
Intel 6700HQ, 2ch DDR4-2400 CL14 (≙11.7ns)

Here are the results from the 7820X (b603963):

Sort Int64

Sort Int128

Sort special inputs

SimultaneousSortperm

A package that implements sortperm using PdqSort is here.
The benchmarking code and more results are found here.

Results (Intel 7820x)

Quick benchmarking script

using BenchmarkTools, Random, SortingAlgorithms
versioninfo()
for i in 2:5
    n = 17^i
    println("sort!(rand(Int128, $n))")
    v = rand(Int128, n)
    print("Default:           "); @btime sort!($v) setup=(rand!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v; alg=QuickSort) setup=(rand!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end

function setupv!(v)
    v[1:end-1] = 1:length(v)-1
    v[end] = 10
end

for i in 2:5
    n = 17^i
    println("sort!([1:$n;10])")
    v = [1:n;10]
    print("Default:           "); @btime sort!($v) setup=(setupv!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v; alg=QuickSort) setup=(setupv!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v; alg=BranchyPdqSort) setup=(setupv!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(setupv!($v)) evals=1
end

for i in 2:5
    n = 17^i
    println("sort!(rand(Complex{Float64},$n), by=abs2)")
    v = rand(Complex{Float64},n)
    print("Default:           "); @btime sort!($v, by=abs2) setup=(rand!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v, by=abs2; alg=QuickSort) setup=(rand!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v, by=abs2; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v, by=abs2; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end

Result on the 3900x (b603963)

Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 12 on 24 virtual cores      
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12
sort!(rand(Int128, 289))
Default:             5.400 μs (1 allocation: 4.69 KiB)
Quicksort:           6.200 μs (0 allocations: 0 bytes)
BranchyPdqSort:      6.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   4.700 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 4913))
Default:             119.900 μs (2 allocations: 76.86 KiB)
Quicksort:           165.500 μs (0 allocations: 0 bytes)
BranchyPdqSort:      171.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   103.500 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 83521))
Default:             2.711 ms (2 allocations: 1.27 MiB)
Quicksort:           4.190 ms (0 allocations: 0 bytes)
BranchyPdqSort:      4.260 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   2.294 ms (2 allocations: 1.06 KiB)
sort!(rand(Int128, 1419857))
Default:             60.424 ms (2 allocations: 21.67 MiB)
Quicksort:           89.065 ms (0 allocations: 0 bytes)
BranchyPdqSort:      91.504 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   46.587 ms (2 allocations: 1.06 KiB)
sort!([1:289;10])
Default:             1.200 μs (2 allocations: 2.77 KiB)
Quicksort:           1.200 μs (0 allocations: 0 bytes)
BranchyPdqSort:      1.000 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   1.200 μs (2 allocations: 1.06 KiB)
sort!([1:4913;10])
Default:             19.900 μs (3 allocations: 39.55 KiB)
Quicksort:           29.800 μs (0 allocations: 0 bytes)
BranchyPdqSort:      14.900 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   15.000 μs (2 allocations: 1.06 KiB)
sort!([1:83521;10])
Default:             448.400 μs (3 allocations: 656.80 KiB)
Quicksort:           651.300 μs (0 allocations: 0 bytes)
BranchyPdqSort:      257.300 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   257.700 μs (2 allocations: 1.06 KiB)
sort!([1:1419857;10])
Default:             13.690 ms (3 allocations: 10.83 MiB)
Quicksort:           13.954 ms (0 allocations: 0 bytes)
BranchyPdqSort:      4.400 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   4.178 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},289), by=abs2)
Default:             8.200 μs (1 allocation: 4.69 KiB)
Quicksort:           11.600 μs (0 allocations: 0 bytes)
BranchyPdqSort:      11.800 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   7.900 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},4913), by=abs2)
Default:             211.000 μs (2 allocations: 76.86 KiB)
Quicksort:           314.600 μs (0 allocations: 0 bytes)
BranchyPdqSort:      323.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   179.800 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},83521), by=abs2)
Default:             5.041 ms (2 allocations: 1.27 MiB)
Quicksort:           7.692 ms (0 allocations: 0 bytes)
BranchyPdqSort:      7.922 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   3.739 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},1419857), by=abs2)
Default:             109.978 ms (2 allocations: 21.67 MiB)
Quicksort:           166.103 ms (0 allocations: 0 bytes)
BranchyPdqSort:      172.113 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   75.914 ms (2 allocations: 1.06 KiB)

Additional correctness checks

No stable sorting guarantee -> simpler checks.

test script

using Base.Order
using Test
using SortingAlgorithms
for T in (Float64, Int, UInt8, UInt128)
    println(T)
    for alg in [BranchlessPdqSort, BranchyPdqSort]
        for order in [Forward, Reverse, By(identity)]
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = rand(T, n)
                @test sort(v; order) == sort(v; alg, order)
            end
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = rand(T, n) .% 4
                @test sort(v; order) == sort(v; alg, order)
            end
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = [9;sort(rand(T, n));8]
                @test sort(v; order) == sort(v; alg, order)
            end
        end
    end
end

codecov-commenter · 2023-02-11T11:51:41Z

Codecov Report

Merging #72 (8b44ef3) into master (2659334) will increase coverage by 0.96%.
The diff coverage is 98.37%.

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
+ Coverage   96.02%   96.98%   +0.96%     
==========================================
  Files           1        1              
  Lines         352      598     +246     
==========================================
+ Hits          338      580     +242     
- Misses         14       18       +4

Impacted Files	Coverage Δ
src/SortingAlgorithms.jl	`96.98% <98.37%> (+0.96%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

- remove selectpivot! function - use m instead of mid - use lo, m, hi in function signatures - midpoint instead of _midpoint

Test results: - 0.8.0: fail - 0.8.1: pass - 0.9.0: pass - 0.9.2: pass - 0.10.0: pass - 0.11.0: pass - 0.12.0: pass - 1.0.0: pass - 1.5.15: pass

... in case of a highly unbalanced partition. The randomization is inspired by ScratchQuickSort.

LSchwerdt mentioned this pull request Feb 12, 2023

Add PagedMergeSort, a merge sort using O(√n) space #71

Merged

9 tasks

LSchwerdt added 4 commits February 14, 2023 16:00

Add PatternDefeatingQuicksort

9d36ce4

Use backwards-compatible way to call insertionsort

a53f989

Return sorted vector

82a9d52

Add initial optimizations

bd4f351

LSchwerdt force-pushed the pdqsort branch from 0682a52 to bd4f351 Compare February 14, 2023 15:12

LSchwerdt added 5 commits February 15, 2023 14:37

Use HeapSort without initial optimization

fdd73c5

Add tests with special inputs for PdqSort

e6594ec

Refactor pivot selection

76efc09

- remove selectpivot! function - use m instead of mid - use lo, m, hi in function signatures - midpoint instead of _midpoint

Use name pivot_index to match ScratchQuickSort

8b44ef3

Add compat for StaticArrays

b603963

Test results: - 0.8.0: fail - 0.8.1: pass - 0.9.0: pass - 0.9.2: pass - 0.10.0: pass - 0.11.0: pass - 0.12.0: pass - 1.0.0: pass - 1.5.15: pass

LSchwerdt marked this pull request as ready for review February 16, 2023 17:36

Break patterns pseudorandomly

9816641

... in case of a highly unbalanced partition. The randomization is inspired by ScratchQuickSort.

LSchwerdt mentioned this pull request Feb 19, 2023

2x faster tuple sorting (for 2-Tuples of Integers) JuliaLang/julia#48724

Closed

LilithHafner added the new algorithm label Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PatternDefeatingQuicksort #72

Add PatternDefeatingQuicksort #72

LSchwerdt commented Feb 11, 2023 •

edited

Loading

codecov-commenter commented Feb 11, 2023 •

edited

Loading

Add PatternDefeatingQuicksort #72

Are you sure you want to change the base?

Add PatternDefeatingQuicksort #72

Conversation

LSchwerdt commented Feb 11, 2023 • edited Loading

Highlights (BranchlessPdqSort):

Checklist

Open Question

Notes

Benchmarks

Sort Int64

Sort Int128

Sort special inputs

SimultaneousSortperm

Quick benchmarking script

Additional correctness checks

codecov-commenter commented Feb 11, 2023 • edited Loading

Codecov Report

LSchwerdt commented Feb 11, 2023 •

edited

Loading

codecov-commenter commented Feb 11, 2023 •

edited

Loading