Bitonic mergesort #62

nlw0 · 2022-10-11T19:50:10Z

Implements Batcher's bitonic mergesort. This algorithm effectively implements a sorting network, but can also be understood as a sorting algorithm.

Basing ourselves on the layout explained in this Wikipedia section:
https://en.wikipedia.org/wiki/Bitonic_sorter#Alternative_representation
it becomes simple to implement a version that works with inputs of arbitrary lengths. We just need to assume inf for all the complementary entries that would make the array size a power-of-two, and simply skip them in the comparisons.

The functions have been implemented forcing some variables to be Val, so that loop vectorizations are more likely to be triggered by the compiler. This seems challenging otherwise, especially for the first step with the backwards iteration.

No experiments have been made yet regarding speed, but it seems good vectorized code is being produced at least in some cases, and tests are already passing, so I thought I could go ahead and get some feedback already.

This PR was inspired by the previous discussion with @LilithHafner in #54.

LilithHafner

Neat! This algorithm is hot rigt now, but that doesn't mean that much. I'd still like to see a few benchmarks to demonstrate that there exist cases where it is the best algorithm available.

This implementation assumes one based indexing. Use firstindex, lastindex, and/or eachindex to make it agnostic to indexing and to make the @inbounds lookups not segfault on OffsetArrays.

This implementation also seems more complex than it needs to be, but I don't have a substantially simpler implementation off the top of my head.

LilithHafner · 2022-10-12T04:54:42Z

src/SortingAlgorithms.jl

@@ -619,4 +640,54 @@ function sort!(v::AbstractVector, lo::Int, hi::Int, ::CombSortAlg, o::Ordering)
    return sort!(v, lo, hi, InsertionSort, o)
 end

+function sort!(v::AbstractVector, lo::Int, hi::Int, ::BitonicSortAlg, o::Ordering)
+    return bitonicsort!(view(v, lo:hi), o::Ordering)


Using views in this context sometimes comes with a runtime performance penalty.

Is there an alternative? This is only called once, btw, would it still be relevant?

The alternative used throughout SortingAlgorithms and Base.Sort is to carry around lo and hi everywhere. It is annoying but efficient. If you can get away with using views without performance penalty that would be great, but you'd need to benchmark both ways to find out.

Yes, it would still be relevant because all future indexing operations will be on a View rather than a Vector (or other input type)

LilithHafner · 2022-10-12T07:06:26Z

src/SortingAlgorithms.jl

+
+function bitonicsort!(data, o::Ordering)
+    N = length(data)
+    for n in 1:ceil(Int, max(0, log2(N)))


using leading_zeros would be much more efficient here for BitIntegers. idk how pronounced the impact on the whole would be.

This only runs once at the start, but thanks, I didn't know about that function!

LilithHafner · 2022-10-12T07:11:51Z

src/SortingAlgorithms.jl

+function bitonicsort!(data, o::Ordering)
+    N = length(data)
+    for n in 1:ceil(Int, max(0, log2(N)))
+        bitonicfirststage!(data, Val(n), o::Ordering)


This is type-unstable because the type of Val(n) depends on something other than the input types of bitonicsort!. It is possible but unlikely that it is a good idea to include type instability in a good sorting algorithm.

I would start with type-stable code and only switch to type-unstable if you can find a convincing reason to do so (in the case of loop unrolling, that would require end to end benchmarks showing the unstable version is faster).

Type instabilities also have disadvantages other than runtime performance (e.g. precompilation efficacy & static analysis). In this case, the first time someone sorts a very large list, there will be new compilation for a new n.

Can we simply force N and n to be integers somehow? And what makes it unstable in the first place?

The idea is indeed to be forcing compilation with knowledge of the input size, because this seems necessary to trigger compiler optimizations. I saw a lot more vectorization in the machine code with that, and I believe it won't be really interesting unless we do this.

We might have a limit above which we don't use compile-time-known values, if that's an issue.

N and n are already integers (length is documented to return an Integer), but that is not the issue. What makes Val(n) unstable is that typeof(Val(4)) == Val{4} != typeof(Val(3)). The compiler only knows about types, not values, so it knows the type of Val(n::Int) is Val{???} but it does not know what value n holds, so it doesn't know the concrete type. This is in contrast with Set(n::Int) which is of type Set{Int} where the compiler can deduce the type of the result from the types of the inputs.

Forcing separate compilation for each step size is possible but should not be necessary to get SIMD. I originally took this approach in implementing radix sort in julia base to compile separately for every chunck size, but eventually realized that there wasn't any noticeable benefit from doing that and removed the optimization as it wasn't helpful. This is the commit where I removed it: JuliaLang/julia@17f45e8. The parent commit contains an example that avoids dynamic dispatch for small values by using explicit if statements. It is ugly IMO and I would only support it if it gave substantial speedups.

While it is possible to glean vlauble information from @code_llvm and @code_native outputs, it takes a great deal of understanding of CPU architectures to make accurate value judgements from them. I highly recommend benchmarking with @time (or BenchmarkTools' @btime) and profiling with Profile's @profile (or VSCode's @profview). Because those tools can quickly and easily point to potential problems and detect improvements or regressions.

For example, @profview for _ in 1:1000 sort!(rand(Int, 80); alg=BitonicSort) end reveals that 95% of sorting time is spent in the bitonicsecondstage! function and less than 1% is spent in the bitonicfirststage! function. This indicates substantial a performance problem in the bitonicsecondstage! function.

@profview for _ in 1:200000 sort!(rand(Int, 15); alg=BitonicSort) end indicates that about 65% the sorting time is spent in the Val function and in runtime dispatch, indicating a problem there, too.

src/SortingAlgorithms.jl

LilithHafner · 2022-10-12T07:18:43Z

src/SortingAlgorithms.jl

+                a, b = v[ia + 1], v[ib + 1]
+                v[ia+1], v[ib+1] = lt(o, b, a) ? (b, a) : (a, b)


This pattern appears thrice now (including once in combsort). It is probably a good idea to factor it out into a swap function.

Sure. A better name might be perhaps comparator.

I think in previous versions of the code it was worth it to read the value before hand, and leave the re-assignment for later, but thankfully it's all really concentrated now, so we should do this.

A comparator is a function that takes two (or sometimes three or more) inputs and determines which is larger. lt(order, _, _) and isless are comparators. This function also conditionally swaps the elements. There may be better names for it than swap, but I don't like comparator.

https://en.wikipedia.org/wiki/Comparator
https://clojure.org/guides/comparators
https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html

LilithHafner · 2022-10-12T11:30:19Z

src/SortingAlgorithms.jl

+            gap = 1 << n
+            for i in 0:gap:N-1
+                lastj = min(N - 1, N - (i + gap >> 1 + 1))
+                for j in 0:lastj


When n = 1, will this loop run Θ(N^2) times? IIUC, n will equal 1 Θ(log2(N)^2) times which would give this algorithm a runtime of Ω(log(N)^2*N^2). That seems bad.

Perhaps the bounds of this loop should be tighter.

I'm still trying to understand it all, because in the sorting network literature it's all about the delay, or depth... On the wikipedia page, though, it clearly states the network has n log(n)^2 comparators, so I guess that should be the complexity. Not n log(n) but not too shabby, and the main attraction is in parallelism anyways.

By eyeballing the diagram, or unrolling these two stages, we have this:

first_stage(data, 1) first_stage(data, 2) second_stage(data, 1) first_stage(data, 3) second_stage(data, 2) second_stage(data, 1) first_stage(data, 4) second_stage(data, 3) second_stage(data, 2) second_stage(data, 1)

So it's log(n) number of these larger blocks, but they increase linearly, making it O(log(n)^2) function calls. Each of these are linear, so O(n log(n)^2). Do you agree? Notice it's linear because although it's two for-loops, the intervals are such that it's actually a linear traversal. We might even reshape the input to implement this.

I agree that that is what the runtime should be, but I think there is an error in the implementation that results in much larger runtimes.

Oops, you're right! There was an extra outer for-loop for the second stage... Hopefully that helps a bit, but I'm still skeptical this is as good as it gets. I mean, I'm not even sure it will really be worth it in the end, but I'm not sure we're at the best for this algorithm yet.

src/SortingAlgorithms.jl

LilithHafner · 2022-10-12T11:33:15Z

src/SortingAlgorithms.jl

+This algorithm performs a series of pre-determined comparisons, and tends to be very parallelizable.
+The algorithm effectively implements a sorting network based on merging bitonic sequences.
+
+Characteristics:


It would be nice to have some characteristics that help people decide whether this algorithm is appropriate for their use case (e.g. when could it be better than the default sorting algorithms)

LilithHafner · 2022-10-12T11:36:37Z

src/SortingAlgorithms.jl

+it performs many independent comparisons.
+
+## References
+- Batcher, K.E., (1968). "Sorting networks and their applications", AFIPS '68, doi: https://doi.org/10.1145/1468075.1468121.


This source claims on page 1 that it is possible to achieve a sorting network with depth (log2 n)*((log2 n)+1)/2 and comparisons n(log2 n)^2/4. Does this implementation achieve those values?

I think so, following the reasoning for the complexity I gave previously.

LilithHafner · 2022-10-12T11:47:30Z

src/SortingAlgorithms.jl

+ - *parallelizable* suitable for vectorization with SIMD instructions because
+it performs many independent comparisons.
+
+## References


https://en.wikipedia.org/wiki/Bitonic_sorter is a great reference for this algorithm.

I'm not really a big fan of bloating the References sections, or citing Wikipedia, but fine by me if you think so.

LilithHafner · 2022-10-12T13:17:04Z

fwiw, here's a first draft of an implementation that uses bit-twiddling to have simpler control flow (i.e. fewer and longer inner loops):

using Base: Order

function bitonic_merge_sort!(v::AbstractVector, o::Ordering=Forward)
    n = Int(length(v))
    fi = firstindex(v)
    for k in 0:8sizeof(n-1)-leading_zeros(n-1)-1
        lo_mask = 1<<k-1
        i = 0
        while true
            lo_bits = i & lo_mask
            hi_bits = (i & ~lo_mask) << 1
            hi = hi_bits + lo_bits + lo_mask + 1
            hi >= n && break
            lo = hi_bits + lo_bits ⊻ lo_mask
            @inbounds swap!(v, lo+fi, hi+fi, o)
            i += 1
        end
        for j in k-1:-1:0
            lo_mask = 1<<j-1
            i = 0
            while true
                lo = i & lo_mask + (i & ~lo_mask) << 1
                hi = lo + lo_mask + 1
                hi >= n && break
                @inbounds swap!(v, lo+fi, hi+fi, o)
                i += 1
            end
        end
    end
    v
end

Base.@propagate_inbounds function swap!(v, i, j, o)
    a, b = v[i], v[j]
    v[i], v[j] = lt(o, b, a) ? (b, a) : (a, b)
end

for _ in 1:1000
    issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)))) || error()
    issorted(bitonic_merge_sort!(OffsetVector(rand(1:10,rand(1:10)),rand(-10:10)), Reverse), rev=true) || error()
end

julia> @btime sort!(x) setup=(x=rand(Int, 80)) evals=1;
  1.543 μs (0 allocations: 0 bytes)

julia> @btime bitonic_merge_sort!(x) setup=(x=rand(Int, 80)) evals=1;
  1.752 μs (0 allocations: 0 bytes)

julia> @btime sort!(x; alg=BitonicSort) setup=(x=rand(Int, 80)) evals=1;
  73.077 μs (28 allocations: 1.31 KiB)

I have yet to find a case where it outperforms the default algorithm (though I happen to be running on a branch which has pretty good default algorithms)

nlw0 · 2022-10-12T15:39:12Z

@LilithHafner that's great! I think I saw something similar in examples out there, but I didn't really understand until now... Have you checked if the loop vectorizations from the compiler are kicking in, though? How does it compare to the original code? Also I would expect that at least for very short lists it should perform great...

LilithHafner · 2022-10-13T01:05:19Z

Have you checked if the loop vectorizations from the compiler are kicking in, though?

I don't see any simd instructions in @code_llvm bitonic_merge_sort!([1,2,3], Forward).

How does it compare to the original code?

About 50x faster than the PR code according to the benchmarks I posted above. Results ae similar for longer vectors but the default algorithms have asymttoic runtime of O(n log n) and O(n), so bitonic merge sort with O(n (log n)^2) runtime will probably only shine for small ish inputs. The benchmarks ae incredibly unfair, though, because the PR still has some major performance issues that can probably be fairly easily resolved (#62 (comment) & #62 (comment)).

codecov-commenter · 2022-10-13T07:30:10Z

Codecov Report

Merging #62 (d4c23d9) into master (80c14f5) will increase coverage by 0.33%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #62      +/-   ##
==========================================
+ Coverage   96.51%   96.85%   +0.33%     
==========================================
  Files           1        1              
  Lines         344      381      +37     
==========================================
+ Hits          332      369      +37     
  Misses         12       12

Impacted Files	Coverage Δ
src/SortingAlgorithms.jl	`96.85% <100.00%> (+0.33%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

nlw0 · 2022-10-13T07:31:35Z

I made some of the changes, hopefully I can look at your implementation later today in more detail.

nlw0 · 2022-10-13T11:00:23Z

I agree, but I'm not sure how to call it. "comparator" is the term used in the sorting networks literature, that's why I was considering it. Ideally it should be a verb, but "compare" also suggests a predicate. Maybe `sort2!(a, b, o)`?

…

On Thu, 13 Oct 2022, 09:48 Lilith Orion Hafner, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/SortingAlgorithms.jl <#62 (comment)> : > + a, b = v[ia + 1], v[ib + 1] + v[ia+1], v[ib+1] = lt(o, b, a) ? (b, a) : (a, b) A comparator is a function that takes two (or sometimes three or more) inputs and determines which is larger. lt(order, _, _) and isless are comparators. This function also conditionally swaps the elements. There may be better names for it than swap, but I don't like comparator. https://en.wikipedia.org/wiki/Comparator https://clojure.org/guides/comparators https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html — Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABHKESPAL2TQACF5Z6NGJTWC65DJANCNFSM6AAAAAARCTT5YM> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

nlw0 · 2022-10-13T20:20:40Z

I've tried some benchmarking, with nasty results. I guess similar to what you observed. I find it strange, though, that I couldn't get reasonable times even for small inputs. I think I'll try something like generator functions or "hand-crafted" implementations with simd just to see if there's any potential and find a clue for what's up going on...

src/SortingAlgorithms.jl

Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>

LilithHafner · 2022-10-25T02:50:11Z

Here's an implementation in Julia: https://github.com/JuliaArrays/StaticArrays.jl/blob/v1.5.9/src/sort.jl#L50-L89

nlw0 · 2022-10-25T12:44:13Z

Nice! I don't think I had seen that before. I guess generated functions are really the way to go.

In the comments they say only inputs up to a certain size are accepted because of the sub-optimal complexity, but I'd argue that because of parallelism it's hard to know for sure where is that point, what is also supported by the good performance of combsort. Maybe we could consider to keep working on a more generic version to see how it goes. We probably need generated functions to get good machine code, though.

I wonder if there's in the literature already some kind of modification to combsort to make it look more like bitonic merge sort. The main difference I see, apart from the fact the sub-list sizes are all in powers of two, is that the intervals grow first, then shrink, then grow again. If we could prove that any of these things puts a hard limit on the maximum distance of an entry to its final position, that would be it, I think.

Bitonic mergesort

b6a15ba

LilithHafner reviewed Oct 12, 2022

View reviewed changes

integral log2, and wp ref

55e724c

comparator method

463514b

LilithHafner reviewed Oct 14, 2022

View reviewed changes

src/SortingAlgorithms.jl Outdated Show resolved Hide resolved

nlw0 and others added 2 commits October 14, 2022 08:11

Update src/SortingAlgorithms.jl

21e1913

Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>

removing extra outer loop on second stage

d4c23d9

LilithHafner added the new algorithm label Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bitonic mergesort #62

Bitonic mergesort #62

nlw0 commented Oct 11, 2022

LilithHafner left a comment

LilithHafner Oct 12, 2022

nlw0 Oct 13, 2022

LilithHafner Oct 13, 2022

LilithHafner Oct 12, 2022

nlw0 Oct 12, 2022

LilithHafner Oct 12, 2022

nlw0 Oct 13, 2022

LilithHafner Oct 13, 2022

LilithHafner Oct 12, 2022

nlw0 Oct 13, 2022

LilithHafner Oct 13, 2022

LilithHafner Oct 12, 2022

nlw0 Oct 13, 2022

LilithHafner Oct 14, 2022

nlw0 Oct 14, 2022 •

edited

Loading

LilithHafner Oct 12, 2022

LilithHafner Oct 12, 2022

nlw0 Oct 13, 2022

LilithHafner Oct 12, 2022

nlw0 Oct 13, 2022

LilithHafner commented Oct 12, 2022

nlw0 commented Oct 12, 2022

LilithHafner commented Oct 13, 2022

codecov-commenter commented Oct 13, 2022 •

edited

Loading

nlw0 commented Oct 13, 2022

nlw0 commented Oct 13, 2022 via email

nlw0 commented Oct 13, 2022

LilithHafner commented Oct 25, 2022

nlw0 commented Oct 25, 2022

		a, b = v[ia + 1], v[ib + 1]
		v[ia+1], v[ib+1] = lt(o, b, a) ? (b, a) : (a, b)

Bitonic mergesort #62

Are you sure you want to change the base?

Bitonic mergesort #62

Conversation

nlw0 commented Oct 11, 2022

LilithHafner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nlw0 Oct 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LilithHafner commented Oct 12, 2022

nlw0 commented Oct 12, 2022

LilithHafner commented Oct 13, 2022

codecov-commenter commented Oct 13, 2022 • edited Loading

Codecov Report

nlw0 commented Oct 13, 2022

nlw0 commented Oct 13, 2022 via email

nlw0 commented Oct 13, 2022

LilithHafner commented Oct 25, 2022

nlw0 commented Oct 25, 2022

nlw0 Oct 14, 2022 •

edited

Loading

codecov-commenter commented Oct 13, 2022 •

edited

Loading