Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document and update default choice between :dict and :radixsort #720

Merged
merged 7 commits into from
Sep 3, 2023

Conversation

LilithHafner
Copy link
Contributor

@LilithHafner LilithHafner commented Oct 1, 2021

Add potential reasons to choose :dict over :radixsort as a counting algorithm to documentation, and update the default choice for small inputs.

Addresses #517

@LilithHafner LilithHafner marked this pull request as draft October 22, 2021 21:49
@LilithHafner
Copy link
Contributor Author

Hold pending JuliaCollections/SortingAlgorithms.jl#51

@LilithHafner
Copy link
Contributor Author

After JuliaLang/julia#44230 it makes sense to change the default behavior to use default sort, which also changes this PR, so now hold pending Julia-1.9.0.

@nalimilan
Copy link
Member

Do you want to update this now that JuliaLang/julia#44230 has been merged?

@LilithHafner LilithHafner marked this pull request as ready for review September 2, 2023 19:54
@LilithHafner
Copy link
Contributor Author

Here are some benchmarks that informed my commentary:

Screenshot 2023-09-02 at 2 54 57 PM

A value V > 1 indicates that :dict is V times faster than :radixsort. A value -V > 1 indicates that :radixsort is V times faster than :dict.

Benchmarking code:
julia> using BenchmarkTools, StatsBase, Plots

julia> function f(rng,len,alg)
           evals = max(1, 10^4÷(len+5))
           samples = max(1, 10^7÷(evals*(len+5)))
           @belapsed countmap(x; alg=$alg) setup=(x=rand(1:$rng, $len)) evals=evals samples=samples gcsample=false gctrial=false
       end
f (generic function with 1 method)

julia> function f(rng, len) 
           len >= 10^8 && rng >= 10^8 && return NaN # it would take too long
           log(f(rng,len,:radixsort)/f(rng,len,:dict))
       end
f (generic function with 2 methods)

julia> lengths = 10 .^ (0:8); ranges = 10 .^ (0:10);

julia> @time data = [f(r,l) for l in lengths, r in ranges];
 78.848502 seconds (263.01 M allocations: 135.172 GiB, 7.51% gc time, 2.73% compilation time)

julia> data2 = [copysign(exp(abs(x)), x) for x in data];

julia> heatmap(ranges, lengths, data2, xscale=:log10, yscale=:log10, xlabel="range", ylabel="length")

Separately, the use of SortingAlgorithms.RadixSort causes increased complexity because we have to work around its bugginess and lack of generality. It would be better to switch to Base's default sorting algorithm and also not call the algorithm :radixsort, but simply :sort. That's for another day, though.

@nalimilan
Copy link
Member

Thanks. So why did you revert the check that the length is higher than 100? Isn't :dict faster in that case?

Separately, the use of SortingAlgorithms.RadixSort causes increased complexity because we have to work around its bugginess and lack of generality. It would be better to switch to Base's default sorting algorithm and also not call the algorithm :radixsort, but simply :sort. That's for another day, though.

Actually we should probably do this soon as it's needed to move these functions to Statistics (JuliaStats/Statistics.jl#87 (comment)).

@LilithHafner
Copy link
Contributor Author

it used to be a 100x performance loss to use radixsort on small sizes, now it's only about 1.5x, not worth the added complexity and runtime variability imo.

@nalimilan
Copy link
Member

Ah OK, you mean that thanks to JuliaCollections/SortingAlgorithms.jl#63 we use the radix sort implementation from Base now.

@nalimilan nalimilan merged commit 7ca72c7 into JuliaStats:master Sep 3, 2023
12 of 18 checks passed
@LilithHafner LilithHafner deleted the count-radix-dict-choice branch September 3, 2023 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants