-
Notifications
You must be signed in to change notification settings - Fork 258
CUDA quicksort #431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA quicksort #431
Conversation
|
Nice, thanks! I'll try to give this a proper look, but two quick questions:
|
|
Interestingly, it seems to speed up when the block size is halved. The trade-offs when you use 512 threads compared to 1024:
Using |
I've noticed the same in JuliaGPU/GPUArrays.jl#301 (comment), maximizing occupancy doesn't always guarantee best performance. |
e590272 to
c0548bf
Compare
|
This one looks related: As well as: |
maleadt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to look very good, thanks! :-)
|
@maleadt thanks for the feedback! I'll happily implement all the code-organization/testing ones in the next push. The ones closer to the actual sorting will take some thought |
|
I guess something went wrong here merging? Better to rebase anyway. |
That commit should be the result of a rebase. It looks like tests passed on build 303 and it failed because of missing NetworkOptions. |
|
Well, something still went wrong because there's a bunch of commits of mine in here: https://github.com/JuliaGPU/CUDA.jl/pull/431/commits |
|
How do you feel about me closing this PR and opening a new one :) |
|
Just |
5d52af4 to
a70b02c
Compare
|
Can we compare the speed to the SortingAlgorithms.jl's Radix sort performance wise? |
a70b02c to
74eec29
Compare
|
The update for Julia 1.6 forced me to make some major updates, which have made this work a lot better (as a standalone algorithm). I think this is up to date with master, so I'm not sure why this is failing in CI on a dependency issue. That said, I am getting an error when I try to call Because of this, I have not yet addressed all the code organization comments above. |
Sure! This is from an aws ec2 instance of type p3.2xlarge. Radix sort's performance varies significantly with type, but Int32 gives you a roughly average comparison. This validates that the sort is correct, too. |
|
I missed your comments -- are you still running into the |
Putting all of |
|
Disabling memcheck is only for incompatibilities, while this seems like a legitimate issue: |
|
I have now reproduced the memcheck error. It occurs when sorting a list with many many duplicates. However, if I don't run the test through memcheck, the same test passes (not just runs, but runs correctly). The recursion can go pretty deep for this case. Memcheck makes all launches blocking. If I prevent launching kernels at depth > 24, the error goes away. That's the maximum sync depth. The same error occurs if we run the following MWE through memcheck: To prevent such launches from happening, I can make the kernel aware of when it has found a large section of identical values. |
So it seems important to honor this limit. |
68d5f29 to
e527584
Compare
Codecov Report
@@ Coverage Diff @@
## master #431 +/- ##
==========================================
- Coverage 78.79% 77.76% -1.03%
==========================================
Files 116 117 +1
Lines 6890 7035 +145
==========================================
+ Hits 5429 5471 +42
- Misses 1461 1564 +103
Continue to review full report at Codecov.
|
Ah, we're still running into this apparently. Now it happened on an RTX2080, so it doesn't seem compute capability-related. The issue seems to require running under |
|
Oh FFS this is another case of JuliaGPU/CUDAnative.jl#4 😭 Forcing a using CUDA, Test
function main()
for i in 1:typemax(Int)
@show i
A = rand(1:10, (2, 100000))
d_A = CuArray(A)
B = sort(A; dims=2)
d_B = sort(d_A; dims=2)
for x in (B, Array(d_B))
@test issorted(x[1,:])
@test issorted(x[2,:])
end
@test B == Array(d_B)
end
end
isinteractive() || main()(but which only reproduces on select hardware/driver combinations, here an RTX 2080 Ti on driver 450.80.2) Filed as NVIDIA bug #3231266 |
884ad81 to
92699b6
Compare
|
All green! Let's merge this 🚀 |
An implementation of quicksort to address: #93
The performance is solid, see
src/sorting/usage.jlfor quick performance tests. I intend to later include handling lists with a large number of duplicates, which currently can stymie the method for partitioning.Hopefully the tests included can help clarify the inner workings. Be warned that this is likely not the most "Julian" code that's ever been written in Julia.