New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a basic implementation of rand() for use inside kernels #772
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I changed the seeding to being purely inside the kernel via a |
The thread id could be mixed in when generating a random number. |
Did some quick and dirty testing of the uniformness of
Which seems about right to me. Only thing that's left is figuring out why the tests keep failing on debug julia. |
Squashed & rebased; looking into some simplifications now. |
julia> broadcast!(rand, CUDA.zeros(10))
10-element CuArray{Float32, 1}:
0.015692137
0.031620786
0.047304787
0.06324157
0.07893371
0.09437306
0.11005706
0.12648316
0.14217122
0.15810394 Getting closer :-) This is built on RandomNumbers.jl, so maybe you could add the functionality you want ( |
An edge case I thought of: HostKernel represents a compilation, so it's possible multiple instances are running on different streams. Swapping out the random state could break something there. The compute-sanitizer failures seem to be happening in different, non-rand related tests, and I'm not sure why they only trigger on this PR. |
This is interesting. So this |
This is
This is device-side; you don't need |
84d94d2
to
14bb519
Compare
Codecov Report
@@ Coverage Diff @@
## master #772 +/- ##
==========================================
+ Coverage 78.92% 78.93% +0.01%
==========================================
Files 123 123
Lines 7510 7519 +9
==========================================
+ Hits 5927 5935 +8
- Misses 1583 1584 +1
Continue to review full report at Codecov.
|
Starting to look good. Note that the current implementation is slow, but not unusably so: CURAND:
CUDA.jl:
The reason it's slow is also the reason it failed the last benchmark: the random state is stored in global memory, and is unique per thread (is why there was 4GB of random state being wasted). But we can start with this, and once #552 lands we can experiment more easily with storing random state in shared memory. The tricky part will be how Relevant resources for such an optimization: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application. A simple implementation like https://forums.developer.nvidia.com/t/random-numbers-inside-the-kernel/14222/10?u=tim5 should yield a large improvement. |
Which branch shall I clone in order to try this out? The master will not precompile on my machine. |
Thanks! |
This PR adds a
rand()
function that is callable inside kernels. The rng algorithm is based on https://discourse.julialang.org/t/generating-random-number-from-inside-kernel/8071/2. Documentation will be added once the code has been reviewed.