Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a basic implementation of rand() for use inside kernels #772

Merged
merged 3 commits into from Mar 24, 2021

Conversation

S-D-R
Copy link
Contributor

@S-D-R S-D-R commented Mar 16, 2021

This PR adds a rand() function that is callable inside kernels. The rng algorithm is based on https://discourse.julialang.org/t/generating-random-number-from-inside-kernel/8071/2. Documentation will be added once the code has been reviewed.

Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/compiler/execution.jl Outdated Show resolved Hide resolved
src/device/intrinsics/random.jl Show resolved Hide resolved
src/device/intrinsics/random.jl Outdated Show resolved Hide resolved
src/device/intrinsics/random.jl Outdated Show resolved Hide resolved
src/compiler/execution.jl Outdated Show resolved Hide resolved
@maleadt maleadt added cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request labels Mar 17, 2021
@S-D-R
Copy link
Contributor Author

S-D-R commented Mar 17, 2021

I changed the seeding to being purely inside the kernel via a seed_rng! function. While this makes kernel launching more efficient, I believe the old API was a bit more elegant. For example when using a custom seed, the user needs to remember to add the thread id, or else every thread will have the same rand() output.

@maleadt
Copy link
Member

maleadt commented Mar 17, 2021

when using a custom seed, the user needs to remember to add the thread id

The thread id could be mixed in when generating a random number.

@S-D-R S-D-R marked this pull request as ready for review March 17, 2021 19:02
@S-D-R
Copy link
Contributor Author

S-D-R commented Mar 18, 2021

Did some quick and dirty testing of the uniformness of rand() with 1024 random numbers:

function kernel(A::CuDeviceArray{Float32}, B::CuDeviceArray{Float32})
        tid = threadIdx().x
        CUDA.seed_rng!()
        A[tid] = CUDA.rand()
        B[tid] = CUDA.rand()
        return nothing
end
quantile(A, [0.0, 0.25, 0.5, 0.75, 1.0]) = [0.0003167390823364258, 0.25575685501098633, 0.4976065158843994, 0.73170006275177, 0.9974073171615601]
quantile(B, [0.0, 0.25, 0.5, 0.75, 1.0]) = [0.00041496753692626953, 0.23968505859375, 0.49785327911376953, 0.744731992483139, 0.9999158382415771]
function kernel(A::CuDeviceArray{Float32})
        CUDA.seed_rng!()
        for i = 1:N
            A[i] = CUDA.rand()
        end
        return nothing
    end
quantile(A, [0.0, 0.25, 0.5, 0.75, 1.0]) = [0.0011812448501586914, 0.2613106966018677, 0.5093778371810913, 0.7636812925338745, 0.9997673034667969]

Which seems about right to me. Only thing that's left is figuring out why the tests keep failing on debug julia.

@maleadt
Copy link
Member

maleadt commented Mar 19, 2021

Squashed & rebased; looking into some simplifications now.

@maleadt
Copy link
Member

maleadt commented Mar 19, 2021

@simsurace

julia> broadcast!(rand, CUDA.zeros(10))
10-element CuArray{Float32, 1}:
 0.015692137
 0.031620786
 0.047304787
 0.06324157
 0.07893371
 0.09437306
 0.11005706
 0.12648316
 0.14217122
 0.15810394

Getting closer :-) This is built on RandomNumbers.jl, so maybe you could add the functionality you want (rand_binomial) over there? Or can we re-use Distributions.jl for that? The Xorshift32 RNG we're using here might also be insufficient for practical purposes. Would be good if you could chime in or improve those pieces here now, since you've been looking into all that recently anyway.

@maleadt
Copy link
Member

maleadt commented Mar 19, 2021

An edge case I thought of: HostKernel represents a compilation, so it's possible multiple instances are running on different streams. Swapping out the random state could break something there.

The compute-sanitizer failures seem to be happening in different, non-rand related tests, and I'm not sure why they only trigger on this PR.

@simsurace
Copy link

Getting closer :-) This is built on RandomNumbers.jl, so maybe you could add the functionality you want (rand_binomial) over there? Or can we re-use Distributions.jl for that? The Xorshift32 RNG we're using here might also be insufficient for practical purposes. Would be good if you could chime in or improve those pieces here now, since you've been looking into all that recently anyway.

This is interesting. So this rand is basically an alternative to the GPUArrays.gpu_rand? If yes, I will give it a try. Using Distributions.jl, we can do rand.(Binomial.(counts, probs)) but actually I don't see any support for rand!. But we can also do broadcast((n,p)->rand(Binomial(n,p)), ns, ps) and this has a version with !. Not completely sure what is required to make this work for CuArrays as well, I'm quite new to GPU programming, I only started looking into this a few weeks ago because I needed those binomial random variates sampled on the GPU.

@maleadt
Copy link
Member

maleadt commented Mar 19, 2021

So this rand is basically an alternative to the GPUArrays.gpu_rand?

This is rand from Random, made compatible by redefining default_rng on the device and using our own RNG. So yes, this is definitely the path forward, but we'll obviously have to flesh out the implementation by making sure the necessary APIs are GPU compatible and either overriding more calls or extending the RNG.

I don't see any support for rand!

This is device-side; you don't need rand! there, but you want to broadcast! with rand() (or rand_binomial()) in your kernel.

@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #772 (14bb519) into master (247afb8) will increase coverage by 0.01%.
The diff coverage is 92.30%.

❗ Current head 14bb519 differs from pull request most recent head 620e9eb. Consider uploading reports for the commit 620e9eb to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #772      +/-   ##
==========================================
+ Coverage   78.92%   78.93%   +0.01%     
==========================================
  Files         123      123              
  Lines        7510     7519       +9     
==========================================
+ Hits         5927     5935       +8     
- Misses       1583     1584       +1     
Impacted Files Coverage Δ
src/CUDA.jl 100.00% <ø> (ø)
src/array.jl 87.75% <ø> (+0.60%) ⬆️
src/compiler/execution.jl 91.36% <92.30%> (-0.11%) ⬇️
lib/cudadrv/stream.jl 80.43% <0.00%> (-2.18%) ⬇️
src/pool.jl 83.58% <0.00%> (ø)
src/pool/binned.jl 0.00% <0.00%> (ø)
src/pool/simple.jl 0.00% <0.00%> (ø)
src/pool/split.jl 97.14% <0.00%> (+0.01%) ⬆️
lib/cutensor/CUTENSOR.jl 100.00% <0.00%> (+3.57%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ffce2b...620e9eb. Read the comment docs.

@maleadt
Copy link
Member

maleadt commented Mar 23, 2021

Starting to look good. Note that the current implementation is slow, but not unusably so:

CURAND:

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     4.539 μs (0.00% GC)
  median time:      4.694 μs (0.00% GC)
  mean time:        4.700 μs (0.00% GC)
  maximum time:     9.496 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     15.074 μs (0.00% GC)
  median time:      15.798 μs (0.00% GC)
  mean time:        16.095 μs (0.00% GC)
  maximum time:     35.990 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  6
  --------------
  minimum time:     19.067 ms (0.00% GC)
  median time:      104.840 ms (0.33% GC)
  mean time:        99.918 ms (0.33% GC)
  maximum time:     105.639 ms (0.32% GC)
  --------------
  samples:          51
  evals/sample:     1

CUDA.jl:

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     8.752 μs (0.00% GC)
  median time:      8.967 μs (0.00% GC)
  mean time:        8.990 μs (0.00% GC)
  maximum time:     31.119 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     433.062 μs (0.00% GC)
  median time:      452.577 μs (0.00% GC)
  mean time:        452.522 μs (0.00% GC)
  maximum time:     568.405 μs (0.00% GC)
  --------------
  samples:          9469
  evals/sample:     1

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
ERROR: Out of GPU memory trying to allocate 4.000 GiB

julia> sum(sizeof, map(kernel->kernel.random_state, values(CUDA.cufunction_cache[device()]))) |> Base.format_bytes
"4.004 GiB"

The reason it's slow is also the reason it failed the last benchmark: the random state is stored in global memory, and is unique per thread (is why there was 4GB of random state being wasted). But we can start with this, and once #552 lands we can experiment more easily with storing random state in shared memory. The tricky part will be how seed! from within a kernel behaves once random state is aliased between threads -- maybe it should then be documented that this value should only be set from thread 0.

Relevant resources for such an optimization: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application. A simple implementation like https://forums.developer.nvidia.com/t/random-numbers-inside-the-kernel/14222/10?u=tim5 should yield a large improvement.

@maleadt maleadt merged commit ba8ff2d into JuliaGPU:master Mar 24, 2021
@simsurace
Copy link

Which branch shall I clone in order to try this out? The master will not precompile on my machine.

@maleadt
Copy link
Member

maleadt commented Mar 24, 2021

@simsurace
Copy link

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants