Add a basic implementation of rand() for use inside kernels #772

S-D-R · 2021-03-16T12:27:47Z

This PR adds a rand() function that is callable inside kernels. The rng algorithm is based on https://discourse.julialang.org/t/generating-random-number-from-inside-kernel/8071/2. Documentation will be added once the code has been reviewed.

maleadt

Nice!

src/compiler/execution.jl

src/device/intrinsics/random.jl

src/compiler/execution.jl

S-D-R · 2021-03-17T15:53:53Z

I changed the seeding to being purely inside the kernel via a seed_rng! function. While this makes kernel launching more efficient, I believe the old API was a bit more elegant. For example when using a custom seed, the user needs to remember to add the thread id, or else every thread will have the same rand() output.

maleadt · 2021-03-17T15:57:26Z

when using a custom seed, the user needs to remember to add the thread id

The thread id could be mixed in when generating a random number.

S-D-R · 2021-03-18T16:47:58Z

Did some quick and dirty testing of the uniformness of rand() with 1024 random numbers:

function kernel(A::CuDeviceArray{Float32}, B::CuDeviceArray{Float32})
        tid = threadIdx().x
        CUDA.seed_rng!()
        A[tid] = CUDA.rand()
        B[tid] = CUDA.rand()
        return nothing
end

quantile(A, [0.0, 0.25, 0.5, 0.75, 1.0]) = [0.0003167390823364258, 0.25575685501098633, 0.4976065158843994, 0.73170006275177, 0.9974073171615601]
quantile(B, [0.0, 0.25, 0.5, 0.75, 1.0]) = [0.00041496753692626953, 0.23968505859375, 0.49785327911376953, 0.744731992483139, 0.9999158382415771]

function kernel(A::CuDeviceArray{Float32})
        CUDA.seed_rng!()
        for i = 1:N
            A[i] = CUDA.rand()
        end
        return nothing
    end

quantile(A, [0.0, 0.25, 0.5, 0.75, 1.0]) = [0.0011812448501586914, 0.2613106966018677, 0.5093778371810913, 0.7636812925338745, 0.9997673034667969]

Which seems about right to me. Only thing that's left is figuring out why the tests keep failing on debug julia.

maleadt · 2021-03-19T09:13:12Z

Squashed & rebased; looking into some simplifications now.

maleadt · 2021-03-19T12:44:57Z

@simsurace

julia> broadcast!(rand, CUDA.zeros(10))
10-element CuArray{Float32, 1}:
 0.015692137
 0.031620786
 0.047304787
 0.06324157
 0.07893371
 0.09437306
 0.11005706
 0.12648316
 0.14217122
 0.15810394

Getting closer :-) This is built on RandomNumbers.jl, so maybe you could add the functionality you want (rand_binomial) over there? Or can we re-use Distributions.jl for that? The Xorshift32 RNG we're using here might also be insufficient for practical purposes. Would be good if you could chime in or improve those pieces here now, since you've been looking into all that recently anyway.

maleadt · 2021-03-19T13:55:48Z

An edge case I thought of: HostKernel represents a compilation, so it's possible multiple instances are running on different streams. Swapping out the random state could break something there.

The compute-sanitizer failures seem to be happening in different, non-rand related tests, and I'm not sure why they only trigger on this PR.

simsurace · 2021-03-19T14:53:29Z

Getting closer :-) This is built on RandomNumbers.jl, so maybe you could add the functionality you want (rand_binomial) over there? Or can we re-use Distributions.jl for that? The Xorshift32 RNG we're using here might also be insufficient for practical purposes. Would be good if you could chime in or improve those pieces here now, since you've been looking into all that recently anyway.

This is interesting. So this rand is basically an alternative to the GPUArrays.gpu_rand? If yes, I will give it a try. Using Distributions.jl, we can do rand.(Binomial.(counts, probs)) but actually I don't see any support for rand!. But we can also do broadcast((n,p)->rand(Binomial(n,p)), ns, ps) and this has a version with !. Not completely sure what is required to make this work for CuArrays as well, I'm quite new to GPU programming, I only started looking into this a few weeks ago because I needed those binomial random variates sampled on the GPU.

maleadt · 2021-03-19T15:05:36Z

So this rand is basically an alternative to the GPUArrays.gpu_rand?

This is rand from Random, made compatible by redefining default_rng on the device and using our own RNG. So yes, this is definitely the path forward, but we'll obviously have to flesh out the implementation by making sure the necessary APIs are GPU compatible and either overriding more calls or extending the RNG.

I don't see any support for rand!

This is device-side; you don't need rand! there, but you want to broadcast! with rand() (or rand_binomial()) in your kernel.

codecov · 2021-03-23T10:19:08Z

Codecov Report

Merging #772 (14bb519) into master (247afb8) will increase coverage by 0.01%.
The diff coverage is 92.30%.

❗ Current head 14bb519 differs from pull request most recent head 620e9eb. Consider uploading reports for the commit 620e9eb to get more accurate results

@@            Coverage Diff             @@
##           master     #772      +/-   ##
==========================================
+ Coverage   78.92%   78.93%   +0.01%     
==========================================
  Files         123      123              
  Lines        7510     7519       +9     
==========================================
+ Hits         5927     5935       +8     
- Misses       1583     1584       +1

Impacted Files	Coverage Δ
src/CUDA.jl	`100.00% <ø> (ø)`
src/array.jl	`87.75% <ø> (+0.60%)`	⬆️
src/compiler/execution.jl	`91.36% <92.30%> (-0.11%)`	⬇️
lib/cudadrv/stream.jl	`80.43% <0.00%> (-2.18%)`	⬇️
src/pool.jl	`83.58% <0.00%> (ø)`
src/pool/binned.jl	`0.00% <0.00%> (ø)`
src/pool/simple.jl	`0.00% <0.00%> (ø)`
src/pool/split.jl	`97.14% <0.00%> (+0.01%)`	⬆️
lib/cutensor/CUTENSOR.jl	`100.00% <0.00%> (+3.57%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ffce2b...620e9eb. Read the comment docs.

maleadt · 2021-03-23T14:58:02Z

Starting to look good. Note that the current implementation is slow, but not unusably so:

CURAND:

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     4.539 μs (0.00% GC)
  median time:      4.694 μs (0.00% GC)
  mean time:        4.700 μs (0.00% GC)
  maximum time:     9.496 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     15.074 μs (0.00% GC)
  median time:      15.798 μs (0.00% GC)
  mean time:        16.095 μs (0.00% GC)
  maximum time:     35.990 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync(rand!(a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  6
  --------------
  minimum time:     19.067 ms (0.00% GC)
  median time:      104.840 ms (0.33% GC)
  mean time:        99.918 ms (0.33% GC)
  maximum time:     105.639 ms (0.32% GC)
  --------------
  samples:          51
  evals/sample:     1

CUDA.jl:

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024))
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     8.752 μs (0.00% GC)
  median time:      8.967 μs (0.00% GC)
  mean time:        8.990 μs (0.00% GC)
  maximum time:     31.119 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     3

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024))
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     433.062 μs (0.00% GC)
  median time:      452.577 μs (0.00% GC)
  mean time:        452.522 μs (0.00% GC)
  maximum time:     568.405 μs (0.00% GC)
  --------------
  samples:          9469
  evals/sample:     1

julia> @benchmark CUDA.@sync(broadcast!(()->rand(Float32), a)) setup=(a=CuArray{Float32}(undef, 1024, 1024, 1024))
ERROR: Out of GPU memory trying to allocate 4.000 GiB

julia> sum(sizeof, map(kernel->kernel.random_state, values(CUDA.cufunction_cache[device()]))) |> Base.format_bytes
"4.004 GiB"

The reason it's slow is also the reason it failed the last benchmark: the random state is stored in global memory, and is unique per thread (is why there was 4GB of random state being wasted). But we can start with this, and once #552 lands we can experiment more easily with storing random state in shared memory. The tricky part will be how seed! from within a kernel behaves once random state is aliased between threads -- maybe it should then be documented that this value should only be set from thread 0.

Relevant resources for such an optimization: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-37-efficient-random-number-generation-and-application. A simple implementation like https://forums.developer.nvidia.com/t/random-numbers-inside-the-kernel/14222/10?u=tim5 should yield a large improvement.

simsurace · 2021-03-24T16:12:01Z

Which branch shall I clone in order to try this out? The master will not precompile on my machine.

maleadt · 2021-03-24T16:36:05Z

https://juliagpu.github.io/CUDA.jl/dev/installation/overview/#Package-installation

simsurace · 2021-03-24T16:41:58Z

Thanks!

maleadt requested changes Mar 16, 2021

View reviewed changes

maleadt added cuda kernels Stuff about writing CUDA kernels. enhancement New feature or request labels Mar 17, 2021

S-D-R marked this pull request as ready for review March 17, 2021 19:02

S-D-R requested a review from maleadt March 18, 2021 16:48

maleadt force-pushed the kernel_rng branch from b30f9ef to 30b661f Compare March 19, 2021 09:12

maleadt force-pushed the kernel_rng branch from 30b661f to d504d16 Compare March 19, 2021 09:34

maleadt approved these changes Mar 19, 2021

View reviewed changes

maleadt mentioned this pull request Mar 19, 2021

compute-sanitizer out-of-bounds failure #780

Closed

simsurace mentioned this pull request Mar 21, 2021

Try new rand(Float32) in kernel JuliaGPU/BinomialGPU.jl#7

Closed

maleadt force-pushed the kernel_rng branch 2 times, most recently from 84d94d2 to 14bb519 Compare March 23, 2021 08:30

maleadt force-pushed the kernel_rng branch from 14bb519 to 620e9eb Compare March 23, 2021 15:09

maleadt and others added 3 commits March 24, 2021 08:00

Reorganize includes so that CuArray can be used by the compiler.

9e2e0bb

Add support for constant memory.

466943c

Use RandomNumbers.jl to provide rand() for other types.

e5fa433

maleadt force-pushed the kernel_rng branch from 620e9eb to e5fa433 Compare March 24, 2021 07:01

maleadt merged commit ba8ff2d into JuliaGPU:master Mar 24, 2021

maleadt mentioned this pull request Mar 25, 2021

Speed-up rand: Tausworthe RNG with shared random state. #788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a basic implementation of rand() for use inside kernels #772

Add a basic implementation of rand() for use inside kernels #772

S-D-R commented Mar 16, 2021 •

edited

maleadt left a comment

S-D-R commented Mar 17, 2021

maleadt commented Mar 17, 2021

S-D-R commented Mar 18, 2021

maleadt commented Mar 19, 2021

maleadt commented Mar 19, 2021

maleadt commented Mar 19, 2021 •

edited

simsurace commented Mar 19, 2021

maleadt commented Mar 19, 2021

codecov bot commented Mar 23, 2021 •

edited

maleadt commented Mar 23, 2021

simsurace commented Mar 24, 2021

maleadt commented Mar 24, 2021

simsurace commented Mar 24, 2021

Add a basic implementation of rand() for use inside kernels #772

Add a basic implementation of rand() for use inside kernels #772

Conversation

S-D-R commented Mar 16, 2021 • edited

maleadt left a comment

Choose a reason for hiding this comment

S-D-R commented Mar 17, 2021

maleadt commented Mar 17, 2021

S-D-R commented Mar 18, 2021

maleadt commented Mar 19, 2021

maleadt commented Mar 19, 2021

maleadt commented Mar 19, 2021 • edited

simsurace commented Mar 19, 2021

maleadt commented Mar 19, 2021

codecov bot commented Mar 23, 2021 • edited

Codecov Report

maleadt commented Mar 23, 2021

simsurace commented Mar 24, 2021

maleadt commented Mar 24, 2021

simsurace commented Mar 24, 2021

S-D-R commented Mar 16, 2021 •

edited

maleadt commented Mar 19, 2021 •

edited

codecov bot commented Mar 23, 2021 •

edited