Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use randn! for stochastic forcing implementations #351

Merged
merged 6 commits into from
Mar 11, 2024

Conversation

navidcy
Copy link
Member

@navidcy navidcy commented Mar 10, 2024

This forcing implementation ensures non-allocating calcF! methods both for CPU and GPU.

Closes #350

Few benchmarks:

# CPU with Fh .= sqrt.(forcing_spectrum) .* cis.(2π * random_uniform(T, size(forcing_spectrum)...)) ./ sqrt(clock.dt)
julia> @btime calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  239.890 μs (10 allocations: 130.23 KiB)

# GPU with Fh .= sqrt.(forcing_spectrum) .* cis.(2π * random_uniform(T, size(forcing_spectrum)...)) ./ sqrt(clock.dt)
julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  26.709 μs (49 allocations: 2.59 KiB)

# randn! on CPU
julia> @btime calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  122.298 μs (4 allocations: 96 bytes)

# randn! on GPU
julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  19.923 μs (32 allocations: 2.02 KiB)

Thus, this PR is 1.5-2x faster than the solution originally proposed in #350 and with less allocations.

@navidcy navidcy added 🐞 bug Something isn't working 🎮 gpu labels Mar 10, 2024
Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to perform long-running simulations on the GPU when there are allocations? Can GPU garbage collection keep up?

@navidcy
Copy link
Member Author

navidcy commented Mar 10, 2024

Is it possible to perform long-running simulations on the GPU when there are allocations? Can GPU garbage collection keep up?

I'm not sure about that.
But I'm also bit confused regarding where are the allocations coming from.

@glwagner
Copy link
Member

random_uniform(T, size(forcing_spectrum)...))

Calling CUDA.randn(sz...) allocates an array, and then populates it with random numbers.

@navidcy
Copy link
Member Author

navidcy commented Mar 10, 2024

random_uniform(T, size(forcing_spectrum)...))

Calling CUDA.randn(sz...) allocates an array, and then populates it with random numbers.

Oh yeah... that was the "allocating" version I suggested in the issue. The PR doesn't have that version, I just put it here for comparison.

But still using randn! you see there are some allocations... Those I don't understand where they come from.

julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  19.923 μs (32 allocations: 2.02 KiB)

@navidcy navidcy merged commit 7693b7b into main Mar 11, 2024
5 checks passed
@navidcy navidcy deleted the ncc/fix-stoch-forcing branch March 11, 2024 05:45
@glwagner
Copy link
Member

random_uniform(T, size(forcing_spectrum)...))

Calling CUDA.randn(sz...) allocates an array, and then populates it with random numbers.

Oh yeah... that was the "allocating" version I suggested in the issue. The PR doesn't have that version, I just put it here for comparison.

But still using randn! you see there are some allocations... Those I don't understand where they come from.

julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  19.923 μs (32 allocations: 2.02 KiB)

Did you look into the code for randn!? You'd probably find your answer quickly.

@navidcy
Copy link
Member Author

navidcy commented Mar 11, 2024

I actually didn't :(

@navidcy
Copy link
Member Author

navidcy commented Mar 11, 2024

omg, I figured it out!

@navidcy
Copy link
Member Author

navidcy commented Mar 11, 2024

randn! calls inplace_pow2 which, if is not provided with an array of length that is a power of 2, then it creates a new array that is of size the next power of 2 --- thus, it allocates!!

https://github.com/JuliaGPU/CUDA.jl/blob/bb49887198f258ffcb186d81df4a787453428b38/lib/curand/random.jl#L83-L111

If we have arrays that have length that is a power of 2 then there is no allocations:

julia> using CUDA, Random

julia> A = CUDA.zeros(1024, 1024);

julia> @btime Random.randn!($A);
  2.417 μs (0 allocations: 0 bytes)

julia> A = CUDA.zeros(1024, 1025);

julia> @btime Random.randn!($A);
  14.119 μs (10 allocations: 352 bytes)

@navidcy
Copy link
Member Author

navidcy commented Mar 11, 2024

Did you look into the code for randn!? You'd probably find your answer quickly.

You were right. In my head this was like an impossible task but it actually took me less than 10 minutes.

@glwagner
Copy link
Member

Nice work 🕵️‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working 🎮 gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with stochastic forcing implementation on GPU
2 participants