-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalar operations on GPU arrays: a potential source of slowdown #29
Comments
Hm.... using CuArrays, FourierFlows, Statistics, FFTW, Random, BenchmarkTools
A, T = CuArray, Float64
Fqh = A{Complex{T}}(undef, (129, 256))
function testscalar(Fqh)
phase = 2π*rand!(A{T}(undef, size(Fqh)))
eta = cos.(phase) + im*sin.(phase)
eta[1, 1] = 0
@. Fqh = eta
end
function testnoscalar(Fqh)
phase = 2π*rand!(A{T}(undef, size(Fqh)))
eta = cos.(phase) + im*sin.(phase)
@. Fqh = eta
end
function testalternativetoscalar(Fqh)
phase = 2π*rand!(A{T}(undef, size(Fqh)))
eta = cos.(phase) + im*sin.(phase)
@. Fqh = eta
Fq = irfft(Fqh, 256)
Fq = Fq .- mean(Fq)
Fqh = rfft(Fq)
end
julia> @btime testscalar(Fqh);
110.507 μs (310 allocations: 13.83 KiB)
julia> @btime testnoscalar(Fqh);
87.748 μs (307 allocations: 13.67 KiB)
julia> @btime testalternativetoscalar(Fqh);
1.899 ms (493 allocations: 21.70 KiB) So the scalar operation causes @25% slowdown... However the alternative method I used is unfair since I didn't use FFT plans... But what other alternative we have other than |
The alternative is to launch a kernel, I suppose. We would just need one thread so it'd be an extremely simple kernel. Perhaps we can write something into |
I don't understand what "launch a kernel" means... |
@navidcy a "kernel" is a function that, when "launched" on the GPU, executes in parallel on hundreds to thousands of GPU threads. All GPU computations are done with kernels. For example, a broadcast operation over For the most part, we are able to use powerful abstractions for launching kernels in FourierFlows, which has the benefit of being easy to program and also permitting code that runs on both CPUs and GPUs. However, there seem to be some small number of tasks that will require us to actually write the kernel functions ourselves (rather than using broadcasting or FFTs). See the https://juliagpu.github.io/CUDAnative.jl/stable/man/usage.html We can also use |
So I think we need something ultra-simple like zero_zeroth_mode!(a) = a[1] = 0
@hascuda function zero_zeroth_mode!(a::CuArray)
@cuda threads=1 _zero_zeroth_mode!(a)
return nothing
end
function _zero_zeroth_mode!(a)
a[1] = 0
return nothing
end Will have to see if that works (I'm not sure...); we might actually need function _zero_zeroth_mode!(a)
i = threadIdx().x
a[i] = 0
return nothing
end let's test it out and see. |
On this issue --- it seems that for very small scalar operations such as |
Sure. For this case it may miniscule. But if such operation occurs every time-step then it might be an issue? Perhaps we should close the issue then... |
By miniscule, I mean actually miniscule compared to something like an FFT transform, which also occurs every time-step. Thus the net effect would be in the noise. The way to test is just to benchmark with and without this operation (even though it is not physically correct to omit, this still tests performance). I'd be curious to see if there's any impact. |
Here is a good solution for scalar operations: CliMA/Oceananigans.jl#851 |
This will prevent unintended invocation of scalar operations, since using scalar operations requires the prefix The above discussion holds however. It is not always desirable to eliminate scalar operations. |
Modules now include |
This error message
stems from these two lines in
domains.jl
:https://github.com/FourierFlows/FourierFlows.jl/blob/0c813d6b35235e5cd91364388111af562e48d992/src/domains.jl#L151, https://github.com/FourierFlows/FourierFlows.jl/blob/0c813d6b35235e5cd91364388111af562e48d992/src/domains.jl#L155
But is there any way to avoid it?
For constructing the grid it doesn't really matter because it's only done once. But many times we use, e.g.,
fh[1, 1]=0
to make sure thatf
has zero mean. See:GeophysicalFlows.jl/examples/twodturb_stochasticforcing.jl
Line 36 in bf6dc6b
and in this case this may be slowing down the computations on the GPU significantly.
The text was updated successfully, but these errors were encountered: