Scalar operations on GPU arrays: a potential source of slowdown #29

navidcy · 2019-09-02T01:35:26Z

This error message

julia> using GeophysicalFlows
julia> gr  = TwoDGrid(GPU(), 32, 2π)
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays ~/.julia/packages/GPUArrays/J4c3Q/src/indexing.jl:16

stems from these two lines in domains.jl:
https://github.com/FourierFlows/FourierFlows.jl/blob/0c813d6b35235e5cd91364388111af562e48d992/src/domains.jl#L151, https://github.com/FourierFlows/FourierFlows.jl/blob/0c813d6b35235e5cd91364388111af562e48d992/src/domains.jl#L155

But is there any way to avoid it?
For constructing the grid it doesn't really matter because it's only done once. But many times we use, e.g., fh[1, 1]=0 to make sure that f has zero mean. See:

GeophysicalFlows.jl/examples/twodturb_stochasticforcing.jl

Line 36 in bf6dc6b

eta[1, 1] = 0

and in this case this may be slowing down the computations on the GPU significantly.

The text was updated successfully, but these errors were encountered:

navidcy · 2019-09-02T02:19:34Z

Hm....

using CuArrays, FourierFlows, Statistics, FFTW, Random, BenchmarkTools

A, T = CuArray, Float64
Fqh = A{Complex{T}}(undef, (129, 256))

function testscalar(Fqh)
  phase = 2π*rand!(A{T}(undef, size(Fqh)))
  eta = cos.(phase) + im*sin.(phase)
  eta[1, 1] = 0
  @. Fqh = eta
end

function testnoscalar(Fqh)
  phase = 2π*rand!(A{T}(undef, size(Fqh)))
  eta = cos.(phase) + im*sin.(phase)
  @. Fqh = eta
end

function testalternativetoscalar(Fqh)
  phase = 2π*rand!(A{T}(undef, size(Fqh)))
  eta = cos.(phase) + im*sin.(phase)
  @. Fqh = eta
  Fq = irfft(Fqh, 256)
  Fq = Fq .- mean(Fq)
  Fqh = rfft(Fq)
end

julia> @btime testscalar(Fqh);
  110.507 μs (310 allocations: 13.83 KiB)

julia> @btime testnoscalar(Fqh);
  87.748 μs (307 allocations: 13.67 KiB)

julia> @btime testalternativetoscalar(Fqh);
  1.899 ms (493 allocations: 21.70 KiB)

So the scalar operation causes @25% slowdown... However the alternative method I used is unfair since I didn't use FFT plans... But what other alternative we have other than eta[1, 1] = 0?

glwagner · 2019-09-03T20:30:25Z

The alternative is to launch a kernel, I suppose. We would just need one thread so it'd be an extremely simple kernel. Perhaps we can write something into utils.jl that does it, like zero_zeroth_mode! or something?

navidcy · 2019-09-04T02:58:34Z

I don't understand what "launch a kernel" means...

glwagner · 2019-09-04T14:52:19Z

@navidcy a "kernel" is a function that, when "launched" on the GPU, executes in parallel on hundreds to thousands of GPU threads. All GPU computations are done with kernels. For example, a broadcast operation over CuArrays launches one kernel. Calling for a CuFFT also launches a kernel.

For the most part, we are able to use powerful abstractions for launching kernels in FourierFlows, which has the benefit of being easy to program and also permitting code that runs on both CPUs and GPUs.

However, there seem to be some small number of tasks that will require us to actually write the kernel functions ourselves (rather than using broadcasting or FFTs).

See the CUDAnative documentation --- the macro @cuda is used to launch a kernel:

https://juliagpu.github.io/CUDAnative.jl/stable/man/usage.html

We can also use GPUifyLoops to specify kernel functions that work on both CPU and GPU, though I don't think we will need to do that. Instead, we will define a high-level function like zero_zeroth_mode! which has methods for both CPU arrays (the easy case) and CuArrays (the case that requires writing a simple GPU kernel).

glwagner · 2019-09-04T15:08:05Z

So I think we need something ultra-simple like

zero_zeroth_mode!(a) = a[1] = 0

@hascuda function zero_zeroth_mode!(a::CuArray)
    @cuda threads=1 _zero_zeroth_mode!(a)
    return nothing
end

function _zero_zeroth_mode!(a)
    a[1] = 0
    return nothing
end

Will have to see if that works (I'm not sure...); we might actually need

function _zero_zeroth_mode!(a)
    i = threadIdx().x
    a[i] = 0
    return nothing
end

let's test it out and see.

glwagner · 2019-10-23T00:55:58Z

On this issue --- it seems that for very small scalar operations such as eta[1, 1] = 0, a scalar operation may actually be the fastest method. There is another quite low-level CUDA function that could be used an alternative, but I doubt that it's worth the effort. The cost of this single scalar operation should be fairly miniscule.

navidcy · 2019-10-23T01:24:17Z

Sure. For this case it may miniscule. But if such operation occurs every time-step then it might be an issue?

Perhaps we should close the issue then...

glwagner · 2019-10-23T01:38:22Z

By miniscule, I mean actually miniscule compared to something like an FFT transform, which also occurs every time-step. Thus the net effect would be in the noise.

The way to test is just to benchmark with and without this operation (even though it is not physically correct to omit, this still tests performance). I'd be curious to see if there's any impact.

navidcy · 2020-08-23T05:20:15Z

Here is a good solution for scalar operations: CliMA/Oceananigans.jl#851

glwagner · 2020-08-23T14:24:41Z

This will prevent unintended invocation of scalar operations, since using scalar operations requires the prefix CUDA.@allowscalar.

The above discussion holds however. It is not always desirable to eliminate scalar operations.

navidcy · 2020-11-27T21:32:40Z

Modules now include CUDA.@allowscalar; I'm closing this.

navidcy added the 🎮 gpu label Sep 2, 2019

navidcy assigned glwagner Sep 2, 2019

navidcy closed this as completed Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalar operations on GPU arrays: a potential source of slowdown #29

Scalar operations on GPU arrays: a potential source of slowdown #29

navidcy commented Sep 2, 2019

navidcy commented Sep 2, 2019

glwagner commented Sep 3, 2019

navidcy commented Sep 4, 2019

glwagner commented Sep 4, 2019 •

edited

Loading

glwagner commented Sep 4, 2019

glwagner commented Oct 23, 2019

navidcy commented Oct 23, 2019

glwagner commented Oct 23, 2019

navidcy commented Aug 23, 2020

glwagner commented Aug 23, 2020

navidcy commented Nov 27, 2020

Scalar operations on GPU arrays: a potential source of slowdown #29

Scalar operations on GPU arrays: a potential source of slowdown #29

Comments

navidcy commented Sep 2, 2019

navidcy commented Sep 2, 2019

glwagner commented Sep 3, 2019

navidcy commented Sep 4, 2019

glwagner commented Sep 4, 2019 • edited Loading

glwagner commented Sep 4, 2019

glwagner commented Oct 23, 2019

navidcy commented Oct 23, 2019

glwagner commented Oct 23, 2019

navidcy commented Aug 23, 2020

glwagner commented Aug 23, 2020

navidcy commented Nov 27, 2020

glwagner commented Sep 4, 2019 •

edited

Loading