Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes CPU slowdown with KernelAbstractions >= v0.8 #3026

Merged
merged 7 commits into from
Mar 24, 2023

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Mar 23, 2023

Apparently, the issue with #2996 is caused by the fact that slurping inside a kernel signature
i.e:

@kernel function test_func(r, c, args...)
      i, j, k = @index(Global, NTuple)
      r[i, j, k] = whatever_inlined_function(i, j, k, c, args...)
end

This can be fixed by passing the size of the Vararg tuple
like this

@kernel function test_func(r, c, args::Vararg{T, N}) where {T, N}
      i, j, k = @index(Global, NTuple)
       r[i, j, k] = whatever_inlined_function(i, j, k, c, args...)
end

Edit, this approach only works if all Varargs are of the same type T.

Edit, Edit, with @glwagner we decided to pass all additional arguments to the kernel as a tuple, so

@kernel function test_func(r, c, args)
      i, j, k = @index(Global, NTuple)
      r[i, j, k] = whatever_inlined_function(i, j, k, c, args...)
end

which means that from now on, when calling a kernel

caller_func(arch, grid, r, c, args...) = launch!(arch, grid, worksize, test_func, r, c, args...)

will become

caller_func(arch, grid, r, c, args...) = launch!(arch, grid, worksize, test_func, r, c, Tuple(args))

Resolves #2996

@navidcy navidcy changed the title Fixes to #2996 Fixes CPU slowdown with KernelAbstractions >= v0.8 Mar 23, 2023
@navidcy navidcy added the performance 🏍️ So we can get the wrong answer even faster label Mar 23, 2023
navidcy referenced this pull request Mar 23, 2023
…#2) (#2990)

* ZDirection -> NegativeZDirection (doctests pass locally atm)

* @info -> @warn

* comment parts of the documentation

* uncomment tilted bbl example (docs still build locally w this change)

* bring back other doc pages (still builds)

* full docs

* adds review suggestions

* bugfix

* better unit-vector validation + fix coriolis tests

---------

Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

@simone-silvestri is this done? shall I test it on my laptop?

@simone-silvestri
Copy link
Collaborator Author

Sure give it a try. We can merge when tests pass

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

Well if the docs take 6hrs then it's a sign that the problem persists...?

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Mar 24, 2023

they seem to be finishing (~ 3hrs maybe). We should definitely separate the docs testing for when the PR is ready to merge, 3 hrs is a very long time for little returns

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

using Oceananigans main

julia> include("baroclinic_adjustment.jl")
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 10.162 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (5.052 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (17.311 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.900 minutes, max(u): (6.955e-01, 3.312e-01, 3.390e-03) m/s, next Δt: 8.858 minutes
[03.03%] i: 200, t: 1.213 days, wall time: 1.542 minutes, max(u): (6.076e-01, 3.510e-01, 2.477e-03) m/s, next Δt: 14.266 minutes
[05.99%] i: 300, t: 2.396 days, wall time: 3.374 minutes, max(u): (5.579e-01, 2.912e-01, 2.036e-03) m/s, next Δt: 20 minutes
[09.44%] i: 400, t: 3.778 days, wall time: 1.572 minutes, max(u): (4.628e-01, 2.791e-01, 1.828e-03) m/s, next Δt: 20 minutes
[12.92%] i: 500, t: 5.167 days, wall time: 1.532 minutes, max(u): (4.567e-01, 2.743e-01, 1.750e-03) m/s, next Δt: 20 minutes
[16.39%] i: 600, t: 6.556 days, wall time: 1.521 minutes, max(u): (4.579e-01, 3.130e-01, 1.805e-03) m/s, next Δt: 20 minutes

Using this PR

julia> include("baroclinic_adjustment.jl")
[ Info: Oceananigans will use 6 threads
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 12.345 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (5.941 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (17.191 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.286 minutes, max(u): (6.958e-01, 3.309e-01, 3.365e-03) m/s, next Δt: 8.858 minutes
[03.03%] i: 200, t: 1.213 days, wall time: 55.848 seconds, max(u): (6.075e-01, 3.512e-01, 2.492e-03) m/s, next Δt: 14.266 minutes
[05.99%] i: 300, t: 2.396 days, wall time: 58.502 seconds, max(u): (5.586e-01, 2.880e-01, 2.085e-03) m/s, next Δt: 20 minutes
[09.44%] i: 400, t: 3.778 days, wall time: 57.702 seconds, max(u): (4.747e-01, 2.683e-01, 1.776e-03) m/s, next Δt: 20 minutes
[12.92%] i: 500, t: 5.167 days, wall time: 57.206 seconds, max(u): (4.541e-01, 2.486e-01, 1.641e-03) m/s, next Δt: 20 minutes

using Oceananigans v0.79.2 (with KernelAbstractions v0.7.2)

julia> include("baroclinic_adjustment.jl")
[ Info: Oceananigans will use 6 threads
[ Info: Precompiling CairoMakie [13f3f980-e62b-5c42-98c6-ff1f3baf88f0]
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 11.628 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (6.329 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (35.780 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.369 minutes, max(u): (6.973e-01, 3.325e-01, 3.378e-03) m/s, next Δt: 8.858 minutes
[03.03%] i: 200, t: 1.213 days, wall time: 43.693 seconds, max(u): (6.093e-01, 3.516e-01, 2.474e-03) m/s, next Δt: 14.266 minutes
[05.99%] i: 300, t: 2.396 days, wall time: 42.554 seconds, max(u): (5.587e-01, 2.900e-01, 2.016e-03) m/s, next Δt: 20 minutes
[09.44%] i: 400, t: 3.778 days, wall time: 42.783 seconds, max(u): (4.648e-01, 2.767e-01, 1.781e-03) m/s, next Δt: 20 minutes

@glwagner
Copy link
Member

@navidcy how about a Nonhydrostatic example?

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

using Oceananigans main

julia> include("ocean_wind_mixing_and_convection.jl")
[ Info: Initializing simulation...
Iteration: 0000, time: 0 seconds, Δt: 11 seconds, max(|w|) = 1.1e-05 ms⁻¹, wall time: 0 seconds
[ Info:     ... simulation initialization complete (2.049 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (16.222 seconds).
Iteration: 0020, time: 3.403 minutes, Δt: 13.310 seconds, max(|w|) = 8.8e-06 ms⁻¹, wall time: 23.980 seconds
Iteration: 0040, time: 7.488 minutes, Δt: 16.105 seconds, max(|w|) = 5.5e-06 ms⁻¹, wall time: 30.307 seconds
Iteration: 0060, time: 12 minutes, Δt: 10.350 seconds, max(|w|) = 8.9e-06 ms⁻¹, wall time: 36.549 seconds
Iteration: 0080, time: 15 minutes, Δt: 8.284 seconds, max(|w|) = 4.4e-05 ms⁻¹, wall time: 42.839 seconds
Iteration: 0100, time: 17.509 minutes, Δt: 7.101 seconds, max(|w|) = 1.2e-04 ms⁻¹, wall time: 49.022 seconds
Iteration: 0120, time: 19.561 minutes, Δt: 6.358 seconds, max(|w|) = 5.5e-04 ms⁻¹, wall time: 55.315 seconds
Iteration: 0140, time: 21.505 minutes, Δt: 5.772 seconds, max(|w|) = 2.5e-03 ms⁻¹, wall time: 1.026 minutes
Iteration: 0160, time: 23.277 minutes, Δt: 5.356 seconds, max(|w|) = 1.0e-02 ms⁻¹, wall time: 1.129 minutes
Iteration: 0180, time: 24.961 minutes, Δt: 5.303 seconds, max(|w|) = 2.6e-02 ms⁻¹, wall time: 1.232 minutes
Iteration: 0200, time: 26.626 minutes, Δt: 5.806 seconds, max(|w|) = 6.5e-02 ms⁻¹, wall time: 1.337 minutes
Iteration: 0220, time: 28.639 minutes, Δt: 6.807 seconds, max(|w|) = 1.0e-01 ms⁻¹, wall time: 1.440 minutes
Iteration: 0240, time: 30.873 minutes, Δt: 8.041 seconds, max(|w|) = 9.1e-02 ms⁻¹, wall time: 1.544 minutes
Iteration: 0260, time: 33.412 minutes, Δt: 8.474 seconds, max(|w|) = 7.0e-02 ms⁻¹, wall time: 1.652 minutes
Iteration: 0280, time: 35.873 minutes, Δt: 8.997 seconds, max(|w|) = 5.9e-02 ms⁻¹, wall time: 1.761 minutes
Iteration: 0300, time: 38.766 minutes, Δt: 9.388 seconds, max(|w|) = 6.3e-02 ms⁻¹, wall time: 1.866 minutes
[ Info: Simulation is stopping after running for 2.052 minutes.
[ Info: Simulation time 40 minutes equals or exceeds stop time 40 minutes.

Using this PR

julia> include("ocean_wind_mixing_and_convection.jl")
[ Info: Initializing simulation...
Iteration: 0000, time: 0 seconds, Δt: 11 seconds, max(|w|) = 1.0e-05 ms⁻¹, wall time: 0 seconds
[ Info:     ... simulation initialization complete (2.042 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (15.933 seconds).
Iteration: 0020, time: 3.403 minutes, Δt: 13.310 seconds, max(|w|) = 8.0e-06 ms⁻¹, wall time: 18.398 seconds
Iteration: 0040, time: 7.488 minutes, Δt: 16.105 seconds, max(|w|) = 6.0e-06 ms⁻¹, wall time: 18.752 seconds
Iteration: 0060, time: 12 minutes, Δt: 10.350 seconds, max(|w|) = 1.2e-05 ms⁻¹, wall time: 19.105 seconds
Iteration: 0080, time: 15 minutes, Δt: 8.284 seconds, max(|w|) = 4.8e-05 ms⁻¹, wall time: 19.541 seconds
Iteration: 0100, time: 17.509 minutes, Δt: 7.101 seconds, max(|w|) = 1.3e-04 ms⁻¹, wall time: 19.889 seconds
Iteration: 0120, time: 19.557 minutes, Δt: 6.359 seconds, max(|w|) = 6.3e-04 ms⁻¹, wall time: 20.244 seconds
Iteration: 0140, time: 21.504 minutes, Δt: 5.734 seconds, max(|w|) = 3.8e-03 ms⁻¹, wall time: 20.680 seconds
Iteration: 0160, time: 23.276 minutes, Δt: 5.325 seconds, max(|w|) = 1.4e-02 ms⁻¹, wall time: 21.026 seconds
Iteration: 0180, time: 24.959 minutes, Δt: 5.372 seconds, max(|w|) = 3.9e-02 ms⁻¹, wall time: 21.370 seconds
Iteration: 0200, time: 26.677 minutes, Δt: 6.354 seconds, max(|w|) = 7.1e-02 ms⁻¹, wall time: 21.800 seconds
Iteration: 0220, time: 28.672 minutes, Δt: 7.391 seconds, max(|w|) = 8.6e-02 ms⁻¹, wall time: 22.149 seconds
Iteration: 0240, time: 31 minutes, Δt: 8.598 seconds, max(|w|) = 8.4e-02 ms⁻¹, wall time: 22.498 seconds
Iteration: 0260, time: 33.878 minutes, Δt: 9.101 seconds, max(|w|) = 6.1e-02 ms⁻¹, wall time: 22.933 seconds
Iteration: 0280, time: 36.770 minutes, Δt: 8.891 seconds, max(|w|) = 6.2e-02 ms⁻¹, wall time: 23.287 seconds
Iteration: 0300, time: 39.637 minutes, Δt: 9.291 seconds, max(|w|) = 4.9e-02 ms⁻¹, wall time: 23.640 seconds
[ Info: Simulation is stopping after running for 23.692 seconds.
[ Info: Simulation time 40 minutes equals or exceeds stop time 40 minutes.

using Oceananigans v0.79.2 (with KernelAbstractions v0.7.2)

julia> include("ocean_wind_mixing_and_convection.jl")
[ Info: Precompiling CairoMakie [13f3f980-e62b-5c42-98c6-ff1f3baf88f0]
[ Info: Oceananigans will use 6 threads
[ Info: Initializing simulation...
Iteration: 0000, time: 0 seconds, Δt: 11 seconds, max(|w|) = 1.0e-05 ms⁻¹, wall time: 0 seconds
[ Info:     ... simulation initialization complete (5.240 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (46.825 seconds).
Iteration: 0020, time: 3.403 minutes, Δt: 13.310 seconds, max(|w|) = 8.4e-06 ms⁻¹, wall time: 52.487 seconds
Iteration: 0040, time: 7.488 minutes, Δt: 16.105 seconds, max(|w|) = 5.8e-06 ms⁻¹, wall time: 52.987 seconds
Iteration: 0060, time: 12 minutes, Δt: 10.350 seconds, max(|w|) = 1.6e-05 ms⁻¹, wall time: 53.334 seconds
Iteration: 0080, time: 15 minutes, Δt: 8.284 seconds, max(|w|) = 3.8e-05 ms⁻¹, wall time: 53.685 seconds
Iteration: 0100, time: 17.509 minutes, Δt: 7.100 seconds, max(|w|) = 1.7e-04 ms⁻¹, wall time: 54.035 seconds
Iteration: 0120, time: 19.669 minutes, Δt: 6.321 seconds, max(|w|) = 9.0e-04 ms⁻¹, wall time: 54.549 seconds
Iteration: 0140, time: 21.602 minutes, Δt: 5.748 seconds, max(|w|) = 4.9e-03 ms⁻¹, wall time: 54.895 seconds
Iteration: 0160, time: 23.370 minutes, Δt: 5.313 seconds, max(|w|) = 1.7e-02 ms⁻¹, wall time: 55.244 seconds
Iteration: 0180, time: 25 minutes, Δt: 5.172 seconds, max(|w|) = 3.6e-02 ms⁻¹, wall time: 55.592 seconds
Iteration: 0200, time: 26.737 minutes, Δt: 5.996 seconds, max(|w|) = 7.7e-02 ms⁻¹, wall time: 56.116 seconds
Iteration: 0220, time: 28.705 minutes, Δt: 6.649 seconds, max(|w|) = 8.4e-02 ms⁻¹, wall time: 56.467 seconds
Iteration: 0240, time: 30.929 minutes, Δt: 7.464 seconds, max(|w|) = 7.1e-02 ms⁻¹, wall time: 56.822 seconds
Iteration: 0260, time: 33.261 minutes, Δt: 8.024 seconds, max(|w|) = 7.9e-02 ms⁻¹, wall time: 57.178 seconds
Iteration: 0280, time: 35.866 minutes, Δt: 8.694 seconds, max(|w|) = 7.1e-02 ms⁻¹, wall time: 57.860 seconds
Iteration: 0300, time: 38.732 minutes, Δt: 8.671 seconds, max(|w|) = 5.8e-02 ms⁻¹, wall time: 58.212 seconds
[ Info: Simulation is stopping after running for 58.372 seconds.
[ Info: Simulation time 40 minutes equals or exceeds stop time 40 minutes.

all seem the same??

@glwagner
Copy link
Member

They don't seem the same at all --- this PR is way faster? It'd be easier if we printed the differential wall time tho

@glwagner
Copy link
Member

I think there's some stuff still to do for the hydrostatic model on this PR but not sure what

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

can we please wait for #2991 ?

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

I've been struggling to merge that for days.....

@simone-silvestri
Copy link
Collaborator Author

hmmm weird that non-hydrostatic is way faster and hydrostatic is similar/slower than 0.79.2

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

Damn, I'm silly!! I was looking at Δt which IS the same!

(I was looking at the place where the relevant info for the other example was being printed...)

@glwagner
Copy link
Member

glwagner commented Mar 24, 2023

I'm cautiously optimistic that we'll get a big speed up... just have to figure out the final touches for hydrostatic model

Edit: looking more closely the different is compile time, not run time. So we shouldn't expect speed up for run time, but maybe we will see something for compile time.

@glwagner
Copy link
Member

Here's a simple hydrostatic model benchmark

using Oceananigans
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1), topology=(Periodic, Periodic, Bounded))
model = HydrostaticFreeSurfaceModel(; grid, momentum_advection=WENO(), tracer_advection=WENO())
ϵ(x, y, z) = 2rand() - 1
set!(model, u=ϵ, v=ϵ)

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        time_step!(model, Δt)
    end
end

@btime lots_of_steps!(model, 0.01)

Results

10.220 s (85845109 allocations: 37.94 GiB) # this PR
6.284 s (66184308 allocations: 16.31 GiB) # this PR with KA downgraded to 0.7.2

So we are close but not there yet.

What about solvers ... ?

@glwagner
Copy link
Member

glwagner commented Mar 24, 2023

hmmm weird that non-hydrostatic is way faster and hydrostatic is similar/slower than 0.79.2

Actually looking more closely at the results, it's only the compile time that is improved. The run time is the same for both. Nevertheless, compile time dropping by 30s is a big improvement and should help docs a lot.

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

shall we merge? or investigate more?

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

I'll merge! And we can investigate more if we think it's needed?

@navidcy navidcy merged commit b37c00f into main Mar 24, 2023
@navidcy navidcy deleted the ss/fixes_performance_issues branch March 24, 2023 03:06
@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

Damn sorry! I didn't see all these comments @glwagner

@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

Let me reopen the issue

@navidcy navidcy restored the ss/fixes_performance_issues branch March 24, 2023 03:11
navidcy added a commit that referenced this pull request Mar 24, 2023
@navidcy
Copy link
Collaborator

navidcy commented Mar 24, 2023

We can undo the merging by git reset --hard ae12376d1e31128e2e17e6b9447e75d37510e550?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?)
3 participants