-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes CPU slowdown with KernelAbstractions >= v0.8 #3026
Conversation
…nigans.jl into ss/fixes_performance_issues
…#2) (#2990) * ZDirection -> NegativeZDirection (doctests pass locally atm) * @info -> @warn * comment parts of the documentation * uncomment tilted bbl example (docs still build locally w this change) * bring back other doc pages (still builds) * full docs * adds review suggestions * bugfix * better unit-vector validation + fix coriolis tests --------- Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
@simone-silvestri is this done? shall I test it on my laptop? |
Sure give it a try. We can merge when tests pass |
Well if the docs take 6hrs then it's a sign that the problem persists...? |
they seem to be finishing (~ 3hrs maybe). We should definitely separate the docs testing for when the PR is ready to merge, 3 hrs is a very long time for little returns |
using Oceananigans mainjulia> include("baroclinic_adjustment.jl")
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 10.162 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info: ... simulation initialization complete (5.052 seconds)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (17.311 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.900 minutes, max(u): (6.955e-01, 3.312e-01, 3.390e-03) m/s, next Δt: 8.858 minutes
[03.03%] i: 200, t: 1.213 days, wall time: 1.542 minutes, max(u): (6.076e-01, 3.510e-01, 2.477e-03) m/s, next Δt: 14.266 minutes
[05.99%] i: 300, t: 2.396 days, wall time: 3.374 minutes, max(u): (5.579e-01, 2.912e-01, 2.036e-03) m/s, next Δt: 20 minutes
[09.44%] i: 400, t: 3.778 days, wall time: 1.572 minutes, max(u): (4.628e-01, 2.791e-01, 1.828e-03) m/s, next Δt: 20 minutes
[12.92%] i: 500, t: 5.167 days, wall time: 1.532 minutes, max(u): (4.567e-01, 2.743e-01, 1.750e-03) m/s, next Δt: 20 minutes
[16.39%] i: 600, t: 6.556 days, wall time: 1.521 minutes, max(u): (4.579e-01, 3.130e-01, 1.805e-03) m/s, next Δt: 20 minutes Using this PRjulia> include("baroclinic_adjustment.jl")
[ Info: Oceananigans will use 6 threads
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 12.345 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info: ... simulation initialization complete (5.941 seconds)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (17.191 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.286 minutes, max(u): (6.958e-01, 3.309e-01, 3.365e-03) m/s, next Δt: 8.858 minutes
[03.03%] i: 200, t: 1.213 days, wall time: 55.848 seconds, max(u): (6.075e-01, 3.512e-01, 2.492e-03) m/s, next Δt: 14.266 minutes
[05.99%] i: 300, t: 2.396 days, wall time: 58.502 seconds, max(u): (5.586e-01, 2.880e-01, 2.085e-03) m/s, next Δt: 20 minutes
[09.44%] i: 400, t: 3.778 days, wall time: 57.702 seconds, max(u): (4.747e-01, 2.683e-01, 1.776e-03) m/s, next Δt: 20 minutes
[12.92%] i: 500, t: 5.167 days, wall time: 57.206 seconds, max(u): (4.541e-01, 2.486e-01, 1.641e-03) m/s, next Δt: 20 minutes using Oceananigans v0.79.2 (with KernelAbstractions v0.7.2)julia> include("baroclinic_adjustment.jl")
[ Info: Oceananigans will use 6 threads
[ Info: Precompiling CairoMakie [13f3f980-e62b-5c42-98c6-ff1f3baf88f0]
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 11.628 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info: ... simulation initialization complete (6.329 seconds)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (35.780 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.369 minutes, max(u): (6.973e-01, 3.325e-01, 3.378e-03) m/s, next Δt: 8.858 minutes
[03.03%] i: 200, t: 1.213 days, wall time: 43.693 seconds, max(u): (6.093e-01, 3.516e-01, 2.474e-03) m/s, next Δt: 14.266 minutes
[05.99%] i: 300, t: 2.396 days, wall time: 42.554 seconds, max(u): (5.587e-01, 2.900e-01, 2.016e-03) m/s, next Δt: 20 minutes
[09.44%] i: 400, t: 3.778 days, wall time: 42.783 seconds, max(u): (4.648e-01, 2.767e-01, 1.781e-03) m/s, next Δt: 20 minutes |
@navidcy how about a Nonhydrostatic example? |
using Oceananigans mainjulia> include("ocean_wind_mixing_and_convection.jl")
[ Info: Initializing simulation...
Iteration: 0000, time: 0 seconds, Δt: 11 seconds, max(|w|) = 1.1e-05 ms⁻¹, wall time: 0 seconds
[ Info: ... simulation initialization complete (2.049 seconds)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (16.222 seconds).
Iteration: 0020, time: 3.403 minutes, Δt: 13.310 seconds, max(|w|) = 8.8e-06 ms⁻¹, wall time: 23.980 seconds
Iteration: 0040, time: 7.488 minutes, Δt: 16.105 seconds, max(|w|) = 5.5e-06 ms⁻¹, wall time: 30.307 seconds
Iteration: 0060, time: 12 minutes, Δt: 10.350 seconds, max(|w|) = 8.9e-06 ms⁻¹, wall time: 36.549 seconds
Iteration: 0080, time: 15 minutes, Δt: 8.284 seconds, max(|w|) = 4.4e-05 ms⁻¹, wall time: 42.839 seconds
Iteration: 0100, time: 17.509 minutes, Δt: 7.101 seconds, max(|w|) = 1.2e-04 ms⁻¹, wall time: 49.022 seconds
Iteration: 0120, time: 19.561 minutes, Δt: 6.358 seconds, max(|w|) = 5.5e-04 ms⁻¹, wall time: 55.315 seconds
Iteration: 0140, time: 21.505 minutes, Δt: 5.772 seconds, max(|w|) = 2.5e-03 ms⁻¹, wall time: 1.026 minutes
Iteration: 0160, time: 23.277 minutes, Δt: 5.356 seconds, max(|w|) = 1.0e-02 ms⁻¹, wall time: 1.129 minutes
Iteration: 0180, time: 24.961 minutes, Δt: 5.303 seconds, max(|w|) = 2.6e-02 ms⁻¹, wall time: 1.232 minutes
Iteration: 0200, time: 26.626 minutes, Δt: 5.806 seconds, max(|w|) = 6.5e-02 ms⁻¹, wall time: 1.337 minutes
Iteration: 0220, time: 28.639 minutes, Δt: 6.807 seconds, max(|w|) = 1.0e-01 ms⁻¹, wall time: 1.440 minutes
Iteration: 0240, time: 30.873 minutes, Δt: 8.041 seconds, max(|w|) = 9.1e-02 ms⁻¹, wall time: 1.544 minutes
Iteration: 0260, time: 33.412 minutes, Δt: 8.474 seconds, max(|w|) = 7.0e-02 ms⁻¹, wall time: 1.652 minutes
Iteration: 0280, time: 35.873 minutes, Δt: 8.997 seconds, max(|w|) = 5.9e-02 ms⁻¹, wall time: 1.761 minutes
Iteration: 0300, time: 38.766 minutes, Δt: 9.388 seconds, max(|w|) = 6.3e-02 ms⁻¹, wall time: 1.866 minutes
[ Info: Simulation is stopping after running for 2.052 minutes.
[ Info: Simulation time 40 minutes equals or exceeds stop time 40 minutes. Using this PRjulia> include("ocean_wind_mixing_and_convection.jl")
[ Info: Initializing simulation...
Iteration: 0000, time: 0 seconds, Δt: 11 seconds, max(|w|) = 1.0e-05 ms⁻¹, wall time: 0 seconds
[ Info: ... simulation initialization complete (2.042 seconds)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (15.933 seconds).
Iteration: 0020, time: 3.403 minutes, Δt: 13.310 seconds, max(|w|) = 8.0e-06 ms⁻¹, wall time: 18.398 seconds
Iteration: 0040, time: 7.488 minutes, Δt: 16.105 seconds, max(|w|) = 6.0e-06 ms⁻¹, wall time: 18.752 seconds
Iteration: 0060, time: 12 minutes, Δt: 10.350 seconds, max(|w|) = 1.2e-05 ms⁻¹, wall time: 19.105 seconds
Iteration: 0080, time: 15 minutes, Δt: 8.284 seconds, max(|w|) = 4.8e-05 ms⁻¹, wall time: 19.541 seconds
Iteration: 0100, time: 17.509 minutes, Δt: 7.101 seconds, max(|w|) = 1.3e-04 ms⁻¹, wall time: 19.889 seconds
Iteration: 0120, time: 19.557 minutes, Δt: 6.359 seconds, max(|w|) = 6.3e-04 ms⁻¹, wall time: 20.244 seconds
Iteration: 0140, time: 21.504 minutes, Δt: 5.734 seconds, max(|w|) = 3.8e-03 ms⁻¹, wall time: 20.680 seconds
Iteration: 0160, time: 23.276 minutes, Δt: 5.325 seconds, max(|w|) = 1.4e-02 ms⁻¹, wall time: 21.026 seconds
Iteration: 0180, time: 24.959 minutes, Δt: 5.372 seconds, max(|w|) = 3.9e-02 ms⁻¹, wall time: 21.370 seconds
Iteration: 0200, time: 26.677 minutes, Δt: 6.354 seconds, max(|w|) = 7.1e-02 ms⁻¹, wall time: 21.800 seconds
Iteration: 0220, time: 28.672 minutes, Δt: 7.391 seconds, max(|w|) = 8.6e-02 ms⁻¹, wall time: 22.149 seconds
Iteration: 0240, time: 31 minutes, Δt: 8.598 seconds, max(|w|) = 8.4e-02 ms⁻¹, wall time: 22.498 seconds
Iteration: 0260, time: 33.878 minutes, Δt: 9.101 seconds, max(|w|) = 6.1e-02 ms⁻¹, wall time: 22.933 seconds
Iteration: 0280, time: 36.770 minutes, Δt: 8.891 seconds, max(|w|) = 6.2e-02 ms⁻¹, wall time: 23.287 seconds
Iteration: 0300, time: 39.637 minutes, Δt: 9.291 seconds, max(|w|) = 4.9e-02 ms⁻¹, wall time: 23.640 seconds
[ Info: Simulation is stopping after running for 23.692 seconds.
[ Info: Simulation time 40 minutes equals or exceeds stop time 40 minutes. using Oceananigans v0.79.2 (with KernelAbstractions v0.7.2)julia> include("ocean_wind_mixing_and_convection.jl")
[ Info: Precompiling CairoMakie [13f3f980-e62b-5c42-98c6-ff1f3baf88f0]
[ Info: Oceananigans will use 6 threads
[ Info: Initializing simulation...
Iteration: 0000, time: 0 seconds, Δt: 11 seconds, max(|w|) = 1.0e-05 ms⁻¹, wall time: 0 seconds
[ Info: ... simulation initialization complete (5.240 seconds)
[ Info: Executing initial time step...
[ Info: ... initial time step complete (46.825 seconds).
Iteration: 0020, time: 3.403 minutes, Δt: 13.310 seconds, max(|w|) = 8.4e-06 ms⁻¹, wall time: 52.487 seconds
Iteration: 0040, time: 7.488 minutes, Δt: 16.105 seconds, max(|w|) = 5.8e-06 ms⁻¹, wall time: 52.987 seconds
Iteration: 0060, time: 12 minutes, Δt: 10.350 seconds, max(|w|) = 1.6e-05 ms⁻¹, wall time: 53.334 seconds
Iteration: 0080, time: 15 minutes, Δt: 8.284 seconds, max(|w|) = 3.8e-05 ms⁻¹, wall time: 53.685 seconds
Iteration: 0100, time: 17.509 minutes, Δt: 7.100 seconds, max(|w|) = 1.7e-04 ms⁻¹, wall time: 54.035 seconds
Iteration: 0120, time: 19.669 minutes, Δt: 6.321 seconds, max(|w|) = 9.0e-04 ms⁻¹, wall time: 54.549 seconds
Iteration: 0140, time: 21.602 minutes, Δt: 5.748 seconds, max(|w|) = 4.9e-03 ms⁻¹, wall time: 54.895 seconds
Iteration: 0160, time: 23.370 minutes, Δt: 5.313 seconds, max(|w|) = 1.7e-02 ms⁻¹, wall time: 55.244 seconds
Iteration: 0180, time: 25 minutes, Δt: 5.172 seconds, max(|w|) = 3.6e-02 ms⁻¹, wall time: 55.592 seconds
Iteration: 0200, time: 26.737 minutes, Δt: 5.996 seconds, max(|w|) = 7.7e-02 ms⁻¹, wall time: 56.116 seconds
Iteration: 0220, time: 28.705 minutes, Δt: 6.649 seconds, max(|w|) = 8.4e-02 ms⁻¹, wall time: 56.467 seconds
Iteration: 0240, time: 30.929 minutes, Δt: 7.464 seconds, max(|w|) = 7.1e-02 ms⁻¹, wall time: 56.822 seconds
Iteration: 0260, time: 33.261 minutes, Δt: 8.024 seconds, max(|w|) = 7.9e-02 ms⁻¹, wall time: 57.178 seconds
Iteration: 0280, time: 35.866 minutes, Δt: 8.694 seconds, max(|w|) = 7.1e-02 ms⁻¹, wall time: 57.860 seconds
Iteration: 0300, time: 38.732 minutes, Δt: 8.671 seconds, max(|w|) = 5.8e-02 ms⁻¹, wall time: 58.212 seconds
[ Info: Simulation is stopping after running for 58.372 seconds.
[ Info: Simulation time 40 minutes equals or exceeds stop time 40 minutes. all seem the same?? |
They don't seem the same at all --- this PR is way faster? It'd be easier if we printed the differential wall time tho |
I think there's some stuff still to do for the hydrostatic model on this PR but not sure what |
can we please wait for #2991 ? |
I've been struggling to merge that for days..... |
hmmm weird that non-hydrostatic is way faster and hydrostatic is similar/slower than 0.79.2 |
Damn, I'm silly!! I was looking at Δt which IS the same! (I was looking at the place where the relevant info for the other example was being printed...) |
Edit: looking more closely the different is compile time, not run time. So we shouldn't expect speed up for run time, but maybe we will see something for compile time. |
Here's a simple hydrostatic model benchmark using Oceananigans
using BenchmarkTools
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1), topology=(Periodic, Periodic, Bounded))
model = HydrostaticFreeSurfaceModel(; grid, momentum_advection=WENO(), tracer_advection=WENO())
ϵ(x, y, z) = 2rand() - 1
set!(model, u=ϵ, v=ϵ)
function lots_of_steps!(model, Δt, steps=100)
for _ = 1:steps
time_step!(model, Δt)
end
end
@btime lots_of_steps!(model, 0.01) Results 10.220 s (85845109 allocations: 37.94 GiB) # this PR
6.284 s (66184308 allocations: 16.31 GiB) # this PR with KA downgraded to 0.7.2 So we are close but not there yet. What about solvers ... ? |
Actually looking more closely at the results, it's only the compile time that is improved. The run time is the same for both. Nevertheless, compile time dropping by 30s is a big improvement and should help docs a lot. |
shall we merge? or investigate more? |
I'll merge! And we can investigate more if we think it's needed? |
Damn sorry! I didn't see all these comments @glwagner |
Let me reopen the issue |
We can undo the merging by |
Apparently, the issue with #2996 is caused by the fact that slurping inside a kernel signature
i.e:
This can be fixed by passing the size of the
Vararg
tuplelike this
Edit, this approach only works if all Varargs are of the same type
T
.Edit, Edit, with @glwagner we decided to pass all additional arguments to the kernel as a tuple, so
which means that from now on, when calling a kernel
will become
Resolves #2996