-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) #2996
Comments
I'm detecting also a massive slow down in model execution and out of control allocations on the CPU. Type inference failure? Unfortunate to find it this late after the PRs are merged. |
@navidcy what's the last version before the catastrophic performance loss? I'll do a benchmark to compare with |
Kind of a random case, but here's some timings from a calibration problem I'm doing (I'm running 6 3D simulations, ranging from something like 33 to 117 time-steps, either size (6, 10, 32) or (6, 10, 64)). For these tests I just downgraded KernelAbstractions via Project.toml: With
|
Something a little puzzling to me is that we clearly succeed at type inference when running on the GPU. Do we have a better chance of succeeding there because of Oceananigans.jl/src/Fields/field.jl Line 400 in f70d094
|
That's an order of magnitude slowdown... And that's not just compilation times, right? |
Compilation actually speeds up, its run time that slows down. (The examples are probably running very very very very slow.) |
So the runtime is 50-100x slower. But because there's other stuff in the docs (and compilation time matters to, the total slow down for the docs build is just 2x). |
@simone-silvestri this is a big deal because 50x is essentially standing still, hardly useable except for the smallest problems. |
Do you have a profile to share? |
Also a MWE would be nice to debug |
I haven't done any profiling --- just simple benchmarks. (Short example coming soon) |
Ok here's something simple: using Oceananigans
using BenchmarkTools
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())
function lots_of_steps!(model, Δt, steps=100)
for _ = 1:steps
time_step!(model, Δt)
end
end
@btime lots_of_steps!(model, 0.01) Here's what I've done:
julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
20.460 s (144483404 allocations: 94.43 GiB)
julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
2.202 s (118604 allocations: 52.20 MiB) I'm running on a single core, Mac M1. Here the performance loss is just 10x so I'll change the somewhat dramatic title of this issue. |
With 17.859 s (144483404 allocations: 94.43 GiB) # KA 0.8
294.401 ms (118604 allocations: 52.20 MiB) # KA 0.7 If we look just at using Oceananigans
using Oceananigans.TimeSteppers: calculate_tendencies!
using BenchmarkTools
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())
function lots_of_steps!(model, Δt, steps=100)
for _ = 1:steps
#time_step!(model, Δt)
calculate_tendencies!(model, [])
end
end
@btime lots_of_steps!(model, 0.01) results are ( 5.268 s (23061000 allocations: 11.73 GiB) # KA 0.8
1.989 s (14600 allocations: 13.03 MiB) # KA 0.7 and 2.846 s (23061000 allocations: 11.73 GiB) # KA 0.8
105.867 ms (14600 allocations: 13.03 MiB) # KA 0.7 |
Simple kernels are safe (in fact, faster) using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ
using KernelAbstractions: @kernel, @index
using BenchmarkTools
@kernel function _diffuse!(c, Δt)
i, j, k = @index(Global, NTuple)
@inbounds c[i, j, k] += Δt * ∇²ᶜᶜᶜ(i, j, k, grid, c)
end
function diffuse!(c, Δt)
grid = c.grid
arch = grid.architecture
ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt)
wait(device(arch), ev)
return nothing
end
function lots_of_steps!(c, Δt, steps=100)
for _ = 1:steps
diffuse!(c, Δt)
end
end
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01) yields 447.763 ms (9832300 allocations: 975.37 MiB) # KA 0.8
499.522 ms (9832300 allocations: 1.81 GiB) # KA 0.7 adding |
I found something. It's hard to provide code since I hacked apart the code base for this. I'll try to describe. I reduced u_velocity_tendency(i, j, k, grid) = zero(grid) and @inline function u_velocity_tendency(i, j, k, grid,
advection,
coriolis,
stokes_drift,
closure,
u_immersed_bc,
buoyancy,
background_fields,
velocities,
tracers,
auxiliary_fields,
diffusivities,
forcings,
hydrostatic_pressure,
clock)
return zero(grid)
end Even those these functions have identical output, applying Maybe splatting is the issue since we write """ Calculate the right-hand-side of the u-velocity equation. """
@kernel function calculate_Gu!(Gu, args...)
i, j, k = @index(Global, NTuple)
@inbounds Gu[i, j, k] = u_velocity_tendency(i, j, k, args...)
end ? |
ok... maybe this is a minimal using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.BoundaryConditions: fill_halo_regions!
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ
using KernelAbstractions: @kernel, @index
using BenchmarkTools
@inline laplacian(i, j, k, grid, c, args...) = ∇²ᶜᶜᶜ(i, j, k, grid, c)
@kernel function _diffuse!(c, Δt, args...)
i, j, k = @index(Global, NTuple)
@inbounds c[i, j, k] += Δt * laplacian(i, j, k, grid, c, args...)
end
function diffuse!(c, Δt)
grid = c.grid
Nargs = 2 # this is the key
args = Tuple(grid for i = 1:Nargs)
arch = grid.architecture
fill_halo_regions!(c)
ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt, args...)
wait(device(arch), ev)
return nothing
end
function lots_of_steps!(c, Δt, steps=100)
for _ = 1:steps
diffuse!(c, Δt)
end
end
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01) Ok, so with nothing difference except KA version, this is what we find:
|
So the likely answer here is that CPU execution is no longer running through my custom Cassette infrastructure that inlined everything even on the CPU. So you are hitting the normal case of Julia deciding that it isn't worth inclining complicated functions. The benefits of not using Cassette are that backtracked are useful (for errors and profiling) and that compile time is much improved. There is a long-term fix that one could explore... I don't currently have time for this but it might make an interesting UROP project. Use a custom host compiler via GPUCompiler/LLVM.jl/AbstractInterpreter and maybe OpaqueClosurer to mimick the compilation of the GPU code. |
In some way this comes back to the fundamental question of: What is the point of KernelAbstractions CPU support. I originally intended it only for making debugging easier... But folks seem to be depending on it as a performance solution... I think it is feasible to get there, but it would require quite a bit of time and effort |
Ok, so if we shouldn't be using KernelAbstractions for performance on CPU, then I think this means we should write our own CPU infrastructure stuff and rely on KA only for GPU? Is that what you recommend? |
That wouldn't solve your problem here. KA gives you reasonable performance on the CPU, but since KA 0.8 it is execution story on the CPU much closer to the rest of Julia and it doesn't play special tricks. |
Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance? Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously. |
My point is that your own CPU infrastructure will encounter these same issues, except if you are willing to do large amounts of work and then I would invite you to improve KA instead. |
Re-posting from #3026... that PR solved performance problems with using Oceananigans
using BenchmarkTools
grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1), topology=(Periodic, Periodic, Bounded))
model = HydrostaticFreeSurfaceModel(; grid, momentum_advection=WENO(), tracer_advection=WENO())
ϵ(x, y, z) = 2rand() - 1
set!(model, u=ϵ, v=ϵ)
function lots_of_steps!(model, Δt, steps=100)
for _ = 1:steps
time_step!(model, Δt)
end
end
@btime lots_of_steps!(model, 0.01) Results
|
To follow up with @vchuravy, it seems that rewriting just some of the code was sufficient, so we are (probably) in the clear! The lesson learned is that we cannot slurp / splat |
I get:
50% slowdown and much more allocations (if that matters) |
I think maybe we caught most of the problem kernels but not all... |
Does it change with different solvers? I'll do some testing today to try to snoop out the issue |
The issues with the Our solution now is to infer the RHS function and pass it as an argument to the kernel. Apparently, this prevents compilation. I ll come up with a solution today |
Also, the implicit vertical solver seems to affect the performance. I would have to guess that it is because we are passing functions as arguments to the kernel. |
I observe a discontinuous doubling of docs building time from 3 hours on
https://buildkite.com/clima/oceananigans/builds/9920#01862a69-61ee-4e59-82b4-5c4b4684e7b6
to 6 hours after "Merge pull request #2899 from CliMA/vc/ka_upgrade"
https://buildkite.com/clima/oceananigans/builds/9935#01862d0c-e009-4a05-a8bb-023d1159a2ae
cc @simone-silvestri, @vchuravy
The text was updated successfully, but these errors were encountered: