Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) #2996

Closed
navidcy opened this issue Mar 22, 2023 · 29 comments · Fixed by #3026 or #3030
Closed
Labels
🚨 high priority 🚨 performance 🏍️ So we can get the wrong answer even faster question 💭 No such thing as a stupid question

Comments

@navidcy
Copy link
Collaborator

navidcy commented Mar 22, 2023

I observe a discontinuous doubling of docs building time from 3 hours on

https://buildkite.com/clima/oceananigans/builds/9920#01862a69-61ee-4e59-82b4-5c4b4684e7b6

to 6 hours after "Merge pull request #2899 from CliMA/vc/ka_upgrade"

https://buildkite.com/clima/oceananigans/builds/9935#01862d0c-e009-4a05-a8bb-023d1159a2ae

cc @simone-silvestri, @vchuravy

@glwagner
Copy link
Member

I'm detecting also a massive slow down in model execution and out of control allocations on the CPU. Type inference failure? Unfortunate to find it this late after the PRs are merged.

@glwagner
Copy link
Member

@navidcy what's the last version before the catastrophic performance loss? I'll do a benchmark to compare with main.

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

Kind of a random case, but here's some timings from a calibration problem I'm doing (I'm running 6 3D simulations, ranging from something like 33 to 117 time-steps, either size (6, 10, 32) or (6, 10, 64)). For these tests I just downgraded KernelAbstractions via Project.toml:

With KernelAbstractions 0.7.2 (also downgraded CUDAKernels)

 23.405205 seconds (25.27 M allocations: 2.610 GiB, 1.78% gc time, 99.80% compilation time)
  5.019944 seconds (5.92 M allocations: 475.126 MiB, 0.86% gc time, 98.22% compilation time)
  0.067385 seconds (107.25 k allocations: 72.628 MiB)
  0.090308 seconds (107.25 k allocations: 72.628 MiB)
  0.139109 seconds (217.20 k allocations: 147.487 MiB)
  0.197798 seconds (217.20 k allocations: 147.487 MiB)

The two simulations are affected by compilation but things go fast after that.

With KernelAbstractions 0.8.6

  4.914645 seconds (28.10 M allocations: 6.039 GiB, 15.62% gc time, 51.22% compilation time)
  5.011717 seconds (31.58 M allocations: 10.844 GiB, 19.87% gc time, 16.87% compilation time)
  4.236418 seconds (27.41 M allocations: 11.073 GiB, 21.75% gc time)
  8.501561 seconds (55.04 M allocations: 22.118 GiB, 22.03% gc time)
  8.618707 seconds (56.01 M allocations: 22.627 GiB, 21.83% gc time)
 17.081286 seconds (112.47 M allocations: 45.197 GiB, 21.73% gc time)

Smells like type inference failure to me. Some informal exploration shows that the tendency calculations dominate this problem (as they do many others) --- so it's a pretty basic issue I suspect.

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

Something a little puzzling to me is that we clearly succeed at type inference when running on the GPU. Do we have a better chance of succeeding there because of adapt? We do make some critical simplifications via adapt, most notably

Adapt.adapt_structure(to, f::Field) = Adapt.adapt(to, f.data)

@navidcy
Copy link
Collaborator Author

navidcy commented Mar 23, 2023

That's an order of magnitude slowdown... And that's not just compilation times, right?

@glwagner
Copy link
Member

Compilation actually speeds up, its run time that slows down. (The examples are probably running very very very very slow.)

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

So the runtime is 50-100x slower. But because there's other stuff in the docs (and compilation time matters to, the total slow down for the docs build is just 2x).

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

@simone-silvestri this is a big deal because 50x is essentially standing still, hardly useable except for the smallest problems.

@glwagner glwagner changed the title Docs build too slow... 50-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type instability?) Mar 23, 2023
@glwagner glwagner changed the title 50-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type instability?) 50-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) Mar 23, 2023
@vchuravy
Copy link
Collaborator

Do you have a profile to share?

@simone-silvestri
Copy link
Collaborator

Also a MWE would be nice to debug

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

I haven't done any profiling --- just simple benchmarks. (Short example coming soon)

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

Ok here's something simple:

using Oceananigans
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        time_step!(model, Δt)
    end
end

@btime lots_of_steps!(model, 0.01)

Here's what I've done:

  • Run this on fresh clone of main. This returns
julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
  20.460 s (144483404 allocations: 94.43 GiB)
  • Restrict compat on KernelAbstractions to 0.7.2 and CUDAKernels to 0.3.3. This returns:
julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
  2.202 s (118604 allocations: 52.20 MiB)

I'm running on a single core, Mac M1. Here the performance loss is just 10x so I'll change the somewhat dramatic title of this issue.

@glwagner glwagner changed the title 50-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) 10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) Mar 23, 2023
@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

With advection = CenteredSecondOrder(), the differences are more dramatic (maybe explains the 100x slow down I saw with a simple setup):

  17.859 s (144483404 allocations: 94.43 GiB) # KA 0.8
  294.401 ms (118604 allocations: 52.20 MiB) # KA 0.7

If we look just at calculate_tendencies! via

using Oceananigans
using Oceananigans.TimeSteppers: calculate_tendencies!
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        #time_step!(model, Δt)
        calculate_tendencies!(model, [])
    end
end

@btime lots_of_steps!(model, 0.01)

results are (advection = WENO())

5.268 s (23061000 allocations: 11.73 GiB) # KA 0.8
1.989 s (14600 allocations: 13.03 MiB) # KA 0.7

and advection = CenteredSecondOrder()

  2.846 s (23061000 allocations: 11.73 GiB) # KA 0.8
105.867 ms (14600 allocations: 13.03 MiB) # KA 0.7

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

Simple kernels are safe (in fact, faster)

using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ

using KernelAbstractions: @kernel, @index
using BenchmarkTools

@kernel function _diffuse!(c, Δt)
    i, j, k = @index(Global, NTuple)
    @inbounds c[i, j, k] += Δt * ∇²ᶜᶜᶜ(i, j, k, grid, c)
end

function diffuse!(c, Δt)
    grid = c.grid
    arch = grid.architecture
    ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt)
    wait(device(arch), ev)
    return nothing
end

function lots_of_steps!(c, Δt, steps=100)
    for _ = 1:steps
        diffuse!(c, Δt)
    end
end

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01)

yields

447.763 ms (9832300 allocations: 975.37 MiB) # KA 0.8
499.522 ms (9832300 allocations: 1.81 GiB) # KA 0.7

adding fill_halo_regions! doesn't change much either

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

I found something. It's hard to provide code since I hacked apart the code base for this. I'll try to describe.

I reduced calculate_tendencies! to just calculating the u tendency. I then wrote two methods:

u_velocity_tendency(i, j, k, grid) = zero(grid)

and

@inline function u_velocity_tendency(i, j, k, grid,
                                     advection,
                                     coriolis,
                                     stokes_drift,
                                     closure,
                                     u_immersed_bc,
                                     buoyancy,
                                     background_fields,
                                     velocities,
                                     tracers,
                                     auxiliary_fields,
                                     diffusivities,
                                     forcings,
                                     hydrostatic_pressure,
                                     clock)

    return zero(grid)
end

Even those these functions have identical output, applying launch! to the "long argument list" version is 300x slower. So the problem is in launch! somewhere?

Maybe splatting is the issue since we write

""" Calculate the right-hand-side of the u-velocity equation. """
@kernel function calculate_Gu!(Gu, args...)
    i, j, k = @index(Global, NTuple)
    @inbounds Gu[i, j, k] = u_velocity_tendency(i, j, k, args...)
end

?

@glwagner
Copy link
Member

ok... maybe this is a minimal

using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.BoundaryConditions: fill_halo_regions!
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ

using KernelAbstractions: @kernel, @index
using BenchmarkTools

@inline laplacian(i, j, k, grid, c, args...) = ∇²ᶜᶜᶜ(i, j, k, grid, c)

@kernel function _diffuse!(c, Δt, args...)
    i, j, k = @index(Global, NTuple)
    @inbounds c[i, j, k] += Δt * laplacian(i, j, k, grid, c, args...)
end

function diffuse!(c, Δt)
    grid = c.grid

    Nargs = 2 # this is the key

    args = Tuple(grid for i = 1:Nargs)
    arch = grid.architecture
    fill_halo_regions!(c)
    ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt, args...)
    wait(device(arch), ev)
    return nothing
end

function lots_of_steps!(c, Δt, steps=100)
    for _ = 1:steps
        diffuse!(c, Δt)
    end
end

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01)

Ok, so with nothing difference except KA version, this is what we find:

Nargs KA 0.8 KA 0.72
1 479.273 ms (9847200 allocations: 977.66 MiB) 686.707 ms (11485600 allocations: 2.57 GiB)
2 1.281 s (14801200 allocations: 4.75 GiB) 840.756 ms (13124400 allocations: 3.32 GiB)
3 1.399 s (16447200 allocations: 5.52 GiB) 981.717 ms (14764000 allocations: 4.08 GiB)

@vchuravy
Copy link
Collaborator

So the likely answer here is that CPU execution is no longer running through my custom Cassette infrastructure that inlined everything even on the CPU. So you are hitting the normal case of Julia deciding that it isn't worth inclining complicated functions.

The benefits of not using Cassette are that backtracked are useful (for errors and profiling) and that compile time is much improved.

There is a long-term fix that one could explore... I don't currently have time for this but it might make an interesting UROP project. Use a custom host compiler via GPUCompiler/LLVM.jl/AbstractInterpreter and maybe OpaqueClosurer to mimick the compilation of the GPU code.

@vchuravy
Copy link
Collaborator

In some way this comes back to the fundamental question of: What is the point of KernelAbstractions CPU support. I originally intended it only for making debugging easier...

But folks seem to be depending on it as a performance solution... I think it is feasible to get there, but it would require quite a bit of time and effort

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

Ok, so if we shouldn't be using KernelAbstractions for performance on CPU, then I think this means we should write our own CPU infrastructure stuff and rely on KA only for GPU? Is that what you recommend?

@vchuravy
Copy link
Collaborator

Is that what you recommend?

That wouldn't solve your problem here. KA gives you reasonable performance on the CPU, but since KA 0.8 it is execution story on the CPU much closer to the rest of Julia and it doesn't play special tricks.

@glwagner
Copy link
Member

glwagner commented Mar 23, 2023

Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance?

Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously.

@navidcy navidcy added the performance 🏍️ So we can get the wrong answer even faster label Mar 23, 2023
@vchuravy
Copy link
Collaborator

in developing our own CPU infrastructure

My point is that your own CPU infrastructure will encounter these same issues, except if you are willing to do large amounts of work and then I would invite you to improve KA instead.

@navidcy navidcy reopened this Mar 24, 2023
@glwagner
Copy link
Member

Re-posting from #3026... that PR solved performance problems with NonhydrostaticModel, but HydrostaticFreeSurfaceModel is still 2x slower roughly than when using KA 0.7.2. Here's a simple benchmark:

using Oceananigans
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1), topology=(Periodic, Periodic, Bounded))
model = HydrostaticFreeSurfaceModel(; grid, momentum_advection=WENO(), tracer_advection=WENO())
ϵ(x, y, z) = 2rand() - 1
set!(model, u=ϵ, v=ϵ)

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        time_step!(model, Δt)
    end
end

@btime lots_of_steps!(model, 0.01)

Results

10.220 s (85845109 allocations: 37.94 GiB) # main
6.284 s (66184308 allocations: 16.31 GiB) # main with KA downgraded to 0.7.2

cc @simone-silvestri

@glwagner
Copy link
Member

Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance?

Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously.

To follow up with @vchuravy, it seems that rewriting just some of the code was sufficient, so we are (probably) in the clear! The lesson learned is that we cannot slurp / splat @kernel function arguments, because it prevents the kernel code from being inlined.

@navidcy
Copy link
Collaborator Author

navidcy commented Mar 24, 2023

I get:

  12.372 s (119521270 allocations: 70.48 GiB) # main
  8.320 s (66686606 allocations: 16.40 GiB) # main w downgraded KA 0.7.3 & CudaKernels 0.3.3

50% slowdown and much more allocations (if that matters)

@glwagner
Copy link
Member

I think maybe we caught most of the problem kernels but not all...

@simone-silvestri
Copy link
Collaborator

Does it change with different solvers? I'll do some testing today to try to snoop out the issue

navidcy referenced this issue Mar 27, 2023
Creates functions for grid spacings
@simone-silvestri
Copy link
Collaborator

The issues with the HydrostaticFreeSurfaceModel are the tendency kernels. The difference with the non-hydrostatic model is that we do not know a priori which RHS function to call (for example CATKE has an :e tracer that requires a different RHS function and the same goes with a 1 equation parameterization of mesoscales that evolves an additional tracer equation for the mesoscale energy :K).

Our solution now is to infer the RHS function and pass it as an argument to the kernel. Apparently, this prevents compilation. I ll come up with a solution today

@simone-silvestri
Copy link
Collaborator

simone-silvestri commented Mar 27, 2023

Also, the implicit vertical solver seems to affect the performance. I would have to guess that it is because we are passing functions as arguments to the kernel.

simone-silvestri added a commit that referenced this issue Mar 28, 2023
…2996 (#3030)

* test it out

* grammar

* maybe better solution?

* comment

* fixes implicit solver

* comment

* quick bugfix

* go back to 1 closure/diffusivity

* bugfix

---------

Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚨 high priority 🚨 performance 🏍️ So we can get the wrong answer even faster question 💭 No such thing as a stupid question
Projects
None yet
4 participants