10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) #2996

navidcy · 2023-03-22T00:28:42Z

I observe a discontinuous doubling of docs building time from 3 hours on

https://buildkite.com/clima/oceananigans/builds/9920#01862a69-61ee-4e59-82b4-5c4b4684e7b6

to 6 hours after "Merge pull request #2899 from CliMA/vc/ka_upgrade"

https://buildkite.com/clima/oceananigans/builds/9935#01862d0c-e009-4a05-a8bb-023d1159a2ae

glwagner · 2023-03-23T01:24:10Z

I'm detecting also a massive slow down in model execution and out of control allocations on the CPU. Type inference failure? Unfortunate to find it this late after the PRs are merged.

glwagner · 2023-03-23T01:24:59Z

@navidcy what's the last version before the catastrophic performance loss? I'll do a benchmark to compare with main.

glwagner · 2023-03-23T01:43:17Z

Kind of a random case, but here's some timings from a calibration problem I'm doing (I'm running 6 3D simulations, ranging from something like 33 to 117 time-steps, either size (6, 10, 32) or (6, 10, 64)). For these tests I just downgraded KernelAbstractions via Project.toml:

With `KernelAbstractions` 0.7.2 (also downgraded CUDAKernels)

 23.405205 seconds (25.27 M allocations: 2.610 GiB, 1.78% gc time, 99.80% compilation time)
  5.019944 seconds (5.92 M allocations: 475.126 MiB, 0.86% gc time, 98.22% compilation time)
  0.067385 seconds (107.25 k allocations: 72.628 MiB)
  0.090308 seconds (107.25 k allocations: 72.628 MiB)
  0.139109 seconds (217.20 k allocations: 147.487 MiB)
  0.197798 seconds (217.20 k allocations: 147.487 MiB)

The two simulations are affected by compilation but things go fast after that.

With `KernelAbstractions` 0.8.6

  4.914645 seconds (28.10 M allocations: 6.039 GiB, 15.62% gc time, 51.22% compilation time)
  5.011717 seconds (31.58 M allocations: 10.844 GiB, 19.87% gc time, 16.87% compilation time)
  4.236418 seconds (27.41 M allocations: 11.073 GiB, 21.75% gc time)
  8.501561 seconds (55.04 M allocations: 22.118 GiB, 22.03% gc time)
  8.618707 seconds (56.01 M allocations: 22.627 GiB, 21.83% gc time)
 17.081286 seconds (112.47 M allocations: 45.197 GiB, 21.73% gc time)

Smells like type inference failure to me. Some informal exploration shows that the tendency calculations dominate this problem (as they do many others) --- so it's a pretty basic issue I suspect.

glwagner · 2023-03-23T01:46:07Z

Something a little puzzling to me is that we clearly succeed at type inference when running on the GPU. Do we have a better chance of succeeding there because of adapt? We do make some critical simplifications via adapt, most notably

Oceananigans.jl/src/Fields/field.jl

Line 400 in f70d094

Adapt.adapt_structure(to, f::Field) = Adapt.adapt(to, f.data)

navidcy · 2023-03-23T03:21:07Z

That's an order of magnitude slowdown... And that's not just compilation times, right?

glwagner · 2023-03-23T04:43:21Z

Compilation actually speeds up, its run time that slows down. (The examples are probably running very very very very slow.)

glwagner · 2023-03-23T04:44:22Z

So the runtime is 50-100x slower. But because there's other stuff in the docs (and compilation time matters to, the total slow down for the docs build is just 2x).

glwagner · 2023-03-23T04:44:28Z

@simone-silvestri this is a big deal because 50x is essentially standing still, hardly useable except for the smallest problems.

vchuravy · 2023-03-23T11:34:34Z

Do you have a profile to share?

simone-silvestri · 2023-03-23T11:55:36Z

Also a MWE would be nice to debug

glwagner · 2023-03-23T14:09:03Z

I haven't done any profiling --- just simple benchmarks. (Short example coming soon)

glwagner · 2023-03-23T14:32:06Z

Ok here's something simple:

using Oceananigans
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        time_step!(model, Δt)
    end
end

@btime lots_of_steps!(model, 0.01)

Here's what I've done:

Run this on fresh clone of main. This returns

julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
  20.460 s (144483404 allocations: 94.43 GiB)

Restrict compat on KernelAbstractions to 0.7.2 and CUDAKernels to 0.3.3. This returns:

julia> include("../simple_benchmark.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
  2.202 s (118604 allocations: 52.20 MiB)

I'm running on a single core, Mac M1. Here the performance loss is just 10x so I'll change the somewhat dramatic title of this issue.

glwagner · 2023-03-23T14:35:23Z

With advection = CenteredSecondOrder(), the differences are more dramatic (maybe explains the 100x slow down I saw with a simple setup):

  17.859 s (144483404 allocations: 94.43 GiB) # KA 0.8
  294.401 ms (118604 allocations: 52.20 MiB) # KA 0.7

If we look just at calculate_tendencies! via

using Oceananigans
using Oceananigans.TimeSteppers: calculate_tendencies!
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
model = NonhydrostaticModel(; grid, advection=WENO())

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        #time_step!(model, Δt)
        calculate_tendencies!(model, [])
    end
end

@btime lots_of_steps!(model, 0.01)

results are (advection = WENO())

5.268 s (23061000 allocations: 11.73 GiB) # KA 0.8
1.989 s (14600 allocations: 13.03 MiB) # KA 0.7

and advection = CenteredSecondOrder()

  2.846 s (23061000 allocations: 11.73 GiB) # KA 0.8
105.867 ms (14600 allocations: 13.03 MiB) # KA 0.7

glwagner · 2023-03-23T14:49:36Z

Simple kernels are safe (in fact, faster)

using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ

using KernelAbstractions: @kernel, @index
using BenchmarkTools

@kernel function _diffuse!(c, Δt)
    i, j, k = @index(Global, NTuple)
    @inbounds c[i, j, k] += Δt * ∇²ᶜᶜᶜ(i, j, k, grid, c)
end

function diffuse!(c, Δt)
    grid = c.grid
    arch = grid.architecture
    ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt)
    wait(device(arch), ev)
    return nothing
end

function lots_of_steps!(c, Δt, steps=100)
    for _ = 1:steps
        diffuse!(c, Δt)
    end
end

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01)

yields

447.763 ms (9832300 allocations: 975.37 MiB) # KA 0.8
499.522 ms (9832300 allocations: 1.81 GiB) # KA 0.7

adding fill_halo_regions! doesn't change much either

glwagner · 2023-03-23T15:11:04Z

I found something. It's hard to provide code since I hacked apart the code base for this. I'll try to describe.

I reduced calculate_tendencies! to just calculating the u tendency. I then wrote two methods:

u_velocity_tendency(i, j, k, grid) = zero(grid)

and

@inline function u_velocity_tendency(i, j, k, grid,
                                     advection,
                                     coriolis,
                                     stokes_drift,
                                     closure,
                                     u_immersed_bc,
                                     buoyancy,
                                     background_fields,
                                     velocities,
                                     tracers,
                                     auxiliary_fields,
                                     diffusivities,
                                     forcings,
                                     hydrostatic_pressure,
                                     clock)

    return zero(grid)
end

Even those these functions have identical output, applying launch! to the "long argument list" version is 300x slower. So the problem is in launch! somewhere?

Maybe splatting is the issue since we write

""" Calculate the right-hand-side of the u-velocity equation. """
@kernel function calculate_Gu!(Gu, args...)
    i, j, k = @index(Global, NTuple)
    @inbounds Gu[i, j, k] = u_velocity_tendency(i, j, k, args...)
end

?

glwagner · 2023-03-23T15:31:08Z

ok... maybe this is a minimal

using Oceananigans
using Oceananigans.Architectures: device
using Oceananigans.BoundaryConditions: fill_halo_regions!
using Oceananigans.Utils: launch!
using Oceananigans.Operators: ∇²ᶜᶜᶜ

using KernelAbstractions: @kernel, @index
using BenchmarkTools

@inline laplacian(i, j, k, grid, c, args...) = ∇²ᶜᶜᶜ(i, j, k, grid, c)

@kernel function _diffuse!(c, Δt, args...)
    i, j, k = @index(Global, NTuple)
    @inbounds c[i, j, k] += Δt * laplacian(i, j, k, grid, c, args...)
end

function diffuse!(c, Δt)
    grid = c.grid

    Nargs = 2 # this is the key

    args = Tuple(grid for i = 1:Nargs)
    arch = grid.architecture
    fill_halo_regions!(c)
    ev = launch!(arch, grid, :xyz, _diffuse!, c, Δt, args...)
    wait(device(arch), ev)
    return nothing
end

function lots_of_steps!(c, Δt, steps=100)
    for _ = 1:steps
        diffuse!(c, Δt)
    end
end

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1))
c = CenterField(grid)
@btime lots_of_steps!(c, 0.01)

Ok, so with nothing difference except KA version, this is what we find:

Nargs	KA 0.8	KA 0.72
1	479.273 ms (9847200 allocations: 977.66 MiB)	686.707 ms (11485600 allocations: 2.57 GiB)
2	1.281 s (14801200 allocations: 4.75 GiB)	840.756 ms (13124400 allocations: 3.32 GiB)
3	1.399 s (16447200 allocations: 5.52 GiB)	981.717 ms (14764000 allocations: 4.08 GiB)

vchuravy · 2023-03-23T16:01:32Z

So the likely answer here is that CPU execution is no longer running through my custom Cassette infrastructure that inlined everything even on the CPU. So you are hitting the normal case of Julia deciding that it isn't worth inclining complicated functions.

The benefits of not using Cassette are that backtracked are useful (for errors and profiling) and that compile time is much improved.

There is a long-term fix that one could explore... I don't currently have time for this but it might make an interesting UROP project. Use a custom host compiler via GPUCompiler/LLVM.jl/AbstractInterpreter and maybe OpaqueClosurer to mimick the compilation of the GPU code.

vchuravy · 2023-03-23T16:04:21Z

In some way this comes back to the fundamental question of: What is the point of KernelAbstractions CPU support. I originally intended it only for making debugging easier...

But folks seem to be depending on it as a performance solution... I think it is feasible to get there, but it would require quite a bit of time and effort

glwagner · 2023-03-23T16:08:52Z

Ok, so if we shouldn't be using KernelAbstractions for performance on CPU, then I think this means we should write our own CPU infrastructure stuff and rely on KA only for GPU? Is that what you recommend?

vchuravy · 2023-03-23T17:08:03Z

Is that what you recommend?

That wouldn't solve your problem here. KA gives you reasonable performance on the CPU, but since KA 0.8 it is execution story on the CPU much closer to the rest of Julia and it doesn't play special tricks.

glwagner · 2023-03-23T18:34:46Z

Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance?

Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously.

vchuravy · 2023-03-23T21:48:42Z

in developing our own CPU infrastructure

My point is that your own CPU infrastructure will encounter these same issues, except if you are willing to do large amounts of work and then I would invite you to improve KA instead.

glwagner · 2023-03-24T03:25:54Z

Re-posting from #3026... that PR solved performance problems with NonhydrostaticModel, but HydrostaticFreeSurfaceModel is still 2x slower roughly than when using KA 0.7.2. Here's a simple benchmark:

using Oceananigans
using BenchmarkTools

grid = RectilinearGrid(CPU(), size=(128, 128, 1), x=(0, 2π), y=(0, 2π), z=(0, 1), topology=(Periodic, Periodic, Bounded))
model = HydrostaticFreeSurfaceModel(; grid, momentum_advection=WENO(), tracer_advection=WENO())
ϵ(x, y, z) = 2rand() - 1
set!(model, u=ϵ, v=ϵ)

function lots_of_steps!(model, Δt, steps=100)
    for _ = 1:steps
        time_step!(model, Δt)
    end
end

@btime lots_of_steps!(model, 0.01)

Results

10.220 s (85845109 allocations: 37.94 GiB) # main
6.284 s (66184308 allocations: 16.31 GiB) # main with KA downgraded to 0.7.2

cc @simone-silvestri

glwagner · 2023-03-24T03:27:28Z

Since we need the performance provided by KA 0.7, and we need to use KA 0.8+ on GPU, does that mean that we should invest in developing our own CPU infrastructure (replicating what KA 0.7 offered) to achieve that performance?

Another possibility is that we re-write much of the code base to avoid the performance pitfalls we are currently facing in order to get back to the level of performance we have with current code + KA 0.7. I believe the issue is basically an interaction between some of the abstractions / indirection we have developed and the compiler, so possibly rolling back that abstraction / indirection will bring us back to where we were previously.

To follow up with @vchuravy, it seems that rewriting just some of the code was sufficient, so we are (probably) in the clear! The lesson learned is that we cannot slurp / splat @kernel function arguments, because it prevents the kernel code from being inlined.

navidcy · 2023-03-24T03:49:03Z

I get:

  12.372 s (119521270 allocations: 70.48 GiB) # main
  8.320 s (66686606 allocations: 16.40 GiB) # main w downgraded KA 0.7.3 & CudaKernels 0.3.3

50% slowdown and much more allocations (if that matters)

glwagner · 2023-03-24T04:22:55Z

I think maybe we caught most of the problem kernels but not all...

simone-silvestri · 2023-03-24T12:14:50Z

Does it change with different solvers? I'll do some testing today to try to snoop out the issue

Creates functions for grid spacings

simone-silvestri · 2023-03-27T14:36:35Z

The issues with the HydrostaticFreeSurfaceModel are the tendency kernels. The difference with the non-hydrostatic model is that we do not know a priori which RHS function to call (for example CATKE has an :e tracer that requires a different RHS function and the same goes with a 1 equation parameterization of mesoscales that evolves an additional tracer equation for the mesoscale energy :K).

Our solution now is to infer the RHS function and pass it as an argument to the kernel. Apparently, this prevents compilation. I ll come up with a solution today

simone-silvestri · 2023-03-27T16:04:41Z

Also, the implicit vertical solver seems to affect the performance. I would have to guess that it is because we are passing functions as arguments to the kernel.

…2996 (#3030) * test it out * grammar * maybe better solution? * comment * fixes implicit solver * comment * quick bugfix * go back to 1 closure/diffusivity * bugfix --------- Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>

navidcy added the question 💭 No such thing as a stupid question label Mar 22, 2023

This was referenced Mar 22, 2023

Flips sign for gravity_unit_vector to match its description (attempt #2) #2990

Merged

Flips sign for gravity_unit_vector to match its description + use SeawaterPolynomials 0.3.2 #2963

Closed

This was referenced Mar 22, 2023

Double gyre example #678

Closed

Every exported name requires a docstring #1133

Closed

glwagner mentioned this issue Mar 23, 2023

Validation script for flow over immersed hills fails in bottom_drag case #2860

Closed

glwagner added the 🚨 high priority 🚨 label Mar 23, 2023

glwagner changed the title ~~Docs build too slow...~~ 50-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type instability?) Mar 23, 2023

glwagner changed the title ~~50-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type instability?)~~ 50-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) Mar 23, 2023

glwagner changed the title ~~50-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?)~~ 10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) Mar 23, 2023

simone-silvestri mentioned this issue Mar 23, 2023

Fixes CPU slowdown with KernelAbstractions >= v0.8 #3026

Merged

navidcy added the performance 🏍️ So we can get the wrong answer even faster label Mar 23, 2023

navidcy closed this as completed in #3026 Mar 24, 2023

navidcy reopened this Mar 24, 2023

navidcy referenced this issue Mar 27, 2023

Merge pull request #2842 from CliMA/tc/spacings

503b8a7

Creates functions for grid spacings

simone-silvestri mentioned this issue Mar 27, 2023

Don't pass function coefficients to solve! with BatchedTridiagonalSolver for vertically implicit diffusion #3030

Merged

simone-silvestri closed this as completed in #3030 Mar 28, 2023

navidcy mentioned this issue Feb 14, 2024

(0.90.7) Remove argument splatting in hydrostatic free surface tendency kernel entry functions #3477

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) #2996

10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) #2996

navidcy commented Mar 22, 2023 •

edited by glwagner

glwagner commented Mar 23, 2023

glwagner commented Mar 23, 2023

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

navidcy commented Mar 23, 2023

glwagner commented Mar 23, 2023

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

vchuravy commented Mar 23, 2023

simone-silvestri commented Mar 23, 2023

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023

vchuravy commented Mar 23, 2023

vchuravy commented Mar 23, 2023

glwagner commented Mar 23, 2023 •

edited

vchuravy commented Mar 23, 2023

glwagner commented Mar 23, 2023 •

edited

vchuravy commented Mar 23, 2023

glwagner commented Mar 24, 2023

glwagner commented Mar 24, 2023

navidcy commented Mar 24, 2023

glwagner commented Mar 24, 2023

simone-silvestri commented Mar 24, 2023

simone-silvestri commented Mar 27, 2023

simone-silvestri commented Mar 27, 2023 •

edited

10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) #2996

10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?) #2996

Comments

navidcy commented Mar 22, 2023 • edited by glwagner

glwagner commented Mar 23, 2023

glwagner commented Mar 23, 2023

glwagner commented Mar 23, 2023 • edited

With KernelAbstractions 0.7.2 (also downgraded CUDAKernels)

With KernelAbstractions 0.8.6

glwagner commented Mar 23, 2023 • edited

navidcy commented Mar 23, 2023

glwagner commented Mar 23, 2023

glwagner commented Mar 23, 2023 • edited

glwagner commented Mar 23, 2023 • edited

vchuravy commented Mar 23, 2023

simone-silvestri commented Mar 23, 2023

glwagner commented Mar 23, 2023 • edited

glwagner commented Mar 23, 2023 • edited

glwagner commented Mar 23, 2023 • edited

glwagner commented Mar 23, 2023 • edited

glwagner commented Mar 23, 2023 • edited

glwagner commented Mar 23, 2023

vchuravy commented Mar 23, 2023

vchuravy commented Mar 23, 2023

glwagner commented Mar 23, 2023 • edited

vchuravy commented Mar 23, 2023

glwagner commented Mar 23, 2023 • edited

vchuravy commented Mar 23, 2023

glwagner commented Mar 24, 2023

glwagner commented Mar 24, 2023

navidcy commented Mar 24, 2023

glwagner commented Mar 24, 2023

simone-silvestri commented Mar 24, 2023

simone-silvestri commented Mar 27, 2023

simone-silvestri commented Mar 27, 2023 • edited

navidcy commented Mar 22, 2023 •

edited by glwagner

glwagner commented Mar 23, 2023 •

edited

With `KernelAbstractions` 0.7.2 (also downgraded CUDAKernels)

With `KernelAbstractions` 0.8.6

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

glwagner commented Mar 23, 2023 •

edited

simone-silvestri commented Mar 27, 2023 •

edited