Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't pass function coefficients to solve! with BatchedTridiagonalSolver for vertically implicit diffusion #3030

Merged
merged 10 commits into from
Mar 28, 2023

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Mar 27, 2023

On main we are passing the tendency kernel function as an argument to the kernel.
Apparently, this prevents compilation on the CPU.
Another place where this design was implemented is the vertically implicit solver, where we pass functions to calculate the tridiagonal matrix coefficients.

This PR fixes both problems.

After this PR we should remember that we cannot pass functions as kernel arguments,
not even as properties of a struct! Instead we can pass Val(:function_name) and dispatch on that to call the correct function_name (as implemented in this PR for the vertically implicit solver)

Baroclinic adjustment test (with Nx = Ny = 128, Nz = 10)
on main:

[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 870.922 ms, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 16.500 minutes
[ Info:     ... simulation initialization complete (957.942 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (19.178 seconds).
[15.28%] i: 20, t: 5.500 hours, wall time: 1.732 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 18.150 minutes
[32.08%] i: 40, t: 11.550 hours, wall time: 1.376 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 19.965 minutes
[50.57%] i: 60, t: 18.205 hours, wall time: 1.333 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[69.09%] i: 80, t: 1.036 days, wall time: 1.219 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[87.61%] i: 100, t: 1.314 days, wall time: 1.175 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[ Info: Simulation is stopping after running for 7.623 minutes.
[ Info: Simulation time 1.500 days equals or exceeds stop time 1.500 days.

on this PR:

[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 9.474 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 16.500 minutes
[ Info:     ... simulation initialization complete (4.467 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (29.607 seconds).
[15.28%] i: 20, t: 5.500 hours, wall time: 45.943 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 18.150 minutes
[32.08%] i: 40, t: 11.550 hours, wall time: 14.963 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 19.965 minutes
[50.57%] i: 60, t: 18.205 hours, wall time: 15.328 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[69.09%] i: 80, t: 1.036 days, wall time: 15.417 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[87.61%] i: 100, t: 1.314 days, wall time: 14.862 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[ Info: Simulation is stopping after running for 2.050 minutes.
[ Info: Simulation time 1.500 days equals or exceeds stop time 1.500 days.

on Oceananigans v0.79.2 (with KA 0.7.2):

[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 52.447 ms, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 16.500 minutes
[ Info:     ... simulation initialization complete (53.391 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (748.379 ms).
[15.28%] i: 20, t: 5.500 hours, wall time: 14.167 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 18.150 minutes
[32.08%] i: 40, t: 11.550 hours, wall time: 14.841 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 19.965 minutes
[50.57%] i: 60, t: 18.205 hours, wall time: 14.606 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[69.09%] i: 80, t: 1.036 days, wall time: 14.924 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[87.61%] i: 100, t: 1.314 days, wall time: 13.987 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[ Info: Simulation is stopping after running for 1.358 minutes.
[ Info: Simulation time 1.500 days equals or exceeds stop time 1.500 days.

closes #2996

@navidcy
Copy link
Collaborator

navidcy commented Mar 27, 2023

I made a "clean" version of the BCI example:

bench_BCI.jl
using Oceananigans
using Oceananigans.Units

Lx = 1000kilometers
Ly = 1000kilometers
Lz = 1kilometers

Nx = 64
Ny = 64
Nz = 40

grid = RectilinearGrid(size = (Nx, Ny, Nz),
                       x = (0, Lx),
                       y = (-Ly/2, Ly/2),
                       z = (-Lz, 0),
                       topology = (Periodic, Bounded, Bounded))

model = HydrostaticFreeSurfaceModel(; grid,
                                    coriolis = BetaPlane(latitude = -45),
                                    buoyancy = BuoyancyTracer(),
                                    tracers = :b,
                                    momentum_advection = WENO(),
                                    tracer_advection = WENO())

ramp(y, Δy) = min(max(0, y/Δy + 1/2), 1)

N² = 4e-6 # [s⁻²] buoyancy frequency / stratification= 8e-8 # [s⁻²] horizontal buoyancy gradient

Δy = 50kilometers # width of the region of the front
Δb = Δy *# buoyancy jump associated with the front
ϵb = 1e-2 * Δb    # noise amplitude

bᵢ(x, y, z) =* z + Δb * ramp(y, Δy) + ϵb * randn()

set!(model, b=bᵢ)

Δt₀ = 5minutes
stop_time = 40days

simulation = Simulation(model, Δt=Δt₀, stop_time=stop_time)

wizard = TimeStepWizard(cfl=0.2, max_change=1.1, max_Δt=20minutes)

simulation.callbacks[:wizard] = Callback(wizard, IterationInterval(20))

using Printf

wall_clock = [time_ns()]

function print_progress(sim)
    @printf("[%05.2f%%] i: %d, t: %s, wall time: %s, max(u): (%6.3e, %6.3e, %6.3e) m/s, next Δt: %s\n",
            100 * (sim.model.clock.time / sim.stop_time),
            sim.model.clock.iteration,
            prettytime(sim.model.clock.time),
            prettytime(1e-9 * (time_ns() - wall_clock[1])),
            maximum(abs, sim.model.velocities.u),
            maximum(abs, sim.model.velocities.v),
            maximum(abs, sim.model.velocities.w),
            prettytime(sim.Δt))

    wall_clock[1] = time_ns()
  
    return nothing
end

simulation.callbacks[:print_progress] = Callback(print_progress, IterationInterval(100))

@info "Running the simulation..."

run!(simulation)

@info "Simulation completed in " * prettytime(simulation.run_wall_time)

For me:

on main

julia> include("/Users/navid/Desktop/bench_BCI.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
WARNING: using CUDA.instantiate in module CubedSpheres conflicts with an existing identifier.
[ Info: Oceananigans will use 6 threads
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 7.837 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (3.810 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (18.936 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.632 minutes, max(u): (6.985e-01, 3.311e-01, 3.422e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.307 minutes, max(u): (5.972e-01, 3.536e-01, 2.517e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.299 minutes, max(u): (5.940e-01, 2.495e-01, 1.809e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.314 minutes, max(u): (5.097e-01, 2.361e-01, 1.700e-03) m/s, next Δt: 20 minutes
[12.98%] i: 500, t: 5.193 days, wall time: 1.307 minutes, max(u): (4.790e-01, 2.186e-01, 1.723e-03) m/s, next Δt: 20 minutes

on v0.79.2 (with KernelAbstractions v0.7.2, CUDAKernels v0.3.3)

julia> include("/Users/navid/Desktop/bench_BCI.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
[ Info: Oceananigans will use 6 threads
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 6.164 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (1.943 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (37.897 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.616 minutes, max(u): (6.986e-01, 3.323e-01, 3.400e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 57.398 seconds, max(u): (5.969e-01, 3.537e-01, 2.494e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 57.343 seconds, max(u): (5.982e-01, 2.524e-01, 1.836e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 57.110 seconds, max(u): (5.151e-01, 2.545e-01, 1.683e-03) m/s, next Δt: 20 minutes
[12.98%] i: 500, t: 5.193 days, wall time: 57.194 seconds, max(u): (4.737e-01, 2.410e-01, 1.623e-03) m/s, next Δt: 20 minutes
[16.45%] i: 600, t: 6.581 days, wall time: 56.511 seconds, max(u): (4.713e-01, 2.876e-01, 1.829e-03) m/s, next Δt: 20 minutes

on this PR

[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
WARNING: using CUDA.instantiate in module CubedSpheres conflicts with an existing identifier.
[ Info: Oceananigans will use 6 threads
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 7.729 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (3.718 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (18.285 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.490 minutes, max(u): (6.969e-01, 3.313e-01, 3.361e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.174 minutes, max(u): (5.931e-01, 3.540e-01, 2.515e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.170 minutes, max(u): (5.970e-01, 2.574e-01, 1.871e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.175 minutes, max(u): (5.186e-01, 2.392e-01, 1.702e-03) m/s, next Δt: 20 minutes
[12.98%] i: 500, t: 5.193 days, wall time: 1.169 minutes, max(u): (4.768e-01, 2.218e-01, 1.660e-03) m/s, next Δt: 20 minutes

@navidcy
Copy link
Collaborator

navidcy commented Mar 27, 2023

My results seem very different from yours @simone-silvestri. I'll check again later in case I did something wrong...

@simone-silvestri
Copy link
Collaborator Author

To see a larger difference you have to use a vertically implicit closure.
Anyways, I ll also do some more tests

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Mar 27, 2023

Using your example, on main:

julia> run!(simulation)
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 9.525 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (4.249 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (19.768 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.965 minutes, max(u): (6.980e-01, 3.315e-01, 3.357e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.684 minutes, max(u): (5.929e-01, 3.535e-01, 2.520e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.583 minutes, max(u): (5.931e-01, 2.530e-01, 1.924e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.644 minutes, max(u): (5.163e-01, 2.431e-01, 1.688e-03) m/s, next Δt: 20 minutes

on this PR

julia> run!(simulation)
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 9.839 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (4.567 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (19.823 seconds).
[09.33%] i: 100, t: 11.193 hours, wall time: 1.573 minutes, max(u): (6.961e-01, 3.346e-01, 3.385e-03) m/s, next Δt: 8.858 minutes
[24.35%] i: 200, t: 1.217 days, wall time: 1.195 minutes, max(u): (5.956e-01, 3.538e-01, 2.506e-03) m/s, next Δt: 14.266 minutes
[48.30%] i: 300, t: 2.415 days, wall time: 1.199 minutes, max(u): (5.931e-01, 2.529e-01, 1.828e-03) m/s, next Δt: 20 minutes
[76.07%] i: 400, t: 3.804 days, wall time: 1.210 minutes, max(u): (5.090e-01, 2.369e-01, 1.658e-03) m/s, next Δt: 20 minutes

on Oceananigans 0.79.2

julia> run!(simulation)
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 8.533 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (2.251 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (35.854 seconds).
[11.66%] i: 100, t: 11.193 hours, wall time: 1.961 minutes, max(u): (6.968e-01, 3.324e-01, 3.355e-03) m/s, next Δt: 8.858 minutes
[30.44%] i: 200, t: 1.217 days, wall time: 1.180 minutes, max(u): (5.960e-01, 3.550e-01, 2.434e-03) m/s, next Δt: 14.266 minutes
[60.37%] i: 300, t: 2.415 days, wall time: 1.200 minutes, max(u): (5.939e-01, 2.541e-01, 1.830e-03) m/s, next Δt: 20 minutes
[95.09%] i: 400, t: 3.804 days, wall time: 1.269 minutes, max(u): (5.162e-01, 2.501e-01, 1.745e-03) m/s, next Δt: 20 minutes

Might it be the threading?

@navidcy
Copy link
Collaborator

navidcy commented Mar 28, 2023

To see a larger difference you have to use a vertically implicit closure.
Anyways, I ll also do some more tests

But you reported results with the BCI example, right? Or you used a different solver?

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Mar 28, 2023

I used validation/mesoscale_turbulence/baroclinic_adjustment.jl but I reduced the grid size. The big problem of main is the implicit solver, so if you don't use it you see only the difference given by the tracer tendencies, which is smaller. The results above, however, are obtained using exactly the same code you posted (on a mac M1 with one thread though, you use 6, so maybe that is the root cause of the difference)

@navidcy
Copy link
Collaborator

navidcy commented Mar 28, 2023

@simone-silvestri will the changes here impact CATKE on CliMA/ClimaOcean.jl#17 ?

@navidcy
Copy link
Collaborator

navidcy commented Mar 28, 2023

I used validation/mesoscale_turbulence/baroclinic_adjustment.jl but I reduced the grid size. The big problem of main is the implicit solver, so if you don't use it you see only the difference given by the tracer tendencies, which is smaller. The results above, however, are obtained using exactly the same code you posted (on a mac M1 with one thread though, you use 6, so maybe that is the root cause of the difference)

with one thread

on main

julia> include("/Users/navid/Desktop/bench_BCI.jl")
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 7.775 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (3.851 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (18.536 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.978 minutes, max(u): (6.965e-01, 3.325e-01, 3.423e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.597 minutes, max(u): (5.961e-01, 3.542e-01, 2.452e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.589 minutes, max(u): (5.953e-01, 2.532e-01, 1.842e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.582 minutes, max(u): (5.110e-01, 2.416e-01, 1.716e-03) m/s, next Δt: 20 minutes

on v0.79.2 (with KernelAbstractions v0.7.2, CUDAKernels v0.3.3)

julia> include("/Users/navid/Desktop/bench_BCI.jl")
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 5.919 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (1.889 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (35.835 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.815 minutes, max(u): (6.956e-01, 3.317e-01, 3.438e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.204 minutes, max(u): (5.965e-01, 3.522e-01, 2.478e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.205 minutes, max(u): (5.984e-01, 2.521e-01, 1.876e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.188 minutes, max(u): (5.077e-01, 2.352e-01, 1.688e-03) m/s, next Δt: 20 minutes
[12.98%] i: 500, t: 5.193 days, wall time: 1.190 minutes, max(u): (4.808e-01, 2.206e-01, 1.666e-03) m/s, next Δt: 20 minutes
[16.45%] i: 600, t: 6.581 days, wall time: 1.177 minutes, max(u): (4.862e-01, 2.322e-01, 1.957e-03) m/s, next Δt: 20 minutes
[19.93%] i: 700, t: 7.970 days, wall time: 1.195 minutes, max(u): (5.148e-01, 3.746e-01, 2.576e-03) m/s, next Δt: 20 minutes
[23.40%] i: 800, t: 9.359 days, wall time: 1.178 minutes, max(u): (5.468e-01, 5.943e-01, 3.142e-03) m/s, next Δt: 20 minutes
[26.42%] i: 900, t: 10.570 days, wall time: 1.182 minutes, max(u): (7.035e-01, 7.189e-01, 3.633e-03) m/s, next Δt: 17.880 minutes
[29.46%] i: 1000, t: 11.785 days, wall time: 1.214 minutes, max(u): (8.409e-01, 6.869e-01, 5.817e-03) m/s, next Δt: 14.325 minutes
[32.28%] i: 1100, t: 12.913 days, wall time: 1.196 minutes, max(u): (8.707e-01, 7.634e-01, 4.171e-03) m/s, next Δt: 19.489 minutes

on this PR

[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 7.863 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (3.951 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (18.668 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.645 minutes, max(u): (6.978e-01, 3.308e-01, 3.392e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.283 minutes, max(u): (5.974e-01, 3.566e-01, 2.491e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.294 minutes, max(u): (5.923e-01, 2.544e-01, 1.822e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.289 minutes, max(u): (5.164e-01, 2.414e-01, 1.701e-03) m/s, next Δt: 20 minutes

@navidcy
Copy link
Collaborator

navidcy commented Mar 28, 2023

But it's the same solver in the validations script also, right? The FFT solver...
Is there any other difference?

@navidcy
Copy link
Collaborator

navidcy commented Mar 28, 2023

OK, now my results are similar to yours -- so it was the threading.

@navidcy navidcy changed the title Fix performance issue #2996 Fix performance on CPU after upgrade to KernelAbstractions 0.8; issue #2996 Mar 28, 2023
@simone-silvestri
Copy link
Collaborator Author

But it's the same solver in the validations script also, right? The FFT solver... Is there any other difference?

not the implicit free surface solver, the vertical tridiagonal solver that solves for implicit diffusion of tracer and momentum in the vertical

@simone-silvestri
Copy link
Collaborator Author

@simone-silvestri will the changes here impact CATKE on CliMA/ClimaOcean.jl#17 ?

not on the GPU, this PR affects only CPU performance

@simone-silvestri simone-silvestri merged commit f4bbce1 into main Mar 28, 2023
@simone-silvestri simone-silvestri deleted the ss/fix-performance branch March 28, 2023 11:55
@navidcy
Copy link
Collaborator

navidcy commented Mar 28, 2023

@simone-silvestri will the changes here impact CATKE on CliMA/ClimaOcean.jl#17 ?

not on the GPU, this PR affects only CPU performance

sorry, I was not clear... I was asking whether any syntax for adding parametrization with additional tracers changed. (But I think, no, right?)

@simone-silvestri
Copy link
Collaborator Author

nono syntax stays the same

upper_diagonal = maybe_tupled_ivd_upper_diagonal)
lower_diagonal = Val(:maybe_tupled_ivd_lower_diagonal),
diagonal = Val(:ivd_diagonal),
upper_diagonal = Val(:maybe_tupled_ivd_upper_diagonal))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems complicated. Can we come up with a better design?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll raise an issue?

the thing is that we can’t pass a function as an argument to a kernel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can't quite be true because we pass forcing into the tendency kernels.

Still, there's no reason why we have to use the "function" interface here. There's many possible solutions: array-like structs, new types + extending get_coefficient (probably the best solution), or redesigning the solver. It's not like there's only one solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am wrong but new types + extending get_coefficient sounds a bit like this with types instead of Val{:function_name}). I guess probably redesigning the solvers might be a long-term solution that we can explore

@glwagner
Copy link
Member

After this PR we should remember that we cannot pass functions as kernel arguments,
not even as properties of a struct!

This can't be true because, for example

is a NamedTuple of functions, eg

not even as properties of a struct!

This might be the key.

@glwagner
Copy link
Member

Instead we can pass Val(:function_name) and dispatch on that to call the correct function_name (as implemented in this PR for the vertically implicit solver)

Let's do better than this?

Comment on lines -76 to +73
return hydrostatic_turbulent_kinetic_energy_tendency, mews_closure, mews_diffusivity_fields
return calculate_hydrostatic_free_surface_Ge!, mews_closure, mews_diffusivity_fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the old name, it's more clear. Can we change it back?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the kernel, not the function

i, j, k = @index(Global, NTuple)
@inbounds Gc[i, j, k] = tendency_kernel_function(i, j, k, grid, args...)
@inbounds Ge[i, j, k] = hydrostatic_turbulent_kinetic_energy_tendency(i, j, k, grid, args...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting inconsistency

Comment on lines +137 to +139
@inline get_coefficient(::Val{:maybe_tupled_ivd_lower_diagonal}, i, j, k, grid, p, args...) = maybe_tupled_ivd_lower_diagonal(i, j, k, grid, args...)
@inline get_coefficient(::Val{:maybe_tupled_ivd_upper_diagonal}, i, j, k, grid, p, args...) = maybe_tupled_ivd_upper_diagonal(i, j, k, grid, args...)
@inline get_coefficient(::Val{:ivd_diagonal}, i, j, k, grid, p, args...) = ivd_diagonal(i, j, k, grid, args...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of dispatching on Val, versus another type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say using Val or a Type results in the same behavior unless you had another idea for the type. Val avoids having to define a new type.

maybe we could actually try

@inline get_coefficient(::Val{:function_name}, i, j, k, grid, p, args...) where function_name = function_name(i, j, k, grid, p, args...) and see if it does compile (probably not)

@glwagner glwagner changed the title Fix performance on CPU after upgrade to KernelAbstractions 0.8; issue #2996 Don't function coefficients to BatchedTridiagonalSolver for vertically implicit diffusion Apr 3, 2023
@glwagner glwagner changed the title Don't function coefficients to BatchedTridiagonalSolver for vertically implicit diffusion Don't pass function coefficients to solve! with BatchedTridiagonalSolver for vertically implicit diffusion Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10-100x slowdown on CPU after upgrade to KernelAbstractions 0.8 (due to type inference failure?)
3 participants