Don't pass function coefficients to `solve!` with `BatchedTridiagonalSolver` for vertically implicit diffusion #3030

simone-silvestri · 2023-03-27T16:28:48Z

On main we are passing the tendency kernel function as an argument to the kernel.
Apparently, this prevents compilation on the CPU.
Another place where this design was implemented is the vertically implicit solver, where we pass functions to calculate the tridiagonal matrix coefficients.

This PR fixes both problems.

After this PR we should remember that we cannot pass functions as kernel arguments,
not even as properties of a struct! Instead we can pass Val(:function_name) and dispatch on that to call the correct function_name (as implemented in this PR for the vertically implicit solver)

Baroclinic adjustment test (with Nx = Ny = 128, Nz = 10)
on main:

[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 870.922 ms, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 16.500 minutes
[ Info:     ... simulation initialization complete (957.942 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (19.178 seconds).
[15.28%] i: 20, t: 5.500 hours, wall time: 1.732 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 18.150 minutes
[32.08%] i: 40, t: 11.550 hours, wall time: 1.376 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 19.965 minutes
[50.57%] i: 60, t: 18.205 hours, wall time: 1.333 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[69.09%] i: 80, t: 1.036 days, wall time: 1.219 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[87.61%] i: 100, t: 1.314 days, wall time: 1.175 minutes, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[ Info: Simulation is stopping after running for 7.623 minutes.
[ Info: Simulation time 1.500 days equals or exceeds stop time 1.500 days.

on this PR:

[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 9.474 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 16.500 minutes
[ Info:     ... simulation initialization complete (4.467 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (29.607 seconds).
[15.28%] i: 20, t: 5.500 hours, wall time: 45.943 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 18.150 minutes
[32.08%] i: 40, t: 11.550 hours, wall time: 14.963 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 19.965 minutes
[50.57%] i: 60, t: 18.205 hours, wall time: 15.328 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[69.09%] i: 80, t: 1.036 days, wall time: 15.417 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[87.61%] i: 100, t: 1.314 days, wall time: 14.862 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[ Info: Simulation is stopping after running for 2.050 minutes.
[ Info: Simulation time 1.500 days equals or exceeds stop time 1.500 days.

on Oceananigans v0.79.2 (with KA 0.7.2):

[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 52.447 ms, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 16.500 minutes
[ Info:     ... simulation initialization complete (53.391 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (748.379 ms).
[15.28%] i: 20, t: 5.500 hours, wall time: 14.167 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 18.150 minutes
[32.08%] i: 40, t: 11.550 hours, wall time: 14.841 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 19.965 minutes
[50.57%] i: 60, t: 18.205 hours, wall time: 14.606 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[69.09%] i: 80, t: 1.036 days, wall time: 14.924 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[87.61%] i: 100, t: 1.314 days, wall time: 13.987 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 20 minutes
[ Info: Simulation is stopping after running for 1.358 minutes.
[ Info: Simulation time 1.500 days equals or exceeds stop time 1.500 days.

closes #2996

navidcy · 2023-03-27T21:25:37Z

I made a "clean" version of the BCI example:

bench_BCI.jl

using Oceananigans
using Oceananigans.Units

Lx = 1000kilometers
Ly = 1000kilometers
Lz = 1kilometers

Nx = 64
Ny = 64
Nz = 40

grid = RectilinearGrid(size = (Nx, Ny, Nz),
                       x = (0, Lx),
                       y = (-Ly/2, Ly/2),
                       z = (-Lz, 0),
                       topology = (Periodic, Bounded, Bounded))

model = HydrostaticFreeSurfaceModel(; grid,
                                    coriolis = BetaPlane(latitude = -45),
                                    buoyancy = BuoyancyTracer(),
                                    tracers = :b,
                                    momentum_advection = WENO(),
                                    tracer_advection = WENO())

ramp(y, Δy) = min(max(0, y/Δy + 1/2), 1)

N² = 4e-6 # [s⁻²] buoyancy frequency / stratification
M² = 8e-8 # [s⁻²] horizontal buoyancy gradient

Δy = 50kilometers # width of the region of the front
Δb = Δy * M²      # buoyancy jump associated with the front
ϵb = 1e-2 * Δb    # noise amplitude

bᵢ(x, y, z) = N² * z + Δb * ramp(y, Δy) + ϵb * randn()

set!(model, b=bᵢ)

Δt₀ = 5minutes
stop_time = 40days

simulation = Simulation(model, Δt=Δt₀, stop_time=stop_time)

wizard = TimeStepWizard(cfl=0.2, max_change=1.1, max_Δt=20minutes)

simulation.callbacks[:wizard] = Callback(wizard, IterationInterval(20))

using Printf

wall_clock = [time_ns()]

function print_progress(sim)
    @printf("[%05.2f%%] i: %d, t: %s, wall time: %s, max(u): (%6.3e, %6.3e, %6.3e) m/s, next Δt: %s\n",
            100 * (sim.model.clock.time / sim.stop_time),
            sim.model.clock.iteration,
            prettytime(sim.model.clock.time),
            prettytime(1e-9 * (time_ns() - wall_clock[1])),
            maximum(abs, sim.model.velocities.u),
            maximum(abs, sim.model.velocities.v),
            maximum(abs, sim.model.velocities.w),
            prettytime(sim.Δt))

    wall_clock[1] = time_ns()
  
    return nothing
end

simulation.callbacks[:print_progress] = Callback(print_progress, IterationInterval(100))

@info "Running the simulation..."

run!(simulation)

@info "Simulation completed in " * prettytime(simulation.run_wall_time)

For me:

on `main`

julia> include("/Users/navid/Desktop/bench_BCI.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
WARNING: using CUDA.instantiate in module CubedSpheres conflicts with an existing identifier.
[ Info: Oceananigans will use 6 threads
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 7.837 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (3.810 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (18.936 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.632 minutes, max(u): (6.985e-01, 3.311e-01, 3.422e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.307 minutes, max(u): (5.972e-01, 3.536e-01, 2.517e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.299 minutes, max(u): (5.940e-01, 2.495e-01, 1.809e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.314 minutes, max(u): (5.097e-01, 2.361e-01, 1.700e-03) m/s, next Δt: 20 minutes
[12.98%] i: 500, t: 5.193 days, wall time: 1.307 minutes, max(u): (4.790e-01, 2.186e-01, 1.723e-03) m/s, next Δt: 20 minutes

on v0.79.2 (with KernelAbstractions v0.7.2, CUDAKernels v0.3.3)

julia> include("/Users/navid/Desktop/bench_BCI.jl")
[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
[ Info: Oceananigans will use 6 threads
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 6.164 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (1.943 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (37.897 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.616 minutes, max(u): (6.986e-01, 3.323e-01, 3.400e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 57.398 seconds, max(u): (5.969e-01, 3.537e-01, 2.494e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 57.343 seconds, max(u): (5.982e-01, 2.524e-01, 1.836e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 57.110 seconds, max(u): (5.151e-01, 2.545e-01, 1.683e-03) m/s, next Δt: 20 minutes
[12.98%] i: 500, t: 5.193 days, wall time: 57.194 seconds, max(u): (4.737e-01, 2.410e-01, 1.623e-03) m/s, next Δt: 20 minutes
[16.45%] i: 600, t: 6.581 days, wall time: 56.511 seconds, max(u): (4.713e-01, 2.876e-01, 1.829e-03) m/s, next Δt: 20 minutes

on this PR

[ Info: Precompiling Oceananigans [9e8cae18-63c1-5223-a75c-80ca9d6e9a09]
WARNING: using CUDA.instantiate in module CubedSpheres conflicts with an existing identifier.
[ Info: Oceananigans will use 6 threads
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 7.729 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (3.718 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (18.285 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.490 minutes, max(u): (6.969e-01, 3.313e-01, 3.361e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.174 minutes, max(u): (5.931e-01, 3.540e-01, 2.515e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.170 minutes, max(u): (5.970e-01, 2.574e-01, 1.871e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.175 minutes, max(u): (5.186e-01, 2.392e-01, 1.702e-03) m/s, next Δt: 20 minutes
[12.98%] i: 500, t: 5.193 days, wall time: 1.169 minutes, max(u): (4.768e-01, 2.218e-01, 1.660e-03) m/s, next Δt: 20 minutes

navidcy · 2023-03-27T21:27:10Z

My results seem very different from yours @simone-silvestri. I'll check again later in case I did something wrong...

simone-silvestri · 2023-03-27T21:53:22Z

To see a larger difference you have to use a vertically implicit closure.
Anyways, I ll also do some more tests

simone-silvestri · 2023-03-27T22:06:44Z

Using your example, on main:

julia> run!(simulation)
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 9.525 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (4.249 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (19.768 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.965 minutes, max(u): (6.980e-01, 3.315e-01, 3.357e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.684 minutes, max(u): (5.929e-01, 3.535e-01, 2.520e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.583 minutes, max(u): (5.931e-01, 2.530e-01, 1.924e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.644 minutes, max(u): (5.163e-01, 2.431e-01, 1.688e-03) m/s, next Δt: 20 minutes

on this PR

julia> run!(simulation)
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 9.839 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (4.567 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (19.823 seconds).
[09.33%] i: 100, t: 11.193 hours, wall time: 1.573 minutes, max(u): (6.961e-01, 3.346e-01, 3.385e-03) m/s, next Δt: 8.858 minutes
[24.35%] i: 200, t: 1.217 days, wall time: 1.195 minutes, max(u): (5.956e-01, 3.538e-01, 2.506e-03) m/s, next Δt: 14.266 minutes
[48.30%] i: 300, t: 2.415 days, wall time: 1.199 minutes, max(u): (5.931e-01, 2.529e-01, 1.828e-03) m/s, next Δt: 20 minutes
[76.07%] i: 400, t: 3.804 days, wall time: 1.210 minutes, max(u): (5.090e-01, 2.369e-01, 1.658e-03) m/s, next Δt: 20 minutes

on Oceananigans 0.79.2

julia> run!(simulation)
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 8.533 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (2.251 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (35.854 seconds).
[11.66%] i: 100, t: 11.193 hours, wall time: 1.961 minutes, max(u): (6.968e-01, 3.324e-01, 3.355e-03) m/s, next Δt: 8.858 minutes
[30.44%] i: 200, t: 1.217 days, wall time: 1.180 minutes, max(u): (5.960e-01, 3.550e-01, 2.434e-03) m/s, next Δt: 14.266 minutes
[60.37%] i: 300, t: 2.415 days, wall time: 1.200 minutes, max(u): (5.939e-01, 2.541e-01, 1.830e-03) m/s, next Δt: 20 minutes
[95.09%] i: 400, t: 3.804 days, wall time: 1.269 minutes, max(u): (5.162e-01, 2.501e-01, 1.745e-03) m/s, next Δt: 20 minutes

Might it be the threading?

navidcy · 2023-03-28T03:11:39Z

To see a larger difference you have to use a vertically implicit closure.
Anyways, I ll also do some more tests

But you reported results with the BCI example, right? Or you used a different solver?

simone-silvestri · 2023-03-28T03:14:32Z

I used validation/mesoscale_turbulence/baroclinic_adjustment.jl but I reduced the grid size. The big problem of main is the implicit solver, so if you don't use it you see only the difference given by the tracer tendencies, which is smaller. The results above, however, are obtained using exactly the same code you posted (on a mac M1 with one thread though, you use 6, so maybe that is the root cause of the difference)

navidcy · 2023-03-28T03:41:25Z

@simone-silvestri will the changes here impact CATKE on CliMA/ClimaOcean.jl#17 ?

navidcy · 2023-03-28T03:43:43Z

I used validation/mesoscale_turbulence/baroclinic_adjustment.jl but I reduced the grid size. The big problem of main is the implicit solver, so if you don't use it you see only the difference given by the tracer tendencies, which is smaller. The results above, however, are obtained using exactly the same code you posted (on a mac M1 with one thread though, you use 6, so maybe that is the root cause of the difference)

with one thread

on `main`

julia> include("/Users/navid/Desktop/bench_BCI.jl")
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 7.775 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (3.851 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (18.536 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.978 minutes, max(u): (6.965e-01, 3.325e-01, 3.423e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.597 minutes, max(u): (5.961e-01, 3.542e-01, 2.452e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.589 minutes, max(u): (5.953e-01, 2.532e-01, 1.842e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.582 minutes, max(u): (5.110e-01, 2.416e-01, 1.716e-03) m/s, next Δt: 20 minutes

on v0.79.2 (with KernelAbstractions v0.7.2, CUDAKernels v0.3.3)

julia> include("/Users/navid/Desktop/bench_BCI.jl")
[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 5.919 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (1.889 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (35.835 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.815 minutes, max(u): (6.956e-01, 3.317e-01, 3.438e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.204 minutes, max(u): (5.965e-01, 3.522e-01, 2.478e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.205 minutes, max(u): (5.984e-01, 2.521e-01, 1.876e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.188 minutes, max(u): (5.077e-01, 2.352e-01, 1.688e-03) m/s, next Δt: 20 minutes
[12.98%] i: 500, t: 5.193 days, wall time: 1.190 minutes, max(u): (4.808e-01, 2.206e-01, 1.666e-03) m/s, next Δt: 20 minutes
[16.45%] i: 600, t: 6.581 days, wall time: 1.177 minutes, max(u): (4.862e-01, 2.322e-01, 1.957e-03) m/s, next Δt: 20 minutes
[19.93%] i: 700, t: 7.970 days, wall time: 1.195 minutes, max(u): (5.148e-01, 3.746e-01, 2.576e-03) m/s, next Δt: 20 minutes
[23.40%] i: 800, t: 9.359 days, wall time: 1.178 minutes, max(u): (5.468e-01, 5.943e-01, 3.142e-03) m/s, next Δt: 20 minutes
[26.42%] i: 900, t: 10.570 days, wall time: 1.182 minutes, max(u): (7.035e-01, 7.189e-01, 3.633e-03) m/s, next Δt: 17.880 minutes
[29.46%] i: 1000, t: 11.785 days, wall time: 1.214 minutes, max(u): (8.409e-01, 6.869e-01, 5.817e-03) m/s, next Δt: 14.325 minutes
[32.28%] i: 1100, t: 12.913 days, wall time: 1.196 minutes, max(u): (8.707e-01, 7.634e-01, 4.171e-03) m/s, next Δt: 19.489 minutes

on this PR

[ Info: Running the simulation...
[ Info: Initializing simulation...
[00.00%] i: 0, t: 0 seconds, wall time: 7.863 seconds, max(u): (0.000e+00, 0.000e+00, 0.000e+00) m/s, next Δt: 5.500 minutes
[ Info:     ... simulation initialization complete (3.951 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (18.668 seconds).
[01.17%] i: 100, t: 11.193 hours, wall time: 1.645 minutes, max(u): (6.978e-01, 3.308e-01, 3.392e-03) m/s, next Δt: 8.858 minutes
[03.04%] i: 200, t: 1.217 days, wall time: 1.283 minutes, max(u): (5.974e-01, 3.566e-01, 2.491e-03) m/s, next Δt: 14.266 minutes
[06.04%] i: 300, t: 2.415 days, wall time: 1.294 minutes, max(u): (5.923e-01, 2.544e-01, 1.822e-03) m/s, next Δt: 20 minutes
[09.51%] i: 400, t: 3.804 days, wall time: 1.289 minutes, max(u): (5.164e-01, 2.414e-01, 1.701e-03) m/s, next Δt: 20 minutes

navidcy · 2023-03-28T03:47:10Z

But it's the same solver in the validations script also, right? The FFT solver...
Is there any other difference?

navidcy · 2023-03-28T03:47:40Z

OK, now my results are similar to yours -- so it was the threading.

simone-silvestri · 2023-03-28T11:43:13Z

But it's the same solver in the validations script also, right? The FFT solver... Is there any other difference?

not the implicit free surface solver, the vertical tridiagonal solver that solves for implicit diffusion of tracer and momentum in the vertical

simone-silvestri · 2023-03-28T11:44:17Z

@simone-silvestri will the changes here impact CATKE on CliMA/ClimaOcean.jl#17 ?

not on the GPU, this PR affects only CPU performance

navidcy · 2023-03-28T20:14:44Z

@simone-silvestri will the changes here impact CATKE on CliMA/ClimaOcean.jl#17 ?

not on the GPU, this PR affects only CPU performance

sorry, I was not clear... I was asking whether any syntax for adding parametrization with additional tracers changed. (But I think, no, right?)

simone-silvestri · 2023-03-28T21:34:15Z

nono syntax stays the same

glwagner · 2023-03-31T06:29:08Z

src/TurbulenceClosures/vertically_implicit_diffusion_solver.jl

-                                        upper_diagonal = maybe_tupled_ivd_upper_diagonal)
+                                        lower_diagonal = Val(:maybe_tupled_ivd_lower_diagonal),
+                                        diagonal       = Val(:ivd_diagonal),
+                                        upper_diagonal = Val(:maybe_tupled_ivd_upper_diagonal))


This seems complicated. Can we come up with a better design?

I’ll raise an issue?

the thing is that we can’t pass a function as an argument to a kernel.

That can't quite be true because we pass forcing into the tendency kernels.

Still, there's no reason why we have to use the "function" interface here. There's many possible solutions: array-like structs, new types + extending get_coefficient (probably the best solution), or redesigning the solver. It's not like there's only one solution.

maybe I am wrong but new types + extending get_coefficient sounds a bit like this with types instead of Val{:function_name}). I guess probably redesigning the solvers might be a long-term solution that we can explore

glwagner · 2023-03-31T18:42:26Z

After this PR we should remember that we cannot pass functions as kernel arguments,
not even as properties of a struct!

This can't be true because, for example

Oceananigans.jl/src/Models/NonhydrostaticModels/nonhydrostatic_tendency_kernel_functions.jl

Line 58 in bcc34f0

forcings,

is a NamedTuple of functions, eg

Oceananigans.jl/src/Models/NonhydrostaticModels/nonhydrostatic_tendency_kernel_functions.jl

Line 74 in bcc34f0

+ forcings.u(i, j, k, grid, clock, model_fields))

not even as properties of a struct!

This might be the key.

glwagner · 2023-03-31T18:43:08Z

Instead we can pass Val(:function_name) and dispatch on that to call the correct function_name (as implemented in this PR for the vertically implicit solver)

Let's do better than this?

glwagner · 2023-03-31T18:43:31Z

src/Models/HydrostaticFreeSurfaceModels/calculate_hydrostatic_free_surface_tendencies.jl

-        return  hydrostatic_turbulent_kinetic_energy_tendency, mews_closure, mews_diffusivity_fields 
+        return  calculate_hydrostatic_free_surface_Ge!, mews_closure, mews_diffusivity_fields 


I like the old name, it's more clear. Can we change it back?

this is the kernel, not the function

glwagner · 2023-03-31T18:43:49Z

src/Models/HydrostaticFreeSurfaceModels/calculate_hydrostatic_free_surface_tendencies.jl

    i, j, k = @index(Global, NTuple)
-    @inbounds Gc[i, j, k] = tendency_kernel_function(i, j, k, grid, args...)
+    @inbounds Ge[i, j, k] =  hydrostatic_turbulent_kinetic_energy_tendency(i, j, k, grid, args...)


Formatting inconsistency

glwagner · 2023-03-31T18:44:10Z

src/TurbulenceClosures/vertically_implicit_diffusion_solver.jl

+@inline get_coefficient(::Val{:maybe_tupled_ivd_lower_diagonal}, i, j, k, grid, p, args...) = maybe_tupled_ivd_lower_diagonal(i, j, k, grid, args...)
+@inline get_coefficient(::Val{:maybe_tupled_ivd_upper_diagonal}, i, j, k, grid, p, args...) = maybe_tupled_ivd_upper_diagonal(i, j, k, grid, args...)
+@inline get_coefficient(::Val{:ivd_diagonal}, i, j, k, grid, p, args...) = ivd_diagonal(i, j, k, grid, args...)


What is the advantage of dispatching on Val, versus another type?

I would say using Val or a Type results in the same behavior unless you had another idea for the type. Val avoids having to define a new type.

maybe we could actually try

@inline get_coefficient(::Val{:function_name}, i, j, k, grid, p, args...) where function_name = function_name(i, j, k, grid, p, args...) and see if it does compile (probably not)

simone-silvestri added 6 commits March 27, 2023 11:02

test it out

4f9ebdd

grammar

5a4f37f

maybe better solution?

57d17dc

comment

251eee6

fixes implicit solver

d9a3605

comment

0e5975f

simone-silvestri requested a review from navidcy March 27, 2023 16:29

simone-silvestri added 3 commits March 27, 2023 12:31

quick bugfix

a0a7e7a

go back to 1 closure/diffusivity

322195d

bugfix

1e84266

Merge branch 'main' into ss/fix-performance

db8daa8

navidcy approved these changes Mar 28, 2023

View reviewed changes

navidcy changed the title ~~Fix performance issue #2996~~ Fix performance on CPU after upgrade to KernelAbstractions 0.8; issue #2996 Mar 28, 2023

simone-silvestri merged commit f4bbce1 into main Mar 28, 2023

simone-silvestri deleted the ss/fix-performance branch March 28, 2023 11:55

glwagner reviewed Mar 31, 2023

View reviewed changes

glwagner changed the title ~~Fix performance on CPU after upgrade to KernelAbstractions 0.8; issue #2996~~ Don't function coefficients to BatchedTridiagonalSolver for vertically implicit diffusion Apr 3, 2023

glwagner changed the title ~~Don't function coefficients to BatchedTridiagonalSolver for vertically implicit diffusion~~ Don't pass function coefficients to solve! with BatchedTridiagonalSolver for vertically implicit diffusion Apr 3, 2023

glwagner mentioned this pull request Apr 3, 2023

Callable tridiagonal coefficients produces catastrophic loss of performance #3047

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't pass function coefficients to `solve!` with `BatchedTridiagonalSolver` for vertically implicit diffusion #3030

Don't pass function coefficients to `solve!` with `BatchedTridiagonalSolver` for vertically implicit diffusion #3030

simone-silvestri commented Mar 27, 2023 •

edited

Loading

navidcy commented Mar 27, 2023

navidcy commented Mar 27, 2023

simone-silvestri commented Mar 27, 2023

simone-silvestri commented Mar 27, 2023 •

edited

Loading

navidcy commented Mar 28, 2023

simone-silvestri commented Mar 28, 2023 •

edited

Loading

navidcy commented Mar 28, 2023

navidcy commented Mar 28, 2023

navidcy commented Mar 28, 2023

navidcy commented Mar 28, 2023

simone-silvestri commented Mar 28, 2023

simone-silvestri commented Mar 28, 2023

navidcy commented Mar 28, 2023

simone-silvestri commented Mar 28, 2023

glwagner Mar 31, 2023

navidcy Mar 31, 2023

glwagner Mar 31, 2023

simone-silvestri Mar 31, 2023

glwagner commented Mar 31, 2023

glwagner commented Mar 31, 2023

glwagner Mar 31, 2023

simone-silvestri Mar 31, 2023

glwagner Mar 31, 2023

glwagner Mar 31, 2023

simone-silvestri Mar 31, 2023

		return hydrostatic_turbulent_kinetic_energy_tendency, mews_closure, mews_diffusivity_fields
		return calculate_hydrostatic_free_surface_Ge!, mews_closure, mews_diffusivity_fields

Don't pass function coefficients to solve! with BatchedTridiagonalSolver for vertically implicit diffusion #3030

Don't pass function coefficients to solve! with BatchedTridiagonalSolver for vertically implicit diffusion #3030

Conversation

simone-silvestri commented Mar 27, 2023 • edited Loading

navidcy commented Mar 27, 2023

on main

on v0.79.2 (with KernelAbstractions v0.7.2, CUDAKernels v0.3.3)

on this PR

navidcy commented Mar 27, 2023

simone-silvestri commented Mar 27, 2023

simone-silvestri commented Mar 27, 2023 • edited Loading

navidcy commented Mar 28, 2023

simone-silvestri commented Mar 28, 2023 • edited Loading

navidcy commented Mar 28, 2023

navidcy commented Mar 28, 2023

on main

on v0.79.2 (with KernelAbstractions v0.7.2, CUDAKernels v0.3.3)

on this PR

navidcy commented Mar 28, 2023

navidcy commented Mar 28, 2023

simone-silvestri commented Mar 28, 2023

simone-silvestri commented Mar 28, 2023

navidcy commented Mar 28, 2023

simone-silvestri commented Mar 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glwagner commented Mar 31, 2023

glwagner commented Mar 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Don't pass function coefficients to `solve!` with `BatchedTridiagonalSolver` for vertically implicit diffusion #3030

Don't pass function coefficients to `solve!` with `BatchedTridiagonalSolver` for vertically implicit diffusion #3030

simone-silvestri commented Mar 27, 2023 •

edited

Loading

on `main`

simone-silvestri commented Mar 27, 2023 •

edited

Loading

simone-silvestri commented Mar 28, 2023 •

edited

Loading

on `main`