Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oceananigans is much slower when using relaxation forcing functions #1827

Closed
tomchor opened this issue Jul 6, 2021 · 24 comments
Closed

Oceananigans is much slower when using relaxation forcing functions #1827

tomchor opened this issue Jul 6, 2021 · 24 comments
Labels
performance 🏍️ So we can get the wrong answer even faster

Comments

@tomchor
Copy link
Collaborator

tomchor commented Jul 6, 2021

I've noticed that Oceananigans is much slower when using forcing functions. As an example, I set-up a simulation without any forcing functions and I noticed that in the first minute (wall time) of the running simulation I complete 3.5% of the whole simulation period.

However, if I include forcing functions as

bot_mask = GaussianMask{:z}(center=minimum(znodes(Face, grid)), width=grid.Lz/10)
mom_sponge = Relaxation(rate=1/10, mask=bot_mask, target=0)
forcing = (u=mom_sponge, v=mom_sponge, w=mom_sponge)

then in the first (wall time) minute of running I complete only 0.15% of the simulation. Basically around 20 times slower!

I of course expected a slowdown after including forcing functions, but not by this much. Is this normal behavior?

So far I ran my tests only on CPUs, but I've observed similar behaviors on GPUs.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

I think the problem is not forcing functions in general, but Relaxation.

There's always a trivial forcing function present:

regularize_forcing(::Nothing, field::AbstractField, field_name, model_field_names) = zeroforcing

which is the trivial function

@inline zeroforcing(args...) = 0

So any slowdown is due to the specific forcing function that's being used.

Let's look into why Relaxation or GaussianMask is so slow. I don't think it should be. Do you mind changing the title of this issue?

If the slowdown is different on CPU than GPU that might be an important clue.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

Could be good to put together a benchmarking script for ContinuousForcing and Relaxation so that we can test ideas.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

There is the dreaded exponentiation by Int64:

@inline (g::GaussianMask{:x})(x, y, z) = exp(-(x - g.center)^2 / (2 * g.width^2))
@inline (g::GaussianMask{:y})(x, y, z) = exp(-(y - g.center)^2 / (2 * g.width^2))
@inline (g::GaussianMask{:z})(x, y, z) = exp(-(z - g.center)^2 / (2 * g.width^2))

though this shouldn't affect CPU.

Relaxation uses ContinuousForcing so we should probably look into whether the problem comes from that code. It does seem possible there were changes after upgrading to 1.6.

@tomchor tomchor changed the title Oeananigans is much slower when using forcing functions Oeananigans is much slower when using relaxation forcing functions Jul 7, 2021
@tomchor
Copy link
Collaborator Author

tomchor commented Jul 7, 2021

If the slowdown is different on CPU than GPU that might be an important clue.

I haven't had a chance to test this properly on GPUs, although I believe (from experience in a less controlled scenario) a similar slowdown occurs.

Relaxation uses ContinuousForcing so we should probably look into whether the problem comes from that code. It does seem possible there were changes after upgrading to 1.6.

I also can't prove/test it right now, but I also think this issue has been there since before 1.6. Basically since I started using Oceananigans. Because I've always had simulations with Relaxation and the time it takes them to run has always been pretty much the same (apart from that WENO5 issue we found a couple of weeks ago). Only now I realized that the simulations run much faster without these forcings though.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

I haven't had a chance to test this properly on GPUs, although I believe (from experience in a less controlled scenario) a similar slowdown occurs.

Hmm ok. Not sure what that would mean, but I guess that is some kind of clue.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

I think it's probable that DiscreteForcing doesn't have the same performance issues.

@ali-ramadhan put together a benchmark script for forcing functions a while ago I thought, but it might have disappeared (because it wasn't informative?) That might've been before we had ContinuousForcing though.

@tomchor
Copy link
Collaborator Author

tomchor commented Jul 7, 2021

I've tried using the forcing as:

@inline sponge_func(x, y, z, ϕ) = -rate * bot_mask(x, y, z) *- 0)
sponge_u(x, y, z, t, u) = sponge_func(x, y, z, u)
sponge_v(x, y, z, t, v) = sponge_func(x, y, z, v)
sponge_w(x, y, z, t, w) = sponge_func(x, y, z, w)

forc_u = Forcing(sponge_u, field_dependencies=:u,)
forc_v = Forcing(sponge_v, field_dependencies=:v,)
forc_w = Forcing(sponge_w, field_dependencies=:w,)
forcing = (u=forc_u, v=forc_v, w=forc_w)

and the same performance issues appear. But I guess this still uses ContinuousForcing.

I've never used DiscreteForcing but I'll try to use it and see what happens. If you have any examples that would help.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

How about using just one forcing function for simplicity?

I think something like this might work:

@inline u_mask(i, j, k, grid, p) = exp(-(xnode(Face(), Center(), Center(), i, j, k, grid) - p.center)^2 / (2 * p.width^2))

@inline u_forcing_func(i, j, k, grid, clock, model_fields, p) = @inbounds - p.rate * u_mask(i, j, k, grid, p) * model_fields.u[i, j, k]

u_forcing = Forcing(u_forcing_func, discrete_form=true, parameters=(rate=1/10, center=-grid.Lz, width=grid.Lz/10))

There's another example in the docs:

https://clima.github.io/OceananigansDocumentation/stable/model_setup/forcing_functions/#%22Discrete-form%22-forcing-functions

@tomchor
Copy link
Collaborator Author

tomchor commented Jul 7, 2021

I just tested DiscreteForcing and I have almost the same slowdown (I completed 0.20% of the simulation, compared to 0.15% using ContinuousForcing). Here's what I used:

Z(k) = @inbounds -grid.Lz + grid.Δz*(k-1/2)
bottom_mask(k) = @inbounds exp(-(Z(k)+80)^2 / ((2*8)^2))

sponge_u_disc(i, j, k, grid, clock, model_fields) = @inbounds - rate * bottom_mask(k) * (model_fields.u[i, j, k] -0)
sponge_v_disc(i, j, k, grid, clock, model_fields) = @inbounds - rate * bottom_mask(k) * (model_fields.v[i, j, k] -0)
sponge_w_disc(i, j, k, grid, clock, model_fields) = @inbounds - rate * bottom_mask(k) * (model_fields.w[i, j, k] -0)

forc_u = Forcing(sponge_u_disc, discrete_form=true)
forc_v = Forcing(sponge_v_disc, discrete_form=true)
forc_w = Forcing(sponge_w_disc, discrete_form=true)

forcing = (u=forc_u, v=forc_v, w=forc_w)

I may have made rookie errors here as well since this is my first time using DiscreteForcing.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

I think you may need @inline in front of most of those functions (only matters for CPU)

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

If the slow down is the same for DiscreteForcing then the problem may really just be evaluating exp, sadly... You could try @inline bottom_mask(k) = 1 to test...

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

Looks like you're referencing grid as a global in Z(k). Not sure if that causes performance issues, but it won't compile on the GPU. So you might want to propagate that argument through.

@tomchor
Copy link
Collaborator Author

tomchor commented Jul 7, 2021

I think you may need @inline in front of most of those functions (only matters for CPU)

That was my first attempt, but I get an error saying ERROR: LoadError: LoadError: -(grid.Lz) + grid.Δz * (k - 1 / 2) is not a function expression. I guess I need to choose between @inline or @inbounds?

Nevermind, I was doing something very dumb. Inlining gives me exact same performance as not inlining (0.20% of the simulation). I guess the compiler is getting smarter about inlining.

If the slow down is the same for DiscreteForcing then the problem may really just be evaluating exp, sadly... You could try @inline bottom_mask(k) = 1 to test...

I'll try that. Although I have tried non-exponential masks in the past with a similar slowdown, so I'm not sure if that's the issue.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

You can use both @inline and @inbounds; they mean different things. @inline is a compiler directive to inline a function (important for performance when a function is called within inner loops). We put @inline in front of functions; eg @inline f(x) = .... @inbounds elides bounds checking when an array / field is indexed into. We need @inbounds in front of any indexing operation that occurs in a loop (eg u[i, j, k]).

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

I guess the compiler is getting smarter about inlining.

For sure, it should inline tiny functions... we just add it because "sometimes" the compiler doesn't do the right thing. So it's just conservative to add @inline and for something like this we should always have it so that we don't have to wonder whether it's causing a difference.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

Although I have tried non-exponential masks in the past with a similar slowdown, so I'm not sure if that's the issue.

I guess the key here is something that doesn't have a transcendental function. I'd be surprised if its the issue but it's possible so worth testing.

@tomchor
Copy link
Collaborator Author

tomchor commented Jul 7, 2021

Using @inline bottom_mask(k) = 1 improves things (twice as fast as the exponentiation), but it's still much slower than the non-forced version. About 9 times slower to be more precise. I'm pasting the relevant part of the code below:

@inline bottom_mask(k) = 1

sponge_u_disc(i, j, k, grid, clock, model_fields) = @inbounds - rate * bottom_mask(k) * (model_fields.u[i, j, k] -0)
sponge_v_disc(i, j, k, grid, clock, model_fields) = @inbounds - rate * bottom_mask(k) * (model_fields.v[i, j, k] -0)
sponge_w_disc(i, j, k, grid, clock, model_fields) = @inbounds - rate * bottom_mask(k) * (model_fields.w[i, j, k] -0)

forc_u = Forcing(sponge_u_disc, discrete_form=true)
forc_v = Forcing(sponge_v_disc, discrete_form=true)
forc_w = Forcing(sponge_w_disc, discrete_form=true)

forcing = (u=forc_u, v=forc_v, w=forc_w)

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

Can you put @inline in front of sponge_u_disc, etc?

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

Again unsure if it affects performance but since rate is referenced as global it needs to be const; eg const rate = 1/10.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

9 times slower might be 3x per kernel function (maybe...)

@tomchor
Copy link
Collaborator Author

tomchor commented Jul 7, 2021

Can you put @inline in front of sponge_u_disc, etc?

Done. Same result. I also tried ContinuousForcing with bottom_mask(x, y, z) = 1 and it's slower than its discrete counterpart. Apparently DiscreteForcing is a bit faster than ContinuousForcing, everything else being the same.

Again unsure if it affects performance but since rate is referenced as global it needs to be const; eg const rate = 1/10

Yes! That makes a big difference! I feel silly that I forgot that. With const rate=1/10 and DiscreteForcing things are as fast as with no forcing.

Using the same "trick" with ContinuousForcing doesn't change things though. So it does seem like the source of the issue is ContinuousForcing.

I should say though, I'm having some trouble securing a GPU right now, so I haven't been able to run these tests on a GPU.

Would a MWE help here?

@tomchor
Copy link
Collaborator Author

tomchor commented Jul 7, 2021

Just got a hold of a GPU. I tried this and only saw a slowdown of 10% or so with ContinuousForcing, which I guess is okay.

So this seems to be a CPU issue.

@glwagner
Copy link
Member

glwagner commented Jul 7, 2021

Slow down of 10% when introducing exp could make sense.

With const rate=1/10 and DiscreteForcing things are as fast as with no forcing.

Okay, that makes sense.

So this seems to be a CPU issue.

So the problem is that ContinuousForcing is rather slow on the CPU. That is a bit annoying. Kind of like how WENO is really slow on the CPU for unknown reasons. At least we have a workaround with DiscreteForcing...

Would a MWE help here?

I think what would help the most is a simple benchmarking script that compares identical forcing function implementations with ContinuousForcing and DiscreteForcing to no forcing. I think we only need one forcing function, and it's probably best if its simple (eg - p.mu * model_fields.u[i, j, k]) and doesn't involve complicated functions like exp. Since it's only a CPU issue I think it's not the highest priority though (it might not be something we can easily solve either...)

@glwagner glwagner added the performance 🏍️ So we can get the wrong answer even faster label Jul 9, 2021
@navidcy navidcy changed the title Oeananigans is much slower when using relaxation forcing functions Oceananigans is much slower when using relaxation forcing functions Jul 17, 2021
@glwagner
Copy link
Member

I'm closing this issue because I'm judging that it's not of current, timely relevance to Oceananigans development. If you would like to make it a higher priority or if you think the issue was closed in error please feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance 🏍️ So we can get the wrong answer even faster
Projects
None yet
Development

No branches or pull requests

2 participants