Improve the performance of buoyancy_gradients #2530

szy21 · 2024-01-16T18:38:50Z

#2456 makes all the dycore simulations slower by a factor of two. When looking at it more closely, it seems most of the slowdown is from buoyancy_gradients. As a temporary solution, the mixing length and thus buoyancy gradients calculation is moved to a callback in #2466 without EDMF. With EDMF, this is still called at each timestep. It would be good if we can improve the performance of buoyancy_gradients. cc @charleskawczynski

The text was updated successfully, but these errors were encountered:

charleskawczynski · 2024-02-12T23:55:43Z

Update on this: the performance of buoyancy_gradients is poor, however, the full story is that two kernels are expensive, from calling cloud_fraction_model_callback!:

The NVTX trace, showing the relative performance hit compared to other high level kernels:

And the flame graph, showing what's being called with more granularity:

So, the buoyancy_gradients is about 33% responsible for the cost of adding the call of cloud_fraction_model_callback!, and the quad_loop call is about 63% responsible. There are a few other broadcast expressions, but they are relatively inexpensive, so I'll focus on these two.

I think, to start with, we should simply break up buoyancy_gradients and try to see if/when there's a catastrophic performance improvement. Just eye-balling it, I'd say that we should hoist out ᶜgradᵥ(ᶠinterp(ϕ)) for both broadcast expressions, and see where that gets us. At the very least, we could reuse these precomputed quantities.

We may need to apply some RootSolvers/Thermodynamics optimizations (e.g., CliMA/RootSolvers.jl#51) to reduce the arithmetic intensity of those kernels (they are very expensive).

Also, next, I'll add a gpu job and see what the performance of this kernel looks like on the gpu, since that's probably a more important target to optimize.

cc @szy21, @trontrytel, @tapios

szy21 · 2024-02-13T00:28:56Z

That's interesting, thanks! I reached the conclusion that buoyancy_gradients is more expensive than quad_loop as when I commented out buoyancy_gradients the time-to-solution is similar to commenting out the entire cloud_fraction_model_callback!, when I first noticed the problem. But maybe that's not an accurate way to estimate it, or maybe we improved the performance of buoyancy_gradients since then.

trontrytel · 2024-02-13T06:11:47Z

Sounds good. If there is anything easy to optimize in Thermodynamics, I think we should start with it. We are calling each function 9 times when running with quadrature points. So even small improvements should show up

charleskawczynski · 2024-02-20T13:30:40Z

Here are some updates (cc @szy21):

I've converted the cloud diagnostics job into a gpu job (Make cloud diagnostics job run on gpu #2684), and the kernel is significantly cheaper (our gpu implementation is much better than our cpu implementation). So, one solution is to just switch more jobs over to the gpu.
The quad loop has optimization room in Thermodynamics / RootSolvers (by doing less work), which will improve cpu and gpu performance
The buoyancy gradients kernel may have optimization room in using shared memory for our FD kernels. This will mean fewer reads/writes, which are serial on the cpu, and parallel on the gpu (i.e., improve global memory traffic). Using shared memory for our FD kernels is already on our perf road map: Performance roadmap #2632, so it's not just the buoyancy gradients kernel that will benefit.

szy21 · 2024-02-26T19:18:16Z

Could you post the time-to-solution for the job with and without cloud diagnostics on GPU? Other than that I'm ok with closing this issue. Thanks for all the work!

charleskawczynski · 2024-02-26T20:31:16Z

Yep, from this PR (with 1 p100 gpu):

[ Info: solve!: 450.030 s (149709528 allocations: 19.45 GiB)
[ Info: sypd: 5.259918262132573
[ Info: wall_time_per_timestep: 234 milliseconds, 390 microseconds

So, it's actually not bad.

szy21 · 2024-02-26T21:33:35Z

Yep, from this PR (with 1 p100 gpu):

[ Info: solve!: 450.030 s (149709528 allocations: 19.45 GiB)
[ Info: sypd: 5.259918262132573
[ Info: wall_time_per_timestep: 234 milliseconds, 390 microseconds

So, it's actually not bad.

And how about the one without cloud diagnostics on GPU?

charleskawczynski · 2024-02-26T21:46:20Z

Good question, I'll convert the other one, too in that PR so that we can compare.

charleskawczynski · 2024-02-27T13:56:53Z

Without cloud diagnostics, we have:

[ Info: solve!: 442.539 s (150425525 allocations: 19.44 GiB)
[ Info: sypd: 5.348959731633479
[ Info: wall_time_per_timestep: 230 milliseconds, 489 microseconds

cc @szy21

szy21 · 2024-02-27T17:22:30Z

Great, thanks!

charleskawczynski · 2024-04-17T19:12:15Z

@tapios mentioned that this is still an issue, so I'm reopening.

charleskawczynski · 2024-04-17T19:15:36Z

We should get updated numbers.

tapios · 2024-04-17T21:58:01Z

For cloud diagnostics, the issue is not directly buoyancy gradients, but gradientes of moisture/enthalpy. This may be related to the buoyancy gradient issue though. @szy21 knows more.

szy21 · 2024-04-18T17:30:14Z

The latest build has 2.03 SYPD for held suarez and 1.47 SYPD for held suarez with cloud diagnostics per stage, so ~30% difference (assuming they are using the same GPU on central). I don't know whether the slowdown is mostly from compute_gm_mixing_length! (which includes buoyancy_gradients) or quad_loop.

charleskawczynski · 2024-05-03T11:52:34Z

The buoyancy gradients itself is now only 547 μs (xref: #2951 (comment)). Closing.

szy21 added the Performance label Jan 16, 2024

szy21 added this to the O1.2.5 1 SYPD for AMIP milestone Jan 16, 2024

This was referenced Jan 22, 2024

Improve buoyancy_gradients design #2367

Merged

Try improving buoyancy gradient kernel performance #2544

Merged

charleskawczynski mentioned this issue Feb 9, 2024

Profile cloud diagnostics #2664

Closed

charleskawczynski self-assigned this Feb 13, 2024

charleskawczynski mentioned this issue Feb 13, 2024

Hoist and precompute common variables #2673

Merged

charleskawczynski closed this as completed Feb 26, 2024

charleskawczynski reopened this Apr 17, 2024

charleskawczynski mentioned this issue Apr 17, 2024

Fix broken NVTX reports #2911

Closed

charleskawczynski mentioned this issue Apr 24, 2024

Flatten buoyancy_gradients call #2951

Merged

charleskawczynski closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of buoyancy_gradients #2530

Improve the performance of buoyancy_gradients #2530

szy21 commented Jan 16, 2024 •

edited by charleskawczynski

Loading

Tasks

charleskawczynski commented Feb 12, 2024 •

edited

Loading

szy21 commented Feb 13, 2024

trontrytel commented Feb 13, 2024

charleskawczynski commented Feb 20, 2024

szy21 commented Feb 26, 2024

charleskawczynski commented Feb 26, 2024

szy21 commented Feb 26, 2024

charleskawczynski commented Feb 26, 2024

charleskawczynski commented Feb 27, 2024

szy21 commented Feb 27, 2024

charleskawczynski commented Apr 17, 2024

charleskawczynski commented Apr 17, 2024

tapios commented Apr 17, 2024

szy21 commented Apr 18, 2024

charleskawczynski commented May 3, 2024

Improve the performance of buoyancy_gradients #2530

Improve the performance of buoyancy_gradients #2530

Comments

szy21 commented Jan 16, 2024 • edited by charleskawczynski Loading

Tasks

charleskawczynski commented Feb 12, 2024 • edited Loading

szy21 commented Feb 13, 2024

trontrytel commented Feb 13, 2024

charleskawczynski commented Feb 20, 2024

szy21 commented Feb 26, 2024

charleskawczynski commented Feb 26, 2024

szy21 commented Feb 26, 2024

charleskawczynski commented Feb 26, 2024

charleskawczynski commented Feb 27, 2024

szy21 commented Feb 27, 2024

charleskawczynski commented Apr 17, 2024

charleskawczynski commented Apr 17, 2024

tapios commented Apr 17, 2024

szy21 commented Apr 18, 2024

charleskawczynski commented May 3, 2024

szy21 commented Jan 16, 2024 •

edited by charleskawczynski

Loading

charleskawczynski commented Feb 12, 2024 •

edited

Loading