Create the `provis_state` subpool at RK4 initialization to avoid memory leak #87

sbrus89 · 2024-04-02T20:34:55Z

I have noticed a memory leak in the RK4 timestepping when running 125 day single-layer barotropic tides cases with the vr45to5 mesh (MPAS-Dev/compass#802) on pm-cpu. I can typically only get through about 42 days of simulation before running out of memory.

This issue is related to creating/destroying the provis_state subpool at each timestep. We had a similar issue a few years back that required memory leaks fixes in the mpas_pool_destroy_pool (MPAS-Dev/MPAS-Model#367) subroutine. However, I believe there is still a memory leak in the mpas_pool_remove_subpool (which calls pool_remove_member) that is called following mpas_pool_destroy_pool. It seems like the TODO comment here: https://github.com/E3SM-Project/E3SM/blob/6b9ecaa67c81c65fe1f7063e5afe63ce9b2c66a9/components/mpas-framework/src/framework/mpas_pool_routines.F#L6036-L6038 suggests it is possible things aren't being completely cleaned up by this subroutine.

I'm not familiar enough with the details of the pools framework to track down the memory leak itself. However, in any case, I think it makes more sense to create the provis_state subpool once at initialization as opposed to creating and destroying it every timestep. This PR is meant to socialize this as a potential approach. The main consequence of this is that the mpas_pool_copy_pool subroutine needs to have a overrideTimeLevel option similar to that used in mpas_pool_clone_pool under the previous approach. I've tested these changes with the vr45to5 test case and they do allow me to run for a full 125 days.

- Avoids create/destroy at each timestep

mark-petersen · 2024-04-03T17:58:18Z

@sbrus89 thanks for your work on this. In recent simulations by @jeremy-lilly and @gcapodag we had a memory leak problem with RK4 on perlmutter. We will retest, as this might take care of it! We were trying to run a 125m single-layer hurricane Sandy test, so very similar to your problem above, and it would run out of memory after a few days.

I spent time fixing RK4 memory leaks back in 2019. I obviously didn't fix them all, and greatly appreciate your effort on this now. For reference, here are the old issues and PRs:
MPAS-Dev/MPAS-Model#137
MPAS-Dev/MPAS-Model#142
MPAS-Dev/MPAS-Model#185

I will also conduct some tests. If they all work, we can move this to E3SM.

mark-petersen · 2024-04-03T18:14:39Z

Will compile with small fix:

--- a/components/mpas-framework/src/framework/mpas_pool_routines.F
+++ b/components/mpas-framework/src/framework/mpas_pool_routines.F
@@ -991,7 +991,7 @@ module mpas_pool_routines
-                  if (present(overrideTimeLevel)) then
+                  if (present(overrideTimeLevels)) then

xylar

@sbrus89, this looks great! I like the improvements you just added based on our discussion today.

sbrus89 · 2024-04-03T18:21:24Z

Thanks @mark-petersen, hopefully this helps with @jeremy-lilly and @gcapodag's issue as well.

sbrus89 · 2024-04-03T18:23:24Z

Thanks @xylar, I tried to make overrideTimeLevels operate analogously to the existing functionality in mpas_pool_clone_pool. Thanks again for your suggestions!

gcapodag · 2024-04-03T23:22:03Z

Hi All, thanks @sbrus89 for bringing this up. LTS and FB-LTS that are both merged in master now, also use the same process of cloning pools. Please if you are going to merge this into master, include the changes also on those two time stepping methods so that they are not left out.

mark-petersen · 2024-04-04T13:46:37Z

Passes nightly test suite and compares bfb with master branch point on chicoma with optimized gnu and chrysalis with optimized intel. Also passes nighty test suite with debug gnu on chicoma. Note this includes a series of RK4 tests:

00:34 PASS ocean_global_ocean_Icos240_WOA23_RK4_performance_test
01:40 PASS ocean_global_ocean_Icos240_WOA23_RK4_restart_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_decomp_test
01:20 PASS ocean_global_ocean_Icos240_WOA23_RK4_threads_test

In E3SM, passes

./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_gnu
./create_test SMS_Ln9.T62_oQU240.GMPAS-IAF.chrysalis_intel

mark-petersen

@sbrus89 since this shows 'no harm done' for the solution and fixes a memory leak, please move this PR over to E3SM-Project.

xylar · 2024-04-04T14:14:28Z

I agree with @gcapodag, let's include LTS and FB-LTS before this goes to E3SM.

gcapodag · 2024-04-04T14:17:57Z

Thanks everyone for your work on this. I have just submitted a job on Perlmutter to see if it fixes our problem. I'll keep you all posted.

sbrus89 · 2024-04-04T16:39:16Z

Sounds good @gcapodag, I'll add these changes for LTS and FB-LTS as well

gcapodag · 2024-04-04T16:49:38Z

Great, thank you very much @sbrus89 !!

gcapodag · 2024-04-05T13:43:04Z

Hi All, just wanted to confirm that this fix on RK4 allowed us to run on Perlmutter for 25 days on a mesh with 4617372 cells and a highest resolution near the coast of 125 m. Before this fix, the run would crash for a segmentation fault around two days in, and we were able to make it proceed past that point only using around 5% of the node capacity using these specs for srun: srun -N 64 -n 128 --ntasks-per-node=2 --ntasks-per-core=1. Thanks a lot to you all for figuring out a solution!

sbrus89 · 2024-04-05T14:22:05Z

@gcapodag, I pushed the LTS/FB-LTS changes but haven't tested them yet.

gcapodag · 2024-04-05T16:49:44Z

thanks @sbrus89 , I just tested on a 2 hr run on Perlmutter and the changes to LTS and FB-LTS are BFB.

sbrus89 · 2024-04-05T17:55:45Z

Thanks very much for testing, @gcapodag! I think I'll go ahead and move this over to E3SM in that case.

Add provis_state subpool at RK4 init

c80ebd3

- Avoids create/destroy at each timestep

sbrus89 added the bug Something isn't working label Apr 2, 2024

sbrus89 requested review from mark-petersen, xylar and cbegeman April 2, 2024 20:34

sbrus89 force-pushed the ocn/rk4_mem_leak branch from f66f372 to 1ebf854 Compare April 3, 2024 16:58

Improve overrideTimeLevels option for mpas_pool_copy_pool

12368b9

sbrus89 force-pushed the ocn/rk4_mem_leak branch from 1ebf854 to 12368b9 Compare April 3, 2024 18:16

xylar approved these changes Apr 3, 2024

View reviewed changes

mark-petersen approved these changes Apr 4, 2024

View reviewed changes

sbrus89 force-pushed the ocn/rk4_mem_leak branch from 12368b9 to 7d8f123 Compare April 5, 2024 14:08

Add addtional subpools for LTS and FBLTS at init

2f49d49

sbrus89 force-pushed the ocn/rk4_mem_leak branch from 7d8f123 to 2f49d49 Compare April 5, 2024 14:12

sbrus89 closed this Apr 5, 2024

sbrus89 mentioned this pull request Apr 5, 2024

Create the provis_state subpool at RK4 initialization to avoid memory leak E3SM-Project/E3SM#6334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create the `provis_state` subpool at RK4 initialization to avoid memory leak #87

Create the `provis_state` subpool at RK4 initialization to avoid memory leak #87

sbrus89 commented Apr 2, 2024

mark-petersen commented Apr 3, 2024

mark-petersen commented Apr 3, 2024

xylar left a comment

sbrus89 commented Apr 3, 2024

sbrus89 commented Apr 3, 2024

gcapodag commented Apr 3, 2024

mark-petersen commented Apr 4, 2024

mark-petersen left a comment

xylar commented Apr 4, 2024

gcapodag commented Apr 4, 2024

sbrus89 commented Apr 4, 2024

gcapodag commented Apr 4, 2024

gcapodag commented Apr 5, 2024

sbrus89 commented Apr 5, 2024

gcapodag commented Apr 5, 2024

sbrus89 commented Apr 5, 2024

Create the provis_state subpool at RK4 initialization to avoid memory leak #87

Create the provis_state subpool at RK4 initialization to avoid memory leak #87

Conversation

sbrus89 commented Apr 2, 2024

mark-petersen commented Apr 3, 2024

mark-petersen commented Apr 3, 2024

xylar left a comment

Choose a reason for hiding this comment

sbrus89 commented Apr 3, 2024

sbrus89 commented Apr 3, 2024

gcapodag commented Apr 3, 2024

mark-petersen commented Apr 4, 2024

mark-petersen left a comment

Choose a reason for hiding this comment

xylar commented Apr 4, 2024

gcapodag commented Apr 4, 2024

sbrus89 commented Apr 4, 2024

gcapodag commented Apr 4, 2024

gcapodag commented Apr 5, 2024

sbrus89 commented Apr 5, 2024

gcapodag commented Apr 5, 2024

sbrus89 commented Apr 5, 2024

Create the `provis_state` subpool at RK4 initialization to avoid memory leak #87

Create the `provis_state` subpool at RK4 initialization to avoid memory leak #87