Pool memory leak #367

mark-petersen · 2019-09-13T14:20:06Z

The destroy_pool subroutine does not deallocate the memory associated with a field, because it uses a simple fortran deallocate rather than the more thorough mpas_deallocate_field routine. This has become a problem when using RK4 time-stepping in the ocean model, as we hit an out-of-memory error.

fixes #350

mark-petersen · 2019-09-13T14:26:02Z

@pwolfram, please cherry-pick 1670ba45 and test your hurricane case. In the simple test case you provided here: #350 (comment) it previously ran 26k iterations before a memory error. With this fix it runs 79k iterations, so a factor of three on that test. Obviously, there is still a leak, but this might get you through a queue cycle. I spent several more hours replacing other deallocate statements, but others would cause an error. I also looked for other memory leaks without luck.

mark-petersen · 2019-09-13T14:37:11Z

@mgduda this was a minor memory leak that came up last March that we didn't fix. It is now stopping a set of simulations, so is a priority. You give a potential solution here: #185 (comment) which I implemented in this PR, but it only partially fixes the memory leak. A few days after that, here: #185 (comment) you said it needs a different solution than you previously thought.

I've been testing alterations "experimentally", but I honestly don't understand the pool clone and destroy routines very well. Any help or advice you could provide would be greatly appreciated.

mark-petersen · 2019-09-13T14:50:24Z

There are other deallocate lines, like this set:

 203    recursive subroutine mpas_pool_destroy_pool(inPool)!{{{
...
 253                   else if (associated(dptr % r1)) then
 254                      if (associated(dptr % r1 % array)) then
 255                         deallocate(dptr % r1 % array, stat=local_err)
 256                      end if
 258                      deallocate(dptr % r1, stat=local_err)

and I tried to alter it to this:

 253                   else if (associated(dptr % r1)) then
 254                      if (associated(dptr % r1 % array)) then
 255                         nullify(dptr % r1 % array)
 256                      end if
 257                      call mpas_deallocate_field(dptr % r1)

I made that change to all data types in mpas_pool_destroy_pool and it compiles. But it has an immediate run-time error, trying to access out-of-bounds memory, so that didn't work.

A second set just after that looks like this:

 306                   else if (associated(dptr % r1a)) then
 307                      do j=1,dptr % contentsTimeLevs
 308                         if (associated(dptr % r1a(j) % array)) then
 309                            deallocate(dptr % r1a(j) % array)
 310                         end if
 311                      end do
 312                      deallocate(dptr % r1a)

that is also a candidate for a memory leak due to the extra field information that remains. But the alteration to:

 312                      call mpas_deallocate_field(dptr % r1a)

gets a compiler error, presumably because r1a is an array of arrays, and mpas_deallocate_field does not have a subroutine for that type.

pwolfram · 2019-09-14T16:44:44Z

@mark-petersen, thanks for the help. I've cherry-picked the commit and am running the job now and can report back later about its status.

mgduda · 2019-09-18T23:17:18Z

@mark-petersen I should be able to look more closely at this by the end of the week. From the discussion on PR #185 I think this is my point of concern regarding the approach taken at present:

Since we do a shallow copy of the source fields to the destination fields in a pool clone, if we ever access the attLists or constituentNames of members of the clone, we'd run into problems by calling mpas_deallocate_field in the mpas_pool_clone_pool routine.

mark-petersen · 2019-09-19T13:53:10Z

@mgduda thanks for looking. The changes I made in this PR only extended the simulations by @pwolfram by ~20%, and it still then dies with an 'out of memory' error. I don't know how to completely purge the memory allocated in the mpas_pool_clone_pool routine, so any help you can provide is greatly appreciated.

mark-petersen · 2019-10-01T14:09:11Z

@mgduda any thoughts on this? Any help you can provide would be greatly appreciated.

mark-petersen · 2019-10-18T18:35:15Z

@mgduda, we are stuck on this, and it's limiting for some of our simulations. If you have any suggestions, that would be extremely helpful. I don't know how to find further memory leaks by simply looking for missing deallocates or nullifies.

pwolfram · 2019-10-28T15:45:18Z

@mgduda, can you please provide some advice on this issue when you get a chance? I'm sure you are very busy but if you don't have any time to engage at all, can you please just let us all know so we can move forward? Thanks!

xylar · 2019-10-28T16:02:10Z

@pwolfram and @mgduda, it might be worth having an MPAS developer telecon about this one (and whatever other pressing MPAS issues).

pwolfram · 2019-10-28T16:06:49Z

@xylar, do you know when the next one is scheduled?

xylar · 2019-10-28T16:17:13Z

@pwolfram, they aren't scheduled anymore. You have to ask for one as topics come up that need to be discussed, like this one. If posting here doesn't do the trick, maybe a quick email to the core MPAS team would do it.

mgduda · 2019-10-28T19:36:30Z

@pwolfram @mark-petersen @xylar Sorry, all, for not taking a closer look long ago! I'll start in this afternoon and let you all know what I'm able to find by tomorrow afternoon at the latest.

mark-petersen · 2019-10-29T12:45:54Z

@mgduda thanks. Let us know if you need a test case to exemplify the behavior.

mgduda · 2019-10-29T21:34:09Z

@mark-petersen @pwolfram I have a tentative set of changes that I think will work. I've pushed these to a branch named 'framework/destroy_pool_mem_leak' in my fork of the repository. Would you all be willing to try merging this branch (which is based on the v7.0 tag) to your working branch to see if it addresses the memory leaks you're seeing? My approach was slightly different than what was done in this PR, so I created a separate branch.

sbrus89 · 2019-10-30T21:50:09Z

@mgduda, Thanks for taking a look at this! I'm testing your branch now on a test case that previously failed due to this issue. Just for my own future reference, the simulation began at 2012-10-10_00:00:00 and crashed due to insufficient virtual memory at 2012-10-21_22:13:45. I'll report back on whether or not the changes allow the run to complete.

cc @pwolfram, @mark-petersen

pwolfram · 2019-10-30T21:50:51Z

@mgduda, thank you so VERY MUCH! Am testing it now and it absolutely is solid under my test case for a simplified case at pwolfram/coastal/memory_leak_test using compass/ocean/hurricane/USDEQU120at30cr10rr2/test_memory. This was really easy to get it to crash before.

I'll defer to @sbrus89 for the real-world case next to confirm but strongly expect to not see an issue. We can let you know.

cc @mark-petersen

pwolfram · 2019-10-30T21:59:35Z

After discussion with @mark-petersen and just to document -- it ran on the order of 130,000 clone / destroy pool cycles and checking out the memory footprint with htop it was rock solid with no apparent long term average memory creep.

@mark-petersen will test more broadly tomorrow. Note the functionality this affects is probably limited only to ocean RK4 time integration and the shallow water pool because the issue was with the mpas_destroy_pool routine if I understand correctly.

sbrus89 · 2019-10-31T03:34:58Z

My run with @mgduda's changes went from 2012-10-10_00:00:00 to 2012-10-31_22:02:55 (when it timed out after reaching the 4hr interactive queue time limit) as compared the previous memory leak fail at 2012-10-21_22:13:45. This was only a little over 2 simulation days short of the stop time for the test case. Based on this results and @pwolfram and @mark-petersen's testing, I would suspect that these changes solve the problem. I'll go ahead and run on standard to make sure the run completes, but I don't anticipate any issues.

mark-petersen · 2019-10-31T14:28:44Z

I rebased and added the three commits by @mgduda.

@mgduda, was it your intention for me to keep my commit, b1d7384, or does your work replace that?

mark-petersen · 2019-10-31T14:53:27Z

src/framework/mpas_field_routines.F


-            deallocate(f_cursor)
+               if ((.not. local_keep_head_ptr) .or. (.not. associated(f_cursor, f))) then
+                   deallocate(f_cursor)


@mgduda I did a local test merge into ocean/develop. I get a seg fault here when I compile with intel/17.0.1 in debug mode, also with gnu, but only if I run multiple blocks per core. That is, if the flag config_number_of_blocks is set to a different number from the mpirun -n number of cores. For example,

config_number_of_blocks=8 mpirun -n 4 ./ocean_model

causes a seg fault, but

config_number_of_blocks=8 mpirun -n 8 ./ocean_model

runs fine. Using config_number_of_blocks=0 is also fine. This test worked before the merge.

mgduda · 2019-11-07T21:33:55Z

@mark-petersen I just pushed three new commits to the framework/destroy_pool_mem_leak branch in my fork. I think these changes should address the deallocation errors you're seeing with multiple blocks. If you do notice any other odd behavior with the recent changes, just let me know and I'll continue to investigate. The commit messages will hopefully explain what I've done, but if there are any questions at all I'll be glad to clarify!

mark-petersen · 2019-11-07T22:37:10Z

@mgduda I added your three new commits, and they worked great! It solved the problem we saw before on finalize. Thank you!

@sbrus89 or @pwolfram if either of you have time to test these new commits on your previous case, that would be great.

I would like to merge this into develop soon. @mgduda please approve if you are happy with this final version. @matthewhoffman, do you want to review?

mgduda · 2019-11-07T22:49:26Z

@mark-petersen The framework/destroy_pool_mem_leak branch in my fork is based off of the v7.0 tag, while the changes in your pool_mem_leak branch are based off a recent develop branch. Ideally, we'd base the changes on v7.0, which is an ancestor of the current develop so that anyone working off of the MPAS v7.0 release code could pick up these fixes without introducing unrelated code changes from develop. Would you all be open to the idea of force-pushing framework/destroy_pool_mem_leak onto pool_mem_leak before we merge this PR?

In cases where we need to deallocate a field that is an element of an array of field types, we could not pass an element of this field array to the mpas_deallocate_field routine, since the argument was a pointer. The field argument to the mpas_deallocate_field routines is now a target rather than a pointer, which allows elements of arrays of fields to be deallocated. However, since we cannot deallocate an individual element of an array, an optional argument, keep_head_ptr, has been added. When set to true, the mpas_deallocate_field routine will not attempt to deallocate the argument to the routine (though the contents of that argument's field type -- array, constituentNames, and attLists -- will be deallocated). Now, it is possible to deallocate an array of field types, e.g.: type (field1DReal), dimension(:), allocatable :: field_array ... code that allocates the field_array ... do i=1,size(field_array) call mpas_deallocate_field(field_array(i), keep_head_ptr=.true.) end do deallocate(field_array) *** Note: Deallocating an allocated pointer to a single field works as before, with the exception that the actual argument to the mpas_deallocate_field routine is not nullified (since the dummy argument is now a target and not a pointer). Therefore, it is recommended to explicitly nullify the actual argument after a call to mpas_deallocate_field, e.g., type (field1DReal), pointer :: f ... code that allocates f ... call mpas_deallocate_field(f) nullify(f) This commit also adds code to the mpas_deallocate_field routines to deallocate the constituentNames member of the field type if is allocated.

Previously, the mpas_pool_destroy_pool routine simply deallocated the 'array' component of the fields that were stored in the pool to be freed. This led to a memory leak, which is fixed by calling mpas_deallocate_field for each field in a pool.

…nsion When querying the decomposed dimension for a 4-d real field with a time dimension in the mpas_pool_link_parinfo routine, the routine previously attempted to access poolMem % r4 % dimNames(4), when it should have accessed poolMem % r4a(1) % dimNames(4) . The former access is appropriate only for 4-d real fields with no time dimension. pool_get_member_decomp_type(poolMem % r4a(1) % dimNames(4))

…le blocks When the keep_head_ptr argument to mpas_deallocate_field was .true., the mpas_deallocate_field routines previously attempted to deallocate all blocks for the input field besides the first block. This led to problems in freeing fields with multiple blocks and multiple time levels, since the blocks of the field were each elements of an array of fields (the array containing the time levels). Now, setting the optional argument keep_head_ptr to .true. in a call to one of the mpas_deallocate_field routines will prevent the containing field type for all blocks from being deallocated. Memory leaks are still avoided, since the field arrays for all blocks should eventually be deallocated (e.g., with calls to mpas_deallocate_block in the loop over blocks in lines 124-128 of mpas_domain_routines.F).

The previous implementations of the mpas_deallocate_field routines would not return a nullified field pointer when the optional keep_head_ptr argument was .false., since the dummy argument to these routines had the target attribute rather than the pointer attribute. This could lead to problems if the calling code didn't explicitly nullify the field pointer but later checked on the association status of that pointer. To remedy this issue, a new interface, mpas_deallocate_field_target, has been implemented to handle the deallocation of fields that are elements of field arrays. This interface is principally used when deallocating fields with multiple time levels. The mpas_deallocate_field interface now reverts to declaring the dummy argument with the pointer attribute; these routines are implemented by a call to mpas_deallocate_field_target followed by an explicit nullification of the dummy argument.

…ime level The mpas_pool_destroy_pool routine previously called the mpas_deallocate_field routine for fields with only a single time level. This works well for fields that have just one block per MPI task, but leads to invalid memory references when fields have multiple blocks: calling mpas_pool_destroy_pool for the pool in the first block frees memeory for all blocks of a field, leaving the field pointers in subsequnt blocks pointing to deallocated memory. As a work-around, the mpas_pool_destroy_pool now calls mpas_deallocate_field_target with the keep_head_ptr argument set to .true. so that the field pointed to by pools from other blocks will not be deallocated. Then, only for the field owned by a block, the field is explicitly deallocated after the call to mpas_deallocate_field_target.

mark-petersen · 2019-11-07T22:59:37Z

@mgduda no problem. I did that and retested, and it all works.

matthewhoffman

Approved by inspection

The destroy_pool subroutine does not deallocate the memory associated with a field, because it uses a simple fortran deallocate rather than the more thorough mpas_deallocate_field routine. This has become a problem when using RK4 time-stepping in the ocean model, as we hit an out-of-memory error. fixes #350

amametjanov · 2020-01-25T00:13:43Z

src/framework/mpas_field_routines.F

+               if (.not. local_keep_head_ptr) then
+                   deallocate(f_cursor)
+               end if


Please note that IBM compiler (xlf2008_r v.16.1.1-3) stops on this deallocate with run-time error:

1525-109 Error encountered while attempting to deallocate a data object. The program will stop.

After re-running with additional STAT and ERRMSG arguments, deallocate returns status value of 2: An invalid data object has been specified for the DEALLOCATE statement.. It appears that the standard wants the object to be either an allocatable array or an object with target attribute. That's not the case here with f_cursor declared with pointer attribute.

Grepping for calls to this subroutine, all callers either

call with keep_head_ptr=.true. and deallocate the object immediately after the call, or

call with keep_head_ptr=.false. and nullify the object right after the call.

If I comment out the if statement, the run proceeds with no errors. It appears that all calls need to pass in keep_head_ptr=.true. (or just remove this logic and the corresponding call to deallocate the head pointer). Would that be OK, @mgduda ?

I'll take a closer look. I'm hesitant to remove a DEALLOCATE statement without understanding whether this might re-introduce a memory leak under certain circumstances. From what I recall of the 2003 standard, objects with the POINTER attribute implicitly have the TARGET attribute.

Here is a branch that lets me build and run a sea-ice case: diff. I can try valgrind on the case that had mem-leaks. How to configure one?

When I try SMS.T62_oQU120.CMPASO-NYF with config_time_integrator = 'RK4' the case runs into SIGFPE: erroneous arithmetic operation. Does an RK4 case run in E3SM?

Sorry for not following-up on this issue for so long. I just encountered the same issue on Summit, and I'll see what I can find.

I just opened Issue #481 to track progress on this.

…elop Fix framework field deallocation errors for XL compiler #578 This merge addresses several sources of pool and field deallocation errors that arise when MPAS is built with the XL 16 compilers. Additional discussion is at #367 (review). Fixes #481 [bit-for-bit]

The destroy_pool subroutine does not deallocate the memory associated with a field, because it uses a simple fortran deallocate rather than the more thorough mpas_deallocate_field routine. This has become a problem when using RK4 time-stepping in the ocean model, as we hit an out-of-memory error. fixes MPAS-Dev#350

I found exactly the same problem as we are having here: MPAS-Dev#367 (review). And I said it was later fixed here: MPAS-Dev#367 (comment). The problem was introduced since then, in this PR: MPAS-Dev#578

mark-petersen added bug Framework labels Sep 13, 2019

mark-petersen requested review from mgduda, matthewhoffman and pwolfram September 13, 2019 14:20

mark-petersen self-assigned this Sep 13, 2019

mark-petersen changed the base branch from master to develop September 13, 2019 14:20

mark-petersen force-pushed the pool_memory_leak branch from 1670ba4 to 5aed53d Compare October 31, 2019 14:25

mark-petersen commented Oct 31, 2019

View reviewed changes

pwolfram mentioned this pull request Oct 31, 2019

Memory leak with RK4 in hurricane simulation #350

Closed

mgduda added 6 commits November 7, 2019 15:51

Fix memory leak in mpas_pool_destroy_pool

94200c9

Previously, the mpas_pool_destroy_pool routine simply deallocated the 'array' component of the fields that were stored in the pool to be freed. This led to a memory leak, which is fixed by calling mpas_deallocate_field for each field in a pool.

mark-petersen force-pushed the pool_memory_leak branch from 896292f to 7ca6769 Compare November 7, 2019 22:57

matthewhoffman approved these changes Nov 7, 2019

View reviewed changes

mgduda approved these changes Nov 8, 2019

View reviewed changes

mark-petersen removed the request for review from pwolfram November 14, 2019 16:54

mark-petersen merged commit 7ca6769 into MPAS-Dev:develop Nov 14, 2019

mark-petersen deleted the pool_memory_leak branch November 14, 2019 20:34

amametjanov reviewed Jan 25, 2020

View reviewed changes

mgduda mentioned this pull request Mar 21, 2020

Field memory deallocation error with XL compilers #481

Closed

amametjanov mentioned this pull request May 29, 2020

Fix framework field deallocation errors for XL compiler #578

Merged

matthewhoffman added bug Framework labels Mar 17, 2021

This was referenced Apr 2, 2024

Create the provis_state subpool at RK4 initialization to avoid memory leak E3SM-Ocean-Discussion/E3SM#87

Closed

Create the provis_state subpool at RK4 initialization to avoid memory leak E3SM-Project/E3SM#6334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool memory leak #367

Pool memory leak #367

mark-petersen commented Sep 13, 2019

mark-petersen commented Sep 13, 2019

mark-petersen commented Sep 13, 2019

mark-petersen commented Sep 13, 2019

pwolfram commented Sep 14, 2019

mgduda commented Sep 18, 2019

mark-petersen commented Sep 19, 2019

mark-petersen commented Oct 1, 2019

mark-petersen commented Oct 18, 2019

pwolfram commented Oct 28, 2019

xylar commented Oct 28, 2019

pwolfram commented Oct 28, 2019

xylar commented Oct 28, 2019

mgduda commented Oct 28, 2019

mark-petersen commented Oct 29, 2019

mgduda commented Oct 29, 2019

sbrus89 commented Oct 30, 2019

pwolfram commented Oct 30, 2019

pwolfram commented Oct 30, 2019

sbrus89 commented Oct 31, 2019

mark-petersen commented Oct 31, 2019

mark-petersen Oct 31, 2019 •

edited

mgduda commented Nov 7, 2019

mark-petersen commented Nov 7, 2019

mgduda commented Nov 7, 2019

mark-petersen commented Nov 7, 2019

matthewhoffman left a comment

amametjanov Jan 25, 2020

mgduda Jan 26, 2020

amametjanov Jan 30, 2020

mgduda Mar 21, 2020

mgduda Mar 21, 2020

Pool memory leak #367

Pool memory leak #367

Conversation

mark-petersen commented Sep 13, 2019

mark-petersen commented Sep 13, 2019

mark-petersen commented Sep 13, 2019

mark-petersen commented Sep 13, 2019

pwolfram commented Sep 14, 2019

mgduda commented Sep 18, 2019

mark-petersen commented Sep 19, 2019

mark-petersen commented Oct 1, 2019

mark-petersen commented Oct 18, 2019

pwolfram commented Oct 28, 2019

xylar commented Oct 28, 2019

pwolfram commented Oct 28, 2019

xylar commented Oct 28, 2019

mgduda commented Oct 28, 2019

mark-petersen commented Oct 29, 2019

mgduda commented Oct 29, 2019

sbrus89 commented Oct 30, 2019

pwolfram commented Oct 30, 2019

pwolfram commented Oct 30, 2019

sbrus89 commented Oct 31, 2019

mark-petersen commented Oct 31, 2019

mark-petersen Oct 31, 2019 • edited

Choose a reason for hiding this comment

mgduda commented Nov 7, 2019

mark-petersen commented Nov 7, 2019

mgduda commented Nov 7, 2019

mark-petersen commented Nov 7, 2019

matthewhoffman left a comment

Choose a reason for hiding this comment

amametjanov Jan 25, 2020

Choose a reason for hiding this comment

mgduda Jan 26, 2020

Choose a reason for hiding this comment

amametjanov Jan 30, 2020

Choose a reason for hiding this comment

mgduda Mar 21, 2020

Choose a reason for hiding this comment

mgduda Mar 21, 2020

Choose a reason for hiding this comment

mark-petersen Oct 31, 2019 •

edited