Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool memory leak #367

Merged
merged 6 commits into from
Nov 14, 2019
Merged

Conversation

mark-petersen
Copy link
Contributor

The destroy_pool subroutine does not deallocate the memory associated with a field, because it uses a simple fortran deallocate rather than the more thorough mpas_deallocate_field routine. This has become a problem when using RK4 time-stepping in the ocean model, as we hit an out-of-memory error.

fixes #350

@mark-petersen
Copy link
Contributor Author

@pwolfram, please cherry-pick 1670ba45 and test your hurricane case. In the simple test case you provided here: #350 (comment) it previously ran 26k iterations before a memory error. With this fix it runs 79k iterations, so a factor of three on that test. Obviously, there is still a leak, but this might get you through a queue cycle. I spent several more hours replacing other deallocate statements, but others would cause an error. I also looked for other memory leaks without luck.

@mark-petersen
Copy link
Contributor Author

@mgduda this was a minor memory leak that came up last March that we didn't fix. It is now stopping a set of simulations, so is a priority. You give a potential solution here: #185 (comment) which I implemented in this PR, but it only partially fixes the memory leak. A few days after that, here: #185 (comment) you said it needs a different solution than you previously thought.

I've been testing alterations "experimentally", but I honestly don't understand the pool clone and destroy routines very well. Any help or advice you could provide would be greatly appreciated.

@mark-petersen
Copy link
Contributor Author

There are other deallocate lines, like this set:

 203    recursive subroutine mpas_pool_destroy_pool(inPool)!{{{
...
 253                   else if (associated(dptr % r1)) then
 254                      if (associated(dptr % r1 % array)) then
 255                         deallocate(dptr % r1 % array, stat=local_err)
 256                      end if
 258                      deallocate(dptr % r1, stat=local_err)

and I tried to alter it to this:

 253                   else if (associated(dptr % r1)) then
 254                      if (associated(dptr % r1 % array)) then
 255                         nullify(dptr % r1 % array)
 256                      end if
 257                      call mpas_deallocate_field(dptr % r1)

I made that change to all data types in mpas_pool_destroy_pool and it compiles. But it has an immediate run-time error, trying to access out-of-bounds memory, so that didn't work.

A second set just after that looks like this:

 306                   else if (associated(dptr % r1a)) then
 307                      do j=1,dptr % contentsTimeLevs
 308                         if (associated(dptr % r1a(j) % array)) then
 309                            deallocate(dptr % r1a(j) % array)
 310                         end if
 311                      end do
 312                      deallocate(dptr % r1a)

that is also a candidate for a memory leak due to the extra field information that remains. But the alteration to:

 312                      call mpas_deallocate_field(dptr % r1a)

gets a compiler error, presumably because r1a is an array of arrays, and mpas_deallocate_field does not have a subroutine for that type.

@pwolfram
Copy link
Contributor

@mark-petersen, thanks for the help. I've cherry-picked the commit and am running the job now and can report back later about its status.

@mgduda
Copy link
Contributor

mgduda commented Sep 18, 2019

@mark-petersen I should be able to look more closely at this by the end of the week. From the discussion on PR #185 I think this is my point of concern regarding the approach taken at present:

Since we do a shallow copy of the source fields to the destination fields in a pool clone, if we ever access the attLists or constituentNames of members of the clone, we'd run into problems by calling mpas_deallocate_field in the mpas_pool_clone_pool routine.

@mark-petersen
Copy link
Contributor Author

@mgduda thanks for looking. The changes I made in this PR only extended the simulations by @pwolfram by ~20%, and it still then dies with an 'out of memory' error. I don't know how to completely purge the memory allocated in the mpas_pool_clone_pool routine, so any help you can provide is greatly appreciated.

@mark-petersen
Copy link
Contributor Author

@mgduda any thoughts on this? Any help you can provide would be greatly appreciated.

@mark-petersen
Copy link
Contributor Author

@mgduda, we are stuck on this, and it's limiting for some of our simulations. If you have any suggestions, that would be extremely helpful. I don't know how to find further memory leaks by simply looking for missing deallocates or nullifies.

@pwolfram
Copy link
Contributor

@mgduda, can you please provide some advice on this issue when you get a chance? I'm sure you are very busy but if you don't have any time to engage at all, can you please just let us all know so we can move forward? Thanks!

@xylar
Copy link
Collaborator

xylar commented Oct 28, 2019

@pwolfram and @mgduda, it might be worth having an MPAS developer telecon about this one (and whatever other pressing MPAS issues).

@pwolfram
Copy link
Contributor

@xylar, do you know when the next one is scheduled?

@xylar
Copy link
Collaborator

xylar commented Oct 28, 2019

@pwolfram, they aren't scheduled anymore. You have to ask for one as topics come up that need to be discussed, like this one. If posting here doesn't do the trick, maybe a quick email to the core MPAS team would do it.

@mgduda
Copy link
Contributor

mgduda commented Oct 28, 2019

@pwolfram @mark-petersen @xylar Sorry, all, for not taking a closer look long ago! I'll start in this afternoon and let you all know what I'm able to find by tomorrow afternoon at the latest.

@mark-petersen
Copy link
Contributor Author

@mgduda thanks. Let us know if you need a test case to exemplify the behavior.

@mgduda
Copy link
Contributor

mgduda commented Oct 29, 2019

@mark-petersen @pwolfram I have a tentative set of changes that I think will work. I've pushed these to a branch named 'framework/destroy_pool_mem_leak' in my fork of the repository. Would you all be willing to try merging this branch (which is based on the v7.0 tag) to your working branch to see if it addresses the memory leaks you're seeing? My approach was slightly different than what was done in this PR, so I created a separate branch.

@sbrus89
Copy link

sbrus89 commented Oct 30, 2019

@mgduda, Thanks for taking a look at this! I'm testing your branch now on a test case that previously failed due to this issue. Just for my own future reference, the simulation began at 2012-10-10_00:00:00 and crashed due to insufficient virtual memory at 2012-10-21_22:13:45. I'll report back on whether or not the changes allow the run to complete.

cc @pwolfram, @mark-petersen

@pwolfram
Copy link
Contributor

@mgduda, thank you so VERY MUCH! Am testing it now and it absolutely is solid under my test case for a simplified case at pwolfram/coastal/memory_leak_test using compass/ocean/hurricane/USDEQU120at30cr10rr2/test_memory. This was really easy to get it to crash before.

I'll defer to @sbrus89 for the real-world case next to confirm but strongly expect to not see an issue. We can let you know.

cc @mark-petersen

@pwolfram
Copy link
Contributor

After discussion with @mark-petersen and just to document -- it ran on the order of 130,000 clone / destroy pool cycles and checking out the memory footprint with htop it was rock solid with no apparent long term average memory creep.

@mark-petersen will test more broadly tomorrow. Note the functionality this affects is probably limited only to ocean RK4 time integration and the shallow water pool because the issue was with the mpas_destroy_pool routine if I understand correctly.

@sbrus89
Copy link

sbrus89 commented Oct 31, 2019

My run with @mgduda's changes went from 2012-10-10_00:00:00 to 2012-10-31_22:02:55 (when it timed out after reaching the 4hr interactive queue time limit) as compared the previous memory leak fail at 2012-10-21_22:13:45. This was only a little over 2 simulation days short of the stop time for the test case. Based on this results and @pwolfram and @mark-petersen's testing, I would suspect that these changes solve the problem. I'll go ahead and run on standard to make sure the run completes, but I don't anticipate any issues.

@mark-petersen
Copy link
Contributor Author

I rebased and added the three commits by @mgduda.

@mgduda, was it your intention for me to keep my commit, b1d7384, or does your work replace that?


deallocate(f_cursor)
if ((.not. local_keep_head_ptr) .or. (.not. associated(f_cursor, f))) then
deallocate(f_cursor)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgduda I did a local test merge into ocean/develop. I get a seg fault here when I compile with intel/17.0.1 in debug mode, also with gnu, but only if I run multiple blocks per core. That is, if the flag config_number_of_blocks is set to a different number from the mpirun -n number of cores. For example,

config_number_of_blocks=8
mpirun -n 4 ./ocean_model

causes a seg fault, but

config_number_of_blocks=8
mpirun -n 8 ./ocean_model

runs fine. Using config_number_of_blocks=0 is also fine. This test worked before the merge.

@mgduda
Copy link
Contributor

mgduda commented Nov 7, 2019

@mark-petersen I just pushed three new commits to the framework/destroy_pool_mem_leak branch in my fork. I think these changes should address the deallocation errors you're seeing with multiple blocks. If you do notice any other odd behavior with the recent changes, just let me know and I'll continue to investigate. The commit messages will hopefully explain what I've done, but if there are any questions at all I'll be glad to clarify!

@mark-petersen
Copy link
Contributor Author

@mgduda I added your three new commits, and they worked great! It solved the problem we saw before on finalize. Thank you!

@sbrus89 or @pwolfram if either of you have time to test these new commits on your previous case, that would be great.

I would like to merge this into develop soon. @mgduda please approve if you are happy with this final version. @matthewhoffman, do you want to review?

@mgduda
Copy link
Contributor

mgduda commented Nov 7, 2019

@mark-petersen The framework/destroy_pool_mem_leak branch in my fork is based off of the v7.0 tag, while the changes in your pool_mem_leak branch are based off a recent develop branch. Ideally, we'd base the changes on v7.0, which is an ancestor of the current develop so that anyone working off of the MPAS v7.0 release code could pick up these fixes without introducing unrelated code changes from develop. Would you all be open to the idea of force-pushing framework/destroy_pool_mem_leak onto pool_mem_leak before we merge this PR?

In cases where we need to deallocate a field that is an element of an array
of field types, we could not pass an element of this field array to
the mpas_deallocate_field routine, since the argument was a pointer.

The field argument to the mpas_deallocate_field routines is now a target
rather than a pointer, which allows elements of arrays of fields to be
deallocated. However, since we cannot deallocate an individual element of
an array, an optional argument, keep_head_ptr, has been added. When set
to true, the mpas_deallocate_field routine will not attempt to deallocate
the argument to the routine (though the contents of that argument's field
type -- array, constituentNames, and attLists -- will be deallocated).

Now, it is possible to deallocate an array of field types, e.g.:

    type (field1DReal), dimension(:), allocatable :: field_array

    ... code that allocates the field_array ...

    do i=1,size(field_array)
        call mpas_deallocate_field(field_array(i), keep_head_ptr=.true.)
    end do
    deallocate(field_array)

*** Note: Deallocating an allocated pointer to a single field works as before,
with the exception that the actual argument to the mpas_deallocate_field routine
is not nullified (since the dummy argument is now a target and not a pointer).
Therefore, it is recommended to explicitly nullify the actual argument after
a call to mpas_deallocate_field, e.g.,

    type (field1DReal), pointer :: f

    ... code that allocates f ...

    call mpas_deallocate_field(f)
    nullify(f)

This commit also adds code to the mpas_deallocate_field routines to deallocate
the constituentNames member of the field type if is allocated.
Previously, the mpas_pool_destroy_pool routine simply deallocated the 'array'
component of the fields that were stored in the pool to be freed. This led to
a memory leak, which is fixed by calling mpas_deallocate_field for each
field in a pool.
…nsion

When querying the decomposed dimension for a 4-d real field with a time
dimension in the mpas_pool_link_parinfo routine, the routine previously
attempted to access poolMem % r4 % dimNames(4), when it should have
accessed poolMem % r4a(1) % dimNames(4) . The former access is appropriate
only for 4-d real fields with no time dimension.

pool_get_member_decomp_type(poolMem % r4a(1) % dimNames(4))
…le blocks

When the keep_head_ptr argument to mpas_deallocate_field was .true.,
the mpas_deallocate_field routines previously attempted to deallocate all blocks
for the input field besides the first block. This led to problems in freeing
fields with multiple blocks and multiple time levels, since the blocks of
the field were each elements of an array of fields (the array containing
the time levels).

Now, setting the optional argument keep_head_ptr to .true. in a call to
one of the mpas_deallocate_field routines will prevent the containing
field type for all blocks from being deallocated. Memory leaks are still
avoided, since the field arrays for all blocks should eventually be
deallocated (e.g., with calls to mpas_deallocate_block in the loop
over blocks in lines 124-128 of mpas_domain_routines.F).
The previous implementations of the mpas_deallocate_field routines would
not return a nullified field pointer when the optional keep_head_ptr
argument was .false., since the dummy argument to these routines had
the target attribute rather than the pointer attribute. This could lead
to problems if the calling code didn't explicitly nullify the field pointer
but later checked on the association status of that pointer.

To remedy this issue, a new interface, mpas_deallocate_field_target, has been
implemented to handle the deallocation of fields that are elements of field
arrays. This interface is principally used when deallocating fields with
multiple time levels. The mpas_deallocate_field interface now reverts to
declaring the dummy argument with the pointer attribute; these routines are
implemented by a call to mpas_deallocate_field_target followed by an explicit
nullification of the dummy argument.
…ime level

The mpas_pool_destroy_pool routine previously called the mpas_deallocate_field
routine for fields with only a single time level. This works well for fields
that have just one block per MPI task, but leads to invalid memory references
when fields have multiple blocks: calling mpas_pool_destroy_pool for the pool
in the first block frees memeory for all blocks of a field, leaving the field
pointers in subsequnt blocks pointing to deallocated memory.

As a work-around, the mpas_pool_destroy_pool now calls
mpas_deallocate_field_target with the keep_head_ptr argument set to .true. so
that the field pointed to by pools from other blocks will not be deallocated.
Then, only for the field owned by a block, the field is explicitly deallocated
after the call to mpas_deallocate_field_target.
@mark-petersen
Copy link
Contributor Author

@mgduda no problem. I did that and retested, and it all works.

Copy link
Member

@matthewhoffman matthewhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved by inspection

@mark-petersen mark-petersen removed the request for review from pwolfram November 14, 2019 16:54
mark-petersen added a commit that referenced this pull request Nov 14, 2019
The destroy_pool subroutine does not deallocate the memory associated
with a field, because it uses a simple fortran deallocate rather than
the more thorough mpas_deallocate_field routine. This has become a
problem when using RK4 time-stepping in the ocean model, as we hit an
out-of-memory error.

fixes #350
@mark-petersen mark-petersen merged commit 7ca6769 into MPAS-Dev:develop Nov 14, 2019
@mark-petersen mark-petersen deleted the pool_memory_leak branch November 14, 2019 20:34
Comment on lines +1848 to +1850
if (.not. local_keep_head_ptr) then
deallocate(f_cursor)
end if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that IBM compiler (xlf2008_r v.16.1.1-3) stops on this deallocate with run-time error:

1525-109 Error encountered while attempting to deallocate a data object.  The program will stop.

After re-running with additional STAT and ERRMSG arguments, deallocate returns status value of 2: An invalid data object has been specified for the DEALLOCATE statement.. It appears that the standard wants the object to be either an allocatable array or an object with target attribute. That's not the case here with f_cursor declared with pointer attribute.

Grepping for calls to this subroutine, all callers either

  1. call with keep_head_ptr=.true. and deallocate the object immediately after the call, or
  2. call with keep_head_ptr=.false. and nullify the object right after the call.

If I comment out the if statement, the run proceeds with no errors. It appears that all calls need to pass in keep_head_ptr=.true. (or just remove this logic and the corresponding call to deallocate the head pointer). Would that be OK, @mgduda ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a closer look. I'm hesitant to remove a DEALLOCATE statement without understanding whether this might re-introduce a memory leak under certain circumstances. From what I recall of the 2003 standard, objects with the POINTER attribute implicitly have the TARGET attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a branch that lets me build and run a sea-ice case: diff. I can try valgrind on the case that had mem-leaks. How to configure one?

When I try SMS.T62_oQU120.CMPASO-NYF with config_time_integrator = 'RK4' the case runs into SIGFPE: erroneous arithmetic operation. Does an RK4 case run in E3SM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not following-up on this issue for so long. I just encountered the same issue on Summit, and I'll see what I can find.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just opened Issue #481 to track progress on this.

mark-petersen added a commit that referenced this pull request Jun 25, 2020
…elop

Fix framework field deallocation errors for XL compiler #578

This merge addresses several sources of pool and field deallocation
errors that arise when MPAS is built with the XL 16 compilers.

Additional discussion is at #367 (review).

Fixes #481

[bit-for-bit]
ashwathsv pushed a commit to ashwathsv/MPAS-Model that referenced this pull request Jul 21, 2020
The destroy_pool subroutine does not deallocate the memory associated
with a field, because it uses a simple fortran deallocate rather than
the more thorough mpas_deallocate_field routine. This has become a
problem when using RK4 time-stepping in the ocean model, as we hit an
out-of-memory error.

fixes MPAS-Dev#350
mark-petersen added a commit to gcapodag/MPAS-Model that referenced this pull request Jan 13, 2021
I found exactly the same problem as we are having here:
MPAS-Dev#367 (review).
And I said it was later fixed here:
MPAS-Dev#367 (comment).

The problem was introduced since then, in this PR:
MPAS-Dev#578
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants