-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool memory leak #367
Pool memory leak #367
Conversation
@pwolfram, please cherry-pick 1670ba45 and test your hurricane case. In the simple test case you provided here: #350 (comment) it previously ran 26k iterations before a memory error. With this fix it runs 79k iterations, so a factor of three on that test. Obviously, there is still a leak, but this might get you through a queue cycle. I spent several more hours replacing other deallocate statements, but others would cause an error. I also looked for other memory leaks without luck. |
@mgduda this was a minor memory leak that came up last March that we didn't fix. It is now stopping a set of simulations, so is a priority. You give a potential solution here: #185 (comment) which I implemented in this PR, but it only partially fixes the memory leak. A few days after that, here: #185 (comment) you said it needs a different solution than you previously thought. I've been testing alterations "experimentally", but I honestly don't understand the pool clone and destroy routines very well. Any help or advice you could provide would be greatly appreciated. |
There are other deallocate lines, like this set:
and I tried to alter it to this:
I made that change to all data types in A second set just after that looks like this:
that is also a candidate for a memory leak due to the extra field information that remains. But the alteration to:
gets a compiler error, presumably because |
@mark-petersen, thanks for the help. I've cherry-picked the commit and am running the job now and can report back later about its status. |
@mark-petersen I should be able to look more closely at this by the end of the week. From the discussion on PR #185 I think this is my point of concern regarding the approach taken at present:
|
@mgduda thanks for looking. The changes I made in this PR only extended the simulations by @pwolfram by ~20%, and it still then dies with an 'out of memory' error. I don't know how to completely purge the memory allocated in the mpas_pool_clone_pool routine, so any help you can provide is greatly appreciated. |
@mgduda any thoughts on this? Any help you can provide would be greatly appreciated. |
@mgduda, we are stuck on this, and it's limiting for some of our simulations. If you have any suggestions, that would be extremely helpful. I don't know how to find further memory leaks by simply looking for missing deallocates or nullifies. |
@mgduda, can you please provide some advice on this issue when you get a chance? I'm sure you are very busy but if you don't have any time to engage at all, can you please just let us all know so we can move forward? Thanks! |
@xylar, do you know when the next one is scheduled? |
@pwolfram, they aren't scheduled anymore. You have to ask for one as topics come up that need to be discussed, like this one. If posting here doesn't do the trick, maybe a quick email to the core MPAS team would do it. |
@pwolfram @mark-petersen @xylar Sorry, all, for not taking a closer look long ago! I'll start in this afternoon and let you all know what I'm able to find by tomorrow afternoon at the latest. |
@mgduda thanks. Let us know if you need a test case to exemplify the behavior. |
@mark-petersen @pwolfram I have a tentative set of changes that I think will work. I've pushed these to a branch named 'framework/destroy_pool_mem_leak' in my fork of the repository. Would you all be willing to try merging this branch (which is based on the v7.0 tag) to your working branch to see if it addresses the memory leaks you're seeing? My approach was slightly different than what was done in this PR, so I created a separate branch. |
@mgduda, Thanks for taking a look at this! I'm testing your branch now on a test case that previously failed due to this issue. Just for my own future reference, the simulation began at 2012-10-10_00:00:00 and crashed due to insufficient virtual memory at 2012-10-21_22:13:45. I'll report back on whether or not the changes allow the run to complete. |
@mgduda, thank you so VERY MUCH! Am testing it now and it absolutely is solid under my test case for a simplified case at I'll defer to @sbrus89 for the real-world case next to confirm but strongly expect to not see an issue. We can let you know. |
After discussion with @mark-petersen and just to document -- it ran on the order of 130,000 clone / destroy pool cycles and checking out the memory footprint with @mark-petersen will test more broadly tomorrow. Note the functionality this affects is probably limited only to ocean RK4 time integration and the shallow water pool because the issue was with the |
My run with @mgduda's changes went from 2012-10-10_00:00:00 to 2012-10-31_22:02:55 (when it timed out after reaching the 4hr interactive queue time limit) as compared the previous memory leak fail at 2012-10-21_22:13:45. This was only a little over 2 simulation days short of the stop time for the test case. Based on this results and @pwolfram and @mark-petersen's testing, I would suspect that these changes solve the problem. I'll go ahead and run on standard to make sure the run completes, but I don't anticipate any issues. |
1670ba4
to
5aed53d
Compare
|
||
deallocate(f_cursor) | ||
if ((.not. local_keep_head_ptr) .or. (.not. associated(f_cursor, f))) then | ||
deallocate(f_cursor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgduda I did a local test merge into ocean/develop. I get a seg fault here when I compile with intel/17.0.1
in debug mode, also with gnu, but only if I run multiple blocks per core. That is, if the flag config_number_of_blocks
is set to a different number from the mpirun -n
number of cores. For example,
config_number_of_blocks=8
mpirun -n 4 ./ocean_model
causes a seg fault, but
config_number_of_blocks=8
mpirun -n 8 ./ocean_model
runs fine. Using config_number_of_blocks=0
is also fine. This test worked before the merge.
@mark-petersen I just pushed three new commits to the |
@mgduda I added your three new commits, and they worked great! It solved the problem we saw before on finalize. Thank you! @sbrus89 or @pwolfram if either of you have time to test these new commits on your previous case, that would be great. I would like to merge this into |
@mark-petersen The |
In cases where we need to deallocate a field that is an element of an array of field types, we could not pass an element of this field array to the mpas_deallocate_field routine, since the argument was a pointer. The field argument to the mpas_deallocate_field routines is now a target rather than a pointer, which allows elements of arrays of fields to be deallocated. However, since we cannot deallocate an individual element of an array, an optional argument, keep_head_ptr, has been added. When set to true, the mpas_deallocate_field routine will not attempt to deallocate the argument to the routine (though the contents of that argument's field type -- array, constituentNames, and attLists -- will be deallocated). Now, it is possible to deallocate an array of field types, e.g.: type (field1DReal), dimension(:), allocatable :: field_array ... code that allocates the field_array ... do i=1,size(field_array) call mpas_deallocate_field(field_array(i), keep_head_ptr=.true.) end do deallocate(field_array) *** Note: Deallocating an allocated pointer to a single field works as before, with the exception that the actual argument to the mpas_deallocate_field routine is not nullified (since the dummy argument is now a target and not a pointer). Therefore, it is recommended to explicitly nullify the actual argument after a call to mpas_deallocate_field, e.g., type (field1DReal), pointer :: f ... code that allocates f ... call mpas_deallocate_field(f) nullify(f) This commit also adds code to the mpas_deallocate_field routines to deallocate the constituentNames member of the field type if is allocated.
Previously, the mpas_pool_destroy_pool routine simply deallocated the 'array' component of the fields that were stored in the pool to be freed. This led to a memory leak, which is fixed by calling mpas_deallocate_field for each field in a pool.
…nsion When querying the decomposed dimension for a 4-d real field with a time dimension in the mpas_pool_link_parinfo routine, the routine previously attempted to access poolMem % r4 % dimNames(4), when it should have accessed poolMem % r4a(1) % dimNames(4) . The former access is appropriate only for 4-d real fields with no time dimension. pool_get_member_decomp_type(poolMem % r4a(1) % dimNames(4))
…le blocks When the keep_head_ptr argument to mpas_deallocate_field was .true., the mpas_deallocate_field routines previously attempted to deallocate all blocks for the input field besides the first block. This led to problems in freeing fields with multiple blocks and multiple time levels, since the blocks of the field were each elements of an array of fields (the array containing the time levels). Now, setting the optional argument keep_head_ptr to .true. in a call to one of the mpas_deallocate_field routines will prevent the containing field type for all blocks from being deallocated. Memory leaks are still avoided, since the field arrays for all blocks should eventually be deallocated (e.g., with calls to mpas_deallocate_block in the loop over blocks in lines 124-128 of mpas_domain_routines.F).
The previous implementations of the mpas_deallocate_field routines would not return a nullified field pointer when the optional keep_head_ptr argument was .false., since the dummy argument to these routines had the target attribute rather than the pointer attribute. This could lead to problems if the calling code didn't explicitly nullify the field pointer but later checked on the association status of that pointer. To remedy this issue, a new interface, mpas_deallocate_field_target, has been implemented to handle the deallocation of fields that are elements of field arrays. This interface is principally used when deallocating fields with multiple time levels. The mpas_deallocate_field interface now reverts to declaring the dummy argument with the pointer attribute; these routines are implemented by a call to mpas_deallocate_field_target followed by an explicit nullification of the dummy argument.
…ime level The mpas_pool_destroy_pool routine previously called the mpas_deallocate_field routine for fields with only a single time level. This works well for fields that have just one block per MPI task, but leads to invalid memory references when fields have multiple blocks: calling mpas_pool_destroy_pool for the pool in the first block frees memeory for all blocks of a field, leaving the field pointers in subsequnt blocks pointing to deallocated memory. As a work-around, the mpas_pool_destroy_pool now calls mpas_deallocate_field_target with the keep_head_ptr argument set to .true. so that the field pointed to by pools from other blocks will not be deallocated. Then, only for the field owned by a block, the field is explicitly deallocated after the call to mpas_deallocate_field_target.
896292f
to
7ca6769
Compare
@mgduda no problem. I did that and retested, and it all works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved by inspection
The destroy_pool subroutine does not deallocate the memory associated with a field, because it uses a simple fortran deallocate rather than the more thorough mpas_deallocate_field routine. This has become a problem when using RK4 time-stepping in the ocean model, as we hit an out-of-memory error. fixes #350
if (.not. local_keep_head_ptr) then | ||
deallocate(f_cursor) | ||
end if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that IBM compiler (xlf2008_r v.16.1.1-3
) stops on this deallocate
with run-time error:
1525-109 Error encountered while attempting to deallocate a data object. The program will stop.
After re-running with additional STAT
and ERRMSG
arguments, deallocate returns status value of 2: An invalid data object has been specified for the DEALLOCATE statement.
. It appears that the standard wants the object to be either an allocatable array or an object with target
attribute. That's not the case here with f_cursor
declared with pointer
attribute.
Grepping for calls to this subroutine, all callers either
- call with
keep_head_ptr=.true.
and deallocate the object immediately after the call, or - call with
keep_head_ptr=.false.
and nullify the object right after the call.
If I comment out the if
statement, the run proceeds with no errors. It appears that all calls need to pass in keep_head_ptr=.true.
(or just remove this logic and the corresponding call to deallocate
the head pointer). Would that be OK, @mgduda ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a closer look. I'm hesitant to remove a DEALLOCATE
statement without understanding whether this might re-introduce a memory leak under certain circumstances. From what I recall of the 2003 standard, objects with the POINTER
attribute implicitly have the TARGET
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a branch that lets me build and run a sea-ice case: diff. I can try valgrind on the case that had mem-leaks. How to configure one?
When I try SMS.T62_oQU120.CMPASO-NYF
with config_time_integrator = 'RK4'
the case runs into SIGFPE: erroneous arithmetic operation
. Does an RK4 case run in E3SM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not following-up on this issue for so long. I just encountered the same issue on Summit, and I'll see what I can find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just opened Issue #481 to track progress on this.
The destroy_pool subroutine does not deallocate the memory associated with a field, because it uses a simple fortran deallocate rather than the more thorough mpas_deallocate_field routine. This has become a problem when using RK4 time-stepping in the ocean model, as we hit an out-of-memory error. fixes MPAS-Dev#350
I found exactly the same problem as we are having here: MPAS-Dev#367 (review). And I said it was later fixed here: MPAS-Dev#367 (comment). The problem was introduced since then, in this PR: MPAS-Dev#578
The
destroy_pool
subroutine does not deallocate the memory associated with a field, because it uses a simple fortrandeallocate
rather than the more thoroughmpas_deallocate_field
routine. This has become a problem when using RK4 time-stepping in the ocean model, as we hit an out-of-memory error.fixes #350