-
Notifications
You must be signed in to change notification settings - Fork 318
-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in RK4 #137
Comments
@xylar and @pwolfram, I'm bringing this issue over from MPAS-Dev/MPAS#1374 into the newer MPAS-Model repo. |
This is what I see. I am running from commit 2763249 on mark-petersen/ocean/smoothBLD_Luke_Phil_Todd. Identical runs with split explicit do not have this problem. I expect to see the same thing on MPAS-Dev/ocean/develop, but have not checked.
|
@mark-petersen, @pwolfram, and @xylar, I also just experienced this same error today. |
I am working with @hrmoncada to find the memory leak. He is working with both valgrind and totalview on LANL IC. He is running simple examples to get practice, and then trying to find memory leaks in MPAS-Ocean with both split explicit and RK4. We are having trouble, though. The valgrind output is very hard to interpret, and only gives total memory leak numbers. We have struggled with totalview, but can't get it to load the code correctly to show the actual lines of code. We compiled with gnu and DEBUG -g, but I suspect that netcdf/pnetcdf/pio libraries are causing problems for totalview. @philipwjones and @amametjanov if you could please advise on how to find memory leaks, or assist @hrmoncada on this, that would be very helpful. This problem is stopping some of our work by @sbrus89 and @xylar so is a priority. I scanned by eye and using grep to check for matching allocate/deallocate in MPAS-Ocean. I found a few here: #142 but not the main one causing trouble in RK4. Thanks! |
Here is an E3SM test on Anvil:
And there appear to be no leaks in the default configuration:
Line numbers are printed for local sources: e.g.
Full log is at
|
@amametjanov, I'm pretty sure E3SM is only configured to run MPAS-Ocean with split-explicit time stepping (@mark-petersen or @jonbob, please correct me if I'm wrong). So I don't think the above test would have caught the RK4 memory leak we're concerned about. |
@hrmoncada and @mark-petersen - can you point me to valgrind output on IC? And were you running with @amametjanov 's arguments (ie track-origins)? |
That is correct. @hrmoncada has run a test with me using MPAS-Ocean stand-alone and setting |
I know the general problem but not the specific solution. Summary: The problem must be that the destroy pool is missing a corresponding deallocate on a particular type of variable, or it is not deallocating correctly. A clone pool only exists in RK4, and nowhere else in MPAS, which is why it is not a problem in E3SM. Thanks @matthewhoffman for a helpful conversation. Details Pools are created and cloned here. Lines don't exactly match with above in F or f90 files, I'm not sure why.
Pools are destroyed here:
In framework, the variables are cloned here for 3d-reals. The trace above indicates that field3d_real is the problem
and destroyed here
Comparing these two blocks of code, the 3D field must not be fully deallocated. It must be allocated with |
I think I found it. Change the initial index from
In the code right now, @hrmoncada, @xylar or @sbrus89 if any of you could retest your memory leak fail with that change, it would be most helpful. I can then make a PR. There are corresponding changes for r1a, r2a, r4a, r5a. |
I emailed Doug, and he kindly responded:
|
Currently, the subroutine `mpas_pool_clone_pool` double-allocates arrays due to an index error, so that `mpas_duplicate_field` is called twice (by mistake) but deallocate is only called once. This PR fixes that index. This error previously caused ocean simulations using RK4 time stepping to die with out-of-memory errors. fixes #137
Fixed by #185. Will appear on |
Not 100% sure this issue was entirely resolved, this may be related to #350. |
We don't run RK4 very often. Running EC60to30 on 576 cores on grizzly with gnu optimized, it runs for 7 days and then dies with memory errors.
The text was updated successfully, but these errors were encountered: