Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in RK4 #137

Closed
mark-petersen opened this issue Jan 22, 2019 · 13 comments
Closed

Memory leak in RK4 #137

mark-petersen opened this issue Jan 22, 2019 · 13 comments

Comments

@mark-petersen
Copy link
Contributor

We don't run RK4 very often. Running EC60to30 on 576 cores on grizzly with gnu optimized, it runs for 7 days and then dies with memory errors.

@mark-petersen
Copy link
Contributor Author

@xylar and @pwolfram, I'm bringing this issue over from MPAS-Dev/MPAS#1374 into the newer MPAS-Model repo.

@mark-petersen
Copy link
Contributor Author

This is what I see. I am running from commit 2763249 on mark-petersen/ocean/smoothBLD_Luke_Phil_Todd. Identical runs with split explicit do not have this problem. I expect to see the same thing on MPAS-Dev/ocean/develop, but have not checked.

gr1112.localdomain> mpirun -n 576 ocean_model_smoothBLD_Luke_Phil_Todd
App launch reported: 16 (out of 16) daemons - 540 (out of 576) procs
Operating system error: Operating system error: Cannot allocate memory
Allocation would exceed memory limit
...
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[46293,1],286]
  Exit code:    1
--------------------------------------------------------------------------
gr1112.localdomain> pwd
/lustre/scratch2/turquoise/mpeterse/runs/c67n/forward

@sbrus89
Copy link

sbrus89 commented Jan 23, 2019

@mark-petersen, @pwolfram, and @xylar, I also just experienced this same error today.

@mark-petersen
Copy link
Contributor Author

I am working with @hrmoncada to find the memory leak. He is working with both valgrind and totalview on LANL IC. He is running simple examples to get practice, and then trying to find memory leaks in MPAS-Ocean with both split explicit and RK4.

We are having trouble, though. The valgrind output is very hard to interpret, and only gives total memory leak numbers. We have struggled with totalview, but can't get it to load the code correctly to show the actual lines of code. We compiled with gnu and DEBUG -g, but I suspect that netcdf/pnetcdf/pio libraries are causing problems for totalview.

@philipwjones and @amametjanov if you could please advise on how to find memory leaks, or assist @hrmoncada on this, that would be very helpful. This problem is stopping some of our work by @sbrus89 and @xylar so is a priority.

I scanned by eye and using grep to check for matching allocate/deallocate in MPAS-Ocean. I found a few here: #142 but not the main one causing trouble in RK4. Thanks!

@amametjanov
Copy link
Contributor

Here is an E3SM test on Anvil:

$ ./cime/scripts/create_test --no-build SMS_D_Ln5.T62_oQU120.CMPASO-NYF --machine anvil --compiler gnu
$ cd $case-dir
$ ./xmlchange --id PIO_TYPENAME --val netcdf
$ ./xmlchange --id HIST_OPTION --val none
$ ./pelayout 
Comp  NTASKS  NTHRDS  ROOTPE
CPL :     36/     1;      0
ATM :     36/     1;      0
LND :     36/     1;      0
ICE :     36/     1;      0
OCN :     36/     1;      0
ROF :     36/     1;      0
GLC :     36/     1;      0
WAV :     36/     1;      0
ESP :      1/     1;      0
$ ./case.setup -r && ./case.build
$ qsub -I -q acme -A ACME -l nodes=1:ppn=36,walltime=1:00:00
$ source .env_mach_specific.sh
$ cd run/
$ mpiexec -n 1 valgrind -v --leak-check=yes --track-origins=yes ../bld/e3sm.exe : -n 35 ../bld/e3sm.exe >> e3sm.log 2>&1

And there appear to be no leaks in the default configuration:

==163534== HEAP SUMMARY:
==163534==     in use at exit: 0 bytes in 0 blocks
==163534==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==163534==
==163534== All heap blocks were freed -- no leaks are possible
==163534==
==163534== ERROR SUMMARY: 54320 errors from 91 contexts (suppressed: 26 from 9)
==163534==
==163534== 1 errors in context 1 of 91:
==163534== Conditional jump or move depends on uninitialised value(s)
...

Line numbers are printed for local sources: e.g.

==163534== Conditional jump or move depends on uninitialised value(s)
==163534==    at 0x979DE58: MPIDI_CH3_Rndv_transfer (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x9788D0B: MPIDI_CH3_PktHandler_RndvClrToSend (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x97A5AB9: MPIDI_CH3I_SMP_read_progress (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x979C289: MPIDI_CH3I_Progress_test (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x978F5AC: MPID_Isend (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x971D3CC: PMPI_Isend (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x92A6A18: mpi_isend (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpifort.so.12.0.5)
==163534==    by 0x531497: __mpas_dmpar_MOD_mpas_dmpar_exch_group_start_send (mpas_dmpar.f90:7644)
==163534==    by 0x5379B9: __mpas_dmpar_MOD_mpas_dmpar_exch_group_full_halo_exch (mpas_dmpar.f90:6916)
==163534==    by 0x537CF0: __mpas_dmpar_MOD_mpas_dmpar_field_halo_exch (mpas_dmpar.f90:6983)
==163534==    by 0xCD73B0: __ocn_time_integration_split_MOD_ocn_time_integrator_split (mpas_ocn_time_integration_split.f90:411)
==163534==    by 0xCD364E: __ocn_time_integration_MOD_ocn_timestep (mpas_ocn_time_integration.f90:110)
==163534==  Uninitialised value was created
==163534==    at 0xA638A4A: brk (in /lib64/libc-2.12.so)
==163534==    by 0xA638AF4: sbrk (in /lib64/libc-2.12.so)
==163534==    by 0x97DF7BD: _int_malloc (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x97E02D1: calloc (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x975ECFC: MPIU_Handle_obj_alloc_unsafe (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x97734D0: MPIDI_CH3U_Recvq_FDU_or_AEP (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x978E8BF: MPID_Irecv (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x974C369: MPIC_Irecv (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x957A074: MPIR_Gather_MV2_Direct (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x957BA1E: MPIR_Gather_index_tuned_intra_MV2 (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x952BF6F: MPIR_Gather_impl (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)
==163534==    by 0x952CB2B: PMPI_Gather (in /blues/gpfs/home/software/spack/opt/spack/linux-x86_64/gcc-5.3.0/mvapich2-2.2b-yzk3up5jlbhoyyfgu7tuezfbh2dhee6f/lib/libmpi.so.12.0.5)

Full log is at

/lcrc/group/acme/azamat/acme_scratch/anvil/SMS_D_Ln5.T62_oQU120.CMPASO-NYF.anvil_gnu.20190228_135852_wwch45/run/e3sm.log

@xylar
Copy link
Collaborator

xylar commented Mar 1, 2019

@amametjanov, I'm pretty sure E3SM is only configured to run MPAS-Ocean with split-explicit time stepping (@mark-petersen or @jonbob, please correct me if I'm wrong). So I don't think the above test would have caught the RK4 memory leak we're concerned about.

@philipwjones
Copy link
Contributor

@hrmoncada and @mark-petersen - can you point me to valgrind output on IC? And were you running with @amametjanov 's arguments (ie track-origins)?

@mark-petersen
Copy link
Contributor Author

I'm pretty sure E3SM is only configured to run MPAS-Ocean with split-explicit time stepping ... So I don't think the above test would have caught the RK4 memory leak we're concerned about.

That is correct.

@hrmoncada has run a test with me using MPAS-Ocean stand-alone and setting
config_time_integrator='RK4'
Both valgrind and totalview show memory leaks. In order to use totalview on IC, only the intel compiler worked (compile with DEBUG=true). Here are the two types of output from totalview. Line numbers are probably c-preprocessed code. Just look at the MB sized lines for now.

Leak detection source report
Screenshot from 2019-02-28 12-32-14

Leak detection backtrace report
Screenshot from 2019-02-28 12-45-30

@mark-petersen
Copy link
Contributor Author

mark-petersen commented Mar 13, 2019

I know the general problem but not the specific solution.

Summary:
In RK4 we clone the state pool and then destroy it on each timestep, because it keeps a copy of the RHS of the timestepping scheme for every single state variable. The pool is created and cloned in the framework, and cycles over all variables in the source pool to make an exact copy.

The problem must be that the destroy pool is missing a corresponding deallocate on a particular type of variable, or it is not deallocating correctly. A clone pool only exists in RK4, and nowhere else in MPAS, which is why it is not a problem in E3SM.

Thanks @matthewhoffman for a helpful conversation.

Details
Looking closely at the above memory trace:
image

Pools are created and cloned here. Lines don't exactly match with above in F or f90 files, I'm not sure why.

src/core_ocean/mode_forward/mpas_ocn_time_integration_rk4.F
  74    subroutine ocn_time_integrator_rk4(domain, dt)!{{{
 207       block => domain % blocklist
 208       do while (associated(block))
 209          call mpas_pool_get_subpool(block % structs, 'state', statePool)
 
 213          call mpas_pool_create_pool(provisStatePool)
 215          call mpas_pool_clone_pool(statePool, provisStatePool, 1)

Pools are destroyed here:

src/core_ocean/mode_forward/mpas_ocn_time_integration_rk4.F
 707       block => domain % blocklist
 708       do while(associated(block))
 709          call mpas_pool_get_subpool(block % structs, 'provis_state', provisStatePool)
 711          call mpas_pool_destroy_pool(provisStatePool)
 713          call mpas_pool_remove_subpool(block % structs, 'provis_state')

In framework, the variables are cloned here for 3d-reals. The trace above indicates that field3d_real is the problem

src/framework/mpas_pool_routines.F
 479    recursive subroutine mpas_pool_clone_pool(srcPool, destPool, overrideTimeLevels)!{{{

 757                else if (associated(dptr % r3a)) then
 758                   if ( newmem % data % contentsTimeLevs > 1) then
 759                      allocate(newmem % data % r3a(newmem % data % contentsTimeLevs))
 760                      minTimeLevels = min(dptr % contentsTimeLevs, newmem % data % contentsTimeLevs)
 761                      do j = 1, minTimeLevels
 762                         call mpas_duplicate_field(dptr % r3a(j), newmem % data % r3)
 763                         newmem % data % r3a(j) = newmem % data % r3
 764                         deallocate(newmem % data % r3)
 765                      end do
 766
 767                      do j = minTimeLevels, newmem % data % contentsTimeLevs
 768                         call mpas_duplicate_field(dptr % r3a(dptr % contentsTimeLevs), newmem % data      % r3)
 769                         newmem % data % r3a(j) = newmem % data % r3
 770                         deallocate(newmem % data % r3)
 771                      end do

and destroyed here

src/framework/mpas_pool_routines.F
203    recursive subroutine mpas_pool_destroy_pool(inPool)!{{{
 329                   else if (associated(dptr % r3a)) then
 330                      do j=1,dptr % contentsTimeLevs
 331                         if (associated(dptr % r3a(j) % array)) then
 332                            deallocate(dptr % r3a(j) % array, stat=local_err)
 333                         end if
 334                      end do
 335                      deallocate(dptr % r3a, stat=local_err)

Comparing these two blocks of code, the 3D field must not be fully deallocated. It must be allocated with mpas_duplicate_field and deallocated with deallocate. More soon...

@mark-petersen
Copy link
Contributor Author

I think I found it. Change the initial index from minTimeLevels to minTimeLevels+1 on the second do loop, here:

src/framework/mpas_pool_routines.F
767                      do j = minTimeLevels+1, newmem % data % contentsTimeLevs

In the code right now,
newmem % data % r3a(minTimeLevels) % array
is allocated twice, with two calls to mpas_duplicate_field. I'm not sure what happens when an array is allocated twice and deallocated once, but it is definitely a memory leak hazard.

@hrmoncada, @xylar or @sbrus89 if any of you could retest your memory leak fail with that change, it would be most helpful. I can then make a PR. There are corresponding changes for r1a, r2a, r4a, r5a.

@mark-petersen
Copy link
Contributor Author

I emailed Doug, and he kindly responded:

Yes, that would be a memory leak, and your fix looks correct.

Another tool you can use in the future is Intel Inspector, which is useful for helping to find threading race conditions, and memory issues (like memory leaks).

A memory leak is when you allocate memory, and never deallocate it, so the OS think it's constantly used but you can't really access it. If you do this once per time step, and you have many many time steps, eventually you run out of available memory to allocate anything new, and the program dies.

In this case, you allocate space for a field, and store a pointer to it. Then you allocate space for the same field, and overwrite the pointer so it points to the new field rather than the old one. Then the first allocation is left dangling forever.

mark-petersen added a commit that referenced this issue Apr 8, 2019
Currently, the subroutine `mpas_pool_clone_pool` double-allocates arrays
due to an index error, so that `mpas_duplicate_field` is called twice
(by mistake) but deallocate is only called once. This PR fixes that
index. This error previously caused ocean simulations using RK4 time
stepping to die with out-of-memory errors.

fixes #137
@mark-petersen
Copy link
Contributor Author

Fixed by #185. Will appear on ocean/develop very soon, after develop is merged in.

@pwolfram
Copy link
Contributor

Not 100% sure this issue was entirely resolved, this may be related to #350.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants