Memory debugging using Valgrind tools

VALGRIND INFO

Valgrind is a free programming tool for memory debugging, memory leak detection, and profiling.
The program is available as a module on NOAA RDHPC machines.

Description

Non-interactive tool for Linux environment
Calls the binary, so said to work with any programming language, though targets C/C++. Works for fortran!
Open source / free software
Pros
Extremely easy to set up / run
Widely available
Cons
False positives, especially in the case of MPI
For MPI, you need to run for each processor, though it is still one call

What Errors does Valgrind/Memcheck Detect?

Reading/writing freed memory or incorrect memory areas
Uninitialized values
Incorrect freeing of memory, such as double freeing heap blocks
Misuse of functions for memory allocations: new(), malloc(), free(), deallocate(), etc.
Memory leaks - unintentional memory consumption often related to program logic flaws which lead to loss of memory pointers prior to deallocation.

Limitations of Valgrind

Does not perform bounds checking on static arrays (i.e., memory allocated on the stack)
Only checks programs dynamically -- May report no errors on a particular input set although the program contains bugs
Consumes more memory (~2x)
Slows down the programs (10x and more)
Optimized binaries can cause Valgrind to wrongly report uninitialized value errors

See link below for additional suggestions for MPI prep

Useful Links:
Valgrind homepage
Quickstart Guide
Explanation of error messages
MPI Debugging (have not used before, but might be useful)

In order to utilize Valgrind, the following steps should be taken:

1. Prep: Compile and execute programs up to the executable that you want to utilize Vaigrind for, in debug mode: Use -c intel_debug (Required Flags: -g -traceback -O0).
Use -q ww3_multi to get executable.
See Prep template job card below.

Sample prep job card:

#!/bin/sh --login
#SBATCH -n 1
#SBATCH -q debug
#SBATCH -t 00:30:00
#SBATCH -A marine-cpu
#SBATCH -J ww3_vlgd
#SBATCH -o prep_vlgd.out

cd <home>/WW3/regtests
module purge
module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-intel/18.0.5.274
module load hpc-impi/2018.0.4
module load netcdf/4.7.4
module load jasper/2.0.25
module load zlib/1.2.11
module load png/1.6.35
module load hdf5/1.10.6
module load bacio/2.4.1
module load g2/3.4.1
module load w3nco/2.4.1
module load esmf/8_1_1

export NETCDF_CONFIG=$NETCDF_ROOT/bin/nc-config
export METIS_PATH=/scratch2/COASTAL/coastal/save/Ali.Abdolali/hpc-stack/parmetis-4.0.3
export JASPER_LIB=$JASPER_ROOT/lib64/libjasper.a
export PNG_LIB=$PNG_ROOT/lib64/libpng.a
export Z_LIB=$ZLIB_ROOT/lib/libz.a
export ESMFMKFILE=$ESMF_LIB/esmf.mk
export WW3_PARCOMPN=4

echo ' '
echo ' **********************************************'
echo ' *** WAVEWATCH III matrix of regression tests ***'
echo ' **********************************************'
echo ' '

./bin/run_test -b slurm -c intel_debug -S -T -s MPI -w work_1 -m grdset_c -f -p srun -n 1 -q ww3_multi -o all ../model <test>

echo ' '
echo ' **************************************************************'
echo ' * end of WAVEWATCH III matrix of regression tests *'
echo ' **************************************************************'
echo ' '

2. Run: Run the executable with Valgrind
valgrind --leak-check=full /<path-to-exe>/<executable> (i.e. WW3/model/exe/ww3_multi)
Other available flags:
--leak-check=full: "each individual leak will be shown in detail".
--show-leak-kinds=all: Show all of "definite, indirect, possible, reachable" leak kinds in the "full" report.
--track-origins=yes: Favor useful output over speed. This tracks the origins of uninitialized values, which could be very useful for memory errors. Consider turning off if Valgrind is unacceptably slow.
--verbose: Can tell you about the unusual behavior of your program. Repeat for more verbosity.
--log-file: Write to a file. Useful when output exceeds terminal space.
--showreachable=yes: Find absolutely every unpaired call to allocate/deallocate.

See bottom of output Summary for some suggestions
See Valgrind run card below.

Sample valgrind run card:

#!/bin/sh --login
#SBATCH -n 1
#SBATCH -q batch
#SBATCH -t 1:00:00
#SBATCH -A marine-cpu
#SBATCH -J ww3_vlgd
#SBATCH -o prep_vlgd.out

cd <home>/WW3/regtests
module purge
module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-intel/18.0.5.274
module load hpc-impi/2018.0.4
module load netcdf/4.7.4
module load jasper/2.0.25
module load zlib/1.2.11
module load png/1.6.35
module load hdf5/1.10.6
module load bacio/2.4.1
module load g2/3.4.1
module load w3nco/2.4.1
module load esmf/8_1_1
module load valgrind

export NETCDF_CONFIG=$NETCDF_ROOT/bin/nc-config
export METIS_PATH=/scratch2/COASTAL/coastal/save/Ali.Abdolali/hpc-stack/parmetis-4.0.3
export JASPER_LIB=$JASPER_ROOT/lib64/libjasper.a
export PNG_LIB=$PNG_ROOT/lib64/libpng.a
export Z_LIB=$ZLIB_ROOT/lib/libz.a
export ESMFMKFILE=$ESMF_LIB/esmf.mk
export WW3_PARCOMPN=4
export VGDIR=/regtests/<test>/work_1
export VGEXE=/WW3/model/exe

echo ' '
echo ' *********************************************'
echo ' *** WAVEWATCH III ---- VALGRIND ***'
echo ' **********************************************'
echo ' '

cd ${VGDIR}
valgrind --leak-check=full --show-reachable=yes --log-file=vlgd.out ${VGEXE}/ww3_multi
#output report will be in outfile specified above (vlgd.out)

echo ' '
echo ' **************************************************************'
echo ' * end of WAVEWATCH III --*-- valgrind *'
echo ' **************************************************************'
echo ' '

3. Analyze: Start at the bottom with the: LEAK SUMMARY
Focus on definitely lost blocks, start with highest yield items and/or easiest to fix. Some will definitely require more sleuthing around the code then others.
For memory leaks, you’re essentially looking for an ALLOCATE without a matching DEALLOCATE. Valgrind shows you the ALLOCATE, you need to assess if it needs to be DEALLOCATE’d, and if so where that statement should go (The line number in Valgrind standard output refers to model/src/*.F90 directory).
Loosely speaking, you are probably OK to skip past MPI stuff initially. MPI, especially init-related will have the highest amount of false positives
Ex., Conditional jump or move depends on uninitialised value(s) - in the case of MPI init is typical.
You will start to build intuition pretty quickly, so you’ll get a better idea of what may be ‘false positives’, and what items are likely to be easier than others.

Ideally, we would like to have zero leak HEAP SUMMARY:
in use at exit: 0 bytes in 0 blocks
total heap usage: 636 allocs, 636 frees, 25,393 bytes allocated

All heap blocks were freed -- no leaks are possible

ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

but, in reality,
==145333==
==145333== HEAP SUMMARY:
==145333== in use at exit: 20,558 bytes in 7 blocks
==145333== total heap usage: 25 allocs, 18 frees, 32,653 bytes allocated

Several kinds of leaks reported:

"definitely lost": leaking memory -- fix it!
“possibly lost”: general indicates leaking memory – fix it!
“indirect lost”: usually disappear if the “definitely” lost block that caused the indirect leak is fixed.

Quick Links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly