Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdf5-iotest fails #11

Closed
brtnfld opened this issue Dec 17, 2021 · 57 comments
Closed

hdf5-iotest fails #11

brtnfld opened this issue Dec 17, 2021 · 57 comments

Comments

@brtnfld
Copy link
Collaborator

brtnfld commented Dec 17, 2021

hdf5-iotest https://github.com/brtnfld/hdf5-iotest is a performance benchmark that compares the effects of different HDF5 parameters and IO patterns. I tried it on summit with vol-log and it crashes on test #18 in H5Dwrite, #18 is output every step of rank 2 chunked arrays, with fill values on, alignment, metadata block size of 2048, using the earliest format, and MPI collective.

For the same test case on a local Linux box it worked fine.

On summit,

Currently Loaded Modules:

  1. lsf-tools/2.0 3) darshan-runtime/3.3.0-lite 5) DefApps 7) zlib/1.2.11 9) spectrum-mpi/10.4.0.3-20210112 11) netlib-lapack/3.9.1 13) fftw/3.3.9 15) nsight-compute/2021.2.1 17) cuda/11.0.3
  2. hsi/5.0.2.p5 4) xalt/1.2.1 6) gcc/11.1.0 8) cmake/3.21.3 10) essl/6.3.0 12) netlib-scalapack/2.1.0 14) boost/1.77.0 16) nsight-systems/2021.3.1.54 18) python/3.8-anaconda3

I used HDF5 1.13, and hdf5-iotest master. To compile hdf5-iotest:

#!/bin/bash
#autogen.sh

HDF5=$HOME/packages/hdf5-1.13/build/hdf5
export LDFLAGS="-L$HDF5/lib"
export LD_LIBRARY_PATH="$HDF5/lib:$LD_LIBRARY_PATH"
export LIBS="-lhdf5"
export CC="mpicc"
export CPPFLAGS="-I$HDF5/include"
export CFLAGS="-g"
../configure --prefix=$PWD
make 
make install

To run the program, it takes the input file "hdf5_iotest.ini"

[DEFAULT]
version = 0
steps = 10 
arrays = 10 
rows = 42 
columns = 42 
#process-rows = 1764
process-rows = 42
#process-rows = 882
process-columns = 1
# [weak, strong]
scaling = weak
# align along increment [bytes] boundaries
alignment-increment = 16777216 
# minimum object size [bytes] to force alignment (0=all objects)
alignment-threshold = 0
# minimum metadata block allocation size [bytes]
meta-block-size = 2048
# [posix, core, mpi-io-uni]
single-process = mpi-io-uni
[CUSTOM]
one-case = 18
#one-case = 119 
#gzip = 6
#szip = H5_SZIP_NN_OPTION_MASK, 8
#async = 1
#delay = 0s
hdf5-file = hdf5_iotest.h5
csv-file = hdf5_iotest.csv
#split = 1
#restart = 1

To run the program:


#!/bin/bash
###BSUB -P CSC300
#BSUB -P CSC444
#BSUB -W 01:00
# power 42
#BSUB -nnodes 1
#BSUB -J IOTEST 
#BSUB -o IOTEST.%J
#BSUB -e IOTEST.%J
##SMT1 -- 1 HW Thread per physical core
##SMT4 -- All 4 HW threads are active (Default)
##BSUB -alloc_flags smt1
# 42 physical cores, (21 each cpu), per node
# 84 per cpu, 168 total

module unload darshan-runtime

JID=$LSB_JOBID
cd $MEMBERWORK/csc444
mkdir iotest.$JID
cd iotest.$JID
EXEC=hdf5_iotest
cp $LS_SUBCWD/$EXEC .
cp $LS_SUBCWD/hdf5_iotest_000042.csv .
cp $LS_SUBCWD/hdf5_iotest.ini .
NPROCS_MAX=$(($LSB_MAX_NUM_PROCESSORS - 1))
NNODES=$(($NPROCS_MAX / 42))
echo "NUMBER OF NNODES, NPROCS_MAX = $NNODES $NPROCS_MAX"

NPROCS="42"

  HDF_DIR=$HOME/packages/hdf5-1.13/build/hdf5
  export LOGVOL_DEBUG_ABORT_ON_ERR=1
  export LD_LIBRARY_PATH="$HDF_DIR/lib:$HOME/packages/vol-log-based/build/log-vol/lib:$LD_LIBRARY_PATH"
  export HDF5_PLUGIN_PATH="$HOME/packages/vol-log-based/build/log-vol/lib"
  export HDF5_VOL_CONNECTOR="LOG under_vol=0;under_info={}"

for i in ${NPROCS}
do
  foo=$(printf "%06d" $i)
  echo "csv-file = hdf5_iotest${C}_${foo}.csv" >> hdf5_iotest.ini
  echo "process-rows = $i" >> hdf5_iotest.ini
  jsrun -n $i ./$EXEC 
  mv *.csv $LS_SUBCWD
  sed -i '$d' hdf5_iotest.ini
  sed -i '$d' hdf5_iotest.ini
  du hdf5_iotest.h5
  ls hdf5_iotest.h5
  $HOME/packages/hdf5/build/hdf5/bin/h5stat -S hdf5_iotest.h5
  #cp stdio_hdf5_iotest.${foo} $LS_SUBCWD/
#  rm -f *.h5
done

The only information I could get from the core files was:

#0 0x0000200000863618 in raise () from /lib64/power9/libc.so.6
Backtrace stopped: Cannot access memory at address 0x7fffcf4085c0

@khou2020
Copy link
Collaborator

Do you have the output error message? Looks like I got a different error.

Error: OMPI_COMM_WORLD_RANK is not set in environment like it should be.
Error checking ibm license.
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

free(): invalid pointer
[h33n04:269834] *** Process received signal ***
[h33n04:269834] Signal: Aborted (6)
[h33n04:269834] Signal code: (-6)
[h33n04:269834] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000504d8]
[h33n04:269834] [ 1] /lib64/power9/libc.so.6(gsignal+0xd8)[0x200000863618]
[h33n04:269834] [ 2] /lib64/power9/libc.so.6(abort+0x164)[0x200000843a2c]
[h33n04:269834] [ 3] /lib64/power9/libc.so.6(+0x8f43c)[0x2000008af43c]
[h33n04:269834] [ 4] /lib64/power9/libc.so.6(+0x98c08)[0x2000008b8c08]
[h33n04:269834] [ 5] /lib64/power9/libc.so.6(+0x9af3c)[0x2000008baf3c]
[h33n04:269834] [ 6] /ccs/home/brtnfld/packages/vol-log-based/build/log-vol/lib/libH5VL_log.so.0.0.0(_Z20H5VL_log_filei_bfreeP15H5VL_log_file_tPv+0x54)[0x2000277abed8]
[h33n04:269834] [ 7] /ccs/home/brtnfld/packages/vol-log-based/build/log-vol/lib/libH5VL_log.so.0.0.0(_Z23H5VL_log_dataseti_writeP15H5VL_log_dset_tllP19H5VL_log_selectionslPKvPPv+0xa94)[0x20002779a290]
[h33n04:269834] [ 8] /ccs/home/brtnfld/packages/vol-log-based/build/log-vol/lib/libH5VL_log.so.0.0.0(Z22H5VL_log_dataset_writePvllllPKvPS+0x210)[0x200027789ee0]
[h33n04:269834] [ 9] /ccs/home/brtnfld/packages/hdf5-1.13/build/hdf5/lib/libhdf5.so.300(+0x3968d0)[0x2000004368d0]
[h33n04:269834] [10] /ccs/home/brtnfld/packages/hdf5-1.13/build/hdf5/lib/libhdf5.so.300(H5VL_dataset_write+0x8c)[0x20000043b58c]
[h33n04:269834] [11] /ccs/home/brtnfld/packages/hdf5-1.13/build/hdf5/lib/libhdf5.so.300(+0xe5fe8)[0x200000185fe8]
[h33n04:269834] [12] /ccs/home/brtnfld/packages/hdf5-1.13/build/hdf5/lib/libhdf5.so.300(H5Dwrite+0xac)[0x20000018968c]
[h33n04:269834] [13] ./hdf5_iotest[0x1000c664]
[h33n04:269834] [14] ./hdf5_iotest[0x100076e8]
[h33n04:269834] [15] /lib64/power9/libc.so.6(+0x24078)[0x200000844078]
[h33n04:269834] [16] /lib64/power9/libc.so.6(__libc_start_main+0xb4)[0x200000844264]
[h33n04:269834] *** End of error message ***
free(): invalid pointer

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

If you comment out one-case = 18 then you can check if the other cases 1-17, are working for your build.

@khou2020
Copy link
Collaborator

This particular bug should be fixed. Can you try again?

Can you point out the location of case 18? Was it the cause of the error I saw?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

It fails in H5Dwrite for case 18, I reran the with the lastest commit same error. The core files are not useful since they seem to be getting truncated.


[e11n09:88401] Signal: Aborted (6)
[e11n09:88401] Signal code:  (-6)
[e11n09:88401] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000504d8]
[e11n09:88401] [ 1] /lib64/power9/libc.so.6(gsignal+0xd8)[0x200000863618]
[e11n09:88401] [ 2] /lib64/power9/libc.so.6(abort+0x164)[0x200000843a2c]
[e11n09:88401] [ 3] /lib64/power9/libc.so.6(+0x8f43c)[0x2000008af43c]
[e11n09:88401] [ 4] /lib64/power9/libc.so.6(+0x98c08)[0x2000008b8c08]
[e11n09:88401] [ 5] /lib64/power9/libc.so.6(+0x9af3c)[0x2000008baf3c]
[e11n09:88401] [ 6] /ccs/home/brtnfld/packages/vol-log-based/build/log-vol/lib/libH5VL_log.so.0.0.0(_Z20H5VL_log_filei_bfreeP15H5VL_log_file_tPv+0x54)[0x2000277ac028]
[e11n09:88401] [ 7] /ccs/home/brtnfld/packages/vol-log-based/build/log-vol/lib/libH5VL_log.so.0.0.0(_Z23H5VL_log_dataseti_writeP15H5VL_log_dset_tllP19H5VL_log_selectionslPKvPPv+0xab0)[0x20002779a5e8]
[e11n09:88401] [ 8] /ccs/home/brtnfld/packages/vol-log-based/build/log-vol/lib/libH5VL_log.so.0.0.0(_Z22H5VL_log_dataset_writePvllllPKvPS_+0x210)[0x20002778a080]
[e11n09:88401] [ 9] /ccs/home/brtnfld/packages/hdf5-1.13/build/hdf5/lib/libhdf5.so.300(+0x3968d0)[0x2000004368d0]
[e11n09:88401] [10] /ccs/home/brtnfld/packages/hdf5-1.13/build/hdf5/lib/libhdf5.so.300(H5VL_dataset_write+0x8c)[0x20000043b58c]
[e11n09:88401] [11] /ccs/home/brtnfld/packages/hdf5-1.13/build/hdf5/lib/libhdf5.so.300(+0xe5fe8)[0x200000185fe8]
[e11n09:88401] [12] /ccs/home/brtnfld/packages/hdf5-1.13/build/hdf5/lib/libhdf5.so.300(H5Dwrite+0xac)[0x20000018968c]
[e11n09:88401] [13] ./hdf5_iotest[0x1000c664]
[e11n09:88401] [14] ./hdf5_iotest[0x100076e8]
[e11n09:88401] [15] /lib64/power9/libc.so.6(+0x24078)[0x200000844078]
[e11n09:88401] [16] /lib64/power9/libc.so.6(__libc_start_main+0xb4)[0x200000844264]
[e11n09:88401] *** End of error message ***
CC
free(): invalid pointer
ERROR:  One or more process (first noticed rank 5) terminated with signal 6 (core dumped)
1       hdf5_iotest.h5
hdf5_iotest.h5
Error: OMPI_COMM_WORLD_RANK is not set in environment like it should be.
Error checking ibm license.
HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,
FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

@khou2020
Copy link
Collaborator

khou2020 commented Dec 20, 2021

Can you post all output prior to the error? Also your build configure oh hdf5 and logvol.
On my side it did not report any error.
This is all output I got, seems it skipped some test.

NUMBER OF NNODES, NPROCS_MAX = 1 42
Output: hdf5_iotest_000042.csv
Config loaded from 'hdf5_iotest.ini':
steps=10, arrays=10, rows=42, columns=42, proc-grid=42x1, scaling=weak

step rk=2 chkd fill=true align-[incr:thold]=[1:0] mblk=2048 fmt=earliest io=mpi-io-col
Wall clock [s]: 0.26
File size [MiB]: 56.8
25424 hdf5_iotest.h5
hdf5_iotest.h5
Error: OMPI_COMM_WORLD_RANK is not set in environment like it should be.
Error checking ibm license.
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL


Sender: LSF System lsfadmin@batch4
Subject: Job 1706489: in cluster Exited

Job was submitted from host by user in cluster at Mon Dec 20 09:51:42 2021
Job was executed on host(s) <1batch4>, in queue , as user in cluster at Mon Dec 20 10:06:15 2021
<42
h50n18>
</ccs/home/khl7265> was used as the home directory.
</ccs/home/khl7265/csc332/hdf5-iotest/src> was used as the working directory.
Started at Mon Dec 20 10:06:15 2021
Terminated at Mon Dec 20 10:06:33 2021
Results reported at Mon Dec 20 10:06:33 2021

The output (if any) is above this job summary.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

#!/bin/bash 

if [[ $UNAME == "spock" ]]; then
  export CC=cc
  export RUNPARALLEL="srun -n6"
else
  export CC=mpicc
  export CXX=mpicxx
  export FC=mpif90
  export RUNPARALLEL="jsrun -n 6"
  opts="$opts --enable-build-mode=production"
fi

export CFLAGS="-g -O2 $OPT $CFLAGS"

opts="--enable-parallel --disable-hl --with-zlib $opts"

HDF_DIR=..

cmd="$HDF_DIR/configure \
$opts \
--disable-fortran --disable-tests --enable-tools --enable-shared"

echo $cmd
$cmd

make -j 16
make -j 16 install

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

!/bin/bash

#autoreconf -i

export CC=mpicc
export CXX=mpicxx
export CXXFLAGS="-std=c++11"
export LDFLAGS="-L${HOME}/packages/hdf5-1.13/build/hdf5/lib"
export TESTSEQRUN="jsrun -n 1"
export TESTMPIRUN="jsrun -n NP"

opts="--enable-debug"

$HOME/packages/vol-log-based/configure $opts --prefix=${PWD}/log-vol --with-hdf5=${HOME}/packages/hdf5-1.13/build/hdf5 --enable-shared --enable-zlib

gmake -j 8
gmake install

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

This is for just running step 18.

NUMBER OF NNODES, NPROCS_MAX = 1 42
Output: hdf5_iotest_000042.csv
Config loaded from 'hdf5_iotest.ini':
  steps=10, arrays=10, rows=42, columns=42, proc-grid=42x1, scaling=weak
-------------------------------------------------------------------------------
step rk=2 chkd fill=true align-[incr:thold]=[1:0] mblk=2048 fmt=earliest io=mpi-io-col
free(): invalid pointer
[e11n09:88401] *** Process received signal ***
[e11n09:88401] Signal: Aborted (6)
[e11n09:88401] Signal code:  (-6)

@khou2020
Copy link
Collaborator

This is for just running step 18.

NUMBER OF NNODES, NPROCS_MAX = 1 42
Output: hdf5_iotest_000042.csv
Config loaded from 'hdf5_iotest.ini':
  steps=10, arrays=10, rows=42, columns=42, proc-grid=42x1, scaling=weak
-------------------------------------------------------------------------------
step rk=2 chkd fill=true align-[incr:thold]=[1:0] mblk=2048 fmt=earliest io=mpi-io-col
free(): invalid pointer
[e11n09:88401] *** Process received signal ***
[e11n09:88401] Signal: Aborted (6)
[e11n09:88401] Signal code:  (-6)

Isn't test 18 the one crashed? It seems to finish without problem on my side.

I saw you using gmake. Are you using gnu compiler or ibm xl?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

yes, 18 causes the crash. I'm using gnu compiler. Are you using 1.13.0 release?

@khou2020
Copy link
Collaborator

yes, 18 causes the crash. I'm using gnu compiler. Are you using 1.13.0 release?

Yes. But i am using ibm xl.
Can you show your loaded modules?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

If you comment out the =18 line, does the program complete for you?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

  1. lsf-tools/2.0 3) darshan-runtime/3.3.0-lite 5) DefApps 7) zlib/1.2.11 9) spectrum-mpi/10.4.0.3-20210112 11) netlib-lapack/3.9.1 13) fftw/3.3.9 15) nsight-compute/2021.2.1 17) cuda/11.0.3
  2. hsi/5.0.2.p5 4) xalt/1.2.1 6) gcc/11.1.0 8) cmake/3.21.3 10) essl/6.3.0 12) netlib-scalapack/2.1.0 14) boost/1.77.0 16) nsight-systems/2021.3.1.54 18) python/3.8-anaconda3

@khou2020
Copy link
Collaborator

If you comment out the =18 line, does the program complete for you?

Yes. All tests finished.

@wkliao
Copy link
Collaborator

wkliao commented Dec 21, 2021

This error message indicates a problem of calling free()

step rk=2 chkd fill=true align-[incr:thold]=[1:0] mblk=2048 fmt=earliest io=mpi-io-col
free(): invalid pointer

Kai-yuan, please run valgrind to narrow down the source code location.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 21, 2021

Strange, I get the same error with xl compilers.

@wkliao
Copy link
Collaborator

wkliao commented Dec 22, 2021

Hi, Scot
Kai-yuan rebuilt all (HDF5, Log-based VOL, and hdf5-iotest) with Spectrum MPI
compilers last night and run the tests again on Summit. This time we used valgrind,
but it passed all tests. Could you send us the job script you used to run on Summit?

Another way I suggest is to link your hdf5-iotest against Kai-yuan's log-VOL.
Kai-yuan, please install the log-vol under your home folder and make it available to Scot.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 22, 2021

I ran mine with xl compilers and valgrind, I get this error:

==79519== Warning: invalid file descriptor -1 in syscall read()
==79519== Invalid free() / delete / delete[] / realloc()
==79519==    at 0x4088C24: free (vg_replace_malloc.c:755)
==79519==    by 0x1300808B: H5VL_log_filei_bfree(H5VL_log_file_t*, void*) (H5VL_log_filei.cpp:166)
==79519==    by 0x12FDE933: H5VL_log_dataseti_write(H5VL_log_dset_t*, long, long, H5VL_log_selections*, long, void const*, void**) (H5VL_log_dataseti.cpp:605)
==79519==    by 0x12FCC88B: H5VL_log_dataset_write(void*, long, long, long, long, void const*, void**) (H5VL_log_dataset.cpp:284)
==79519==    by 0x4598C47: IPRA.$H5VL__dataset_write (H5VLcallback.c:2147)
==79519==    by 0x4598A4F: H5VL_dataset_write (H5VLcallback.c:2179)
==79519==    by 0x41FBDEF: IPRA.$H5D__write_api_common (H5D.c:1167)
==79519==    by 0x41FBB2B: H5Dwrite (H5D.c:1220)
==79519==    by 0x10009FCB: write_test (write_test.c:265)
==79519==    by 0x1000612B: main (hdf5_iotest.c:328)
==79519==  Address 0x1fff0029e0 is on thread 1's stack
==79519==  in frame #2, created by H5VL_log_dataseti_write(H5VL_log_dset_t*, long, long, H5VL_log_selections*, long, void const*, void**) (H5VL_log_dataseti.cpp:422)
==79519==
==79520== Invalid free() / delete / delete[] / realloc()
==79520==    at 0x4088C24: free (vg_replace_malloc.c:755)
==79520==    by 0x1300808B: H5VL_log_filei_bfree(H5VL_log_file_t*, void*) (H5VL_log_filei.cpp:166)
==79520==    by 0x12FDE933: H5VL_log_dataseti_write(H5VL_log_dset_t*, long, long, H5VL_log_selections*, long, void const*, void**) (H5VL_log_dataseti.cpp:605)
==79520==    by 0x12FCC88B: H5VL_log_dataset_write(void*, long, long, long, long, void const*, void**) (H5VL_log_dataset.cpp:284)
==79520==    by 0x4598C47: IPRA.$H5VL__dataset_write (H5VLcallback.c:2147)
==79520==    by 0x4598A4F: H5VL_dataset_write (H5VLcallback.c:2179)
==79520==    by 0x41FBDEF: IPRA.$H5D__write_api_common (H5D.c:1167)
==79520==    by 0x41FBB2B: H5Dwrite (H5D.c:1220)
==79520==    by 0x10009FCB: write_test (write_test.c:265)
==79520==    by 0x1000612B: main (hdf5_iotest.c:328)
==79520==  Address 0x1fff0029e0 is on thread 1's stack
==79520==  in frame #2, created by H5VL_log_dataseti_write(H5VL_log_dset_t*, long, long, H5VL_log_selections*, long, void const*, void**) (H5VL_log_dataseti.cpp:422)

@khou2020
Copy link
Collaborator

I ran mine with xl compilers and valgrind, I get this error:

==79519== Warning: invalid file descriptor -1 in syscall read()
==79519== Invalid free() / delete / delete[] / realloc()
==79519==    at 0x4088C24: free (vg_replace_malloc.c:755)
==79519==    by 0x1300808B: H5VL_log_filei_bfree(H5VL_log_file_t*, void*) (H5VL_log_filei.cpp:166)
==79519==    by 0x12FDE933: H5VL_log_dataseti_write(H5VL_log_dset_t*, long, long, H5VL_log_selections*, long, void const*, void**) (H5VL_log_dataseti.cpp:605)
==79519==    by 0x12FCC88B: H5VL_log_dataset_write(void*, long, long, long, long, void const*, void**) (H5VL_log_dataset.cpp:284)
==79519==    by 0x4598C47: IPRA.$H5VL__dataset_write (H5VLcallback.c:2147)
==79519==    by 0x4598A4F: H5VL_dataset_write (H5VLcallback.c:2179)
==79519==    by 0x41FBDEF: IPRA.$H5D__write_api_common (H5D.c:1167)
==79519==    by 0x41FBB2B: H5Dwrite (H5D.c:1220)
==79519==    by 0x10009FCB: write_test (write_test.c:265)
==79519==    by 0x1000612B: main (hdf5_iotest.c:328)
==79519==  Address 0x1fff0029e0 is on thread 1's stack
==79519==  in frame #2, created by H5VL_log_dataseti_write(H5VL_log_dset_t*, long, long, H5VL_log_selections*, long, void const*, void**) (H5VL_log_dataseti.cpp:422)
==79519==
==79520== Invalid free() / delete / delete[] / realloc()
==79520==    at 0x4088C24: free (vg_replace_malloc.c:755)
==79520==    by 0x1300808B: H5VL_log_filei_bfree(H5VL_log_file_t*, void*) (H5VL_log_filei.cpp:166)
==79520==    by 0x12FDE933: H5VL_log_dataseti_write(H5VL_log_dset_t*, long, long, H5VL_log_selections*, long, void const*, void**) (H5VL_log_dataseti.cpp:605)
==79520==    by 0x12FCC88B: H5VL_log_dataset_write(void*, long, long, long, long, void const*, void**) (H5VL_log_dataset.cpp:284)
==79520==    by 0x4598C47: IPRA.$H5VL__dataset_write (H5VLcallback.c:2147)
==79520==    by 0x4598A4F: H5VL_dataset_write (H5VLcallback.c:2179)
==79520==    by 0x41FBDEF: IPRA.$H5D__write_api_common (H5D.c:1167)
==79520==    by 0x41FBB2B: H5Dwrite (H5D.c:1220)
==79520==    by 0x10009FCB: write_test (write_test.c:265)
==79520==    by 0x1000612B: main (hdf5_iotest.c:328)
==79520==  Address 0x1fff0029e0 is on thread 1's stack
==79520==  in frame #2, created by H5VL_log_dataseti_write(H5VL_log_dset_t*, long, long, H5VL_log_selections*, long, void const*, void**) (H5VL_log_dataseti.cpp:422)

Are you using any filters? That part of the code is experimental and should not even be ran.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 22, 2021

It should not be using any filters, those input lines, gzip and szip are commented out.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 22, 2021

I checked the native FD hdf5 file, filters is none.

@wkliao
Copy link
Collaborator

wkliao commented Dec 22, 2021

Scot, could you please share your modified test programs?
I can see your codes are different from the master branch.

==79520==    by 0x1000612B: main (hdf5_iotest.c:328)

https://github.com/brtnfld/hdf5-iotest/blob/ff0d833ab8908582c089dc88ec19e17a30abde44/src/hdf5_iotest.c#L328-L329

@khou2020
Copy link
Collaborator

khou2020 commented Dec 22, 2021

I checked the native FD hdf5 file, filters is none.

That code should only run when there are filters defined in dcpl. The buffer is only allocated when there are filters, so it make sense to see the invalid free error.

Is there any error reported a prior?

@wkliao
Copy link
Collaborator

wkliao commented Dec 22, 2021

@khou2020
The lines of errors from the valgrind log look fishy.

==79519==    at 0x4088C24: free (vg_replace_malloc.c:755)
==79519==    by 0x1300808B: H5VL_log_filei_bfree(H5VL_log_file_t*, void*) (H5VL_log_filei.cpp:166)

https://github.com/DataLib-ECP/vol-log-based/blob/96c2cd2eea10d7e150eabbc6ffd39507c943135c/src/H5VL_log_filei.cpp#L164-L166

Shouldn't line 166 free(bp) be free(buf)?

@wkliao
Copy link
Collaborator

wkliao commented Dec 22, 2021

@khou2020
https://github.com/DataLib-ECP/vol-log-based/blob/96c2cd2eea10d7e150eabbc6ffd39507c943135c/src/H5VL_log_filei.cpp#L161

Subroutine H5VL_log_filei_bfree should check whether the input
arguments fp and buf are NULL and return proper error codes.

@khou2020
Copy link
Collaborator

@khou2020 The lines of errors from the valgrind log look fishy.

==79519==    at 0x4088C24: free (vg_replace_malloc.c:755)
==79519==    by 0x1300808B: H5VL_log_filei_bfree(H5VL_log_file_t*, void*) (H5VL_log_filei.cpp:166)

https://github.com/DataLib-ECP/vol-log-based/blob/96c2cd2eea10d7e150eabbc6ffd39507c943135c/src/H5VL_log_filei.cpp#L164-L166

Shouldn't line 166 free(bp) be free(buf)?

H5VL_log_filei_balloc reserve the first 8 byte to store the size of the buffer allocated, so actual buffer starts 8 bytes before the user buffer.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 22, 2021

I checked the native FD hdf5 file, filters is none.

That code should only run when there are filters defined in dcpl. The buffer is only allocated when there are filters, so it make sense to see the invalid free error.

Is there any error reported a prior?

The only dcpl is for chunking, no compression. The full output (2 ranks was used) in in /tmp/brtnfld on summit.

@wkliao
Copy link
Collaborator

wkliao commented Dec 22, 2021

@khou2020 The lines of errors from the valgrind log look fishy.

==79519==    at 0x4088C24: free (vg_replace_malloc.c:755)
==79519==    by 0x1300808B: H5VL_log_filei_bfree(H5VL_log_file_t*, void*) (H5VL_log_filei.cpp:166)

https://github.com/DataLib-ECP/vol-log-based/blob/96c2cd2eea10d7e150eabbc6ffd39507c943135c/src/H5VL_log_filei.cpp#L164-L166

Shouldn't line 166 free(bp) be free(buf)?

H5VL_log_filei_balloc reserve the first 8 byte to store the size of the buffer allocated, so actual buffer starts 8 bytes before the user buffer.

Please add comments into this subroutine.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 22, 2021

login1, which login node did you place your build?

@khou2020
Copy link
Collaborator

I moved that to my home folder.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 23, 2021

Note that the only difference between case 17 and case 18 is independent and collective writes, respectively. case 17 passes and valgrind does not denote any issues.

@khou2020
Copy link
Collaborator

Note that the only difference between case 17 and case 18 is independent and collective writes, respectively. case 17 passes and valgrind does not denote any issues.

Logvol write is all independent. Actual write happens at file flush time. H5Dwrite only stage the requests locally in the queue.

@khou2020
Copy link
Collaborator

login1, which login node did you place your build?

I tried your code but cannot reproduce the error. I did get many valgrind warning about uninitialized values, but none were originated within logvol.

@wkliao
Copy link
Collaborator

wkliao commented Dec 23, 2021

Note that the only difference between case 17 and case 18 is independent and collective writes, respectively. case 17 passes and valgrind does not denote any issues.

Logvol write is all independent. Actual write happens at file flush time. H5Dwrite only stage the requests locally in the queue.

This is not optimal.
When flushing metadata and data, collective writes should be used.

@khou2020
Copy link
Collaborator

Note that the only difference between case 17 and case 18 is independent and collective writes, respectively. case 17 passes and valgrind does not denote any issues.

Logvol write is all independent. Actual write happens at file flush time. H5Dwrite only stage the requests locally in the queue.

This is not optimal. When flushing metadata and data, collective writes should be used.

Flushing (H5Fflush) is always collective, but posting (H5Dwrite) is always independent.
There should not be any difference between calling H5Dwrite with collective or independent dxpl.

@khou2020
Copy link
Collaborator

@brtnfld We tried on 3 different machines but still cannot reproduce the error.
Can you try rebuilding everything (HDF5, logvol, test program) and share your build with us?
Can you also share your command line history start from login?

@khou2020
Copy link
Collaborator

@brtnfld I rebuilt everything using all the modules you listed but still cannot reproduce the issue.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 27, 2021

I put a fresh build of all the packages, hdf5, hdf5-log, qmcpack in /tmp/brtnfld/VLOG on login4.

For my environment, I have in my .bashrc


module load zlib
module load cmake

# QMCPACK
module load gcc/11.1.0
module load spectrum-mpi
module load essl
module load netlib-lapack
module load netlib-scalapack
module load fftw
module load boost
module load cuda
module load python/3.8-anaconda3

I built everything in my /gpfs/alpine/csc300/proj-shared directory. You should be able to run the qmcpack/build_summit_cpu/bin batch script with only a #BSUB account change.

All the build scripts are in each package directories, all but qmcpack script is run from the build dir and uses relative paths so you should not have to change them at all.

@khou2020
Copy link
Collaborator

I cannot access it.
$ cd /gpfs/alpine/csc300/proj-shared
-bash: cd: /gpfs/alpine/csc300/proj-shared: Permission denied

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 27, 2021

it is in /tmp/brtnfld on login4

@khou2020
Copy link
Collaborator

I can't see it.
[khl7265@login4.summit csc332]$ cd /tmp/brtnfld
-bash: cd: /tmp/brtnfld: No such file or directory

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 28, 2021

recopied it.

@khou2020
Copy link
Collaborator

Can you also include built binaries? We once used your source but cannot reproduce the issue.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 28, 2021

All the builds and binaries are there.

@khou2020
Copy link
Collaborator

There is no hdf5-iotest.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 28, 2021

This comment was for this PR, not #10

I found the issue with my batch script, it was enabling compression, only for the cases of chunked datasets and collective I/O. I removed that line in the script and it works.

This issue can be closed and PR#10 reopened.

@khou2020 khou2020 reopened this Dec 28, 2021
@khou2020 khou2020 reopened this Dec 28, 2021
@khou2020
Copy link
Collaborator

I am confused. Can you create a new issue for what's unsolved yet?
These tickets seems to be mixed up.

@wkliao
Copy link
Collaborator

wkliao commented Jan 5, 2022

Hi, @brtnfld @khou2020
To properly close this issue, could you both confirm whether hdf5-iotest
passes all its tests (on Summit and other machines e.g. local machines)
when using the master branch of this log-based VOL?

@khou2020 khou2020 reopened this Jan 6, 2022
@khou2020
Copy link
Collaborator

khou2020 commented Jan 7, 2022

@brtnfld Do you have the steps to build on Cori? I keep getting this error.
hdf5_iotest: ../../src/hdf5_iotest.c:64: main: Assertion `MPI_THREAD_MULTIPLE == mpi_thread_lvl_provided' failed.

Seems it needs a different toolchain to the one come with the system.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Jan 7, 2022

You have to set env. MPICH_MAX_THREAD_SAFETY=multiple

@khou2020
Copy link
Collaborator

khou2020 commented Jan 7, 2022

I tested it on Cori without problem.

@khou2020 khou2020 closed this as completed Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants