Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dread test segfautls on summit #10

Closed
brtnfld opened this issue Dec 10, 2021 · 62 comments
Closed

dread test segfautls on summit #10

brtnfld opened this issue Dec 10, 2021 · 62 comments

Comments

@brtnfld
Copy link
Collaborator

brtnfld commented Dec 10, 2021

Using HDF5 1.13 and

currently Loaded Modules:

  1. lsf-tools/2.0 3) darshan-runtime/3.3.0-lite 5) DefApps 7) zlib/1.2.11 9) spectrum-mpi/10.4.0.3-20210112 11) netlib-lapack/3.9.1 13) fftw/3.3.9 15) nsight-compute/2021.2.1 17) cuda/11.0.3
  2. hsi/5.0.2.p5 4) xalt/1.2.1 6) gcc/11.1.0 8) cmake/3.21.3 10) essl/6.3.0 12) netlib-scalapack/2.1.0 14) boost/1.77.0 16) nsight-systems/2021.3.1.54 18) python/3.8-anaconda3

the dread test fails with the segfault below, it seems to work with one rank, this is with 2 ranks.

#0 0x0000200014f17a00 in ADIOI_GEN_WriteStrided () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#1 0x0000200014edba8c in ADIOI_GPFS_WriteStridedColl ()
from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#2 0x0000200014eceda8 in MPIOI_File_write_all () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#3 0x0000200014ecf908 in mca_io_romio_dist_MPI_File_write_at_all ()
from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#4 0x0000200014ec1c7c in mca_io_romio321_file_write_at_all ()
from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#5 0x0000200000b08a2c in PMPI_File_write_at_all () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/libmpi_ibm.so.3
#6 0x000020000009c4d8 in MPI_File_write_at_all (fh=, offset=2436, buf=0x0, count=, datatype=0x3a4abdc0, status=0x7fffe55cfb18) at lib/darshan-mpiio.c:563
#7 0x000020000017e374 in H5VL_log_filei_metaflush(H5VL_log_file_t*) () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#8 0x000020000017ab84 in H5VL_log_filei_close(H5VL_log_file_t*) () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#9 0x000020000017c3e0 in H5VL_log_filei_dec_ref(H5VL_log_file_t*) () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#10 0x000020000018780c in H5VL_log_obj_t::~H5VL_log_obj_t() () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#11 0x000020000018a920 in H5VL_log_free_wrap_ctx(void*) () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#12 0x00002000008594a4 in H5VL__free_vol_wrapper (vol_wrap_ctx=0x3a2efba0) at ../../src/H5VLint.c:2243
#13 0x000020000085c884 in H5VL_reset_vol_wrapper () at ../../src/H5VLint.c:2431
#14 0x000020000084fd88 in H5VL_file_close (vol_obj=, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:4163
#15 0x000020000060b9e0 in H5F__close_cb (file_vol_obj=, request=) at ../../src/H5Fint.c:216
#16 0x00002000006a388c in H5I__dec_ref (id=id@entry=72057594037927936, request=0x0) at ../../src/H5Iint.c:1036
#17 0x00002000006a3a40 in H5I__dec_app_ref (id=72057594037927936, request=) at ../../src/H5Iint.c:1108
#18 0x00002000006a3b5c in H5I_dec_app_ref (id=) at ../../src/H5Iint.c:1156
#19 0x0000200000600c3c in H5Fclose (file_id=72057594037927936) at ../../src/H5F.c:1060
#20 0x0000000010002338 in ?? ()
#21 0x0000200000e04078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#22 0x0000200000e04264 in __libc_start_main () from /lib64/power9/libc.so.6
#23 0x0000000000000000 in ?? ()

@khou2020
Copy link
Collaborator

Which application was it? can you share it along with the run script?

@wkliao
Copy link
Collaborator

wkliao commented Dec 11, 2021

dread test is test/basic/dread.cpp

@khou2020
Copy link
Collaborator

khou2020 commented Dec 12, 2021

Should be fixed.
@brtnfld Can you help testing it on your side just to be sure?

@khou2020 khou2020 reopened this Dec 12, 2021
@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 13, 2021

qmcpack: https://github.com/QMCPACK/qmcpack

To build on summit:

#!/bin/bash

#BUILD_MODULES=config/load_olcf_summit_modules.sh

#module purge
#echo "Purging current module set"
#echo "Sourcing file: $BUILD_MODULES to build QMCPACK"
#
#. $BUILD_MODULES

#echo "Either source $BUILD_MODULES or load these same modules to run QMCPACK"

declare -A builds=( ["cpu"]="-DQMC_BUILD_SANDBOX_ONLY=1 -DENABLE_SOA=0" ) #\
#                    ["complex_cpu"]="-DQMC_COMPLEX=1  -DQMC_MATH_VENDOR=IBM_MASS -DMASS_ROOT=/sw/summit/xl/16.1.1-10/xlmass/9.1.1" \
#                    ["legacy_gpu"]="-DQMC_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=70 " \
#                    ["complex_legacy_gpu"]="-DQMC_CUDA=1 -DQMC_COMPLEX=1 -DCMAKE_CUDA_ARCHITECTURES=70 " )
#
mkdir bin

export HDF5_ROOT=$HOME/packages/hdf5-1.13/build/hdf5

for build in "${!builds[@]}"
do
    echo "building: $build with ${builds[$build]}"
    rm bin/qmcpack_${build}
    mkdir build_summit_${build}
    cd build_summit_${build}
    cmake -DCMAKE_C_COMPILER="mpicc" \
          -DCMAKE_CXX_COMPILER="mpicxx" \
          -DBUILD_LMYENGINE_INTERFACE=0 \
          ${builds[$build]} \
          ..
    make -j 20 restart
    if [ $? -eq 0 ]; then
      build_dir=$(pwd)
      if [ -e ${build_dir}/bin/qmcpack_complex ]; then
        ln -sf ${build_dir}/bin/qmcpack_complex ${build_dir}/../bin/qmcpack_${build}
      else
        ln -sf ${build_dir}/bin/qmcpack ${build_dir}/../bin/qmcpack_${build}
      fi
    fi
    cd ..
done

------- END ----

To run the restart test build_summit_cpu/bin (submitted from this directory) :

#!/bin/bash
#BSUB -P CSC444
#BSUB -W 01:00
# power 42
#BSUB -nnodes 1
#BSUB -J qmcpack 
#BSUB -o qmcpack.%J
#BSUB -e qmcpack.%J
##SMT1 -- 1 HW Thread per physical core
##SMT4 -- All 4 HW threads are active (Default)
##BSUB -alloc_flags smt1
# 42 physical cores, (21 each cpu), per node
# 84 per cpu, 168 total

export LD_LIBRARY_PATH="$HOME/packages/hdf5-1.13/build/hdf5/lib:$LD_LIBRARY_PATH"

JID=$LSB_JOBID
cd $MEMBERWORK/csc444
mkdir qmcpack.$JID
cd qmcpack.$JID
EXEC=restart
cp $LS_SUBCWD/$EXEC .
NPROCS_MAX=$(($LSB_MAX_NUM_PROCESSORS - 1))
NNODES=$(($NPROCS_MAX / 42))
echo "NUMBER OF NNODES, NPROCS_MAX = $NNODES $NPROCS_MAX"

NPROCS="42"

export LD_LIBRARY_PATH="$HOME/packages/vol-log-based/build/log-vol/lib:$LD_LIBRARY_PATH"
export HDF5_PLUGIN_PATH="$HOME/packages/vol-log-based/build/log-vol/lib"
export HDF5_VOL_CONNECTOR="LOG under_vol=0;under_info={}"

for i in ${NPROCS}
do
  jsrun -n $i ./$EXEC -g "8 8 8"
  ls -haolF *.h5
done

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 13, 2021

For log VOL make check, I run an interactive job on summit and run it using:

#!/bin/bash

HDF5_DIR=$HOME/packages/hdf5/build/hdf5
ABT_DIR=$HOME/packages/argobots/build/argobots
VOL_DIR=$HOME/packages/vol-async/src

export LD_LIBRARY_PATH=${HDF5_DIR}/lib:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH=$VOL_DIR/src:$ABT_DIR/lib:${LD_LIBRARY_PATH}

export HDF5_PLUGIN_PATH="$VOL_DIR"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}" 
export TESTSEQRUN="jsrun -n 1"
export TESTMPIRUN="jsrun -n 4"

make check

# OR RUN THIS FROM test/basic
jsrun -n 4 group
jsrun -n 4 attr 
jsrun -n 4 dset
jsrun -n 4 dwrite
jsrun -n 4 dread 

dread now passes with the newest updates if I run the test manually (see the second option in the script).
I can't get VERBOSE output to work with make check to see why it is failing.

@wkliao
Copy link
Collaborator

wkliao commented Dec 13, 2021

Are you using async I/O VOL? I am seeing this line:

VOL_DIR=$HOME/packages/vol-async/src

@khou2020
Copy link
Collaborator

For log VOL make check, I run an interactive job on summit and run it using:

#!/bin/bash

HDF5_DIR=$HOME/packages/hdf5/build/hdf5
ABT_DIR=$HOME/packages/argobots/build/argobots
VOL_DIR=$HOME/packages/vol-async/src

export LD_LIBRARY_PATH=${HDF5_DIR}/lib:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH=$VOL_DIR/src:$ABT_DIR/lib:${LD_LIBRARY_PATH}

export HDF5_PLUGIN_PATH="$VOL_DIR"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}" 
export TESTSEQRUN="jsrun -n 1"
export TESTMPIRUN="jsrun -n 4"

make check

# OR RUN THIS FROM test/basic
jsrun -n 4 group
jsrun -n 4 attr 
jsrun -n 4 dset
jsrun -n 4 dwrite
jsrun -n 4 dread 

dread now passes with the newest updates if I run the test manually (see the second option in the script). I can't get VERBOSE output to work with make check to see why it is failing.

The output should be saved to test-suit.log and dread.log.
Logvol is hard coded in read, so there is no need to set HDF5_VOL_CONNECTOR and HDF5_PLUGIN_PATH.
The wrapper script for make check also sets proper environment, so need to export anything.

My build steps are:
./configure --prefix=/gpfs/alpine/csc332/scratch/khl7265/.local/log_io_vol/debug --with-hdf5=/gpfs/alpine/csc332/scratch/khl7265/.local/hdf5/develop --enable-shared --enable-profiling CFLAGS="-O0 -g" CXXFLAGS="-O0 -g -std=c++14" --enable-debug --enable-zlib TESTSEQRUN="jsrun -n 1" TESTMPIRUN="jsrun -n NP"
make -j 64
make -j 64 tests
In an interactive job:
make check
make ptest

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 13, 2021

Sorry pasted wron script:


#!/bin/bash -l

export LD_LIBRARY_PATH="$HOME/packages/vol-log-based/build/log-vol/lib:$LD_LIBRARY_PATH"
export HDF5_PLUGIN_PATH="$HOME/packages/vol-log-based/build/log-vol/lib"
export HDF5_VOL_CONNECTOR="LOG under_vol=0;under_info={}"

#export TESTSEQRUN="jsrun -n 1"
export TESTMPIRUN="jsrun -n 4" 

make check

#jsrun -n 2 group
#jsrun -n 2 attr 
#jsrun -n 2 dset
#jsrun -n 2 dwrite
#jsrun -n 2 dread 

@khou2020
Copy link
Collaborator

Sorry pasted wron script:


#!/bin/bash -l

export LD_LIBRARY_PATH="$HOME/packages/vol-log-based/build/log-vol/lib:$LD_LIBRARY_PATH"
export HDF5_PLUGIN_PATH="$HOME/packages/vol-log-based/build/log-vol/lib"
export HDF5_VOL_CONNECTOR="LOG under_vol=0;under_info={}"

#export TESTSEQRUN="jsrun -n 1"
export TESTMPIRUN="jsrun -n 4" 

make check

#jsrun -n 2 group
#jsrun -n 2 attr 
#jsrun -n 2 dset
#jsrun -n 2 dwrite
#jsrun -n 2 dread 

Did you set TESTSEQRUN in configure step? TESTMPIRUN only used in make ptest.

@wkliao
Copy link
Collaborator

wkliao commented Dec 13, 2021

Hi, @brtnfld
To clarify the usage of "make check" and "make ptest",
"make check" runs all test programs on a single MPI process and
"make ptest" runs all test programs on 4 MPI processes.
So if you want to run both in your script, then please
add both commands, for example,

export TESTSEQRUN="jsrun -n 1"
export TESTMPIRUN="jsrun -n 4" 

make check
make ptest

@wkliao
Copy link
Collaborator

wkliao commented Dec 13, 2021

Correction for the environment variable TESTMPIRUN.
One should use NP, instead of 4, as "make ptest" may run different numbers of MPI processes between 2 and 10.

export TESTSEQRUN="jsrun -n 1"
export TESTMPIRUN="jsrun -n NP" 

make check
make ptest

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 13, 2021

There is no test_suit.log, the individual log files just say:

Error: OMPI_COMM_WORLD_RANK is not set in environment like it should be.
Error checking ibm license.
FAIL dset (exit status: 255)

It does not give the actual verbose make output.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 13, 2021

For the parallel test (and probably for the serial test) it seems to be using mpiexec instead of my env. setting.

export SED="/usr/bin/sed"; export srcdir="/ccs/home/brtnfld/packages/vol-log-based/test/basic"; export TESTOUTDIR="."; export TESTSEQRUN=""; export TESTMPIRUN="/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-1
0.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/bin/mpiexec"; export TESTPROGRAMS="file group attr dset dwrite dread memsel null_req"; export check_PROGRAMS="file group attr dset dwrite dread memsel null_req";
/ccs/home/brtnfld/packages/vol-log-based/test/basic/parallel_run.sh 4 || exit 1

[h16n05:3615581] Error: common_pami.c:1094 - ompi_common_pami_init() 27: Unable to create 1 PAMI communication context(s) rc=1
Unable to connect queue-pairs
[h16n05:3615586] Error: common_pami.c:1094 - ompi_common_pami_init() 28: Unable to create 1 PAMI communication context(s) rc=1
Unable to connect queue-pairs
[h16n05:3615588] Error: common_pami.c:1094 - ompi_common_pami_init() 30: Unable to create 1 PAMI communication context(s) rc=1
Unable to connect queue-pairs
[h16n05:3615594] Error: common_pami.c:1094 - ompi_common_pami_init() 39: Unable to create 1 PAMI communication context(s) rc=1
Unable to connect queue-pairs
[h16n05:3615563] Error: common_pami.c:1094 - ompi_common_pami_init() 6: Unable to create 1 PAMI communication context(s) rc=1

@wkliao
Copy link
Collaborator

wkliao commented Dec 13, 2021

What is the configure command line you used to build the log-based VOL?
Can you share the file config.log?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 13, 2021

#!/bin/bash

#autoreconf -i

export CC=mpicc
export CXX=mpicxx
export CXXFLAGS="-std=c++11"
export LDFLAGS="-L${HOME}/packages/hdf5-1.13/build/hdf5/lib"

../configure --prefix=${PWD}/log-vol --with-hdf5=${HOME}/packages/hdf5-1.13/build/hdf5 --enable-shared --enable-zlib

gmake -j 8

@khou2020
Copy link
Collaborator

There is no test_suit.log, the individual log files just say:

Error: OMPI_COMM_WORLD_RANK is not set in environment like it should be. Error checking ibm license. FAIL dset (exit status: 255)

It does not give the actual verbose make output.

That error means it was not being ran with jsrun.

@wkliao
Copy link
Collaborator

wkliao commented Dec 13, 2021

@brtnfld , Please try this configure command.

./configure --prefix=${PWD}/log-vol  \
          --with-hdf5=${HOME}/packages/hdf5-1.13/build/hdf5 \
          --enable-shared --enable-zlib \
          CC=mpicc CXX=mpicxx \
          CXXFLAGS="-std=c++14" \
          TESTSEQRUN="jsrun -n 1" \
          TESTMPIRUN="jsrun -n NP"

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 13, 2021

I added the MPI RUN env. before configure and it is now correct. All but null_req.log pass.

HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 0:
#000: ../../src/H5VL.c line 87 in H5VLregister_connector(): library initialization failed
major: Function entry/exit
minor: Unable to initialize object
#1: ../../src/H5.c line 285 in H5_init_library(): unable to initialize VOL interface
major: Function entry/exit
minor: Unable to initialize object
#2: ../../src/H5VLint.c line 213 in H5VL_init_phase2(): unable to set default VOL connector
major: Virtual Object Layer
minor: Can't set value
#3: ../../src/H5VLint.c line 423 in H5VL__set_def_conn(): can't register connector
major: Virtual Object Layer
minor: Unable to register new ID
#4: ../../src/H5VLint.c line 1354 in H5VL__register_connector_by_name(): unable to load VOL connector
major: Virtual Object Layer
minor: Unable to initialize object
#5: ../../src/H5PLint.c line 257 in H5PL_load(): search in path table failed
major: Plugin for dynamically loaded library
minor: Can't get value
#6: ../../src/H5PLpath.c line 804 in H5PL__find_plugin_in_path_table(): search in path /src/.libs encountered an error
major: Plugin for dynamically loaded library
minor: Can't get value
#7: ../../src/H5PLpath.c line 857 in H5PL__find_plugin_in_path(): can't open directory: /src/.libs

It is looking for the VOL in the wrong directory, same with the parallel tests.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 13, 2021

It is looking for /src/.libs instead of using src/.libs in the build directory

@wkliao
Copy link
Collaborator

wkliao commented Dec 13, 2021

Can you rebuild it? Maybe there are some residue files.
i.e. run "make distclean" and then the configure commands.

@wkliao
Copy link
Collaborator

wkliao commented Dec 13, 2021

I pushed a commit to fix. Please run 'git pull' to update your local repo.
Let us know if it fixes the problem.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 14, 2021

I still have the same issue with the new code. I did the build into an empty build directory.

Also, it would be nice if make also built the tests. Unfortunately, make check or make ptest requires running parallel programs on the frontend, which is not allowed and will fail.

@wkliao
Copy link
Collaborator

wkliao commented Dec 14, 2021

I pushed another commit. That should fix the bug.

Summit, like other DOE parallel computers, is a cross-compile environment.
The executables created from "make" must be run on compute nodes.
Thus, commands "make check" or "make ptest" will fail if run on login node.
You can run these make commands in the batch script file or in an interactive mode.

@wkliao
Copy link
Collaborator

wkliao commented Dec 14, 2021

FYI. Command "make tests" compiles and builds all executables of the
test and example programs. You can run that command at login node first
and then submit a job to run "make check" and "make ptest".

@khou2020
Copy link
Collaborator

qmcpack: https://github.com/QMCPACK/qmcpack

To build on summit:

#!/bin/bash

#BUILD_MODULES=config/load_olcf_summit_modules.sh

#module purge
#echo "Purging current module set"
#echo "Sourcing file: $BUILD_MODULES to build QMCPACK"
#
#. $BUILD_MODULES

#echo "Either source $BUILD_MODULES or load these same modules to run QMCPACK"

declare -A builds=( ["cpu"]="-DQMC_BUILD_SANDBOX_ONLY=1 -DENABLE_SOA=0" ) #\
#                    ["complex_cpu"]="-DQMC_COMPLEX=1  -DQMC_MATH_VENDOR=IBM_MASS -DMASS_ROOT=/sw/summit/xl/16.1.1-10/xlmass/9.1.1" \
#                    ["legacy_gpu"]="-DQMC_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=70 " \
#                    ["complex_legacy_gpu"]="-DQMC_CUDA=1 -DQMC_COMPLEX=1 -DCMAKE_CUDA_ARCHITECTURES=70 " )
#
mkdir bin

export HDF5_ROOT=$HOME/packages/hdf5-1.13/build/hdf5

for build in "${!builds[@]}"
do
    echo "building: $build with ${builds[$build]}"
    rm bin/qmcpack_${build}
    mkdir build_summit_${build}
    cd build_summit_${build}
    cmake -DCMAKE_C_COMPILER="mpicc" \
          -DCMAKE_CXX_COMPILER="mpicxx" \
          -DBUILD_LMYENGINE_INTERFACE=0 \
          ${builds[$build]} \
          ..
    make -j 20 restart
    if [ $? -eq 0 ]; then
      build_dir=$(pwd)
      if [ -e ${build_dir}/bin/qmcpack_complex ]; then
        ln -sf ${build_dir}/bin/qmcpack_complex ${build_dir}/../bin/qmcpack_${build}
      else
        ln -sf ${build_dir}/bin/qmcpack ${build_dir}/../bin/qmcpack_${build}
      fi
    fi
    cd ..
done

------- END ----

To run the restart test build_summit_cpu/bin (submitted from this directory) :

#!/bin/bash
#BSUB -P CSC444
#BSUB -W 01:00
# power 42
#BSUB -nnodes 1
#BSUB -J qmcpack 
#BSUB -o qmcpack.%J
#BSUB -e qmcpack.%J
##SMT1 -- 1 HW Thread per physical core
##SMT4 -- All 4 HW threads are active (Default)
##BSUB -alloc_flags smt1
# 42 physical cores, (21 each cpu), per node
# 84 per cpu, 168 total

export LD_LIBRARY_PATH="$HOME/packages/hdf5-1.13/build/hdf5/lib:$LD_LIBRARY_PATH"

JID=$LSB_JOBID
cd $MEMBERWORK/csc444
mkdir qmcpack.$JID
cd qmcpack.$JID
EXEC=restart
cp $LS_SUBCWD/$EXEC .
NPROCS_MAX=$(($LSB_MAX_NUM_PROCESSORS - 1))
NNODES=$(($NPROCS_MAX / 42))
echo "NUMBER OF NNODES, NPROCS_MAX = $NNODES $NPROCS_MAX"

NPROCS="42"

export LD_LIBRARY_PATH="$HOME/packages/vol-log-based/build/log-vol/lib:$LD_LIBRARY_PATH"
export HDF5_PLUGIN_PATH="$HOME/packages/vol-log-based/build/log-vol/lib"
export HDF5_VOL_CONNECTOR="LOG under_vol=0;under_info={}"

for i in ${NPROCS}
do
  jsrun -n $i ./$EXEC -g "8 8 8"
  ls -haolF *.h5
done

I got following error running the script.
[khl7265@login5.summit build]$ ./build.sh
-bash: ./build.sh: Permission denied
[khl7265@login5.summit build]$ chmod 755 build.sh
[khl7265@login5.summit build]$ ./build.sh
building: cpu with -DQMC_BUILD_SANDBOX_ONLY=1 -DENABLE_SOA=0
rm: cannot remove 'bin/qmcpack_cpu': No such file or directory
-- Defining the float point precision
Base precision = double
Full precision = double
-- CMAKE_BUILD_TYPE is RELEASE
-- Enable sanitizer ENABLE_SANITIZER=none
-- Trying to figure out compiler options ....
-- C++ Compiler is identified by QMCPACK as : IBM
-- Power8+ system using xlC/xlc/xlf
CMake Error in /ccs/home/khl7265/csc332/qmcpack/build/CMakeFiles/CMakeTmp/CMakeLists.txt:
Target "cmTC_c3ccb" requires the language dialect "CXX17" , but CMake does
not know the compile flags to use to enable it.

CMake Error at CMake/Testlibstdc++.cmake:3 (try_compile):
Failed to generate test project build system.
Call Stack (most recent call first):
CMakeLists.txt:283 (include)

-- Configuring incomplete, errors occurred!
See also "/ccs/home/khl7265/csc332/qmcpack/build/CMakeFiles/CMakeOutput.log".
make: *** No rule to make target 'restart'. Stop.

I also tried building locally on my PC, but it also failed.

[ 78%] Building CXX object src/QMCDrivers/CMakeFiles/qmcdriver_unit.dir/WaveFunctionTester.cpp.o
g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 14, 2021

Did you have these modules loaded?
Currently Loaded Modules:

  1. lsf-tools/2.0 3) darshan-runtime/3.3.0-lite 5) DefApps 7) zlib/1.2.11 9) spectrum-mpi/10.4.0.3-20210112 11) netlib-lapack/3.9.1 13) fftw/3.3.9 15) nsight-compute/2021.2.1 17) cuda/11.0.3
  2. hsi/5.0.2.p5 4) xalt/1.2.1 6) gcc/11.1.0 8) cmake/3.21.3 10) essl/6.3.0 12) netlib-scalapack/2.1.0 14) boost/1.77.0 16) nsight-systems/2021.3.1.54 18) python/3.8-anaconda3
module load zlib
module load cmake

# QMCPACK
module load gcc/11.1.0
module load spectrum-mpi
module load essl
module load netlib-lapack
module load netlib-scalapack
module load fftw
module load boost
module load cuda
module load python/3.8-anaconda3

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 14, 2021

For the tests. It passes the null_req tests, but fails with:

*** TESTING CXX lt-dynamic: Creating files
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 0:
#000: ../../src/H5.c line 1110 in H5open(): library initialization failed
major: Function entry/exit
minor: Unable to initialize object
#1: ../../src/H5.c line 285 in H5_init_library(): unable to initialize VOL interface
major: Function entry/exit
minor: Unable to initialize object
#2: ../../src/H5VLint.c line 213 in H5VL_init_phase2(): unable to set default VOL connector
major: Virtual Object Layer
minor: Can't set value
#3: ../../src/H5VLint.c line 423 in H5VL__set_def_conn(): can't register connector
major: Virtual Object Layer
minor: Unable to register new ID
#4: ../../src/H5VLint.c line 1354 in H5VL__register_connector_by_name(): unable to load VOL connector
major: Virtual Object Layer
minor: Unable to initialize object
#5: ../../src/H5PLint.c line 257 in H5PL_load(): search in path table failed
major: Plugin for dynamically loaded library
minor: Can't get value
#6: ../../src/H5PLpath.c line 804 in H5PL__find_plugin_in_path_table(): search in path /ccs/home/brtnfld/packages/vol-log-based/build/log-vol/lib encountered an error
major: Plugin for dynamically loaded library
minor: Can't get value
#7: ../../src/H5PLpath.c line 857 in H5PL__find_plugin_in_path(): can't open directory: /ccs/home/brtnfld/packages/vol-log-based/build/log-vol/lib
major: Plugin for dynamically loaded library
minor: Can't open directory or file
Error at line 60 in /ccs/home/brtnfld/packages/vol-log-based/test/dynamic/dynamic.cpp: expecting vol_name = LOG, but got native

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 14, 2021

The parallel tests pass, but not:

===========================================================
    examples: Parallel testing on 4 MPI processes
===========================================================
export SED="/usr/bin/sed"; export srcdir="/ccs/home/brtnfld/packages/vol-log-based/examples"; export top_builddir=".."; export TESTOUTDIR="."; export TESTSEQRUN="jsrun -n 1"; export TESTMPIRUN="jsrun -n NP"; export TESTPROGRAMS="create_
open dwrite_n dread_n non_blocking"; export check_PROGRAMS="create_open dwrite_n dread_n non_blocking"; \
/ccs/home/brtnfld/packages/vol-log-based/examples/parallel_run.sh 4 || exit 1
Error (No such file or directory) executing process: ./create_open
Error (No such file or directory) executing process: ./create_open
Error (No such file or directory) executing process: ./create_open
make[1]: *** [Makefile:1217: ptest] Error 1
make[1]: Leaving directory '/gpfs/alpine/csc300/proj-shared/brtnfld/build/examples'

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 14, 2021

Unless the env. variable is set wrong, I don't understand why setting these should cause the tests to fail. Users just might set these by default in their environment if they plan on using LOG VOL regularly, or have already set it previously.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 14, 2021

@khou2020 were you able to compile qmcpack?

@wkliao
Copy link
Collaborator

wkliao commented Dec 14, 2021

Unless the env. variable is set wrong, I don't understand why setting these should cause the tests to fail. Users just might set these by default in their environment if they plan on using LOG VOL regularly, or have already set it previously.

This is because your setting (quoted below) uses a previously installed log-based VOL which I assume uses the same shared library ABI version, causing a conflict. You can still set LD_LIBRARY_PATH, just not the same VOL to be built.

export LD_LIBRARY_PATH="$HOME/packages/vol-log-based/build/log-vol/lib:$LD_LIBRARY_PATH"

@khou2020
Copy link
Collaborator

@khou2020 were you able to compile qmcpack?

I will try it when summit is back online.
Do you know how to build it locally? It is easier to debug on local build.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 14, 2021

I've not built it locally, they do have build scripts in the config directory.

@wkliao
Copy link
Collaborator

wkliao commented Dec 14, 2021

@brtnfld
Since we have committed a few fixes to the log-based VOL, they may
resolve the issue you encountered when running qmcpack.
Can you give it a try and let us know?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 15, 2021

I still get the same QMCPACK errors:

All the species have the same mass 1
Error at line 121 in /ccs/home/brtnfld/packages/vol-log-based/src/H5VL_log_group.cpp:
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 18:
#000: ../../src/H5VLcallback.c line 4406 in H5VLgroup_open(): unable to open group
major: Virtual Object Layer
minor: Unable to initialize object
#1: ../../src/H5VLcallback.c line 4335 in H5VL__group_open(): group open failed
major: Virtual Object Layer
minor: Can't open object
#2: ../../src/H5VLnative_group.c line 154 in H5VL__native_group_open(): unable to open group
major: Symbol table
minor: Can't open object
#3: ../../src/H5Gint.c line 397 in H5G__open_name(): group not found
major: Symbol table
minor: Object not found
#4: ../../src/H5Gloc.c line 439 in H5G_loc_find(): can't find object
major: Symbol table
minor: Object not found
#5: ../../src/H5Gtraverse.c line 838 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#6: ../../src/H5Gtraverse.c line 614 in H5G__traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#7: ../../src/H5Gloc.c line 396 in H5G__loc_find_cb(): object 'state_0' doesn't exist
major: Symbol table
minor: Object not found

@khou2020
Copy link
Collaborator

khou2020 commented Dec 15, 2021

Is there any error messages previously? Seems the group does not exist.
H5L API is currently not supported. If the application creates soft link between objects and open them through the link, they will get this error. Also, LogVOL only support location by self .

If your logvol is built in debug mode, you can set environment variable LOGVOL_DEBUG_ABORT_ON_ERR to 1 to stop at first error.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 15, 2021

It is not using H5L APIs.

I think I see the issue, they are using H5Gopen1, which your VOL probably does not support.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 15, 2021

H5Gopen2 still has the same issue. With debugging and stop on error:

(gdb) bt
#0  0x0000200003a33618 in raise () from /lib64/power9/libc.so.6
#1  0x0000200003a13a2c in abort () from /lib64/power9/libc.so.6
#2  0x00002000185a8df4 in H5VL_log_group_open (obj=0x2fc2b950, loc_params=0x7fffc7e9f0c0, name=0x7fffc7e9f240 "state_0", gapl_id=792633534417207299, dxpl_id=792633534417207304, req=0x0)
    at /ccs/home/brtnfld/packages/vol-log-based/src/H5VL_log_group.cpp:121
#3  0x0000200002f4592c in H5VL__group_open (obj=<optimized out>, loc_params=<optimized out>, loc_params@entry=0x7fffc7e9f0c0, name=<optimized out>, name@entry=0x7fffc7e9f240 "state_0", gapl_id=<optimized out>, 
    gapl_id@entry=792633534417207299, dxpl_id=<optimized out>, dxpl_id@entry=792633534417207304, req=<optimized out>, req@entry=0x0, cls=<optimized out>, cls=<optimized out>) at ../../src/H5VLcallback.c:4334
#4  0x0000200002f5048c in H5VL_group_open (vol_obj=0x2fc26db0, loc_params=0x7fffc7e9f0c0, name=0x7fffc7e9f240 "state_0", gapl_id=792633534417207299, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:4366
#5  0x0000200002d51508 in H5G__open_api_common (loc_id=loc_id@entry=72057594037927936, name=name@entry=0x7fffc7e9f240 "state_0", gapl_id=<optimized out>, gapl_id@entry=0, token_ptr=token_ptr@entry=0x0, 
    _vol_obj_ptr=_vol_obj_ptr@entry=0x0) at ../../src/H5G.c:397
#6  0x0000200002d52798 in H5Gopen2 (loc_id=72057594037927936, name=0x7fffc7e9f240 "state_0", gapl_id=0) at ../../src/H5G.c:437
#7  0x00000000100e0dbc in ?? ()
#8  0x00000000100c1dc8 in ?? ()
#9  0x00000000100c3f64 in ?? ()
#10 0x000000001000cc34 in ?? ()
#11 0x0000200003a14078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#12 0x0000200003a14264 in __libc_start_main () from /lib64/power9/libc.so.6
#13 0x0000000000000000 in ?? ()

@khou2020
Copy link
Collaborator

khou2020 commented Dec 16, 2021

Can you share the program, the core file and the output file on summit? (giving access permission)
That error means the under VOL (native) fails to open the group. It is likely the group does not exist at all.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 16, 2021

I put the core files in /tmp/brtnfld, login3 if that is important.

@khou2020
Copy link
Collaborator

I built it locally, but i cannot locate the "restart" benchmark. Is it an optional feature?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 16, 2021

cmake -DCMAKE_C_COMPILER="mpicc" \
      -DCMAKE_CXX_COMPILER="mpicxx" \
      -DBUILD_LMYENGINE_INTERFACE=0 \
      ${builds[$build]} \
      ..
make -j 20 restart

@khou2020
Copy link
Collaborator

I debugged the program and saw it calls H5Gopen1 before any call to H5Gcreate. The group does not exist so the open failed.
Can you explain how the program works?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 16, 2021

The program uses h5gopen to determine if the group exists, if not it is created.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 16, 2021

The other option is to have them use H5Oexist_by_name, but I don't see that listed as a supported API in your VOL.

@khou2020
Copy link
Collaborator

H5Oexist_by_name is supported.
I fixed the bugs crashing qmcpack. Can you try again with the latest commit?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 20, 2021

With the latest commit I get the core dump:


#0  0x0000200003a33618 in raise () from /lib64/power9/libc.so.6
#1  0x0000200003a13a2c in abort () from /lib64/power9/libc.so.6
#2  0x00002000185a93c8 in H5VL_log_group_open (obj=0x2e1cc940, loc_params=0x7fffd99b58f0, name=0x7fffd99b5a70 "state_0", gapl_id=792633534417207299, dxpl_id=792633534417207304, req=0x0)
    at /ccs/home/brtnfld/packages/vol-log-based/src/H5VL_log_group.cpp:121
#3  0x0000200002f4592c in H5VL__group_open (obj=<optimized out>, loc_params=<optimized out>, loc_params@entry=0x7fffd99b58f0, name=<optimized out>, name@entry=0x7fffd99b5a70 "state_0", gapl_id=<optimized out>, 
    gapl_id@entry=792633534417207299, dxpl_id=<optimized out>, dxpl_id@entry=792633534417207304, req=<optimized out>, req@entry=0x0, cls=<optimized out>, cls=<optimized out>) at ../../src/H5VLcallback.c:4334
#4  0x0000200002f5048c in H5VL_group_open (vol_obj=0x2e1c7d70, loc_params=0x7fffd99b58f0, name=0x7fffd99b5a70 "state_0", gapl_id=792633534417207299, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:4366
#5  0x0000200002d51508 in H5G__open_api_common (loc_id=loc_id@entry=72057594037927936, name=name@entry=0x7fffd99b5a70 "state_0", gapl_id=<optimized out>, gapl_id@entry=0, token_ptr=token_ptr@entry=0x0, _vol_obj_ptr=_vol_obj_ptr@entry=0x0)
    at ../../src/H5G.c:397
#6  0x0000200002d52798 in H5Gopen2 (loc_id=72057594037927936, name=0x7fffd99b5a70 "state_0", gapl_id=0) at ../../src/H5G.c:437
#7  0x00000000100f34ac in qmcplusplus::hdf_archive::push(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
#8  0x00000000100d76d8 in qmcplusplus::RandomNumberControl::write_parallel(qmcplusplus::hdf_archive&, Communicate*) ()
#9  0x00000000100d9874 in qmcplusplus::RandomNumberControl::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Communicate*) ()
#10 0x000000001000d0bc in main ()

@khou2020
Copy link
Collaborator

With the latest commit I get the core dump:


#0  0x0000200003a33618 in raise () from /lib64/power9/libc.so.6
#1  0x0000200003a13a2c in abort () from /lib64/power9/libc.so.6
#2  0x00002000185a93c8 in H5VL_log_group_open (obj=0x2e1cc940, loc_params=0x7fffd99b58f0, name=0x7fffd99b5a70 "state_0", gapl_id=792633534417207299, dxpl_id=792633534417207304, req=0x0)
    at /ccs/home/brtnfld/packages/vol-log-based/src/H5VL_log_group.cpp:121
#3  0x0000200002f4592c in H5VL__group_open (obj=<optimized out>, loc_params=<optimized out>, loc_params@entry=0x7fffd99b58f0, name=<optimized out>, name@entry=0x7fffd99b5a70 "state_0", gapl_id=<optimized out>, 
    gapl_id@entry=792633534417207299, dxpl_id=<optimized out>, dxpl_id@entry=792633534417207304, req=<optimized out>, req@entry=0x0, cls=<optimized out>, cls=<optimized out>) at ../../src/H5VLcallback.c:4334
#4  0x0000200002f5048c in H5VL_group_open (vol_obj=0x2e1c7d70, loc_params=0x7fffd99b58f0, name=0x7fffd99b5a70 "state_0", gapl_id=792633534417207299, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:4366
#5  0x0000200002d51508 in H5G__open_api_common (loc_id=loc_id@entry=72057594037927936, name=name@entry=0x7fffd99b5a70 "state_0", gapl_id=<optimized out>, gapl_id@entry=0, token_ptr=token_ptr@entry=0x0, _vol_obj_ptr=_vol_obj_ptr@entry=0x0)
    at ../../src/H5G.c:397
#6  0x0000200002d52798 in H5Gopen2 (loc_id=72057594037927936, name=0x7fffd99b5a70 "state_0", gapl_id=0) at ../../src/H5G.c:437
#7  0x00000000100f34ac in qmcplusplus::hdf_archive::push(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
#8  0x00000000100d76d8 in qmcplusplus::RandomNumberControl::write_parallel(qmcplusplus::hdf_archive&, Communicate*) ()
#9  0x00000000100d9874 in qmcplusplus::RandomNumberControl::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Communicate*) ()
#10 0x000000001000d0bc in main ()

Did you unset LOGVOL_DEBUG_ABORT_ON_ERR? It seems logvol aborts after it cannot find the group.

@khou2020
Copy link
Collaborator

With the latest commit I get the core dump:


#0  0x0000200003a33618 in raise () from /lib64/power9/libc.so.6
#1  0x0000200003a13a2c in abort () from /lib64/power9/libc.so.6
#2  0x00002000185a93c8 in H5VL_log_group_open (obj=0x2e1cc940, loc_params=0x7fffd99b58f0, name=0x7fffd99b5a70 "state_0", gapl_id=792633534417207299, dxpl_id=792633534417207304, req=0x0)
    at /ccs/home/brtnfld/packages/vol-log-based/src/H5VL_log_group.cpp:121
#3  0x0000200002f4592c in H5VL__group_open (obj=<optimized out>, loc_params=<optimized out>, loc_params@entry=0x7fffd99b58f0, name=<optimized out>, name@entry=0x7fffd99b5a70 "state_0", gapl_id=<optimized out>, 
    gapl_id@entry=792633534417207299, dxpl_id=<optimized out>, dxpl_id@entry=792633534417207304, req=<optimized out>, req@entry=0x0, cls=<optimized out>, cls=<optimized out>) at ../../src/H5VLcallback.c:4334
#4  0x0000200002f5048c in H5VL_group_open (vol_obj=0x2e1c7d70, loc_params=0x7fffd99b58f0, name=0x7fffd99b5a70 "state_0", gapl_id=792633534417207299, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:4366
#5  0x0000200002d51508 in H5G__open_api_common (loc_id=loc_id@entry=72057594037927936, name=name@entry=0x7fffd99b5a70 "state_0", gapl_id=<optimized out>, gapl_id@entry=0, token_ptr=token_ptr@entry=0x0, _vol_obj_ptr=_vol_obj_ptr@entry=0x0)
    at ../../src/H5G.c:397
#6  0x0000200002d52798 in H5Gopen2 (loc_id=72057594037927936, name=0x7fffd99b5a70 "state_0", gapl_id=0) at ../../src/H5G.c:437
#7  0x00000000100f34ac in qmcplusplus::hdf_archive::push(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
#8  0x00000000100d76d8 in qmcplusplus::RandomNumberControl::write_parallel(qmcplusplus::hdf_archive&, Communicate*) ()
#9  0x00000000100d9874 in qmcplusplus::RandomNumberControl::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Communicate*) ()
#10 0x000000001000d0bc in main ()

LOGVOL_DEBUG_ABORT_ON_ERR is removed in latest version, can you try again on latest commit?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 27, 2021

I found the issue with my batch script, it was enabling compression, only for the cases of chunked datasets and collective I/O. I removed that line in the script and it works.

@khou2020
Copy link
Collaborator

I found the issue with my batch script, it was enabling compression, only for the cases of chunked datasets and collective I/O. I removed that line in the script and it works.
@brtnfld

Compression is currently experimental. Should we close this issue and open a new one for filter support?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 27, 2021

Fine with me to close it.

Maybe for now add a disclaimer for :

Compile with zlib library (--enable-zlib) to enable metadata compression, or mention it in the limitations section of the ReadME.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 28, 2021

The comments related to QMCPACK in PR#11 and /tmp on Summit should be here, and this issue reopened.

@khou2020 khou2020 reopened this Dec 28, 2021
@brtnfld
Copy link
Collaborator Author

brtnfld commented Dec 28, 2021

created #13, this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants