-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dread test segfautls on summit #10
Comments
Which application was it? can you share it along with the run script? |
dread test is test/basic/dread.cpp |
Should be fixed. |
qmcpack: https://github.com/QMCPACK/qmcpack To build on summit:
------- END ---- To run the restart test build_summit_cpu/bin (submitted from this directory) :
|
For log VOL make check, I run an interactive job on summit and run it using:
dread now passes with the newest updates if I run the test manually (see the second option in the script). |
Are you using async I/O VOL? I am seeing this line:
|
The output should be saved to test-suit.log and dread.log. My build steps are: |
Sorry pasted wron script:
|
Did you set TESTSEQRUN in configure step? TESTMPIRUN only used in make ptest. |
Hi, @brtnfld
|
Correction for the environment variable
|
There is no test_suit.log, the individual log files just say: Error: OMPI_COMM_WORLD_RANK is not set in environment like it should be. It does not give the actual verbose make output. |
For the parallel test (and probably for the serial test) it seems to be using mpiexec instead of my env. setting. export SED="/usr/bin/sed"; export srcdir="/ccs/home/brtnfld/packages/vol-log-based/test/basic"; export TESTOUTDIR="."; export TESTSEQRUN=""; export TESTMPIRUN="/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-1 [h16n05:3615581] Error: common_pami.c:1094 - ompi_common_pami_init() 27: Unable to create 1 PAMI communication context(s) rc=1 |
What is the configure command line you used to build the log-based VOL? |
#!/bin/bash #autoreconf -i export CC=mpicc ../configure --prefix=${PWD}/log-vol --with-hdf5=${HOME}/packages/hdf5-1.13/build/hdf5 --enable-shared --enable-zlib gmake -j 8 |
That error means it was not being ran with jsrun. |
@brtnfld , Please try this configure command.
|
I added the MPI RUN env. before configure and it is now correct. All but null_req.log pass. HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 0: It is looking for the VOL in the wrong directory, same with the parallel tests. |
It is looking for /src/.libs instead of using src/.libs in the build directory |
Can you rebuild it? Maybe there are some residue files. |
I pushed a commit to fix. Please run 'git pull' to update your local repo. |
I still have the same issue with the new code. I did the build into an empty build directory. Also, it would be nice if make also built the tests. Unfortunately, make check or make ptest requires running parallel programs on the frontend, which is not allowed and will fail. |
I pushed another commit. That should fix the bug. Summit, like other DOE parallel computers, is a cross-compile environment. |
FYI. Command "make tests" compiles and builds all executables of the |
I got following error running the script. CMake Error at CMake/Testlibstdc++.cmake:3 (try_compile): -- Configuring incomplete, errors occurred! I also tried building locally on my PC, but it also failed. [ 78%] Building CXX object src/QMCDrivers/CMakeFiles/qmcdriver_unit.dir/WaveFunctionTester.cpp.o |
Did you have these modules loaded?
|
For the tests. It passes the null_req tests, but fails with: *** TESTING CXX lt-dynamic: Creating files |
The parallel tests pass, but not:
|
Unless the env. variable is set wrong, I don't understand why setting these should cause the tests to fail. Users just might set these by default in their environment if they plan on using LOG VOL regularly, or have already set it previously. |
@khou2020 were you able to compile qmcpack? |
This is because your setting (quoted below) uses a previously installed log-based VOL which I assume uses the same shared library ABI version, causing a conflict. You can still set LD_LIBRARY_PATH, just not the same VOL to be built.
|
I will try it when summit is back online. |
I've not built it locally, they do have build scripts in the config directory. |
@brtnfld |
I still get the same QMCPACK errors: All the species have the same mass 1 |
Is there any error messages previously? Seems the group does not exist. If your logvol is built in debug mode, you can set environment variable LOGVOL_DEBUG_ABORT_ON_ERR to 1 to stop at first error. |
It is not using H5L APIs. I think I see the issue, they are using H5Gopen1, which your VOL probably does not support. |
H5Gopen2 still has the same issue. With debugging and stop on error:
|
Can you share the program, the core file and the output file on summit? (giving access permission) |
I put the core files in /tmp/brtnfld, login3 if that is important. |
I built it locally, but i cannot locate the "restart" benchmark. Is it an optional feature? |
|
I debugged the program and saw it calls H5Gopen1 before any call to H5Gcreate. The group does not exist so the open failed. |
The program uses h5gopen to determine if the group exists, if not it is created. |
The other option is to have them use H5Oexist_by_name, but I don't see that listed as a supported API in your VOL. |
H5Oexist_by_name is supported. |
With the latest commit I get the core dump:
|
Did you unset LOGVOL_DEBUG_ABORT_ON_ERR? It seems logvol aborts after it cannot find the group. |
LOGVOL_DEBUG_ABORT_ON_ERR is removed in latest version, can you try again on latest commit? |
I found the issue with my batch script, it was enabling compression, only for the cases of chunked datasets and collective I/O. I removed that line in the script and it works. |
Compression is currently experimental. Should we close this issue and open a new one for filter support? |
Fine with me to close it. Maybe for now add a disclaimer for : Compile with zlib library (--enable-zlib) to enable metadata compression, or mention it in the limitations section of the ReadME. |
The comments related to QMCPACK in PR#11 and /tmp on Summit should be here, and this issue reopened. |
created #13, this can be closed. |
Using HDF5 1.13 and
currently Loaded Modules:
the dread test fails with the segfault below, it seems to work with one rank, this is with 2 ranks.
#0 0x0000200014f17a00 in ADIOI_GEN_WriteStrided () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#1 0x0000200014edba8c in ADIOI_GPFS_WriteStridedColl ()
from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#2 0x0000200014eceda8 in MPIOI_File_write_all () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#3 0x0000200014ecf908 in mca_io_romio_dist_MPI_File_write_at_all ()
from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#4 0x0000200014ec1c7c in mca_io_romio321_file_write_at_all ()
from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/spectrum_mpi/mca_io_romio321.so
#5 0x0000200000b08a2c in PMPI_File_write_at_all () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/container/../lib/libmpi_ibm.so.3
#6 0x000020000009c4d8 in MPI_File_write_at_all (fh=, offset=2436, buf=0x0, count=, datatype=0x3a4abdc0, status=0x7fffe55cfb18) at lib/darshan-mpiio.c:563
#7 0x000020000017e374 in H5VL_log_filei_metaflush(H5VL_log_file_t*) () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#8 0x000020000017ab84 in H5VL_log_filei_close(H5VL_log_file_t*) () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#9 0x000020000017c3e0 in H5VL_log_filei_dec_ref(H5VL_log_file_t*) () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#10 0x000020000018780c in H5VL_log_obj_t::~H5VL_log_obj_t() () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#11 0x000020000018a920 in H5VL_log_free_wrap_ctx(void*) () from /ccs/home/brtnfld/scratch/vol-log-based/build/src/.libs/libH5VL_log.so.0
#12 0x00002000008594a4 in H5VL__free_vol_wrapper (vol_wrap_ctx=0x3a2efba0) at ../../src/H5VLint.c:2243
#13 0x000020000085c884 in H5VL_reset_vol_wrapper () at ../../src/H5VLint.c:2431
#14 0x000020000084fd88 in H5VL_file_close (vol_obj=, dxpl_id=792633534417207304, req=0x0) at ../../src/H5VLcallback.c:4163
#15 0x000020000060b9e0 in H5F__close_cb (file_vol_obj=, request=) at ../../src/H5Fint.c:216
#16 0x00002000006a388c in H5I__dec_ref (id=id@entry=72057594037927936, request=0x0) at ../../src/H5Iint.c:1036
#17 0x00002000006a3a40 in H5I__dec_app_ref (id=72057594037927936, request=) at ../../src/H5Iint.c:1108
#18 0x00002000006a3b5c in H5I_dec_app_ref (id=) at ../../src/H5Iint.c:1156
#19 0x0000200000600c3c in H5Fclose (file_id=72057594037927936) at ../../src/H5F.c:1060
#20 0x0000000010002338 in ?? ()
#21 0x0000200000e04078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#22 0x0000200000e04264 in __libc_start_main () from /lib64/power9/libc.so.6
#23 0x0000000000000000 in ?? ()
The text was updated successfully, but these errors were encountered: