Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to pass parallel make tests #1

Closed
scottneuhoff opened this issue Mar 29, 2021 · 14 comments
Closed

Unable to pass parallel make tests #1

scottneuhoff opened this issue Mar 29, 2021 · 14 comments
Assignees

Comments

@scottneuhoff
Copy link

Hello vol-async team,

I'm trying to get this HDF5 Asynchronous I/O VOL Connector installed on my system and I can get it to a point where it is passing the serial tests (in vol-async/test/pytest.py) but never the parallel ones; I think there may be some inconsistencies with the directory structures / paths as written so hopefully we can clear this up together. Let me walk you through how I got here:

  1. I cloned the HDF5 and vol-async repos and set my environment directories like so:
export H5_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/
export VOL_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/
export ABT_DIR=/home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/
  1. I checked out the async_vol_register_optional branch and ran autogen.sh
  2. I ran ./configure --prefix=$H5_DIR/install --enable-parallel --enable-threadsafe --enable-unsupported CC=mpicc using my systems HPE MPT installation for MPI.
  3. Ran make install with no issues, switched to $ABT_DIR, ran ./autogen.sh && CC=cc ./configure --prefix=$ABT_DIR/build && make install with no issues
  4. Here is where I think things start to break down a little. I cd into $VOL_DIR/src, and copy Makefile.summit to Makefile. I edit it so that:
HDF5_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/install`
ABT_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/build

Notice these are not as written in repo's README: I had to add /install on the end of HDF5_DIR for it to find the correct header files, if I did not do this, it would complain that hdf5dev.h could not be found (as it should, that header file is not in $H5_DIR as Makefile.summit would have you believe)
6. After editing that Makefile, I run make and it completes smoothly. Next, I run

export LD_LIBRARY_PATH=$VOL_DIR/src:$H5_DIR/lib:$LD_LIBRARY_PATH
export HDF5_PLUGIN_PATH="$VOL_DIR"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}

although, here again I find that $H5_DIR/lib doesn't exist, perhaps it should be $H5_DIR/install/lib
7. I copy Makefile.summit to Makefile and again edit it so that:

ASYNC_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/src
HDF5_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/hdf5/install
ABT_DIR = /home1/sneuhoff/nbu11/scratch/hdf5_async/vol-async/argobots/build
  1. I run make with no issues
  2. When I run make check (my Python is version 3.7.0), I get the following:
./pytest.py -p
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
ERROR: Test async_test_multifile.exe : returned non-zero exit status= -6 aborting test
run_cmd= ./async_test_multifile.exe
pytest was unsuccessful

Running async_test_multifile.exe alone gives me:

async_test_multifile.exe: H5CX.c:3610: H5CX__pop_common: Assertion `head && *head' failed.
Aborted (core dumped)

In my other attempts changing various things I was able to get it to pass all the way to here:

./pytest.py -p
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
Test # 3 : async_test_multifile.exe PASSED
Test # 4 : async_test_serial_event_set.exe PASSED
ERROR: Test async_test_serial_event_set_error_stack.exe : returned non-zero exit status= 255 aborting test
run_cmd= ./async_test_serial_event_set_error_stack.exe
pytest was unsuccessful

Running that test individually gives:

H5Fcreate start
H5Fcreate done
H5Gcreate start
H5Gcreate done
H5Gcreate 2 start (should fail when executed)
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5G.c line 268 in H5Gcreate_async(): unable to asynchronously create group
    major: Symbol table
    minor: Unable to create file
  #001: H5G.c line 185 in H5G__create_api_common(): unable to create group
    major: Symbol table
    minor: Unable to initialize object
  #002: H5VLcallback.c line 4920 in H5VL_group_create(): group create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLcallback.c line 4887 in H5VL__group_create(): group create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #004: H5VLnative_group.c line 103 in H5VL__native_group_create(): unable to create group
    major: Symbol table
    minor: Unable to initialize object
  #005: H5Gint.c line 328 in H5G__create_named(): unable to create and link to group
    major: Symbol table
    minor: Unable to initialize object
  #006: H5L.c line 2383 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #007: H5L.c line 2625 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #008: H5Gtraverse.c line 838 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #009: H5Gtraverse.c line 614 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #010: H5L.c line 2418 in H5L__link_cb(): name already exists
    major: Links
    minor: Object already exists
Error with group create
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5S.c line 496 in H5Sclose(): not a dataspace
    major: Invalid arguments to routine
    minor: Inappropriate type
Closing dataset's dataspace failed
HDF5-DIAG: Error detected in HDF5 (1.13.0) thread 0:
  #000: H5D.c line 472 in H5Dclose(): not a dataset ID
    major: Invalid arguments to routine
    minor: Inappropriate type
Closing dataset failed

I am wondering if there is anything here that is obviously inconsistent with how I should be installing things. Let me know, thanks!

@houjun houjun self-assigned this Mar 31, 2021
@houjun
Copy link
Collaborator

houjun commented Mar 31, 2021

Hi @scottneuhoff, thank you for your interest in trying our async VOL connector and provide the feedback.

I have updated the README with the correct path for the installed libraries as you suggested.

For the errors you are seeing with "./async_test_serial_event_set_error_stack.exe", it is likely that the HDF5_VOL_CONNECTOR environment variable is not set properly, with the test program not using the async connector, can you do "echo $HDF5_VOL_CONNECTOR" before your run and make sure it is "async under_vol=0;under_info={}". Also it would be good to update the HDF5 code and async code to the latest version with "git pull", since we have been fixing bugs.

For parallel tests, can you check the content in async_vol_test.err, or just run with "mpirun -np 4 ./async_test_parallel.exe" and see if there are any errors.

What version of Linux system are you running on? You can check with "cat /etc/os-release".

@scottneuhoff
Copy link
Author

@houjun , thanks for your feedback, directory structures look much better. However, I am still unable to pass the multifile test. When I run make check from $VOL_DIR/tests, I get the following:

./pytest.py -p
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
ERROR: Test async_test_multifile.exe : returned non-zero exit status= -6 aborting test
run_cmd= ./async_test_multifile.exe
pytest was unsuccessful

Running the multifile test individually and/or checking async_vol_test.err yields the same message:
async_test_multifile.exe: H5CX.c:3610: H5CX__pop_common: Assertion head && *head' failed.`

These were after double checking my directories, environment variables, and pulling the most recent updates from git. I am currently running SUSE linux, SLES12.
As for async_test_parallel.exe, when I try to run that individually, I get the following:

HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 0:
  #000: H5.c line 1010 in H5open(): library initialization failed
    major: Function entry/exit
    minor: Unable to initialize object
  #001: H5.c line 277 in H5_init_library(): unable to initialize vol interface
    major: Function entry/exit
    minor: Unable to initialize object
  #002: H5VLint.c line 202 in H5VL_init_phase2(): unable to set default VOL connector
    major: Virtual Object Layer
    minor: Can't set value
  #003: H5VLint.c line 444 in H5VL__set_def_conn(): can't register connector
    major: Virtual Object Layer
    minor: Unable to register new ID
  #004: H5VLint.c line 1371 in H5VL__register_connector_by_name(): unable to load VOL connector
    major: Virtual Object Layer
    minor: Unable to initialize object
HDF5-DIAG: Error detected in HDF5 (1.13.0) MPI-process 0:
  #000: H5VL.c line 144 in H5VLregister_connector_by_name(): unable to register VOL connector
    major: Virtual Object Layer
    minor: Unable to register new ID
  #001: H5VLint.c line 1371 in H5VL__register_connector_by_name(): unable to load VOL connector
    major: Virtual Object Layer
    minor: Unable to initialize object
  [ASYNC VOL ERROR] with H5VLregister_connector_by_name
  [ASYNC VOL ERROR] H5Pset_vol_async: async_setup
async_test_parallel.exe: async_test_parallel.c:41: main: Assertion `status >= 0' failed.
MPT ERROR: Rank 0(g:0) received signal SIGABRT/SIGIOT(6).

@qkoziol
Copy link
Contributor

qkoziol commented Apr 1, 2021

@scottneuhoff - This looks like the bug I fixed on the async_vol_register_optional branch if the HPC-IO org's HDF5 git repo last week. Can you please pull the latest HDF5 code from that branch, rebuild & install, then try this test again?

@houjun
Copy link
Collaborator

houjun commented Apr 1, 2021

@scottneuhoff I just pushed a change to the parallel testing that removes H5Pset_vol_async() calls which are not necessary, can you try again? If there's still errors, maybe we can find a time for a zoom session to go through the tests together?

@scottneuhoff
Copy link
Author

scottneuhoff commented Apr 1, 2021

@qkoziol @houjun Thanks for your quick responses; I switched to another machine that I hoped would be less complicated to work with (Red Hat Linux) and went through the process from a fresh directory; this ensured that I had cloned the most recent git clones so all the changes you refer to should be in. However, I get to exactly the same place as before - I set those environment variables, go into $VOL_DIR/test, edit my Makefile, run make check and hit:

./pytest.py -p
Running serial tests
Test # 1 : async_test_serial.exe PASSED
Test # 2 : async_test_serial2.exe PASSED
ERROR: Test async_test_multifile.exe : returned non-zero exit status= -6 aborting test
run_cmd= ./async_test_multifile.exe
pytest was unsuccessful

Where again, running async_test_multifile.exe individually (or checking async_vol_test.err) tells me:

async_test_multifile.exe: H5CX.c:3610: H5CX__pop_common: Assertion `head && *head' failed.

Interestingly, I also found that earlier in the install process when running make check in my $H5_DIR (the hdf5 directory), that I pass most of those checks but get to the test_mirror.sh which fails, with the same error:

============================
Testing test_mirror.sh 
============================
test_mirror.sh  Test Log
============================
mkdir: cannot create directory \u2018mirror_vfd_test\u2019: File exists
Launching Mirror Server
Mirror VFD was not built -- cannot launch server.
mirror_vfd: H5CX.c:3610: H5CX__pop_common: Assertion `head && *head' failed.
test_mirror.sh: line 80: 13012 Aborted                 (core dumped) ./mirror_vfd
Stopping Mirror Server
Mirror VFD not built -- unable to perform shutdown.
Mirror VFD tests FAILED.
0.05user 0.11system 0:00.29elapsed 58%CPU (0avgtext+0avgdata 3376maxresident)k
0inputs+1656outputs (0major+18419minor)pagefaults 0swaps
make[4]: *** [Makefile:3536: test_mirror.sh.chkexe_] Error 1

It's the same Assertion head && *head' failed.
@houjun if a call would make solving this issue smoother, then please email me at scott.neuhoff@nasa.gov and we can figure it out. Thanks~

@qkoziol
Copy link
Contributor

qkoziol commented Apr 2, 2021

@scottneuhoff - Although the assert is the same, I have a feeling that these have different root causes. I'm happy to "pair debug" in a call also, I'll send you and Tang an email to set something up.

@houjun
Copy link
Collaborator

houjun commented Apr 6, 2021

Closing this as we solved the problem in the call.

@houjun houjun closed this as completed Apr 6, 2021
@gsjaardema
Copy link

I am trying the Async vol today and am getting the same head && *head assertion failure. My application is access HDF5 through either the netCDF or CGNS library (assertion failure on both). In the assert, head is non-null, but *head is NULL.

Not sure if the same cause as this issue, but the failure looks the same, so perhaps the solution is the same?

@houjun
Copy link
Collaborator

houjun commented Apr 7, 2021

Hi @gsjaardema, the previous problem was due to an environment variable setting, HDF5_PLUGIN_PATH should be set to "$VOL_DIR/src" instead of "$VOL_DIR", can you check the HDF5_PLUGIN_PATH value in your environment? (Also please update the HDF5 library as well as the async vol to the latest version.)

@gsjaardema
Copy link

The HDF5_PLUGIN_PATH is set to the correct location and the HDF5 library is up-to-date as of yesterday. I will do some more debugging and make sure the plugin is being found and loaded and then open a new issue if I still can't determine the problem.

@qkoziol
Copy link
Contributor

qkoziol commented Apr 8, 2021

The HDF5_PLUGIN_PATH is set to the correct location and the HDF5 library is up-to-date as of yesterday. I will do some more debugging and make sure the plugin is being found and loaded and then open a new issue if I still can't determine the problem.

This also seems similar to a problem that I fixed in the incoming branch. Can you try again with the 'async_vol_register_optional' branch of both the HDF5 and vol-async repos from the HPC-IO org's forks:

https://github.com/hpc-io/hdf5/tree/async_vol_register_optional
https://github.com/hpc-io/vol-async/tree/async_vol_register_optional

@gsjaardema
Copy link

I am on the async_vol_register_optional branch of both HDF5 and vol-async repos and code is up-to-date. For some reason the dlopen is failing in H5PL__open. The path that it is using in that routine points to the correct plugin library, but for some reason, the handle coming back is NULL.

I made a simple C program to just do a dlopen on the same file and a dlsym on the symbol and that works correctly.

Not sure why it can't open the libh5async.so library in the app, but can in my simple program... The tests in vol-async run correctly, so it is opening the library correctly there...

@gsjaardema
Copy link

OK, I reran with LD_DEBUG=libs which outputs debug information related to shared libraries and it showed that the application could not find libabt.so. I then reran with the LD_LIBRARY_PATH specifying the path to libabt.so and everything seems to be working.

I'm not sure why it isn't finding the library without LD_LIBRARY_PATH being set since the executable has that location specified with the -rpath option... But, seems to be working for now. Will see if I can figure out why it needs LD_LIBRARY_PATH.

@qkoziol
Copy link
Contributor

qkoziol commented Apr 8, 2021

OK, I reran with LD_DEBUG=libs which outputs debug information related to shared libraries and it showed that the application could not find libabt.so. I then reran with the LD_LIBRARY_PATH specifying the path to libabt.so and everything seems to be working.

I'm not sure why it isn't finding the library without LD_LIBRARY_PATH being set since the executable has that location specified with the -rpath option... But, seems to be working for now. Will see if I can figure out why it needs LD_LIBRARY_PATH.

Ah, very cool! Annoying about the LD_LIBRARY_PATH though - I would tend to agree with you, it should have been linked into the async VOL connector and shouldn't need to be added to the dynamic library path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants