-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hdf5-iotest fails #11
Comments
Do you have the output error message? Looks like I got a different error. Error: OMPI_COMM_WORLD_RANK is not set in environment like it should be. |
free(): invalid pointer |
If you comment out one-case = 18 then you can check if the other cases 1-17, are working for your build. |
This particular bug should be fixed. Can you try again? Can you point out the location of case 18? Was it the cause of the error I saw? |
It fails in H5Dwrite for case 18, I reran the with the lastest commit same error. The core files are not useful since they seem to be getting truncated.
|
Can you post all output prior to the error? Also your build configure oh hdf5 and logvol. NUMBER OF NNODES, NPROCS_MAX = 1 42 step rk=2 chkd fill=true align-[incr:thold]=[1:0] mblk=2048 fmt=earliest io=mpi-io-col Sender: LSF System lsfadmin@batch4 Job was submitted from host by user in cluster at Mon Dec 20 09:51:42 2021 The output (if any) is above this job summary. |
|
|
This is for just running step 18.
|
Isn't test 18 the one crashed? It seems to finish without problem on my side. I saw you using gmake. Are you using gnu compiler or ibm xl? |
yes, 18 causes the crash. I'm using gnu compiler. Are you using 1.13.0 release? |
Yes. But i am using ibm xl. |
If you comment out the =18 line, does the program complete for you? |
|
Yes. All tests finished. |
This error message indicates a problem of calling
Kai-yuan, please run |
Strange, I get the same error with xl compilers. |
Hi, Scot Another way I suggest is to link your hdf5-iotest against Kai-yuan's log-VOL. |
I ran mine with xl compilers and valgrind, I get this error:
|
Are you using any filters? That part of the code is experimental and should not even be ran. |
It should not be using any filters, those input lines, gzip and szip are commented out. |
I checked the native FD hdf5 file, filters is none. |
Scot, could you please share your modified test programs?
|
That code should only run when there are filters defined in dcpl. The buffer is only allocated when there are filters, so it make sense to see the invalid free error. Is there any error reported a prior? |
@khou2020
Shouldn't line 166 |
@khou2020 Subroutine |
H5VL_log_filei_balloc reserve the first 8 byte to store the size of the buffer allocated, so actual buffer starts 8 bytes before the user buffer. |
The only dcpl is for chunking, no compression. The full output (2 ranks was used) in in /tmp/brtnfld on summit. |
Please add comments into this subroutine. |
login1, which login node did you place your build? |
I moved that to my home folder. |
Note that the only difference between case 17 and case 18 is independent and collective writes, respectively. case 17 passes and valgrind does not denote any issues. |
Logvol write is all independent. Actual write happens at file flush time. H5Dwrite only stage the requests locally in the queue. |
I tried your code but cannot reproduce the error. I did get many valgrind warning about uninitialized values, but none were originated within logvol. |
This is not optimal. |
Flushing (H5Fflush) is always collective, but posting (H5Dwrite) is always independent. |
@brtnfld We tried on 3 different machines but still cannot reproduce the error. |
@brtnfld I rebuilt everything using all the modules you listed but still cannot reproduce the issue. |
I put a fresh build of all the packages, hdf5, hdf5-log, qmcpack in /tmp/brtnfld/VLOG on login4. For my environment, I have in my .bashrc
I built everything in my /gpfs/alpine/csc300/proj-shared directory. You should be able to run the qmcpack/build_summit_cpu/bin batch script with only a #BSUB account change. All the build scripts are in each package directories, all but qmcpack script is run from the build dir and uses relative paths so you should not have to change them at all. |
I cannot access it. |
it is in /tmp/brtnfld on login4 |
I can't see it. |
recopied it. |
Can you also include built binaries? We once used your source but cannot reproduce the issue. |
All the builds and binaries are there. |
There is no hdf5-iotest. |
This comment was for this PR, not #10 I found the issue with my batch script, it was enabling compression, only for the cases of chunked datasets and collective I/O. I removed that line in the script and it works. This issue can be closed and PR#10 reopened. |
I am confused. Can you create a new issue for what's unsolved yet? |
@brtnfld Do you have the steps to build on Cori? I keep getting this error. Seems it needs a different toolchain to the one come with the system. |
You have to set env. MPICH_MAX_THREAD_SAFETY=multiple |
I tested it on Cori without problem. |
hdf5-iotest https://github.com/brtnfld/hdf5-iotest is a performance benchmark that compares the effects of different HDF5 parameters and IO patterns. I tried it on summit with vol-log and it crashes on test #18 in H5Dwrite, #18 is output every step of rank 2 chunked arrays, with fill values on, alignment, metadata block size of 2048, using the earliest format, and MPI collective.
For the same test case on a local Linux box it worked fine.
On summit,
Currently Loaded Modules:
I used HDF5 1.13, and hdf5-iotest master. To compile hdf5-iotest:
To run the program, it takes the input file "hdf5_iotest.ini"
To run the program:
The only information I could get from the core files was:
#0 0x0000200000863618 in raise () from /lib64/power9/libc.so.6
Backtrace stopped: Cannot access memory at address 0x7fffcf4085c0
The text was updated successfully, but these errors were encountered: