-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF5: Field "Kinks" with FreeFormula #2841
Comments
Could you please provide the function you used to create the density. |
Could you please also provide the version of PIConGPU and commit hash if it is the dev branch. |
It's PIConGPU 0.4.1 and I sent you a mail with the details :) But since this shows up weirdly again later on in time steps where the target expanded, I suspect it is in the field dumping routines (E and density!) somehow... |
Perfect catch, we saw that indeed the whole front is moved by one cell as well, occurring in sections at transversal GPU boundaries (of which I have four). Since I averaged transversely the images above, my artifact was not perfectly zero as one can see now. Could be a host-to-device preparation or a splash dump issue. |
Is this a PNG plugin output or also HDF5? |
This output is HDF5. I also tried ADIOS and cannot find such artifacts. |
So it's hopefully just a libSplash bug since deviceToHost functionality for fields is identical in our HDF5 & ADIOS plugins. |
The reason for this bug is https://github.com/ComputationalRadiationPhysics/libSplash/blob/140d6ce903d265909c3f8021d5d9cc04e5a72d37/src/DCDataSet.cpp#L220. Somehow the chunking is creating these artefacts. |
Ouch. What do we do about this? Looks like a HDF5 bug to me since I can't see a limitation that we violate: besides that our chunks are a little small for parallel I/O: We can of course just disable the whole feature in libSplash, reducing some functionality such as serial compression and extensible data sets (off by default). |
I read also all docu I found and cannot see find something we did wrong. I am currently not sure if we should disable chunking. Maybe we should first check other hdf5 versions and create a mini app. |
suggested from @ax3l to create an example based on this code snipped compile:
The problem is #include <mpi.h>
#include <hdf5.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#define X 1872llu
#define Y 1872llu
int write_HDF5(
MPI_Comm const comm, MPI_Info const info,
float* data, size_t len, int rank)
{
// property list
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
// MPI-I/O driver
H5Pset_fapl_mpio(plist_id, comm, info);
// file create
char file_name[100];
sprintf(file_name, "%zu", len);
strcat(file_name, ".h5");
hid_t file_id = H5Fcreate(file_name, H5F_ACC_TRUNC,
H5P_DEFAULT, plist_id);
// dataspace
hsize_t dims[3] = {Y, X};
hsize_t globalDims[3] = {Y * 4, X * 4};
hsize_t max_dims[2] = {Y * 4, X * 4};
hsize_t offset[2] = {rank/4 * Y, rank%4 * X};
hid_t srcSize = H5Screate_simple(2, dims, NULL);
hid_t filespace = H5Screate_simple(2,
globalDims,
max_dims);
printf("%i: %llu,%llu %llu,%llu \n", rank,offset[0],offset[1],globalDims[0],globalDims[1]);
// chunking
hsize_t chunk[2] = {128, 128};
hid_t datasetCreationProperty = H5Pcreate(H5P_DATASET_CREATE);
H5Pset_chunk(datasetCreationProperty, 2, chunk);
// dataset
hid_t dset_id = H5Dcreate(file_id, "dataset1", H5T_NATIVE_FLOAT,
filespace, H5P_DEFAULT,
datasetCreationProperty, H5P_DEFAULT);
// write
hid_t dset_plist_id = H5Pcreate(H5P_DATASET_XFER);
#ifdef FIX
H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_INDEPENDENT); // default
#else
H5Pset_dxpl_mpio(dset_plist_id, H5FD_MPIO_COLLECTIVE);
#endif
hid_t dd = H5Dget_space(dset_id);
H5Sselect_hyperslab(dd, H5S_SELECT_SET, offset,
NULL, dims, NULL);
herr_t status;
status = H5Dwrite(dset_id, H5T_NATIVE_FLOAT,
srcSize, dd, dset_plist_id, data);
// close all
status = H5Pclose(plist_id);
status = H5Pclose(dset_plist_id);
status = H5Dclose(dset_id);
status = H5Fclose(file_id);
return 0;
}
int main(int argc, char* argv[])
{
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Info info = MPI_INFO_NULL;
MPI_Init(&argc, &argv);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
size_t lengths[1] = {X*Y};
for( size_t i = 0; i < 1; ++i )
{
size_t len = lengths[i];
printf("Writing for len=%zu ...\n", len);
float* data = (float*)malloc(len * sizeof(float));
for( size_t y = 0; y < Y; ++y)
for( size_t x = 0; x < X; ++x)
data[y * Y + x] = 100.f + y%1024;
write_HDF5(comm, info, data, len, rank);
free(data);
printf("Finished write for len=%zu ...\n", len);
}
MPI_Finalize();
return 0;
} |
Thanks, great reproducer. In libSplash we also default to collective writing. We know that independent writes are broken in other ways, e.g. they sometimes deadlock and have bad performance if I remember correctly. |
@ax3l could you have a look at my report in the HDF5 forum and reproduce https://forum.hdfgroup.org/t/hdf5bug-h5fd-mpio-collective-chunking/5279/4 It looks like that some grid sizes are also broken if I use my workaround with I need an additional pair of eyes to validate that I did not do something wrong. |
I solved my issue with the small size arrays and updated the example code here and in the HDF5 issue. |
This comment has been minimized.
This comment has been minimized.
Suggestions to improve the report, it currently reads confusing: You are mixing the terminology threads and mpi ranks in your HDF5 forum post. Please remove all wrong text and wrong images, do not cross them out. It will confuse readers that will read this post next weak earliest anyway. You don't want them to reproduce usage errors but the actual bug demonstrator only. Please report if this is seen for the HDF5 1.10 releases as well, they will not fix 1.8 with high priority. |
The issue is very likely OpenMPI specific, as I cannot reproduce it with MPICH 3.3 open-mpi/ompi#6285 |
New temporary work-around surfaced: OpenMPI also ships a ported ROMIO implementation for IO. We can switch the OpenMPI I/O backend away from its defaults to ROMIO (same as in MPICH) via: mpirun --mca io ^ompio ... |
This issue is work-arounded in #2857 and reported upstream to OpenMPI. (Affects releases 2.0-4.0.) |
I am seeing density-"kinks"/dips at multiples of about 1024 cells in one of my simulations:
Looks like either an initialization or HDF5 output bug to me, since I can later on see the same dip not only in density but also in electric fields and they stay at the same y over time. Or something with our field strides...
The same output dip is seen for multiples of this position at later times when particles move into the simulation box.
Cells in y: 14720
GPUs in y: 4
GPUs in x: 4
y-cells per GPU: 3680 (<- not at a GPU boundary)
Simulation: 2D
Supercell: 16x16
Particles per cell: 20
Init in cell: random
Shape: PCS
Software / System
PIConGPU 0.4.1 on Hypnos (K80) and Hemera (P100).
Files: (per mail)
The text was updated successfully, but these errors were encountered: