You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A couple of us in the SWIFT team have been having issues with running VR on snapshots from a particular box size (50 Mpc) at a particular resolution (around 10^6 Msun) - these runs require MPI. VR runs fine on equally large runs (in terms of particle number) if the box size is 25 Mpc or 100 Mpc.
There is a segmentation fault that happens late on in the run (I may open another issue about that), but I have now come across a different bug that happens quite early. Here is the error message:
fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(259): MPI_Sendrecv(sbuf=0x2b30784136b0, scount=-95418416, MPI_BYTE, dest=59, stag=20, rbuf=0x2b31874d6850, rcount=38400, MPI_BYTE, src=59, rtag=20, MPI_COMM_WORLD, status=0x7ffe17ebe620) failed
MPI_Sendrecv(125): Negative count, value is -95418416
Fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(259): MPI_Sendrecv(sbuf=0x9253efa0, scount=38400, MPI_BYTE, dest=58, stag=20, rbuf=0x2ac35a697a30, rcount=-95418416, MPI_BYTE, src=58, rtag=20, MPI_COMM_WORLD, status=0x7fff18993aa0) failed
MPI_Sendrecv(126): Negative count, value is -95418416
The snapshot itself is /cosma8/data/dp004/dc-husk1/SWIFT/COLIBRE/runs3_Leiden_tests/EAGLE_res/R50_aug2023_v3p75_dT9p5/colibre_0002.hdf5. An output log file is in the same directory, out_6512100.out.
I compiled VR with the following flags: cmake ../ -DVR_USE_HYDRO=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_BUILD_TYPE=Release. The submission script contains the following:
The same configuration successfully runs on non-MPI runs. With MPI it also gets further into the calculation (on this same snapshot), but only if I use less nodes/tasks. This particular crash appears to be related to using 8 nodes and 8 tasks per node. With e.g. 4 nodes and 4 tasks per node, the crash doesn't happen.
The text was updated successfully, but these errors were encountered:
@FHusko thanks for the report. Do you think it'd be possible to narrow this down a bit more on your side? For example, a stack trace of where the issue is happening.
We saw a very similarly-looking error recently, and I'd like to confirm whether it's the same issue or a different one. I don't have details at hand on our exact issue, but I think it was also a negative scount in an MPI_Sendrecv operation (and there are only a few of those I think in the code).
Of course! I tried creating a core dump by including ulimit -c unlimited, but for some reason it was not made. Do you have an idea why? I haven't seen that happen before.
A couple of us in the SWIFT team have been having issues with running VR on snapshots from a particular box size (50 Mpc) at a particular resolution (around 10^6 Msun) - these runs require MPI. VR runs fine on equally large runs (in terms of particle number) if the box size is 25 Mpc or 100 Mpc.
There is a segmentation fault that happens late on in the run (I may open another issue about that), but I have now come across a different bug that happens quite early. Here is the error message:
The snapshot itself is
/cosma8/data/dp004/dc-husk1/SWIFT/COLIBRE/runs3_Leiden_tests/EAGLE_res/R50_aug2023_v3p75_dT9p5/colibre_0002.hdf5
. An output log file is in the same directory,out_6512100.out
.I compiled VR with the following flags:
cmake ../ -DVR_USE_HYDRO=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_BUILD_TYPE=Release
. The submission script contains the following:The same configuration successfully runs on non-MPI runs. With MPI it also gets further into the calculation (on this same snapshot), but only if I use less nodes/tasks. This particular crash appears to be related to using 8 nodes and 8 tasks per node. With e.g. 4 nodes and 4 tasks per node, the crash doesn't happen.
The text was updated successfully, but these errors were encountered: