Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with large SWIFT runs using MPI #129

Open
FHusko opened this issue Oct 5, 2023 · 2 comments
Open

Crash with large SWIFT runs using MPI #129

FHusko opened this issue Oct 5, 2023 · 2 comments

Comments

@FHusko
Copy link

FHusko commented Oct 5, 2023

A couple of us in the SWIFT team have been having issues with running VR on snapshots from a particular box size (50 Mpc) at a particular resolution (around 10^6 Msun) - these runs require MPI. VR runs fine on equally large runs (in terms of particle number) if the box size is 25 Mpc or 100 Mpc.

There is a segmentation fault that happens late on in the run (I may open another issue about that), but I have now come across a different bug that happens quite early. Here is the error message:

fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(259): MPI_Sendrecv(sbuf=0x2b30784136b0, scount=-95418416, MPI_BYTE, dest=59, stag=20, rbuf=0x2b31874d6850, rcount=38400, MPI_BYTE, src=59, rtag=20, MPI_COMM_WORLD, status=0x7ffe17ebe620) failed
MPI_Sendrecv(125): Negative count, value is -95418416
Fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(259): MPI_Sendrecv(sbuf=0x9253efa0, scount=38400, MPI_BYTE, dest=58, stag=20, rbuf=0x2ac35a697a30, rcount=-95418416, MPI_BYTE, src=58, rtag=20, MPI_COMM_WORLD, status=0x7fff18993aa0) failed
MPI_Sendrecv(126): Negative count, value is -95418416

The snapshot itself is /cosma8/data/dp004/dc-husk1/SWIFT/COLIBRE/runs3_Leiden_tests/EAGLE_res/R50_aug2023_v3p75_dT9p5/colibre_0002.hdf5. An output log file is in the same directory, out_6512100.out.

I compiled VR with the following flags: cmake ../ -DVR_USE_HYDRO=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_BUILD_TYPE=Release. The submission script contains the following:

#!/bin/bash

#SBATCH -J VR
#SBATCH -N 8
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH -o out_%j.out
#SBATCH -e err_%j.err
#SBATCH -p cosma8
#SBATCH -A dp004
#SBATCH -t 48:00:00

module purge

module load intel_comp/2021.1.0 compiler
module load intel_mpi/2018
module load ucx/1.10.1
module load fftw/3.3.9epyc
module load parallel_hdf5/1.10.6
module load parmetis/4.0.3-64bit
module load gsl/2.5

module unload python/2.7.15
module load python/3.6.5

ulimit -c unlimited
export OMP_NUM_THREADS=1

mpirun -np 64 /cosma/home/durham/dc-husk1/VR_MPI/VELOCIraptor-STF/build/stf -I 2 -i colibre_0002 -o haloes_0002 -C /cosma8/data/dp004/dc-husk1/SWIFT/COLIBRE/vrconfig_baseline.cfg

The same configuration successfully runs on non-MPI runs. With MPI it also gets further into the calculation (on this same snapshot), but only if I use less nodes/tasks. This particular crash appears to be related to using 8 nodes and 8 tasks per node. With e.g. 4 nodes and 4 tasks per node, the crash doesn't happen.

@rtobar
Copy link

rtobar commented Oct 6, 2023

@FHusko thanks for the report. Do you think it'd be possible to narrow this down a bit more on your side? For example, a stack trace of where the issue is happening.

We saw a very similarly-looking error recently, and I'd like to confirm whether it's the same issue or a different one. I don't have details at hand on our exact issue, but I think it was also a negative scount in an MPI_Sendrecv operation (and there are only a few of those I think in the code).

@FHusko
Copy link
Author

FHusko commented Oct 6, 2023

Of course! I tried creating a core dump by including ulimit -c unlimited, but for some reason it was not made. Do you have an idea why? I haven't seen that happen before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants