Crash with large SWIFT runs using MPI #129

FHusko · 2023-10-05T16:28:11Z

A couple of us in the SWIFT team have been having issues with running VR on snapshots from a particular box size (50 Mpc) at a particular resolution (around 10^6 Msun) - these runs require MPI. VR runs fine on equally large runs (in terms of particle number) if the box size is 25 Mpc or 100 Mpc.

There is a segmentation fault that happens late on in the run (I may open another issue about that), but I have now come across a different bug that happens quite early. Here is the error message:

fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(259): MPI_Sendrecv(sbuf=0x2b30784136b0, scount=-95418416, MPI_BYTE, dest=59, stag=20, rbuf=0x2b31874d6850, rcount=38400, MPI_BYTE, src=59, rtag=20, MPI_COMM_WORLD, status=0x7ffe17ebe620) failed
MPI_Sendrecv(125): Negative count, value is -95418416
Fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(259): MPI_Sendrecv(sbuf=0x9253efa0, scount=38400, MPI_BYTE, dest=58, stag=20, rbuf=0x2ac35a697a30, rcount=-95418416, MPI_BYTE, src=58, rtag=20, MPI_COMM_WORLD, status=0x7fff18993aa0) failed
MPI_Sendrecv(126): Negative count, value is -95418416

The snapshot itself is /cosma8/data/dp004/dc-husk1/SWIFT/COLIBRE/runs3_Leiden_tests/EAGLE_res/R50_aug2023_v3p75_dT9p5/colibre_0002.hdf5. An output log file is in the same directory, out_6512100.out.

I compiled VR with the following flags: cmake ../ -DVR_USE_HYDRO=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_BUILD_TYPE=Release. The submission script contains the following:

#!/bin/bash

#SBATCH -J VR
#SBATCH -N 8
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH -o out_%j.out
#SBATCH -e err_%j.err
#SBATCH -p cosma8
#SBATCH -A dp004
#SBATCH -t 48:00:00

module purge

module load intel_comp/2021.1.0 compiler
module load intel_mpi/2018
module load ucx/1.10.1
module load fftw/3.3.9epyc
module load parallel_hdf5/1.10.6
module load parmetis/4.0.3-64bit
module load gsl/2.5

module unload python/2.7.15
module load python/3.6.5

ulimit -c unlimited
export OMP_NUM_THREADS=1

mpirun -np 64 /cosma/home/durham/dc-husk1/VR_MPI/VELOCIraptor-STF/build/stf -I 2 -i colibre_0002 -o haloes_0002 -C /cosma8/data/dp004/dc-husk1/SWIFT/COLIBRE/vrconfig_baseline.cfg

The same configuration successfully runs on non-MPI runs. With MPI it also gets further into the calculation (on this same snapshot), but only if I use less nodes/tasks. This particular crash appears to be related to using 8 nodes and 8 tasks per node. With e.g. 4 nodes and 4 tasks per node, the crash doesn't happen.

The text was updated successfully, but these errors were encountered:

rtobar · 2023-10-06T00:18:59Z

@FHusko thanks for the report. Do you think it'd be possible to narrow this down a bit more on your side? For example, a stack trace of where the issue is happening.

We saw a very similarly-looking error recently, and I'd like to confirm whether it's the same issue or a different one. I don't have details at hand on our exact issue, but I think it was also a negative scount in an MPI_Sendrecv operation (and there are only a few of those I think in the code).

FHusko · 2023-10-06T09:13:45Z

Of course! I tried creating a core dump by including ulimit -c unlimited, but for some reason it was not made. Do you have an idea why? I haven't seen that happen before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash with large SWIFT runs using MPI #129

Crash with large SWIFT runs using MPI #129

FHusko commented Oct 5, 2023

rtobar commented Oct 6, 2023

FHusko commented Oct 6, 2023 •

edited

Loading

Crash with large SWIFT runs using MPI #129

Crash with large SWIFT runs using MPI #129

Comments

FHusko commented Oct 5, 2023

rtobar commented Oct 6, 2023

FHusko commented Oct 6, 2023 • edited Loading

FHusko commented Oct 6, 2023 •

edited

Loading