New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hamiltonian replica exchange simulation might hang at gromacs checkpoint writing #829
Comments
Thanks! This is the same as #742 The report on stack tracing is very useful. I will try to have a look at this in the next couple of weeks. |
Hi @GiovanniBussi , After diving into the codes, I think I found the root cause. It can be triggered as following: suppose there're 4 replicas, and replica #1 and #2 have done replica exchange at step X, and then they're going to writing checkpoint which will call I think it can be fixed by passing the previous Thanks.
|
Closed by #831 |
Summary
While running replica exchange multi-simulation, gmx_mpi processes will hang.
GROMACS version
2021.5-plumed-2.7.2
Steps to reproduce
cd gromacs-test
mpirun -np 4 gmx_mpi mdrun -v -ntomp 12 -cpt 1 --deffnm lambda -plumed plumed.dat -hrex -replex 100 -nb gpu -bonded gpu -pme gpu -multidir lambda0 lambda1 lambda2 lambda3 -gpu_id 01
gromacs-test.tar.gz
(Gromacs compile options: cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=CUDA -DCUDA_cufft_LIBRARY=/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcufft.so.10 -DREGRESSIONTEST_DOWNLOAD=OFF -DCMAKE_SOURCE_DIR=/usr/local/cuda-11.0/targets/x86_64-linux/include -DGMX_MPI=ON)
What is the current bug behavior?
The simulation might hang if some ranks are writing checkpoint. Attaching a debugger and getting the stack trace below (i.e. gdb -p PID, then type "bt"):
These processes are hanging at PMPI_Barrier and PMPI_Recv, might be deadlock. I found the 2nd stack is related with plumed patch. The command option '-hrex' will trigger it and call
exchange_state()
.The text was updated successfully, but these errors were encountered: