Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hamiltonian replica exchange simulation might hang at gromacs checkpoint writing #829

Closed
shazj99 opened this issue Jun 7, 2022 · 3 comments
Assignees

Comments

@shazj99
Copy link

shazj99 commented Jun 7, 2022

Summary
While running replica exchange multi-simulation, gmx_mpi processes will hang.

GROMACS version
2021.5-plumed-2.7.2

Steps to reproduce
cd gromacs-test
mpirun -np 4 gmx_mpi mdrun -v -ntomp 12 -cpt 1 --deffnm lambda -plumed plumed.dat -hrex -replex 100 -nb gpu -bonded gpu -pme gpu -multidir lambda0 lambda1 lambda2 lambda3 -gpu_id 01

gromacs-test.tar.gz

(Gromacs compile options: cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=CUDA -DCUDA_cufft_LIBRARY=/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcufft.so.10 -DREGRESSIONTEST_DOWNLOAD=OFF -DCMAKE_SOURCE_DIR=/usr/local/cuda-11.0/targets/x86_64-linux/include -DGMX_MPI=ON)

What is the current bug behavior?
The simulation might hang if some ranks are writing checkpoint. Attaching a debugger and getting the stack trace below (i.e. gdb -p PID, then type "bt"):

#0  0x00007f953708a4e0 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#1  0x00007f95370791aa in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#2  0x00007f9536f5e35b in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#3  0x00007f9536fd9456 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#4  0x00007f9536fda1fc in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#5  0x00007f9536f924c7 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#6  0x00007f9536eeb921 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#7  0x00007f9536eeba9d in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#8  0x00007f9536f9260c in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#9  0x00007f9536eeb9f3 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#10 0x00007f9536eeba9d in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#11 0x00007f9536eebbcb in PMPI_Barrier () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#12 0x00007f9537ded120 in write_checkpoint(char const*, bool, _IO_FILE*, t_commrec const*, int*, int, int, int, bool, int, long, double, t_state*, ObservablesHistory*, gmx::MdModulesNotifier const&, gmx::WriteCheckpointDataHolder*, bool, int) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#13 0x00007f9537dedba3 in mdoutf_write_checkpoint(gmx_mdoutf*, _IO_FILE*, t_commrec const*, long, double, t_state*, ObservablesHistory*, gmx::WriteCheckpointDataHolder*) ()
   from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#14 0x00007f9537dede54 in mdoutf_write_to_trajectory_files(_IO_FILE*, t_commrec const*, gmx_mdoutf*, int, int, long, double, t_state*, t_state*, ObservablesHistory*, gmx::ArrayRef<gmx::BasicVector<float> const>, gmx::WriteCheckpointDataHolder*) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#15 0x00007f9537e1acff in do_md_trajectory_writing(_IO_FILE*, t_commrec*, int, t_filenm const*, long, long, double, t_inputrec*, t_state*, t_state*, ObservablesHistory*, gmx_mtop_t const*, t_forcerec*, gmx_mdoutf*, gmx::EnergyOutput const&, gmx_ekindata_t*, gmx::ArrayRef<gmx::BasicVector<float> const>, bool, bool, bool, bool, bool) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#16 0x00007f9537f17a7d in gmx::LegacySimulator::do_md() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#17 0x00007f9537f1558d in gmx::LegacySimulator::run() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#18 0x00007f9537f4f73c in gmx::Mdrunner::mdrunner() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#19 0x000055bf81b9231c in gmx::gmx_mdrun(int, gmx_hw_info_t const&, int, char**) ()
#20 0x000055bf81b92417 in gmx::gmx_mdrun(int, char**) ()
#21 0x00007f95378acde2 in gmx::CommandLineModuleManager::run(int, char**) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#22 0x000055bf81b9088c in main ()
#0  0x00007f74b07fb046 in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#1  0x00007f74b06e035b in ?? () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#2  0x00007f74b06d374e in PMPI_Recv () from /usr/lib/x86_64-linux-gnu/libmpich.so.12
#3  0x00007f74b16bde44 in exchange_rvecs(gmx_multisim_t const*, int, float (*) [3], int) [clone .isra.4] [clone .part.5] [clone .constprop.63] () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#4  0x00007f74b16bf227 in exchange_state(gmx_multisim_t const*, int, t_state*) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#5  0x00007f74b169cbd9 in gmx::LegacySimulator::do_md() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#6  0x00007f74b169758d in gmx::LegacySimulator::run() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#7  0x00007f74b16d173c in gmx::Mdrunner::mdrunner() () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#8  0x000056080e98831c in gmx::gmx_mdrun(int, gmx_hw_info_t const&, int, char**) ()
#9  0x000056080e988417 in gmx::gmx_mdrun(int, char**) ()
#10 0x00007f74b102ede2 in gmx::CommandLineModuleManager::run(int, char**) () from /usr/local/gromacs/lib/libgromacs_mpi.so.6
#11 0x000056080e98688c in main ()

These processes are hanging at PMPI_Barrier and PMPI_Recv, might be deadlock. I found the 2nd stack is related with plumed patch. The command option '-hrex' will trigger it and call exchange_state().

@GiovanniBussi
Copy link
Member

Thanks! This is the same as #742

The report on stack tracing is very useful. I will try to have a look at this in the next couple of weeks.

@GiovanniBussi GiovanniBussi self-assigned this Jun 7, 2022
@shazj99
Copy link
Author

shazj99 commented Jun 7, 2022

Hi @GiovanniBussi ,

After diving into the codes, I think I found the root cause. It can be triggered as following: suppose there're 4 replicas, and replica #1 and #2 have done replica exchange at step X, and then they're going to writing checkpoint which will call PMPI_Barrier in write_checkpoint() to wait all other replicas to write at the same step. But since replica #0 or #3 do not do exchange at this round, the afterwards checking checkpointHandler->decideIfCheckpointingThisStep will fail and make them run forward to next exchange step(X+replex) which will wait on PMPI_Recv by exchange_state().

I think it can be fixed by passing the previous bDoReplEx value to decideIfCheckpointingThisStep: If some replicas decide to write checkpoint by bExchanged, the other ones should also manage to do so. I paste the changes as below and if you are agree with me, I'll send a new PR for it.

Thanks.

--- md.cpp.orig	2022-06-08 00:22:11.286821932 +0800
+++ md.cpp	2022-06-08 00:28:41.643920122 +0800
@@ -177,7 +177,7 @@
     gmx_repl_ex_t     repl_ex = nullptr;
     gmx_global_stat_t gstat;
     gmx_shellfc_t*    shellfc;
-    gmx_bool          bSumEkinhOld, bDoReplEx, bExchanged, bNeedRepartition;
+    gmx_bool          bSumEkinhOld, bDoReplEx, bDoReplExPrev, bExchanged, bNeedRepartition;
     gmx_bool          bTemp, bPres, bTrotter;
     real              dvdl_constr;
     std::vector<RVec> cbuf;
@@ -693,6 +693,7 @@
     bSumEkinhOld     = FALSE;
     bExchanged       = FALSE;
     bNeedRepartition = FALSE;
+    bDoReplEx        = FALSE;

     step     = ir->init_step;
     step_rel = 0;
@@ -760,6 +761,7 @@
                            && (!bFirstStep));
         }

+        bDoReplExPrev = bDoReplEx;
         bDoReplEx = (useReplicaExchange && (step > 0) && !bLastStep
                      && do_per_step(step, replExParams.exchangeInterval));

@@ -873,7 +875,7 @@
         }
         clear_mat(force_vir);

-        checkpointHandler->decideIfCheckpointingThisStep(bNS, bFirstStep, bLastStep);
+        checkpointHandler->decideIfCheckpointingThisStep(bNS||bDoReplExPrev, bFirstStep, bLastStep);

         /* Determine the energy and pressure:
          * at nstcalcenergy steps and at energy output steps (set below).

@GiovanniBussi
Copy link
Member

Closed by #831

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants