Skip to content

Re-enable REMD with TC-MPI interface#68

Merged
danielhollas merged 1 commit intomasterfrom
remd-terapi
May 28, 2021
Merged

Re-enable REMD with TC-MPI interface#68
danielhollas merged 1 commit intomasterfrom
remd-terapi

Conversation

@danielhollas
Copy link
Copy Markdown
Contributor

This has worked before but was removed in the previous commit. Here I add it back and add a test.

The current approach is a bit hacky, and requires that the number of TC servers is exactly equal to number of replicas ,
as each MPI process is connecting to its own TC server.
This is untenable for larger number of replicas so we might need to improve upon this in the future.
(note that for technical reasons, nteraservers must be 1 in the ABIN input). Though maybe we should just ignore the setting and set it automatically in the code?)

@danielhollas danielhollas self-assigned this May 13, 2021
Comment thread src/init.F90
write (*, *)
if (iqmmm == 3 .or. pot == 'mm') write (*, nml=qmmm)
write (*, *)
if (inose >= 1) then
Copy link
Copy Markdown
Contributor Author

@danielhollas danielhollas May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated cleanup. In general I am getting rid of one line if statements since they don't play nicely with code coverage (as can be seen by the Codecov warning below).

Comment thread src/init.F90
#ifdef USE_MPI
call MPI_Barrier(MPI_COMM_WORLD, ierr)
write (*, '(A,I0,A,I0)') 'MPI rank: ', my_rank, ' PID: ', GetPID()
call MPI_Barrier(MPI_COMM_WORLD, ierr)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for a nicer output.

@@ -0,0 +1,37 @@

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this file and the reference files from the REMD test.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2021

Codecov Report

Merging #68 (349eca1) into master (add5286) will increase coverage by 0.03%.
The diff coverage is 79.31%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #68      +/-   ##
==========================================
+ Coverage   68.93%   68.97%   +0.03%     
==========================================
  Files          38       38              
  Lines        5676     5689      +13     
==========================================
+ Hits         3913     3924      +11     
- Misses       1763     1765       +2     
Impacted Files Coverage Δ
src/fftw_interface.F90 68.42% <ø> (ø)
src/remd.F90 81.97% <0.00%> (-1.96%) ⬇️
src/init.F90 66.42% <86.66%> (+0.29%) ⬆️
src/tera_mpi_api.F90 98.30% <100.00%> (+0.64%) ⬆️

@danielhollas danielhollas force-pushed the remd-terapi branch 2 times, most recently from a98d266 to af80eba Compare May 14, 2021 10:26
@danielhollas danielhollas requested a review from suchanj May 14, 2021 10:27
@danielhollas danielhollas marked this pull request as ready for review May 14, 2021 10:29
@danielhollas
Copy link
Copy Markdown
Contributor Author

@suchanj can you take a look? The only (small) functional change is in tera_mpi_api.F90. Then there are just a couple of cosmetic changes in other files and a whole bunch of new test files, but they were mostly copied from existing tests so nothing interesting to review there.

@danielhollas
Copy link
Copy Markdown
Contributor Author

@suchanj I am trying to test this on NEON but with no luck. :-(
I am using the r.terabin script, but when I submit it, it is stuck on the node, and the standard outputs from ABIN and TC say only:

error: executing task of job 332517 failed: execution daemon on host "n28.vscht.cz" didn't accept task

It looks like some bad interaction between MPI and SGE, but this is the first time that I see this. I've tried multiple MPICH versions with no luck. Have you encountered this before?

I was only able to test on n28, n26 is throwing the SGE CUDA starter error, and the other nodes are full.

This has worked before but was removed in the previous
commit. Here I add it back and add a test.

The current approach is a bit hacky and requires
nreplica == number of TC servers.
@suchanj
Copy link
Copy Markdown
Contributor

suchanj commented May 28, 2021

@danielhollas I haven't seen this error before. It is also not very specific... Well, this functionality is not critical so we might wait for new cluster to be installed. I reported the n26 SGE CUDA error. (PS: I also remember having minor doubts about future r.terabin functionality based on previous commits, just don't remember why.)

@danielhollas
Copy link
Copy Markdown
Contributor Author

@suchanj thanks. I'm going to merge this but will need to figure it out. The problem is that I am not able to run normal MD either. It might also be an issue with the fact the TeraChem is compiled with different MPICH version, I am not sure.

@danielhollas danielhollas merged commit b5f4933 into master May 28, 2021
@danielhollas danielhollas deleted the remd-terapi branch May 28, 2021 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants