Test TeraChem MPI interface #62

danielhollas · 2021-02-03T20:15:10Z

The primary goal here was to write tests for the TeraChem MPI interface implemented in src/force_tera.F90. There's also a separate interface for Surface Hopping defined in force_terash.F90, but that will be handled in a follow-up PR.

I also ended up doing a little refactor and separated some common functions from force_tera.F90 to tera_mpi_api.F90. The advantage is that they can now be more easily reused in force_terash.F90 and force_tera.F90 and force_terash.F90 are now independent. I also spend quite some time improving the error handling so the interface should now be more robust and easier to debug.

The tricky thing here is of course the TC part. We don't want to run real TC as part of our tests so I essentially ended up extracting the TeraChem side of the API from TeraChem source code and we compile and run it as a fake TC server as part of the tests (see key files below). Moreover, the TC server is compiled together with the q-TIP4P/Fw force field so it returns real forces and energies (of course only for pure water). I reused the same code that is part of ABIN itself and that we use in other tests (pot='mmwater'). Thanks to this trick, I was able to generate the test reference files (*ref) by simply running ABIN with qTIP4P force field, and that way I could verify that the fake TC server works correctly and returns correct energies and forces.

There are several tests:

tests/TERAPI: The most common use case, classical MD.
tests/TERAPI-PIMD: Mostly the same as above but with several PI beads. The only difference in terms of the API is that we pass a different scratch directory name for each bead so that each bead has it's separate scratch dir. (the fake TC server does not really populate the scratch directory. Instead, for testing purposes it creates a file with the same name as the scratch directory).
tests/TERAPI-PIMD-PARALLEL: Same as above, but we connect to several TC servers so that we can run different beads in parallel. This is controlled by the nteraservers variable in the ABIN input.
tests/TERAPI-FAILS : This is in fact a collection of different tests verifying different failure modes and error handling. Thanks to these tests I was able to achieve almost 100% test coverage (see Codecov report).

There are a lot of new files, but most of them are new test files. In any case, it's probably easier to check out this branch locally instead of browsing it in GitHub UI.

Key files:

src.force_tera.F90 and src/tera_mpi_api.F90
tests/tc_mpi_api.cpp, tests/tc_mpi_api.h - TeraChem side of the API
tests/TERAPI*/tc_server.cpp, the actual code for the server that uses the API and acts as TeraChem (executes in an MD loop).
each test is driven by it's own test.sh file. Common BASH helper function were extracted to tests/test_tc_server_utils.sh

TODO

Verify that everything works with actual TeraChem, both on Argon and Neon clusters.
Verify parallelization over multiple TC servers (with real TeraChem)
Figure out how to handle REMD in combination with TeraChem

codecov · 2021-02-03T23:09:26Z

Codecov Report

Merging #62 (4131a6a) into master (c6a7c67) will increase coverage by 5.05%.
The diff coverage is 77.98%.

@@            Coverage Diff             @@
##           master      #62      +/-   ##
==========================================
+ Coverage   63.87%   68.93%   +5.05%     
==========================================
  Files          37       38       +1     
  Lines        5645     5676      +31     
==========================================
+ Hits         3606     3913     +307     
+ Misses       2039     1763     -276

Impacted Files	Coverage Δ
src/force_abin.F90	`64.06% <ø> (ø)`
src/force_terash.F90	`0.00% <0.00%> (ø)`
src/init.F90	`66.13% <88.88%> (+4.41%)`	⬆️
src/force_tera.F90	`95.52% <97.05%> (+95.52%)`	⬆️
src/tera_mpi_api.F90	`97.66% <97.66%> (ø)`
src/abin.F90	`83.06% <100.00%> (-0.14%)`	⬇️
src/arrays.F90	`69.09% <100.00%> (ø)`
src/forces.F90	`83.83% <100.00%> (+1.19%)`	⬆️
src/io.F90	`71.05% <100.00%> (+42.10%)`	⬆️
src/modules.F90	`56.46% <100.00%> (+1.15%)`	⬆️
... and 7 more

This reverts commit fb5ee80.

.github/workflows/gfortran.yml

danielhollas · 2021-02-05T16:56:07Z

.github/workflows/gfortran.yml

@@ -14,6 +14,9 @@ env:
  CURL_OPTS:  -S --no-silent
  CODECOV_OPTIONS: -Z -X coveragepy -X xcode

+  # FFLAGS for building ABIN, applicable for most jobs
+  ABIN_FFLAGS: -O0 -fopenmp -Wall --coverage -ffpe-trap=invalid,zero,overflow,denormal


The only job where we don't use this is the optimized_build

src/force_tera.F90

Also check for the received count for each MPI_Recv call in TCServerMock. MPI fails automatically if it receives more then specified, but allows to receive less than the buffer capacity so we need to check that manually. I wonder what else should we be checking.

There's a bug in hydra_nameserver where it crashes when multiple TC servers call MPI_Unpublish_name pmodels/mpich#5058 Hence, we're passing TC port names to ABIN via files.

I'm getting segfaults from MPI_Comm_connect, maybe a bug in MPICH? On the other hand, we try to parallelize the TC initialization, sending number of atoms and atom types at the beginning.

danielhollas · 2021-03-22T03:01:11Z

@suchanj feel free to spend as much or as little time on this. :-) I tried to write up the general gist in the PR description, feel free to ask if anything is not clear.

In the meantime I will have to resolve conflicts with master and do a bit of testing with real TeraChem on NEON.

Conflicts: codecov.yml src/abin.F90 src/arrays.F90 src/force_abin.F90 src/force_tera.F90 src/force_terash.F90 src/forces.F90 src/init.F90 src/io.F90 src/modules.F90 src/read_cmdline.F90 tests/test.sh

suchanj · 2021-03-23T09:41:50Z

What a chonky PR! Description nicely sums it up, a great step forward.

danielhollas · 2021-04-28T17:13:02Z

There's a remaining TODO related to REMD, but I'll fix that in a separate PR.

While the Amber MPI interface that we use for classical MD expects coordinates in angstromgs, the FMS interface expects bohr, which I didn't notice when I was refactoring this code. I really need to write those tests for the TC-SH interface.

While the Amber MPI interface that we use for classical MD expects coordinates in angstroms, the FMS interface expects bohrs, which I didn't notice when I was refactoring this code. I really need to write those tests for the TC-SH interface.

danielhollas added the testing Any changes to Github Actions or testing scripts. label Feb 3, 2021

danielhollas self-assigned this Feb 3, 2021

danielhollas force-pushed the test-terapi branch 2 times, most recently from 4712bae to cd0f1af Compare February 3, 2021 20:39

WIP: Test TC-MPI interface

8cd4bd9

danielhollas force-pushed the test-terapi branch from cd0f1af to 8cd4bd9 Compare February 3, 2021 21:16

danielhollas added 2 commits February 3, 2021 23:50

Disable TERAPI to fix MPICH cache

fb5ee80

Debug option for mpich build

bb21306

danielhollas added 7 commits February 4, 2021 00:11

Revert "Disable TERAPI to fix MPICH cache"

f08b367

This reverts commit fb5ee80.

Test

bf9a2e6

Simplify handle_mpi_error

dda6b57

Skip TERAPI test for OpenMPI

8dff536

Separate TCServerMock so it can be reused in multiple tests

7fea23a

Refactor TCServerMock to be more granular

7f2a2b0

Hide/remove unused code from force_tera.F90

1bef2da

danielhollas commented Feb 4, 2021

View reviewed changes

.github/workflows/gfortran.yml Outdated Show resolved Hide resolved

Squash compiler warnings for non-MPI compilation from force_terash

24ff1e2

danielhollas force-pushed the test-terapi branch 2 times, most recently from 4b0b644 to 9e250ec Compare February 4, 2021 04:58

Use common FFLAGS accross jobs in GA

abada1d

danielhollas force-pushed the test-terapi branch from 9e250ec to abada1d Compare February 4, 2021 05:03

danielhollas commented Feb 5, 2021

View reviewed changes

src/force_tera.F90 Outdated Show resolved Hide resolved

danielhollas added 2 commits February 6, 2021 07:28

Try MPICH 3.4.1

cc4181f

danielhollas force-pushed the test-terapi branch from 96bdb4b to cc4181f Compare February 8, 2021 07:12

TCServerMock uses qTIP4PFw for energies and gradients

ba84c8d

danielhollas force-pushed the test-terapi branch from 2c6a972 to 811bf8c Compare February 8, 2021 18:48

Test multiple TC servers with PIMD

0eae50d

There's a bug in hydra_nameserver where it crashes when multiple TC servers call MPI_Unpublish_name pmodels/mpich#5058 Hence, we're passing TC port names to ABIN via files.

More refactor

ff87382

danielhollas force-pushed the test-terapi branch from 49b2ede to ff87382 Compare February 12, 2021 09:11

danielhollas and others added 9 commits February 12, 2021 10:20

Debug mpich build to see transient segfaults

b1ed7d9

Add CMDLINE test

c9832c9

Validate scrdir names

757d652

Try addressing flakiness with sleep after launching hydra_nameserver

c2ea7ae

Check we got all expected data from MPI_Recv

9981f0b

Test mpich build with ch4:ofi

9785862

Back to ch3:nemesis

6e91370

Prettify!

c416415

Do not connect to TC servers in parallel

bdbed14

I'm getting segfaults from MPI_Comm_connect, maybe a bug in MPICH? On the other hand, we try to parallelize the TC initialization, sending number of atoms and atom types at the beginning.

danielhollas force-pushed the test-terapi branch from da4a78d to bdbed14 Compare February 22, 2021 17:53

danielhollas and others added 2 commits February 22, 2021 19:05

Try reenabling parallel TC test

a116007

One more failing test to test check_recv_count

150cffc

danielhollas changed the title ~~WIP: Test TC-MPI interface~~ Test TeraChem MPI interface Feb 23, 2021

Remove file interface for TeraChem

9eb8965

danielhollas requested a review from suchanj March 22, 2021 02:22

danielhollas marked this pull request as ready for review March 22, 2021 02:54

just comments

df54cdb

danielhollas and others added 2 commits March 22, 2021 12:50

Merge branch 'master' into test-terapi

5a4606b

Conflicts: codecov.yml src/abin.F90 src/arrays.F90 src/force_abin.F90 src/force_tera.F90 src/force_terash.F90 src/forces.F90 src/init.F90 src/io.F90 src/modules.F90 src/read_cmdline.F90 tests/test.sh

Rerun prettify

4131a6a

suchanj approved these changes Mar 23, 2021

View reviewed changes

danielhollas merged commit c86dc82 into master Apr 28, 2021

danielhollas deleted the test-terapi branch May 30, 2021 18:34

danielhollas mentioned this pull request Feb 4, 2022

Terachem-MPI for Surface Hopping interface broken #79

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test TeraChem MPI interface #62

Test TeraChem MPI interface #62

danielhollas commented Feb 3, 2021 •

edited

Loading

codecov bot commented Feb 3, 2021 •

edited

Loading

danielhollas Feb 5, 2021 •

edited

Loading

danielhollas commented Mar 22, 2021 •

edited

Loading

suchanj commented Mar 23, 2021

danielhollas commented Apr 28, 2021

Test TeraChem MPI interface #62

Test TeraChem MPI interface #62

Conversation

danielhollas commented Feb 3, 2021 • edited Loading

codecov bot commented Feb 3, 2021 • edited Loading

Codecov Report

danielhollas Feb 5, 2021 • edited Loading

Choose a reason for hiding this comment

danielhollas commented Mar 22, 2021 • edited Loading

suchanj commented Mar 23, 2021

danielhollas commented Apr 28, 2021

danielhollas commented Feb 3, 2021 •

edited

Loading

codecov bot commented Feb 3, 2021 •

edited

Loading

danielhollas Feb 5, 2021 •

edited

Loading

danielhollas commented Mar 22, 2021 •

edited

Loading