Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test TeraChem MPI interface #62

Merged
merged 74 commits into from
Apr 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
8cd4bd9
WIP: Test TC-MPI interface
danielhollas Feb 3, 2021
fb5ee80
Disable TERAPI to fix MPICH cache
danielhollas Feb 3, 2021
bb21306
Debug option for mpich build
danielhollas Feb 3, 2021
f08b367
Revert "Disable TERAPI to fix MPICH cache"
danielhollas Feb 3, 2021
bf9a2e6
Test
danielhollas Feb 3, 2021
dda6b57
Simplify handle_mpi_error
danielhollas Feb 3, 2021
8dff536
Skip TERAPI test for OpenMPI
danielhollas Feb 4, 2021
7fea23a
Separate TCServerMock so it can be reused in multiple tests
danielhollas Feb 4, 2021
7f2a2b0
Refactor TCServerMock to be more granular
danielhollas Feb 4, 2021
1bef2da
Hide/remove unused code from force_tera.F90
danielhollas Feb 4, 2021
24ff1e2
Squash compiler warnings for non-MPI compilation from force_terash
danielhollas Feb 4, 2021
abada1d
Use common FFLAGS accross jobs in GA
danielhollas Feb 4, 2021
3e23422
Print charges and dipole moments in force_tera.F90
danielhollas Feb 6, 2021
cc4181f
Try MPICH 3.4.1
danielhollas Feb 8, 2021
ba84c8d
TCServerMock uses qTIP4PFw for energies and gradients
danielhollas Feb 8, 2021
0eae50d
Test multiple TC servers with PIMD
danielhollas Feb 8, 2021
77bc31e
GA: Use v3.3.2 and and v3.4.1 MPICH
danielhollas Feb 8, 2021
8d31bf0
whoops, test bugfix
danielhollas Feb 8, 2021
ec2e64c
TC-MPI: Test failure modes (WIP)
danielhollas Feb 9, 2021
cd2e821
Minor stuff
danielhollas Feb 9, 2021
1e4fb3a
Print MPI ERROR string
danielhollas Feb 9, 2021
3bfe472
Parallelize connection to multiple TS servers via OpenMP
danielhollas Feb 9, 2021
6b17c6d
Add WITHOUT_MPI test
danielhollas Feb 9, 2021
4e54d75
Do not print dipoles and charges when running multiple TC servers
danielhollas Feb 9, 2021
b9e4f71
Skip TERAPI-PIMD test for now, it is flaky
danielhollas Feb 9, 2021
c02d226
Increase Codecov patch threshold to 2%
danielhollas Feb 9, 2021
5d2d4cc
Minor cleanup
danielhollas Feb 9, 2021
972a441
Remove insignificant whitespace changes
danielhollas Feb 9, 2021
ca2ca41
Reenable TERAPI-PIMD with one TC server
danielhollas Feb 10, 2021
7094adc
Don't kill hydra_nameserver in tests
danielhollas Feb 10, 2021
a35b6af
Update gitignore for TERAPI tests
danielhollas Feb 10, 2021
f4cf20d
Refactor + 1 more test
danielhollas Feb 10, 2021
9afe5c6
none -> _none_
danielhollas Feb 10, 2021
3630dca
Remove MAXTERASERVERS restriction
danielhollas Feb 10, 2021
f416425
Increase waiting time
danielhollas Feb 10, 2021
fe2bd93
Print TC and ABIN outputs when failing
danielhollas Feb 10, 2021
9217e45
gitignore update
danielhollas Feb 10, 2021
58374da
always restart hydra_nameserver
danielhollas Feb 10, 2021
b9ae1c4
Temporarily disable test
danielhollas Feb 10, 2021
45653c3
see what's going on
danielhollas Feb 10, 2021
79f824b
test3
danielhollas Feb 10, 2021
c744bdd
Deallocate TC MPI communicators!
danielhollas Feb 10, 2021
16eac61
Don't call MPI_Abort with _tera_
danielhollas Feb 10, 2021
5f050e7
Test sending non-empty array
danielhollas Feb 10, 2021
8941834
Revert "Test sending non-empty array"
danielhollas Feb 10, 2021
f037f50
Reenable TERAPI-FAILS/test2
danielhollas Feb 10, 2021
eeea8d0
cat
danielhollas Feb 10, 2021
cc4e4f8
Relaunch nameserver in each subtest
danielhollas Feb 10, 2021
7506a6d
Log if server name not provided
danielhollas Feb 10, 2021
abc888b
debug print
danielhollas Feb 10, 2021
1dbcf70
Decrese max time
danielhollas Feb 10, 2021
28db648
Stupido! Do not deallocate so damn early
danielhollas Feb 10, 2021
6ae4624
Minor cleanup
danielhollas Feb 10, 2021
813be6f
Harden MPI Publish + subroutine print_mpi_error
danielhollas Feb 11, 2021
3126dab
New failing test
danielhollas Feb 11, 2021
c3f5f69
Refactor + add faling test
danielhollas Feb 11, 2021
0cec500
test mpi_sleep validation
danielhollas Feb 11, 2021
27c5005
Separate common functions
danielhollas Feb 11, 2021
ff87382
More refactor
danielhollas Feb 12, 2021
b1ed7d9
Debug mpich build to see transient segfaults
danielhollas Feb 12, 2021
c9832c9
Add CMDLINE test
danielhollas Feb 12, 2021
757d652
Validate scrdir names
danielhollas Feb 19, 2021
c2ea7ae
Try addressing flakiness with sleep after launching hydra_nameserver
danielhollas Feb 19, 2021
9981f0b
Check we got all expected data from MPI_Recv
danielhollas Feb 19, 2021
9785862
Test mpich build with ch4:ofi
danielhollas Feb 19, 2021
6e91370
Back to ch3:nemesis
danielhollas Feb 21, 2021
c416415
Prettify!
danielhollas Feb 21, 2021
bdbed14
Do not connect to TC servers in parallel
danielhollas Feb 22, 2021
a116007
Try reenabling parallel TC test
danielhollas Feb 22, 2021
150cffc
One more failing test to test check_recv_count
danielhollas Feb 23, 2021
9eb8965
Remove file interface for TeraChem
danielhollas Feb 23, 2021
df54cdb
just comments
danielhollas Mar 22, 2021
5a4606b
Merge branch 'master' into test-terapi
danielhollas Mar 22, 2021
4131a6a
Rerun prettify
danielhollas Mar 22, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 16 additions & 15 deletions .github/workflows/gfortran.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ env:
CURL_OPTS: -S --no-silent
CODECOV_OPTIONS: -Z -X coveragepy -X xcode

# FFLAGS for building ABIN, applicable for most jobs
ABIN_FFLAGS: -O0 -fopenmp -Wall --coverage -ffpe-trap=invalid,zero,overflow,denormal
Copy link
Contributor Author

@danielhollas danielhollas Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only job where we don't use this is the optimized_build


jobs:

basic_build:
Expand All @@ -25,7 +28,6 @@ jobs:
gcc_v: [7, 8, 9]
env:
FC: gfortran
FFLAGS: -O0 -fopenmp -Wall --coverage -ffpe-trap=invalid,zero,overflow,denormal
GCC_V: ${{ matrix.gcc_v}}
CODECOV_NAME: ${{format('{0} GCC-{1}', github.job, matrix.gcc_v)}}

Expand Down Expand Up @@ -56,6 +58,8 @@ jobs:

- name: Build ABIN
run: ./configure --pfunit ${HOME}/pfunit/build/installed/ && make
env:
FFLAGS: ${{ env.ABIN_FFLAGS }}

- name: Run Unit tests
run: make unittest
Expand Down Expand Up @@ -132,7 +136,6 @@ jobs:
matrix:
gcc_v: [7]
env:
FFLAGS: -O0 -fopenmp -Wall --coverage -ffpe-trap=invalid,zero,overflow,denormal
GCC_V: ${{ matrix.gcc_v}}
CODECOV_NAME: ${{format('{0} GCC-{1}', github.job, matrix.gcc_v)}}
steps:
Expand All @@ -156,6 +159,8 @@ jobs:

- name: Build ABIN
run: ./configure --pfunit ${HOME}/pfunit/build/installed/ --fftw && make
env:
FFLAGS: ${{ env.ABIN_FFLAGS }}

- name: Run Unit tests
run: make unittest
Expand All @@ -174,9 +179,8 @@ jobs:
fail-fast: false
matrix:
gcc_v: [7, 8, 9]
mpich_v: ["3.1.3", "3.3.2"]
mpich_v: ["3.3.2", "3.4.1"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped official support for MPICH versions < 3.3, because the hydra_nameserver which we use to connect to TC contained a bug where MPI_Lookup_name() would sometimes return invalid port names. See pmodels/mpich@cb6deb3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw: That fix is a text-book example of a bug in handling strings in C code. strlen() returns length of the string without the terminating null character :-)

env:
ABIN_FFLAGS: -O0 -fopenmp --coverage -ffpe-trap=invalid,zero,overflow,denormal
# To speed-up MPICH build
CFLAGS: -O0
GCC_V: ${{ matrix.gcc_v}}
Expand Down Expand Up @@ -204,10 +208,9 @@ jobs:
run: ./dev_scripts/install_mpich.sh ${HOME}/mpich ${MPICH_V}

- name: build ABIN
run: |
export FFLAGS=${ABIN_FFLAGS} &&\
./configure --mpi ${HOME}/mpich/${MPICH_V}/install &&\
make
run: ./configure --mpi ${HOME}/mpich/${MPICH_V}/install && make
env:
FFLAGS: ${{ env.ABIN_FFLAGS }} -g
- name: test ABIN
run: make test
# We upload to Codecov from the OpenMPI build,
Expand All @@ -227,7 +230,6 @@ jobs:
matrix:
gcc_v: [7]
env:
ABIN_FFLAGS: -O0 --coverage -fopenmp -ffpe-trap=invalid,zero,overflow,denormal
# To speed-up OpenMPI build
CFLAGS: -O0
GCC_V: ${{ matrix.gcc_v}}
Expand Down Expand Up @@ -255,10 +257,9 @@ jobs:
run: ./dev_scripts/install_openmpi.sh ${HOME}/openmpi ${OPENMPI_V} ${OPENMPI_PATCH}

- name: build ABIN
run: |
export FFLAGS=${ABIN_FFLAGS} &&\
./configure --mpi "${HOME}/openmpi/${OPENMPI_V}/install" &&\
make
run: ./configure --mpi "${HOME}/openmpi/${OPENMPI_V}/install" && make
env:
FFLAGS: ${{ env.ABIN_FFLAGS }}
- name: test ABIN
run: make test
- name: Codecov upload
Expand All @@ -278,7 +279,6 @@ jobs:
env:
PLUMED_V: ${{ matrix.plumed_v}}
GCC_V: ${{ matrix.gcc_v}}
ABIN_FFLAGS: -O0 -fopenmp --coverage -ffpe-trap=invalid,zero,overflow,denormal
# Speeding up the Plumed build
CFLAGS: -O0
CXXLAGS: -O0
Expand Down Expand Up @@ -312,10 +312,11 @@ jobs:

- name: build ABIN
run: |
export FFLAGS=${ABIN_FFLAGS} &&\
./configure --plumed "${HOME}/plumed/${PLUMED_V}/install"\
--pfunit ~/pfunit/build/installed/ &&\
make
env:
FFLAGS: ${{ env.ABIN_FFLAGS }}

- name: Run Unit tests
run: make unittest
Expand Down
2 changes: 1 addition & 1 deletion codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ coverage:
patch:
default:
target: auto
threshold: 3.0%
threshold: 3%

parsers:
gcov:
Expand Down
25 changes: 17 additions & 8 deletions dev_scripts/install_mpich.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ TAR_FILE="mpich-${MPICH_VERSION}.tar.gz"
DOWNLOAD_URL="https://www.mpich.org/static/downloads/${MPICH_VERSION}/${TAR_FILE}"
INSTALL_DIR="$MPICH_DIR/$MPICH_VERSION/install"

# Github Actions machines have two CPUs, per:
# https://docs.github.com/en/free-pro-team@latest/actions/reference/specifications-for-github-hosted-runners#supported-runners-and-hardware-resources
NCPUS=2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already do this for OpenMPI build.


if [[ -d $MPICH_DIR/$MPICH_VERSION ]];then
echo "Found existing MPICH installation in $MPICH_DIR/$MPICH_VERSION"
echo "Remove this folder if you want to reinstall"
Expand All @@ -30,18 +34,23 @@ curl "$DOWNLOAD_URL" > $MPICH_DIR/$MPICH_VERSION/pkg/${TAR_FILE}
cd $MPICH_DIR/$MPICH_VERSION/src && tar -xzf ../pkg/${TAR_FILE} && cd mpich-${MPICH_VERSION}

# If you're building MPI for general use, not only for ABIN,
# you might want to change the configure options:
# --enable-fortran=yes Compile all versions of Fortran interfaces
# This option is needed for GitHub Actions build, I don't know why
# --with-hydra-pm=pmiserv --with-namepublisher=pmi
# Needed for MPI interface with TeraChem
# you might want change some of the configure options.
# --enable-fortran=all Compile all versions of Fortran interfaces
# In principle we don't need F77, but configure fails in that case.
# --with-namepublisher=pmi
# This compiles hydra_nameserver binary, needed for MPI interface with TeraChem
#
# Use the two rows below for a debug build/
# export CFLAGS='-g -O0'
# --disable-fast --enable-g-option=all \
./configure FC=gfortran CC=gcc \
--enable-fortran=all --disable-cxx \
--with-hydra-pm=pmiserv --with-namepublisher=pmi \
--enable-fortran=all \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted --disable-cxx since we need to mpic++ to build the mock TeraChem server for testing.

I also added the MPICH interface explicitly --with-device=ch3 since v3.4.1 uses a new ch4 default which I'm not sure whether it works. The device is basically an underlying communication protocol that MPICH uses, and can influence performance. But we don't really care about performance that much here, ab initio computation is the bottleneck, MPI communication overhead should be insignificant.

--with-pm=hydra --with-device=ch3:nemesis \
--with-namepublisher=pmi \
--enable-static --disable-shared \
--prefix=${INSTALL_DIR} 2>&1 |\
tee configure.log
make 2>&1 | tee make.log
make -j $NCPUS 2>&1 | tee make.log
make install 2>&1 | tee make_install.log

echo "
Expand Down
110 changes: 0 additions & 110 deletions interfaces/TERA/r.tera

This file was deleted.

2 changes: 1 addition & 1 deletion src/Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
F_OBJS := modules.o fortran_interfaces.o utils.o io.o random.o arrays.o qmmm.o fftw_interface.o \
shake.o nosehoover.o gle.o transform.o potentials.o estimators.o ekin.o vinit.o plumed.o \
remd.o force_bound.o water.o force_cp2k.o sh_integ.o surfacehop.o landau_zener.o\
force_mm.o force_tera.o force_terash.o force_abin.o en_restraint.o analyze_ext_template.o geom_analysis.o analysis.o \
force_mm.o tera_mpi_api.o force_tera.o force_terash.o force_abin.o en_restraint.o analyze_ext_template.o geom_analysis.o analysis.o \
minimizer.o mdstep.o forces.o read_cmdline.o init.o

C_OBJS := water_interface.o
Expand Down
11 changes: 6 additions & 5 deletions src/abin.F90
Original file line number Diff line number Diff line change
Expand Up @@ -43,16 +43,19 @@ program abin
logical :: file_exists
integer, dimension(8) :: time_start, time_end
real(DP) :: total_cpu_time
#ifdef USE_MPI
integer :: ierr
!$ integer :: nthreads, omp_get_max_threads
#endif

call date_and_time(VALUES=time_start)

! INPUT AND INITIALIZATION SECTION
call init(dt)

! This cannot be in init because of the namelist 'system'
if (my_rank == 0) call clean_temp_files()
if (my_rank == 0) then
call clean_temp_files()
end if

if (irest == 1 .and. (my_rank == 0 .or. iremd == 1)) then
call archive_file('restart.xyz', it)
Expand All @@ -66,9 +69,7 @@ program abin
call lz_rewind(en_array_lz)
end if

!$ nthreads = omp_get_max_threads()
if (my_rank == 0) then
!$ write (*, *) 'Number of OpenMP threads used = ', nthreads
write (*, '(A)') 'Job started at: '//trim(get_formatted_date_and_time(time_start))
write (*, *) ''
end if
Expand Down Expand Up @@ -111,7 +112,7 @@ program abin
if (ipimd == 5) call lz_rewind(en_array_lz)

! if we use reference potential with RESPA
if (pot_ref /= 'none') then
if (pot_ref /= '_none_') then
call force_clas(fxc, fyc, fzc, x, y, z, eclas, pot_ref)
call force_clas(fxc_diff, fyc_diff, fzc_diff, x, y, z, eclas, pot)
end if
Expand Down
2 changes: 1 addition & 1 deletion src/arrays.F90
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ subroutine allocate_arrays(natomalloc, nwalkalloc)
fxc = 0.0D0; fyc = fxc; fzc = fxc
fxq = fxc; fyq = fxc; fzq = fxc
transfxc = fxc; transfyc = fxc; transfzc = fxc
if (pot_ref /= 'none') then
if (pot_ref /= '_none_') then
allocate (fxc_diff(natomalloc, nwalkalloc))
allocate (fyc_diff(natomalloc, nwalkalloc))
allocate (fzc_diff(natomalloc, nwalkalloc))
Expand Down
7 changes: 3 additions & 4 deletions src/force_abin.F90
Original file line number Diff line number Diff line change
Expand Up @@ -121,12 +121,11 @@ subroutine force_abin(x, y, z, fx, fy, fz, eclas, chpot, walkmax)
!-----MAKE THE CALL----------!
ISTATUS = system(chsystem)

! TODO: Verify that the below is true even for GCC >= 7
! Exit status 0 turns to 0
! For some reason, exit status 1 turns to 256
! However, this one we get by default from bash, don't know why...
! see this thread for explanation:
! http://coding.derkeiler.com/Archive/Fortran/comp.lang.fortran/MAXUNITS07-01/msg00085.html
! If the bash script wants to notify ABIN, it can use e.g. exit 2
! However, this one we get by default from BASH, I don't know why.
! If the BASH script wants to notify ABIN, it can use e.g. exit 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link doesn't work anymore.

if (ISTATUS /= 0 .and. ISTATUS /= 256) then
write (*, '(A)') 'ERROR during the execution of the ab initio external program.'
write (*, '(A)') 'Please inspect the output files in&
Expand Down
Loading