Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error writing checkpoints at high core counts #548

Closed
lizziel opened this issue Sep 21, 2020 · 23 comments
Closed

Error writing checkpoints at high core counts #548

lizziel opened this issue Sep 21, 2020 · 23 comments
Labels
🪲 Bug Something isn't working

Comments

@lizziel
Copy link
Contributor

lizziel commented Sep 21, 2020

I've been running GCHPctm with MAPL 2.2.7 for various grid resolutions and core counts on the Harvard Cannon cluster. I am encountering an error while writing checkpoint files when running with high core counts, in my case 1440 cores. The error is in UCX, so not MAPL specifically, but it is specific to the MAPL checkpoint files:

srun: error: holy2a01303: task 0: Killed
[holy2a19302:73246:0:73246]    ud_iface.c:746  Fatal: transport error: Endpoint timeout
==== backtrace ====
[holy2a19304:31734:0:31734]    ud_iface.c:746  Fatal: transport error: Endpoint timeout
==== backtrace ====
    0  /usr/lib64/libucs.so.0(ucs_fatal_error_message+0x68) [0x2b64849e1318]
    1  /usr/lib64/libucs.so.0(+0x17495) [0x2b64849e1495]
    2  /usr/lib64/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_do+0x121) [0x2b648b0267c1]
    3  /usr/lib64/ucx/libuct_ib.so.0(+0x1d902) [0x2b648b02a902]
    4  /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a) [0x2b64843679ea]
    5  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi.so.40(mca_pml_ucx_send+0x107) [0x2b6481f48727]
    6  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi.so.40(MPI_Gatherv+0xf0) [0x2b6481e354c0]
    7  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi_mpifh.so.40(pmpi_gatherv__+0xad) [0x2b648196212d]
    8  /n/holyscratch01/jacob_lab/elundgren/testruns/GCHPctm/13.0.0-alpha.10/scalability/gfortran93/gchp_standard_c180_1440core/./geos() [0x13d93e8]
   etc
===================
Program received signal SIGABRT: Process abort signal.

My libraries are as follows (plus UCX 1.6.0):

  1) git/2.17.0-fasrc01      7) zlib/1.2.11-fasrc02
  2) gmp/6.1.2-fasrc01       8) szip/2.1.1-fasrc01
  3) mpfr/3.1.5-fasrc01      9) hdf5/1.10.6-fasrc03
  4) mpc/1.0.3-fasrc06      10) netcdf/4.7.3-fasrc03
  5) gcc/9.3.0-fasrc01      11) netcdf-fortran/4.5.2-fasrc04
  6) openmpi/4.0.2-fasrc01  12) cmake/3.16.1-fasrc01

My run is at c180, and NX=16 and NY=90. I am using 24 cores per node across 60 nodes, reserving full 128G memory for each. Originally I encountered this error at the beginning of the run because I had periodic checkpoints configured (RECORD_* in GCHP.rc) which caused a checkpoint to be written at the beginning of the run. I turned that off and my run then got to the end, successfully wrote History files, but then again encountered the issue writing the checkpoint file.

@LiamBindle also encountered this problem on a separate compute cluster with c360 using 1200 cores.

Have you seen this before?

@lizziel lizziel added the 🪲 Bug Something isn't working label Sep 21, 2020
@LiamBindle
Copy link
Contributor

LiamBindle commented Sep 21, 2020

I received a similar error running a C360 sim on 1200 cores. The error message I got was

 ExtData Run_: Calculating derived fields
 ExtData Run_: End
 Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint.20160701_0000z.nc4
[compute1-exec-78:49   :0:49]    ud_iface.c:747  Fatal: transport error: Endpoint timeout
==== backtrace ====
    0  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucs.so.0(ucs_fatal_error_message+0x60) [0x7fb40badbaa0]
    1  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucs.so.0(ucs_fatal_error_format+0xde) [0x7fb40badbc0e]
    2  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(+0x4d355) [0x7fb4035d2355]
    3  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_do+0x10b) [0x7fb4035d246b]
    4  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(+0x5bb90) [0x7fb4035e0b90]
    5  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fb40bf3cdba]
    6  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(mca_pml_ucx_progress+0x17) [0x7fb40da757d7]
    7  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libopen-pal.so.40(opal_progress+0x2b) [0x7fb40a51a3ab]
    8  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(mca_pml_ucx_send+0x275) [0x7fb40da77645]
    9  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(PMPI_Gatherv+0x190) [0x7fb40d95f830]
   10  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi_mpifh.so.40(MPI_Gatherv_f08+0xab) [0x7fb40dfb1d9b]
   11  /scratch1/liam.bindle/C1AT/GCHPctm/build-gnu/bin/geos() [0x136e57e]
   12  /scratch1/liam.bindle/C1AT/GCHPctm/build-gnu/bin/geos() [0x1380cec]
    .
    .
    .

My libraries are

bash-4.2$ spack find --loaded
==> 14 installed packages
-- linux-centos7-skylake_avx512 / gcc@8 -------------------------
esmf@8.0.0  hdf5@1.10.6  hwloc@1.11.11  libnl@3.3.0  libpciaccess@0.13.5  libxml2@2.9.9  lsf@10.1  netcdf-c@4.7.3  netcdf-fortran@4.5.2  numactl@2.0.12  openmpi@3.1.5  rdma-core@20  ucx@1.6.1  zlib@1.2.11

In my run I have NX=10, NY=120. I used 30 cores per node across 40 nodes with 300 GB or memory per node. Let me know if there's any more information that I can provide.

@mathomp4
Copy link
Member

A couple of things. First, do you set any OMPI_ environment variables or pass any mca options to the mpirun command?

Second, as a test can you see if adding:

WRITE_RESTART_BY_OSERVER: YES

to AGCM.rc (or your equivalent) does anything? I set it when I run GEOS with Open MPI but that's actually for a performance reason, not a 'things go crash' reason. (Or, conversely, if you run with that set to YES, can you try it with NO.)

@LiamBindle
Copy link
Contributor

For me,

First, do you set any OMPI_ environment variables or pass any mca options to the mpirun command?

My only OMPI MCA setting is:

bash-4.2$ env | grep OMPI_
OMPI_MCA_btl_vader_single_copy_mechanism=none

I must admit, I'm not familiar with these settings. Our sysadmin set this and I've used it blindly.

For the second point, I'll give WRITE_RESTART_BY_OBSERVER a try and report back once the job has run!

@lizziel
Copy link
Contributor Author

lizziel commented Sep 21, 2020

I've got a rerun in the queue with the WRITE_RESTART_BY_OSERVER setting and should have results tomorrow.

I actually use srun rather mpirun. I have nothing containing OMPI in env. Should I? My OpenMPI build settings are summarized in this file in case it's useful:
ompi_info.txt

@mathomp4
Copy link
Member

It looks pretty standard. It looks like you built with UCX instead of verbs which I think is the current preferred method for Infiniband. I will note I often have issues with srun on discover, but maybe Cannon is different. I tend to stick with mpirun because I'm old. I suppose you could try that, but in the end I imagine Open MPI does the same thing.

The OMPI_MCA_btl_vader_single_copy_mechanism is one I've seen before when using Open MPI with containers and indeed:

OMPI_MCA_btl_vader_single_copy_mechanism: none

so I can't complain.

If I had a thought from what you've said, it might be to try a newer version of UCX. Say one in the 1.8 or the new 1.9.0. Though maybe that'll just cause different errors...

@LiamBindle
Copy link
Contributor

I checked this morning and my sim is ~3 days in, so it looks like

WRITE_RESTART_BY_OSERVER: YES

worked! Thanks! What does this switch do?

@lizziel
Copy link
Contributor Author

lizziel commented Sep 22, 2020

This is now solved for me as well. My first run kept srun but added WRITE_RESTART_BY_OSERVER. It crashed before even hitting writing the first checkpoint, but I think this was due to cluster issues given a couple other runs inexplicably failed but are fine today. For my latest run I swtiched to mpirun which adds some uncertaintly on what the fix exactly was, srun or not using the o-server for restart write. I'll narrow it down.

@mathomp4
Copy link
Member

I checked this morning and my sim is ~3 days in, so it looks like

WRITE_RESTART_BY_OSERVER: YES

worked! Thanks! What does this switch do?

I'll ping @weiyuan-jiang to the thread to be more specific, but when I was trying to run with Open MPI on Discover, I found that it was taking ages to write out restarts. I think I eventually tracked it down to Open MPI having some bad MPI_GatherV (or Gather? can't remember) timings. Like stupid bad. And guess what calls are used when writing checkpoints/restarts? 😄

So, I asked around and it turns out @weiyuan-jiang added a (somewhat hidden) ability for the IOSERVER to write the restarts instead of the "normal" path. The IOSERVER uses Send/Recv I think, so it bypassed the bad performing call.

Now, I will say that in our GEOSldas @weiyuan-jiang found some sort of oddity happening with the WRITE_RESTART_BY_OSERVER method. I can't remember what (binary restarts?) but I have never seen any issues in my testing with the GCM.

@lizziel
Copy link
Contributor Author

lizziel commented Sep 22, 2020

This is great. I'll add the line to the default GCHP.rc file for the GCHP 13.0.0 release, pending a response from @weiyuan-jiang on what the observed oddity was of course.

@mathomp4
Copy link
Member

@lizziel Note that I only turn this on with Open MPI. I keep our "default" behavior with Intel MPI, etc. because, well, it works so don't rock the boat.

(Well, we do need I_MPI_ADJUST_GATHERV=3 for Intel MPI because the other GatherV algorithms seemed to have issues on our system, etc.)

@weiyuan-jiang
Copy link
Contributor

I am checking with @bena-nasa . Eventually, we will eliminate the parameter WRITE_RESTART_BY_OSERVER. So far without this parameter, the program goes to different branch which may cause problems.

@mathomp4
Copy link
Member

If you do eliminate it, that would probably mean I have to stop using Open MPI on discover. It is the only way I can write checkpoints due to the crazy slow MPI_GatherV performance.

@lizziel
Copy link
Contributor Author

lizziel commented Sep 23, 2020

Could you update checkpoint writing to be similar to History writing so it avoids the problem?

@weiyuan-jiang
Copy link
Contributor

Even we use WRITE_RESTART_BY_OSERVER, we still use mpi_gatherV. I am wondering if that is the problem.

@tclune
Copy link
Collaborator

tclune commented Sep 24, 2020

@weiyuan-jiang But isn't it the case that when we use the OSERVER, the gatherv() is on a much smaller set of processes? For the main application there are many cores and there are therefore many very small messages. On the server it is much fewer cores and thus fewer and larger messages.

@weiyuan-jiang
Copy link
Contributor

Oserver does not have mpi_gatherV. This gatherV happens in the client side only in 1d tile space. On the client side, it gathers all the data and then send it through oserver. For multi-dimension, WRITE_RESTART_BY_OSERVER bypassed the gatherV. @tclune

@weiyuan-jiang
Copy link
Contributor

@lizziel Do you have problem after you set WRITE_RESTART_BY_OSERVER to yes?

@bena-nasa
Copy link
Collaborator

It looks like you hit an MPI problem in a gatherV. Like Weiyan said, if you do the write by oserver option it bypasses doing a gatherV and takes whole different code path to write the checkpoint. So you have sidestepped the problem by not exercising the code that was causing the initial problem.

@lizziel
Copy link
Contributor Author

lizziel commented Sep 24, 2020

I have not noticed any run issues after setting WRITE_RESTART_BY_OSERVER to yes.

@lizziel
Copy link
Contributor Author

lizziel commented Sep 30, 2020

I am going to close this issue. Please keep @LiamBindle and myself informed if there is a new fix in a future MAPL release, or if this fix is to be retired without a replacement for the problem.

@LiamBindle
Copy link
Contributor

LiamBindle commented Nov 30, 2020

Last week I tried GCHP at C360 with Intel MPI on Compute1 (WashU cluster) and saw that the checkpoint file was being written extremely slowly. I saw @mathomp4's comment about about I_MPI_ADJUST_GATHERV=3, so I tried it and it fixed my problem. Thanks @mathomp4!

@mathomp4
Copy link
Member

@LiamBindle The other one to watch out for is I_MPI_ADJUST_ALLREDUCE. For some reason (on discover) we had to set that to 12. I think it was a weird allreduce crash inside of ESMF that @bena-nasa and I took a while to track down. Since then, we've always run GEOS with both the GATHERV and ALLREDUCE settings with Intel MPI.

@LiamBindle
Copy link
Contributor

LiamBindle commented Dec 1, 2020

Thanks @mathomp4—I'll give that a try too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants