Error writing checkpoints at high core counts #548

lizziel · 2020-09-21T20:31:54Z

I've been running GCHPctm with MAPL 2.2.7 for various grid resolutions and core counts on the Harvard Cannon cluster. I am encountering an error while writing checkpoint files when running with high core counts, in my case 1440 cores. The error is in UCX, so not MAPL specifically, but it is specific to the MAPL checkpoint files:

srun: error: holy2a01303: task 0: Killed
[holy2a19302:73246:0:73246]    ud_iface.c:746  Fatal: transport error: Endpoint timeout
==== backtrace ====
[holy2a19304:31734:0:31734]    ud_iface.c:746  Fatal: transport error: Endpoint timeout
==== backtrace ====
    0  /usr/lib64/libucs.so.0(ucs_fatal_error_message+0x68) [0x2b64849e1318]
    1  /usr/lib64/libucs.so.0(+0x17495) [0x2b64849e1495]
    2  /usr/lib64/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_do+0x121) [0x2b648b0267c1]
    3  /usr/lib64/ucx/libuct_ib.so.0(+0x1d902) [0x2b648b02a902]
    4  /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a) [0x2b64843679ea]
    5  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi.so.40(mca_pml_ucx_send+0x107) [0x2b6481f48727]
    6  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi.so.40(MPI_Gatherv+0xf0) [0x2b6481e354c0]
    7  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi_mpifh.so.40(pmpi_gatherv__+0xad) [0x2b648196212d]
    8  /n/holyscratch01/jacob_lab/elundgren/testruns/GCHPctm/13.0.0-alpha.10/scalability/gfortran93/gchp_standard_c180_1440core/./geos() [0x13d93e8]
   etc
===================
Program received signal SIGABRT: Process abort signal.

My libraries are as follows (plus UCX 1.6.0):

  1) git/2.17.0-fasrc01      7) zlib/1.2.11-fasrc02
  2) gmp/6.1.2-fasrc01       8) szip/2.1.1-fasrc01
  3) mpfr/3.1.5-fasrc01      9) hdf5/1.10.6-fasrc03
  4) mpc/1.0.3-fasrc06      10) netcdf/4.7.3-fasrc03
  5) gcc/9.3.0-fasrc01      11) netcdf-fortran/4.5.2-fasrc04
  6) openmpi/4.0.2-fasrc01  12) cmake/3.16.1-fasrc01

My run is at c180, and NX=16 and NY=90. I am using 24 cores per node across 60 nodes, reserving full 128G memory for each. Originally I encountered this error at the beginning of the run because I had periodic checkpoints configured (RECORD_* in GCHP.rc) which caused a checkpoint to be written at the beginning of the run. I turned that off and my run then got to the end, successfully wrote History files, but then again encountered the issue writing the checkpoint file.

@LiamBindle also encountered this problem on a separate compute cluster with c360 using 1200 cores.

Have you seen this before?

The text was updated successfully, but these errors were encountered:

LiamBindle · 2020-09-21T20:41:18Z

I received a similar error running a C360 sim on 1200 cores. The error message I got was

 ExtData Run_: Calculating derived fields
 ExtData Run_: End
 Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint.20160701_0000z.nc4
[compute1-exec-78:49   :0:49]    ud_iface.c:747  Fatal: transport error: Endpoint timeout
==== backtrace ====
    0  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucs.so.0(ucs_fatal_error_message+0x60) [0x7fb40badbaa0]
    1  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucs.so.0(ucs_fatal_error_format+0xde) [0x7fb40badbc0e]
    2  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(+0x4d355) [0x7fb4035d2355]
    3  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_do+0x10b) [0x7fb4035d246b]
    4  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(+0x5bb90) [0x7fb4035e0b90]
    5  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fb40bf3cdba]
    6  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(mca_pml_ucx_progress+0x17) [0x7fb40da757d7]
    7  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libopen-pal.so.40(opal_progress+0x2b) [0x7fb40a51a3ab]
    8  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(mca_pml_ucx_send+0x275) [0x7fb40da77645]
    9  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(PMPI_Gatherv+0x190) [0x7fb40d95f830]
   10  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi_mpifh.so.40(MPI_Gatherv_f08+0xab) [0x7fb40dfb1d9b]
   11  /scratch1/liam.bindle/C1AT/GCHPctm/build-gnu/bin/geos() [0x136e57e]
   12  /scratch1/liam.bindle/C1AT/GCHPctm/build-gnu/bin/geos() [0x1380cec]
    .
    .
    .

My libraries are

bash-4.2$ spack find --loaded
==> 14 installed packages
-- linux-centos7-skylake_avx512 / gcc@8 -------------------------
esmf@8.0.0  hdf5@1.10.6  hwloc@1.11.11  libnl@3.3.0  libpciaccess@0.13.5  libxml2@2.9.9  lsf@10.1  netcdf-c@4.7.3  netcdf-fortran@4.5.2  numactl@2.0.12  openmpi@3.1.5  rdma-core@20  ucx@1.6.1  zlib@1.2.11

In my run I have NX=10, NY=120. I used 30 cores per node across 40 nodes with 300 GB or memory per node. Let me know if there's any more information that I can provide.

mathomp4 · 2020-09-21T21:08:19Z

A couple of things. First, do you set any OMPI_ environment variables or pass any mca options to the mpirun command?

Second, as a test can you see if adding:

WRITE_RESTART_BY_OSERVER: YES

to AGCM.rc (or your equivalent) does anything? I set it when I run GEOS with Open MPI but that's actually for a performance reason, not a 'things go crash' reason. (Or, conversely, if you run with that set to YES, can you try it with NO.)

LiamBindle · 2020-09-21T21:16:53Z

For me,

First, do you set any OMPI_ environment variables or pass any mca options to the mpirun command?

My only OMPI MCA setting is:

bash-4.2$ env | grep OMPI_
OMPI_MCA_btl_vader_single_copy_mechanism=none

I must admit, I'm not familiar with these settings. Our sysadmin set this and I've used it blindly.

For the second point, I'll give WRITE_RESTART_BY_OBSERVER a try and report back once the job has run!

lizziel · 2020-09-21T21:19:28Z

I've got a rerun in the queue with the WRITE_RESTART_BY_OSERVER setting and should have results tomorrow.

I actually use srun rather mpirun. I have nothing containing OMPI in env. Should I? My OpenMPI build settings are summarized in this file in case it's useful:
ompi_info.txt

mathomp4 · 2020-09-21T23:12:51Z

It looks pretty standard. It looks like you built with UCX instead of verbs which I think is the current preferred method for Infiniband. I will note I often have issues with srun on discover, but maybe Cannon is different. I tend to stick with mpirun because I'm old. I suppose you could try that, but in the end I imagine Open MPI does the same thing.

The OMPI_MCA_btl_vader_single_copy_mechanism is one I've seen before when using Open MPI with containers and indeed:

MAPL/.circleci/config.yml

Line 10 in 34eae4b

OMPI_MCA_btl_vader_single_copy_mechanism: none

so I can't complain.

If I had a thought from what you've said, it might be to try a newer version of UCX. Say one in the 1.8 or the new 1.9.0. Though maybe that'll just cause different errors...

LiamBindle · 2020-09-22T14:01:08Z

I checked this morning and my sim is ~3 days in, so it looks like

WRITE_RESTART_BY_OSERVER: YES

worked! Thanks! What does this switch do?

lizziel · 2020-09-22T15:11:09Z

This is now solved for me as well. My first run kept srun but added WRITE_RESTART_BY_OSERVER. It crashed before even hitting writing the first checkpoint, but I think this was due to cluster issues given a couple other runs inexplicably failed but are fine today. For my latest run I swtiched to mpirun which adds some uncertaintly on what the fix exactly was, srun or not using the o-server for restart write. I'll narrow it down.

mathomp4 · 2020-09-22T16:05:30Z

I checked this morning and my sim is ~3 days in, so it looks like
WRITE_RESTART_BY_OSERVER: YES
worked! Thanks! What does this switch do?

I'll ping @weiyuan-jiang to the thread to be more specific, but when I was trying to run with Open MPI on Discover, I found that it was taking ages to write out restarts. I think I eventually tracked it down to Open MPI having some bad MPI_GatherV (or Gather? can't remember) timings. Like stupid bad. And guess what calls are used when writing checkpoints/restarts? 😄

So, I asked around and it turns out @weiyuan-jiang added a (somewhat hidden) ability for the IOSERVER to write the restarts instead of the "normal" path. The IOSERVER uses Send/Recv I think, so it bypassed the bad performing call.

Now, I will say that in our GEOSldas @weiyuan-jiang found some sort of oddity happening with the WRITE_RESTART_BY_OSERVER method. I can't remember what (binary restarts?) but I have never seen any issues in my testing with the GCM.

lizziel · 2020-09-22T16:29:10Z

This is great. I'll add the line to the default GCHP.rc file for the GCHP 13.0.0 release, pending a response from @weiyuan-jiang on what the observed oddity was of course.

mathomp4 · 2020-09-22T17:51:12Z

@lizziel Note that I only turn this on with Open MPI. I keep our "default" behavior with Intel MPI, etc. because, well, it works so don't rock the boat.

(Well, we do need I_MPI_ADJUST_GATHERV=3 for Intel MPI because the other GatherV algorithms seemed to have issues on our system, etc.)

weiyuan-jiang · 2020-09-23T17:19:15Z

I am checking with @bena-nasa . Eventually, we will eliminate the parameter WRITE_RESTART_BY_OSERVER. So far without this parameter, the program goes to different branch which may cause problems.

mathomp4 · 2020-09-23T17:52:46Z

If you do eliminate it, that would probably mean I have to stop using Open MPI on discover. It is the only way I can write checkpoints due to the crazy slow MPI_GatherV performance.

lizziel · 2020-09-23T19:23:50Z

Could you update checkpoint writing to be similar to History writing so it avoids the problem?

weiyuan-jiang · 2020-09-24T17:25:21Z

Even we use WRITE_RESTART_BY_OSERVER, we still use mpi_gatherV. I am wondering if that is the problem.

tclune · 2020-09-24T17:36:19Z

@weiyuan-jiang But isn't it the case that when we use the OSERVER, the gatherv() is on a much smaller set of processes? For the main application there are many cores and there are therefore many very small messages. On the server it is much fewer cores and thus fewer and larger messages.

weiyuan-jiang · 2020-09-24T17:44:12Z

Oserver does not have mpi_gatherV. This gatherV happens in the client side only in 1d tile space. On the client side, it gathers all the data and then send it through oserver. For multi-dimension, WRITE_RESTART_BY_OSERVER bypassed the gatherV. @tclune

weiyuan-jiang · 2020-09-24T18:17:59Z

@lizziel Do you have problem after you set WRITE_RESTART_BY_OSERVER to yes?

bena-nasa · 2020-09-24T19:46:26Z

It looks like you hit an MPI problem in a gatherV. Like Weiyan said, if you do the write by oserver option it bypasses doing a gatherV and takes whole different code path to write the checkpoint. So you have sidestepped the problem by not exercising the code that was causing the initial problem.

lizziel · 2020-09-24T19:59:36Z

I have not noticed any run issues after setting WRITE_RESTART_BY_OSERVER to yes.

lizziel · 2020-09-30T18:44:01Z

I am going to close this issue. Please keep @LiamBindle and myself informed if there is a new fix in a future MAPL release, or if this fix is to be retired without a replacement for the problem.

LiamBindle · 2020-11-30T17:32:19Z

Last week I tried GCHP at C360 with Intel MPI on Compute1 (WashU cluster) and saw that the checkpoint file was being written extremely slowly. I saw @mathomp4's comment about about I_MPI_ADJUST_GATHERV=3, so I tried it and it fixed my problem. Thanks @mathomp4!

mathomp4 · 2020-11-30T21:04:17Z

@LiamBindle The other one to watch out for is I_MPI_ADJUST_ALLREDUCE. For some reason (on discover) we had to set that to 12. I think it was a weird allreduce crash inside of ESMF that @bena-nasa and I took a while to track down. Since then, we've always run GEOS with both the GATHERV and ALLREDUCE settings with Intel MPI.

LiamBindle · 2020-12-01T17:30:34Z

Thanks @mathomp4—I'll give that a try too.

lizziel added the 🪲 Bug Something isn't working label Sep 21, 2020

lizziel closed this as completed Sep 30, 2020

LiamBindle mentioned this issue Nov 30, 2020

[BUG/ISSUE] Writing initial restart file is very slow with IntelMPI for big/high core count simulations geoschem/GCHP#65

Closed

LiamBindle mentioned this issue Jul 16, 2021

[BUG/ISSUE] GCHP simulations with >1000 cores crash on Pleiades geoschem/GCHP#117

Closed

lizziel mentioned this issue May 20, 2024

GCHP 14.3.1 out of memory when writing checkpoint files geoschem/GCHP#413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error writing checkpoints at high core counts #548

Error writing checkpoints at high core counts #548

lizziel commented Sep 21, 2020 •

edited

Loading

LiamBindle commented Sep 21, 2020 •

edited

Loading

mathomp4 commented Sep 21, 2020

LiamBindle commented Sep 21, 2020

lizziel commented Sep 21, 2020

mathomp4 commented Sep 21, 2020

LiamBindle commented Sep 22, 2020

lizziel commented Sep 22, 2020

mathomp4 commented Sep 22, 2020

lizziel commented Sep 22, 2020

mathomp4 commented Sep 22, 2020

weiyuan-jiang commented Sep 23, 2020

mathomp4 commented Sep 23, 2020

lizziel commented Sep 23, 2020

weiyuan-jiang commented Sep 24, 2020

tclune commented Sep 24, 2020

weiyuan-jiang commented Sep 24, 2020

weiyuan-jiang commented Sep 24, 2020

bena-nasa commented Sep 24, 2020

lizziel commented Sep 24, 2020

lizziel commented Sep 30, 2020

LiamBindle commented Nov 30, 2020 •

edited

Loading

mathomp4 commented Nov 30, 2020

LiamBindle commented Dec 1, 2020 •

edited

Loading

Error writing checkpoints at high core counts #548

Error writing checkpoints at high core counts #548

Comments

lizziel commented Sep 21, 2020 • edited Loading

LiamBindle commented Sep 21, 2020 • edited Loading

mathomp4 commented Sep 21, 2020

LiamBindle commented Sep 21, 2020

lizziel commented Sep 21, 2020

mathomp4 commented Sep 21, 2020

LiamBindle commented Sep 22, 2020

lizziel commented Sep 22, 2020

mathomp4 commented Sep 22, 2020

lizziel commented Sep 22, 2020

mathomp4 commented Sep 22, 2020

weiyuan-jiang commented Sep 23, 2020

mathomp4 commented Sep 23, 2020

lizziel commented Sep 23, 2020

weiyuan-jiang commented Sep 24, 2020

tclune commented Sep 24, 2020

weiyuan-jiang commented Sep 24, 2020

weiyuan-jiang commented Sep 24, 2020

bena-nasa commented Sep 24, 2020

lizziel commented Sep 24, 2020

lizziel commented Sep 30, 2020

LiamBindle commented Nov 30, 2020 • edited Loading

mathomp4 commented Nov 30, 2020

LiamBindle commented Dec 1, 2020 • edited Loading

lizziel commented Sep 21, 2020 •

edited

Loading

LiamBindle commented Sep 21, 2020 •

edited

Loading

LiamBindle commented Nov 30, 2020 •

edited

Loading

LiamBindle commented Dec 1, 2020 •

edited

Loading