Don't exceed maximum MPI tag value #2804

jhkennedy · 2018-09-13T22:02:35Z

The MPI standard specifies that the MPI tag upper bound (MPI_TAG_UB) must
be at least 32767 and many implementations greatly exceed this number
(eg. cray-mpich/7.4.4 has MPI_TAG_UB == 2^21-1). However, in
multi-instance runs, this tag is quickly exceed (~30 instances).

This PR adds a check for tags larger than MPI_TAG_UB and reduces
msgtag so that it will take ~300 instances to exceed MPI_TAG_UB on cray
machines while still protecting against tag collisions out to ~100
instances (see #2397 and #2363).

This should be sufficient for any multi-instance runs not using
multi-driver (where this isn't a problem), and runs larger should be
using multi-driver.

Test suite: scripts_regression_tests.py on edison; MVK SystemTest on Titan (E3SM)
Test baseline: N/A
Test namelist changes: none
Test status: [bit for bit]

Fixes E3SM-Project/E3SM#2487

User interface changes?: N

Update gh-pages html (Y/N)?: N

Code review:

The MPI standard specifies that the MPI tag upper bound (MPI_TAG_UB) must be at least 32767 and many implementations greatly exceed this number (eg. cray-mpich/7.4.4 has MPI_TAG_UB == 2^21-1). However, in multi-instance runs, this tag is quickly exceed (~30 instances). This commit adds a check for tags larger than MPI_TAG_UB and reduces msgtag so that it will take ~300 instances to exceed MPI_TAG_UB on cray machines while still protecting against tag collisions out to ~100 instances (see #2397 and #2363). This should be sufficient for any multi-instance runs not using multi-driver (where this isn't a problem), and runs larger should be using multi-driver.

amametjanov

+1 on the upper bound check. My vote is to keep the tags unchanged in single-instance runs and use old (or this PR's values) tag values in multi-instance runs. Tracking down the hang in high-res runs took a while and reducing tags by a factor of 10 may bring that issue back.

jedwards4b · 2018-09-18T13:14:27Z

src/drivers/mct/main/component_mod.F90

@@ -63,6 +63,8 @@ module component_mod
  integer  :: mpicom_GLOID, mpicom_CPLID           ! GLOID, CPLID mpi communicator
  integer  :: nthreads_GLOID, nthreads_CPLID
  logical  :: drv_threading
+  integer (kind=MPI_ADDRESS_KIND) :: max_mpi_tag


Are you sure that this is the correct type for this variable - from the mpich documentation:
The attribute_val in Fortran is a pointer to a Fortran integer, not a pointer to a void *.
Which I think implies that what is in max_mpi_tag is the address of the value and not the value itself?

No I'm not a 100% sure. From my understanding though, this is returning the actual value not an address of the value.

This mpi-forum page explains it and provides some examples, but here is the relevant text:

MPI supports two types of attributes: address-valued (pointer) attributes, and integer valued attributes. C and C++ attribute functions put and get address valued attributes. Fortran attribute functions put and get integer valued attributes. When an integer valued attribute is accessed from C or C++, then MPI_xxx_get_attr will return the address of (a pointer to) the integer valued attribute, which is a pointer to MPI_Aint if the attribute was stored with Fortran MPI_xxx_SET_ATTR, and a pointer to int if it was stored with the deprecated Fortran MPI_ATTR_PUT. When an address valued attribute is accessed from Fortran, then MPI_xxx_GET_ATTR will convert the address into an integer and return the result of this conversion. This conversion is lossless if new style attribute functions are used, and an integer of kind MPI_ADDRESS_KIND is returned. The conversion may cause truncation if deprecated attribute functions are used. In C, the deprecated routines MPI_Attr_put and MPI_Attr_get behave identical to MPI_Comm_set_attr and MPI_Comm_get_attr.

And that same behavior is implied here as well:

https://www.open-mpi.org/doc/v3.0/man3/MPI_Comm_set_attr.3.php

jhkennedy · 2018-09-18T14:50:09Z

@amametjanov I think you're right, it's probably best to keep the message tags unchanged (Note: this will limit multi-instance to ~20 instances without using multi-driver). I'm in the process of converting some of my multi-instance runs to use multi-driver. Once those pass I'll bump it back up and update the multi-instance docs to discuss this limitation.

jedwards4b · 2018-09-21T13:38:58Z

@jhkennedy is this ready to merge?

jhkennedy · 2018-09-21T15:18:14Z

@jedwards4b I think we need to rethink these tags first.

To recap, #2397 and #2363 caused the MPI tag values in the mct driver to change from:

 mpi_tag = comp(eci)%cplcompid*100+eci*10+N

to

 mpi_tag = comp(eci)%cplcompid*10000+eci*10+N

where N is in the range of [1,9], because there were some message tag collisions causing hangups on single instance high rez runs.

This change causes multi-instance runs to exceed the maximum MPI tag value at ~20 instances on titan (see E3SM-Project/E3SM#2487) and at ~100 instances on edison (see ESCOMP/CTSM#515 ; pinging @ckoven).

Multi-driver, should alleviate this issue, but I've run into #2295 when trying to change my multi-instance runs over to multi-driver (seems to be titan specific? trying at NERSC), and multi-driver isn't useful for ESCOMP/CTSM#515 because of how long it takes to build the DATM namelists in their case.

So the proposed solution was to leave a (MPI implementation dependent) limit on multi-instance runs and encourage multi-driver use, while also picking a middle ground in the MPI tag size (mpi_tag = comp(eci)%cplcompid*1000+eci*10+5; 1000 instead of 100 (old) or 10000 (new)).

But, due to the aforementioned multi-driver issues and @amametjanov feeling that 1000 still leaves too much potential for tag collision, I don't think this solution is the right one.

So, that leaves us with is there a different and better way to assign the tags?

For example, it seems the comp(eci)%cplcompid part of the tag varies the greatest so could we reorder the tag to something like:

 mpi_tag = eci*10000+comp(eci)%cplcompid*10+5

?

However, I'm not sure I understand how the collisions are occurring in single instance runs -- @amametjanov what are these tags colliding with?

amametjanov · 2018-09-21T17:41:22Z

@jhkennedy how about this patch to use old tags for multi-instance runs: https://gist.github.com/amametjanov/790661841af3a49977b1b584fd509668

ckoven · 2018-09-28T17:09:50Z

Hi All,
Just wanted to add that the --multi-dirver method doesn't seem to be working for my use-case described in ESCOMP/CTSM#515, I get DATM errors as described in ESCOMP/CTSM#515 (comment). So I guess I would just like to advocate maintaining/strengthening the ability to do single-instance DATM, multi-instance land simulations with large numbers of land instances if at all possible. Thanks!

rljacob · 2018-10-10T19:07:17Z

What is the status of this PR?

jhkennedy · 2018-10-10T20:14:11Z

@rljacob I'm testing @amametjanov patch -- so far looks good; I'll know more tomorrow.

Instead of picking a middle ground between tag value magnitudes, use the old tags for multi-instance runs and the new tags for single-instance runs. This allows high resolution cases to run, but also preserves multi-instance functionality for non-high res runs.

jhkennedy · 2018-10-13T20:42:19Z

@jedwards4b this is ready to merge.

@ckoven this should allow you to run your multi-instance runs.

jedwards4b

It's not clear to me why we need a seperate case for ninst > 1.

src/drivers/mct/main/component_mod.F90

jedwards4b · 2018-10-13T20:52:43Z

src/drivers/mct/main/component_mod.F90

-                call seq_map_map_exchange(comp(1), flow='c2x', dom_flag=.true., msgtag=comp(1)%cplcompid*10000+1*10+1)
+
+                if (size(comp) > 1) then
+                    mpi_tag = comp(eci)%cplcompid*100+eci*10+1


Why two cases here? Why not just use line 393 regardless of size(comp)?

To summarize the thread, that was the suggestion by @amametjanov to multi-instance runs with >20 instance to run without exceeding the MPI tag limit, but still prevent tag collision on single instance high res runs. Multi-driver is a possible alternative, but there were challenges around setting that up for people in this thread and others I've talked to about this issue. It seems if multi-instance is going to be an option it should be able to handle at least a normal ensemble size.

I'd be welcome to a suggestion on how to assign these tag values entirely differently, but I don't know them, or with what they are colliding, well enough. I suggested re-ordering the way the tags were written, but that didn't get any traction from anyone in this thread. So this at least fixes the immediate issue.

Fix the baseline gen to use the updated _get_all_hist_files function using the archiver. Also, now that #2804 has fixed the maximum MPI tag problem, bump the number of instances to 30

Remove redundant check of mpi-tag values Because MPI-serial doesn't have the MPI_attr_get function, PR #2804 broke MP-serial builds as discussed in #2890 . This removes a redundant exception that uses MPI_attr_get to determine the max MPI tag value, and report it and when it is exceed. MPI will still throw an exception when it is exceeded, but doesn't report the maximum value. I also added an MPI-serial test to cime_developer test suite. Test suite: ./scripts_regression_tests on titan, ./create_test cime_developer (with new SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A test) on anvil Test baseline: Test namelist changes: Test status: bit for bit Fixes #2890 User interface changes?: N Update gh-pages html (Y/N)?: N/A Code review: jedwards

In ESMCI/cime#2804 and ESMCI/cime#2892, the MPI tag generation for MCT was adjusted to allow more than ~30 multi-instance runs fixing #2487. This patch is required in order to run Salil's climate reproduceability tests on maint-1.0, to determine the reporoduceablility of simulations on Cori which were originally run on Edision.

@rljacob

#2877) Multiple mods for maint-1.0 to run compset F1850C5-CMIP6 and nbfb tests This PR adds following 4 fixes: 1. Fixes the land initial conditions for CMIP6 compset: Adds the land/ocean mask for the existing land initial conditions to only define a land initial condition file for appropriate resolutions. 2. Apply maximum MPI tag value patch: Context: @rljacob requested the CMDV-SM team apply their climate reproducibility tests to maint-1.0 to determine if the simulations run on Edison are reproducible on Cori. This patch is required in order to run Salil's multivariate climate reproducible tests. Changes: In ESMCI/cime#2804 and ESMCI/cime#2892, the MPI tag generation for MCT was adjusted to allow more than ~30 multi-instance runs fixing #2487 and ESMCI/cime#2890. This PR applies these changes to Maint-1.0. 3. Adds TSC and NPG tests nbfb tests: This fix adds python scripts for nbfb tests which were missing in maint-1.0. Adding these tests will allow us to run these tests using maint1.0 4. Fixes a bug in PIO1 which was fixed in master via CIME update: This bug was fixed in master branch via a CIME update (see 8352dd4). [BFB] * singhbalwinder/atm/mods-for-nbfb-test: Fixes a bug in pio1 Adds tsc and npg tests MPI tag fix Land init fix for missing mask

jhkennedy added tp: drv mct ty: Bug Fix labels Sep 13, 2018

jhkennedy requested review from amametjanov, jedwards4b and jgfouca September 13, 2018 22:02

jhkennedy added the in progress label Sep 13, 2018

amametjanov reviewed Sep 13, 2018

View reviewed changes

jgfouca approved these changes Sep 13, 2018

View reviewed changes

jedwards4b reviewed Sep 18, 2018

View reviewed changes

Fix bad Fortran write format specifier

3ea0a06

jedwards4b approved these changes Sep 18, 2018

View reviewed changes

klindsay28 mentioned this pull request Sep 19, 2018

crash in multi-instance run when ninst gets above ~100 ESCOMP/CTSM#515

Closed

ckoven mentioned this pull request Sep 20, 2018

MPI tag can overflow with multi-instance E3SM-Project/E3SM#2487

Closed

jedwards4b requested changes Oct 13, 2018

View reviewed changes

Fix typos

7d8ef9e

jedwards4b approved these changes Oct 15, 2018

View reviewed changes

jedwards4b self-assigned this Oct 15, 2018

jedwards4b merged commit 490074f into master Oct 15, 2018

rljacob removed the in progress label Oct 15, 2018

jhkennedy deleted the jhkennedy/mpi_tags branch October 23, 2018 17:47

billsacks mentioned this pull request Oct 29, 2018

PR #2804 broke builds with mpi-serial #2890

Closed

jhkennedy mentioned this pull request Oct 30, 2018

Remove redundant check of mpi-tag values #2892

Merged

jhkennedy mentioned this pull request Apr 9, 2019

Apply maximum MPI tag value patch E3SM-Project/E3SM#2844

Closed

singhbalwinder mentioned this pull request Apr 23, 2019

Multiple mods for maint-1.0 to run compset F1850C5-CMIP6 and nbfb tests E3SM-Project/E3SM#2877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't exceed maximum MPI tag value #2804

Don't exceed maximum MPI tag value #2804

jhkennedy commented Sep 13, 2018

amametjanov left a comment

jedwards4b Sep 18, 2018

jhkennedy Sep 18, 2018

jedwards4b Sep 18, 2018

jhkennedy commented Sep 18, 2018

jedwards4b commented Sep 21, 2018

jhkennedy commented Sep 21, 2018

amametjanov commented Sep 21, 2018

ckoven commented Sep 28, 2018

rljacob commented Oct 10, 2018

jhkennedy commented Oct 10, 2018

jhkennedy commented Oct 13, 2018

jedwards4b left a comment

jedwards4b Oct 13, 2018

jhkennedy Oct 14, 2018

Don't exceed maximum MPI tag value #2804

Don't exceed maximum MPI tag value #2804

Conversation

jhkennedy commented Sep 13, 2018

amametjanov left a comment

Choose a reason for hiding this comment

jedwards4b Sep 18, 2018

Choose a reason for hiding this comment

jhkennedy Sep 18, 2018

Choose a reason for hiding this comment

jedwards4b Sep 18, 2018

Choose a reason for hiding this comment

jhkennedy commented Sep 18, 2018

jedwards4b commented Sep 21, 2018

jhkennedy commented Sep 21, 2018

amametjanov commented Sep 21, 2018

ckoven commented Sep 28, 2018

rljacob commented Oct 10, 2018

jhkennedy commented Oct 10, 2018

jhkennedy commented Oct 13, 2018

jedwards4b left a comment

Choose a reason for hiding this comment

jedwards4b Oct 13, 2018

Choose a reason for hiding this comment

jhkennedy Oct 14, 2018

Choose a reason for hiding this comment