Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't exceed maximum MPI tag value #2804

Merged
merged 4 commits into from
Oct 15, 2018
Merged

Don't exceed maximum MPI tag value #2804

merged 4 commits into from
Oct 15, 2018

Conversation

jhkennedy
Copy link
Contributor

The MPI standard specifies that the MPI tag upper bound (MPI_TAG_UB) must
be at least 32767 and many implementations greatly exceed this number
(eg. cray-mpich/7.4.4 has MPI_TAG_UB == 2^21-1). However, in
multi-instance runs, this tag is quickly exceed (~30 instances).

This PR adds a check for tags larger than MPI_TAG_UB and reduces
msgtag so that it will take ~300 instances to exceed MPI_TAG_UB on cray
machines while still protecting against tag collisions out to ~100
instances (see #2397 and #2363).

This should be sufficient for any multi-instance runs not using
multi-driver (where this isn't a problem), and runs larger should be
using multi-driver.

Test suite: scripts_regression_tests.py on edison; MVK SystemTest on Titan (E3SM)
Test baseline: N/A
Test namelist changes: none
Test status: [bit for bit]

Fixes E3SM-Project/E3SM#2487

User interface changes?: N

Update gh-pages html (Y/N)?: N

Code review:

The MPI standard specifies that the MPI tag upper bound (MPI_TAG_UB) must
be at least 32767 and many implementations greatly exceed this number
(eg. cray-mpich/7.4.4 has MPI_TAG_UB == 2^21-1). However, in
multi-instance runs, this tag is quickly exceed (~30 instances).

This commit adds a check for tags larger than MPI_TAG_UB and reduces
msgtag so that it will take ~300 instances to exceed MPI_TAG_UB on cray
machines while still protecting against tag collisions out to ~100
instances (see #2397 and #2363).

This should be sufficient for any multi-instance runs not using
multi-driver (where this isn't a problem), and runs larger should be
using multi-driver.
Copy link
Member

@amametjanov amametjanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the upper bound check. My vote is to keep the tags unchanged in single-instance runs and use old (or this PR's values) tag values in multi-instance runs. Tracking down the hang in high-res runs took a while and reducing tags by a factor of 10 may bring that issue back.

@@ -63,6 +63,8 @@ module component_mod
integer :: mpicom_GLOID, mpicom_CPLID ! GLOID, CPLID mpi communicator
integer :: nthreads_GLOID, nthreads_CPLID
logical :: drv_threading
integer (kind=MPI_ADDRESS_KIND) :: max_mpi_tag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that this is the correct type for this variable - from the mpich documentation:
The attribute_val in Fortran is a pointer to a Fortran integer, not a pointer to a void *.
Which I think implies that what is in max_mpi_tag is the address of the value and not the value itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I'm not a 100% sure. From my understanding though, this is returning the actual value not an address of the value.

This mpi-forum page explains it and provides some examples, but here is the relevant text:

MPI supports two types of attributes: address-valued (pointer) attributes, and integer valued attributes. C and C++ attribute functions put and get address valued attributes. Fortran attribute functions put and get integer valued attributes. When an integer valued attribute is accessed from C or C++, then MPI_xxx_get_attr will return the address of (a pointer to) the integer valued attribute, which is a pointer to MPI_Aint if the attribute was stored with Fortran MPI_xxx_SET_ATTR, and a pointer to int if it was stored with the deprecated Fortran MPI_ATTR_PUT. When an address valued attribute is accessed from Fortran, then MPI_xxx_GET_ATTR will convert the address into an integer and return the result of this conversion. This conversion is lossless if new style attribute functions are used, and an integer of kind MPI_ADDRESS_KIND is returned. The conversion may cause truncation if deprecated attribute functions are used. In C, the deprecated routines MPI_Attr_put and MPI_Attr_get behave identical to MPI_Comm_set_attr and MPI_Comm_get_attr.

And that same behavior is implied here as well:

https://www.open-mpi.org/doc/v3.0/man3/MPI_Comm_set_attr.3.php

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@jhkennedy
Copy link
Contributor Author

@amametjanov I think you're right, it's probably best to keep the message tags unchanged (Note: this will limit multi-instance to ~20 instances without using multi-driver). I'm in the process of converting some of my multi-instance runs to use multi-driver. Once those pass I'll bump it back up and update the multi-instance docs to discuss this limitation.

@jedwards4b
Copy link
Contributor

@jhkennedy is this ready to merge?

@jhkennedy
Copy link
Contributor Author

@jedwards4b I think we need to rethink these tags first.

To recap, #2397 and #2363 caused the MPI tag values in the mct driver to change from:

 mpi_tag = comp(eci)%cplcompid*100+eci*10+N

to

 mpi_tag = comp(eci)%cplcompid*10000+eci*10+N

where N is in the range of [1,9], because there were some message tag collisions causing hangups on single instance high rez runs.

This change causes multi-instance runs to exceed the maximum MPI tag value at ~20 instances on titan (see E3SM-Project/E3SM#2487) and at ~100 instances on edison (see ESCOMP/CTSM#515 ; pinging @ckoven).

Multi-driver, should alleviate this issue, but I've run into #2295 when trying to change my multi-instance runs over to multi-driver (seems to be titan specific? trying at NERSC), and multi-driver isn't useful for ESCOMP/CTSM#515 because of how long it takes to build the DATM namelists in their case.

So the proposed solution was to leave a (MPI implementation dependent) limit on multi-instance runs and encourage multi-driver use, while also picking a middle ground in the MPI tag size (mpi_tag = comp(eci)%cplcompid*1000+eci*10+5; 1000 instead of 100 (old) or 10000 (new)).

But, due to the aforementioned multi-driver issues and @amametjanov feeling that 1000 still leaves too much potential for tag collision, I don't think this solution is the right one.

So, that leaves us with is there a different and better way to assign the tags?

For example, it seems the comp(eci)%cplcompid part of the tag varies the greatest so could we reorder the tag to something like:

 mpi_tag = eci*10000+comp(eci)%cplcompid*10+5

?

However, I'm not sure I understand how the collisions are occurring in single instance runs -- @amametjanov what are these tags colliding with?

@amametjanov
Copy link
Member

@jhkennedy how about this patch to use old tags for multi-instance runs: https://gist.github.com/amametjanov/790661841af3a49977b1b584fd509668

@ckoven
Copy link

ckoven commented Sep 28, 2018

Hi All,
Just wanted to add that the --multi-dirver method doesn't seem to be working for my use-case described in ESCOMP/CTSM#515, I get DATM errors as described in ESCOMP/CTSM#515 (comment). So I guess I would just like to advocate maintaining/strengthening the ability to do single-instance DATM, multi-instance land simulations with large numbers of land instances if at all possible. Thanks!

@rljacob
Copy link
Member

rljacob commented Oct 10, 2018

What is the status of this PR?

@jhkennedy
Copy link
Contributor Author

@rljacob I'm testing @amametjanov patch -- so far looks good; I'll know more tomorrow.

Instead of picking a middle ground between tag value magnitudes, use the
old tags for multi-instance runs and the new tags for single-instance
runs. This allows high resolution cases to run, but also preserves
multi-instance functionality for non-high res runs.
@jhkennedy
Copy link
Contributor Author

@jedwards4b this is ready to merge.

@ckoven this should allow you to run your multi-instance runs.

Copy link
Contributor

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why we need a seperate case for ninst > 1.

src/drivers/mct/main/component_mod.F90 Outdated Show resolved Hide resolved
src/drivers/mct/main/component_mod.F90 Outdated Show resolved Hide resolved
call seq_map_map_exchange(comp(1), flow='c2x', dom_flag=.true., msgtag=comp(1)%cplcompid*10000+1*10+1)

if (size(comp) > 1) then
mpi_tag = comp(eci)%cplcompid*100+eci*10+1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two cases here? Why not just use line 393 regardless of size(comp)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize the thread, that was the suggestion by @amametjanov to multi-instance runs with >20 instance to run without exceeding the MPI tag limit, but still prevent tag collision on single instance high res runs. Multi-driver is a possible alternative, but there were challenges around setting that up for people in this thread and others I've talked to about this issue. It seems if multi-instance is going to be an option it should be able to handle at least a normal ensemble size.

I'd be welcome to a suggestion on how to assign these tag values entirely differently, but I don't know them, or with what they are colliding, well enough. I suggested re-ordering the way the tags were written, but that didn't get any traction from anyone in this thread. So this at least fixes the immediate issue.

@jedwards4b jedwards4b self-assigned this Oct 15, 2018
@jedwards4b jedwards4b merged commit 490074f into master Oct 15, 2018
@jhkennedy jhkennedy deleted the jhkennedy/mpi_tags branch October 23, 2018 17:47
jhkennedy pushed a commit that referenced this pull request Oct 25, 2018
Fix the baseline gen to use the updated _get_all_hist_files function
using the archiver.

Also, now that #2804 has fixed the maximum MPI tag problem,
bump the number of instances to 30
jedwards4b added a commit that referenced this pull request Oct 30, 2018
Remove redundant check of mpi-tag values
Because MPI-serial doesn't have the MPI_attr_get function, PR #2804 broke MP-serial builds as discussed in #2890 . This removes a redundant exception that uses MPI_attr_get to determine the max MPI tag value, and report it and when it is exceed. MPI will still throw an exception when it is exceeded, but doesn't report the maximum value. I also added an MPI-serial test to cime_developer test suite.

Test suite: ./scripts_regression_tests on titan, ./create_test cime_developer (with new SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A test) on anvil
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #2890

User interface changes?: N

Update gh-pages html (Y/N)?: N/A

Code review: jedwards
jhkennedy added a commit to E3SM-Project/E3SM that referenced this pull request Apr 9, 2019
In ESMCI/cime#2804 and ESMCI/cime#2892, the MPI tag generation for MCT
was adjusted to allow more than ~30 multi-instance runs fixing
#2487.

This patch is required in order to run Salil's climate reproduceability
tests on maint-1.0, to determine the reporoduceablility of simulations
on Cori which were originally run on Edision.
singhbalwinder added a commit to E3SM-Project/E3SM that referenced this pull request Apr 23, 2019
#2877)

Multiple mods for maint-1.0 to run compset F1850C5-CMIP6 and nbfb tests

This PR adds following 4 fixes:

1. Fixes the land initial conditions for CMIP6 compset:
Adds the land/ocean mask for the existing land initial conditions to
only define a land initial condition file for appropriate resolutions.

2. Apply maximum MPI tag value patch:
Context:
@rljacob requested the CMDV-SM team apply their climate reproducibility
 tests to maint-1.0 to determine if the simulations run on Edison are
reproducible on Cori. This patch is required in order to run Salil's
multivariate climate reproducible tests.

Changes:

In ESMCI/cime#2804 and ESMCI/cime#2892, the MPI tag generation for MCT
was adjusted to allow more than ~30 multi-instance runs fixing #2487
and ESMCI/cime#2890. This PR applies these changes to Maint-1.0.

3. Adds TSC and NPG tests nbfb tests:
This fix adds python scripts for nbfb tests which were missing in
maint-1.0. Adding these tests will allow us to run these tests using
maint1.0

4. Fixes a bug in PIO1 which was fixed in master via CIME update:
This bug was fixed in master branch via a CIME update (see 8352dd4).

[BFB]

* singhbalwinder/atm/mods-for-nbfb-test:
  Fixes a bug in pio1
  Adds tsc and npg tests
  MPI tag fix
  Land init fix for missing mask
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants