-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't exceed maximum MPI tag value #2804
Conversation
The MPI standard specifies that the MPI tag upper bound (MPI_TAG_UB) must be at least 32767 and many implementations greatly exceed this number (eg. cray-mpich/7.4.4 has MPI_TAG_UB == 2^21-1). However, in multi-instance runs, this tag is quickly exceed (~30 instances). This commit adds a check for tags larger than MPI_TAG_UB and reduces msgtag so that it will take ~300 instances to exceed MPI_TAG_UB on cray machines while still protecting against tag collisions out to ~100 instances (see #2397 and #2363). This should be sufficient for any multi-instance runs not using multi-driver (where this isn't a problem), and runs larger should be using multi-driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the upper bound check. My vote is to keep the tags unchanged in single-instance runs and use old (or this PR's values) tag values in multi-instance runs. Tracking down the hang in high-res runs took a while and reducing tags by a factor of 10 may bring that issue back.
@@ -63,6 +63,8 @@ module component_mod | |||
integer :: mpicom_GLOID, mpicom_CPLID ! GLOID, CPLID mpi communicator | |||
integer :: nthreads_GLOID, nthreads_CPLID | |||
logical :: drv_threading | |||
integer (kind=MPI_ADDRESS_KIND) :: max_mpi_tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that this is the correct type for this variable - from the mpich documentation:
The attribute_val in Fortran is a pointer to a Fortran integer, not a pointer to a void *.
Which I think implies that what is in max_mpi_tag is the address of the value and not the value itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I'm not a 100% sure. From my understanding though, this is returning the actual value not an address of the value.
This mpi-forum page explains it and provides some examples, but here is the relevant text:
MPI supports two types of attributes: address-valued (pointer) attributes, and integer valued attributes. C and C++ attribute functions put and get address valued attributes. Fortran attribute functions put and get integer valued attributes. When an integer valued attribute is accessed from C or C++, then MPI_xxx_get_attr will return the address of (a pointer to) the integer valued attribute, which is a pointer to MPI_Aint if the attribute was stored with Fortran MPI_xxx_SET_ATTR, and a pointer to int if it was stored with the deprecated Fortran MPI_ATTR_PUT. When an address valued attribute is accessed from Fortran, then MPI_xxx_GET_ATTR will convert the address into an integer and return the result of this conversion. This conversion is lossless if new style attribute functions are used, and an integer of kind MPI_ADDRESS_KIND is returned. The conversion may cause truncation if deprecated attribute functions are used. In C, the deprecated routines MPI_Attr_put and MPI_Attr_get behave identical to MPI_Comm_set_attr and MPI_Comm_get_attr.
And that same behavior is implied here as well:
https://www.open-mpi.org/doc/v3.0/man3/MPI_Comm_set_attr.3.php
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
@amametjanov I think you're right, it's probably best to keep the message tags unchanged (Note: this will limit multi-instance to ~20 instances without using multi-driver). I'm in the process of converting some of my multi-instance runs to use multi-driver. Once those pass I'll bump it back up and update the multi-instance docs to discuss this limitation. |
@jhkennedy is this ready to merge? |
@jedwards4b I think we need to rethink these tags first. To recap, #2397 and #2363 caused the MPI tag values in the mct driver to change from: mpi_tag = comp(eci)%cplcompid*100+eci*10+N to mpi_tag = comp(eci)%cplcompid*10000+eci*10+N where This change causes multi-instance runs to exceed the maximum MPI tag value at ~20 instances on titan (see E3SM-Project/E3SM#2487) and at ~100 instances on edison (see ESCOMP/CTSM#515 ; pinging @ckoven). Multi-driver, should alleviate this issue, but I've run into #2295 when trying to change my multi-instance runs over to multi-driver (seems to be titan specific? trying at NERSC), and multi-driver isn't useful for ESCOMP/CTSM#515 because of how long it takes to build the DATM namelists in their case. So the proposed solution was to leave a (MPI implementation dependent) limit on multi-instance runs and encourage multi-driver use, while also picking a middle ground in the MPI tag size ( But, due to the aforementioned multi-driver issues and @amametjanov feeling that 1000 still leaves too much potential for tag collision, I don't think this solution is the right one. So, that leaves us with is there a different and better way to assign the tags? For example, it seems the mpi_tag = eci*10000+comp(eci)%cplcompid*10+5 ? However, I'm not sure I understand how the collisions are occurring in single instance runs -- @amametjanov what are these tags colliding with? |
@jhkennedy how about this patch to use old tags for multi-instance runs: https://gist.github.com/amametjanov/790661841af3a49977b1b584fd509668 |
Hi All, |
What is the status of this PR? |
@rljacob I'm testing @amametjanov patch -- so far looks good; I'll know more tomorrow. |
Instead of picking a middle ground between tag value magnitudes, use the old tags for multi-instance runs and the new tags for single-instance runs. This allows high resolution cases to run, but also preserves multi-instance functionality for non-high res runs.
@jedwards4b this is ready to merge. @ckoven this should allow you to run your multi-instance runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why we need a seperate case for ninst > 1.
call seq_map_map_exchange(comp(1), flow='c2x', dom_flag=.true., msgtag=comp(1)%cplcompid*10000+1*10+1) | ||
|
||
if (size(comp) > 1) then | ||
mpi_tag = comp(eci)%cplcompid*100+eci*10+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why two cases here? Why not just use line 393 regardless of size(comp)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize the thread, that was the suggestion by @amametjanov to multi-instance runs with >20 instance to run without exceeding the MPI tag limit, but still prevent tag collision on single instance high res runs. Multi-driver is a possible alternative, but there were challenges around setting that up for people in this thread and others I've talked to about this issue. It seems if multi-instance is going to be an option it should be able to handle at least a normal ensemble size.
I'd be welcome to a suggestion on how to assign these tag values entirely differently, but I don't know them, or with what they are colliding, well enough. I suggested re-ordering the way the tags were written, but that didn't get any traction from anyone in this thread. So this at least fixes the immediate issue.
Fix the baseline gen to use the updated _get_all_hist_files function using the archiver. Also, now that #2804 has fixed the maximum MPI tag problem, bump the number of instances to 30
Remove redundant check of mpi-tag values Because MPI-serial doesn't have the MPI_attr_get function, PR #2804 broke MP-serial builds as discussed in #2890 . This removes a redundant exception that uses MPI_attr_get to determine the max MPI tag value, and report it and when it is exceed. MPI will still throw an exception when it is exceeded, but doesn't report the maximum value. I also added an MPI-serial test to cime_developer test suite. Test suite: ./scripts_regression_tests on titan, ./create_test cime_developer (with new SMS_D_Ln9_Mmpi-serial.f19_g16_rx1.A test) on anvil Test baseline: Test namelist changes: Test status: bit for bit Fixes #2890 User interface changes?: N Update gh-pages html (Y/N)?: N/A Code review: jedwards
In ESMCI/cime#2804 and ESMCI/cime#2892, the MPI tag generation for MCT was adjusted to allow more than ~30 multi-instance runs fixing #2487. This patch is required in order to run Salil's climate reproduceability tests on maint-1.0, to determine the reporoduceablility of simulations on Cori which were originally run on Edision.
#2877) Multiple mods for maint-1.0 to run compset F1850C5-CMIP6 and nbfb tests This PR adds following 4 fixes: 1. Fixes the land initial conditions for CMIP6 compset: Adds the land/ocean mask for the existing land initial conditions to only define a land initial condition file for appropriate resolutions. 2. Apply maximum MPI tag value patch: Context: @rljacob requested the CMDV-SM team apply their climate reproducibility tests to maint-1.0 to determine if the simulations run on Edison are reproducible on Cori. This patch is required in order to run Salil's multivariate climate reproducible tests. Changes: In ESMCI/cime#2804 and ESMCI/cime#2892, the MPI tag generation for MCT was adjusted to allow more than ~30 multi-instance runs fixing #2487 and ESMCI/cime#2890. This PR applies these changes to Maint-1.0. 3. Adds TSC and NPG tests nbfb tests: This fix adds python scripts for nbfb tests which were missing in maint-1.0. Adding these tests will allow us to run these tests using maint1.0 4. Fixes a bug in PIO1 which was fixed in master via CIME update: This bug was fixed in master branch via a CIME update (see 8352dd4). [BFB] * singhbalwinder/atm/mods-for-nbfb-test: Fixes a bug in pio1 Adds tsc and npg tests MPI tag fix Land init fix for missing mask
The MPI standard specifies that the MPI tag upper bound (MPI_TAG_UB) must
be at least 32767 and many implementations greatly exceed this number
(eg. cray-mpich/7.4.4 has MPI_TAG_UB == 2^21-1). However, in
multi-instance runs, this tag is quickly exceed (~30 instances).
This PR adds a check for tags larger than MPI_TAG_UB and reduces
msgtag so that it will take ~300 instances to exceed MPI_TAG_UB on cray
machines while still protecting against tag collisions out to ~100
instances (see #2397 and #2363).
This should be sufficient for any multi-instance runs not using
multi-driver (where this isn't a problem), and runs larger should be
using multi-driver.
Test suite: scripts_regression_tests.py on edison; MVK SystemTest on Titan (E3SM)
Test baseline: N/A
Test namelist changes: none
Test status: [bit for bit]
Fixes E3SM-Project/E3SM#2487
User interface changes?: N
Update gh-pages html (Y/N)?: N
Code review: