-
Notifications
You must be signed in to change notification settings - Fork 307
-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some runs with NUOPC driver with multiple threads can hang #1331
Comments
I'm temporarily changing the processor layout for SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline from 480x3 to 480x1 because it is sometimes hanging with a 480x3 layout. Once we have resolved the threading-related hangs with nuopc documented in ESCOMP#1331, we can change this back to use threading.
For now, I'm changing |
Three attempts, no hangs: |
@jedwards4b - Interesting, thanks for trying this. It failed for me twice using the externals here https://github.com/billsacks/ctsm/blob/6c3abc680cce8328569a3d9efddfeee801231def/Externals.cfg even though I'm pretty sure it passed at least once for me with essentially the same set of externals a few days ago. I wonder if your externals are different or if this is just a chance occurrence whether it passes or fails. |
Update externals and fixes for nuopc threading (1) Some fixes for threading with the nuopc/cmeps driver. (However, threading with nuopc/cmeps still doesn't work completely: see #1331.) (2) Updates externals to versions needed for these nuopc threading changes
Has anybody tried running 480x3 without the OMP_WAIT_POLICY=passive setting? I think that would still be an interesting test. |
I just tried this from CTSM master. Unfortunately, I still get a hang in what appears to be the same (not-totally-determined) location. I did this with this change to cime: diff --git a/config/cesm/machines/config_machines.xml b/config/cesm/machines/config_machines.xml
index 3809a2c9f..17dc94f22 100644
--- a/config/cesm/machines/config_machines.xml
+++ b/config/cesm/machines/config_machines.xml
@@ -634,7 +634,6 @@ This allows using a different mpirun command to launch unit tests
<env name="UGCSINPUTPATH">/glade/work/turuncu/FV3GFS/benchmark-inputs/2012010100/gfs/fcst</env>
<env name="UGCSFIXEDFILEPATH">/glade/work/turuncu/FV3GFS/fix_am</env>
<env name="UGCSADDONPATH">/glade/work/turuncu/FV3GFS/addon</env>
- <env name="OMP_WAIT_POLICY">PASSIVE</env>
<env name="MPI_DSM_VERBOSE">true</env>
</environment_variables>
<environment_variables unit_testing="true"> |
Here is how I find the hang location:
(In the ddt gui attach to cesm) then look at the stack - in my experience it'll point to a write statement in ctsm. |
One more permutation to try: |
@jedwards4b thanks a lot for that pointer. To be honest, I had completely forgotten how to use DDT, so thanks for that jump-start. And yes, it points to a write statement in CTSM (https://github.com/escomp/CTSM/blob/master/src/main/lnd2glcMod.F90#L228). (As a side-note, I'm just realizing that there's probably an underlying issue that we should resolve that is leading to the generation of a bunch of these WARNING messages in the first time step; I'll open a separate issue for that.) @theurich unfortunately, I still get a hang in the same place with that additional permutation: diff --git a/config/cesm/machines/config_machines.xml b/config/cesm/machines/config_machines.xml
index 3809a2c9f..db87985c2 100644
--- a/config/cesm/machines/config_machines.xml
+++ b/config/cesm/machines/config_machines.xml
@@ -634,7 +634,8 @@ This allows using a different mpirun command to launch unit tests
<env name="UGCSINPUTPATH">/glade/work/turuncu/FV3GFS/benchmark-inputs/2012010100/gfs/fcst</env>
<env name="UGCSFIXEDFILEPATH">/glade/work/turuncu/FV3GFS/fix_am</env>
<env name="UGCSADDONPATH">/glade/work/turuncu/FV3GFS/addon</env>
- <env name="OMP_WAIT_POLICY">PASSIVE</env>
+ <env name="OMP_WAIT_POLICY">active</env>
+ <env name="KMP_BLOCKTIME">infinite</env>
<env name="MPI_DSM_VERBOSE">true</env>
</environment_variables>
<environment_variables unit_testing="true"> Incidentally, I also ran a couple more times with the out-of-the-box settings, and this time they ran successfully, suggesting that (not surprisingly) this failure is somewhat sporadic / random, presumably based on the exact timing of when different threads hit the write statements. |
@billsacks thanks for running the extra tests. It definitely looks like a race condition. You say hat for the new permutation of settings you tested, it hung "in the same place". Do you mean that when it hangs, it always hangs in the same write statement? If so, it would be interesting to know what that write statement looks like. E.g., does it write out maybe parts of a shared array, or does it construct an internal unit first, and then write that out? Maybe coming from that end, i.e. the actual write statement that triggers the issues, we might be able to come up with some more ideas of what might be the underlying issue we are hitting. |
Yes, I mean that, when it hangs, it seems to hang in the same write statement. I've only run it through a debugger to confirm once, but I'm pretty sure of this based on what's present in the log files at the time of the hang. This isn't the only problematic write statement: @jedwards4b encountered some others that we worked around, and this is just the next one causing problems. The current problem statement is this one:
However, one of the lines that caused problems for @jedwards4b was this one, before it was wrapped in an OMP MASTER: which doesn't print out any values. |
I've been down this rabbit hole and offered a branch of ctsm with critical clauses surrounding most of the write statements in ctsm. There is nothing special I can see about any of the write statements except that they are in threaded regions. |
@jedwards4b and @billsacks I suggest we attempt a test run where we try to eliminate as many of the ESMF threading-related aspects as possible. The changes I have in mind are theses:
|
So then you also want me to turn off esmf_aware_threading?
with esmf_aware_threading enabled this leads to a setvm error
…On Tue, Apr 13, 2021 at 9:59 AM Gerhard Theurich ***@***.***> wrote:
@jedwards4b <https://github.com/jedwards4b> and @billsacks
<https://github.com/billsacks> I suggest we attempt a test run where we
try to eliminate as many of the ESMF threading-related aspects as possible.
The changes I have in mind are theses:
1.
Use an ESMF that is completely built without Pthread dependency:
/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.0.5/esmf-fit2-ncdfio-mpt-g.lua
/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.0.5/esmf-fit2-ncdfio-mpt-O.lua
2.
Comment out the call to ESMF_InitializePreMPI().
3.
Replace the call to MPI_Init_thread() with a regular MPI_Init().
4.
Comment out any of the OMP_WAIT_POLICY and KMP_BLOCKTIME environment
variables.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1331 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABOXUGAWPAY6TYGKAGRUUPTTIRS5LANCNFSM42WPNGHQ>
.
--
Jim Edwards
CESM Software Engineer
National Center for Atmospheric Research
Boulder, CO
|
That is correct. This is only a test with esmf_aware_threading off, to see if any of the listed aspects might be underlying cause for the problem. This is not a long term solution, because I think we really want esmf_aware_threading support, but if we can figure out what causes the issue, maybe we can find a way to resolve it. |
One other thing I just thought of is the fact that the hanging was observed with Intel+MPT. I am not aware of such hanging under UFS with Intel+IntelMPI (not sure they ever run with MPT). But I do know they run FV3 threaded. However, I am not 100% sure that they have write statements inside threaded regions - but I would almost assume they do. So, another interesting test would be to look into Intel+IntelMPI on Cheyenne and see if CTSM writes would hang for that combo ever. That is a separate test from the other set of changes I suggested above. |
Thanks a lot for your continued thoughts, @theurich and @jedwards4b . I wanted to share a few more thoughts and findings. The important finding is near the bottom: the problem only seems to appear for writes to the lnd.log file: writes to the cesm.log file (unit 6) seem fine. (@theurich - in CESM, we open separate files for each component's master proc; so, for example, the CTSM master proc writes to a lnd.log file; all other procs from all components write to the cesm.log file by writing to unit 6, I think via stdout being redirected to the cesm.log file. In other words, it seems that output to stdout is fine, but output to other files is affected.)
diff --git a/src/main/lnd2glcMod.F90 b/src/main/lnd2glcMod.F90
index f48b3ef8b..dce65272d 100644
--- a/src/main/lnd2glcMod.F90
+++ b/src/main/lnd2glcMod.F90
@@ -225,7 +225,7 @@ contains
! Check for bad values of qice
if ( abs(this%qice_grc(g,n)) > 1.0_r8) then
- write(iulog,*) 'WARNING: qice out of bounds: g, n, qice =', g, n, this%qice_grc(g,n)
+ write(iulog,'(a,i0,i0,f14.7)') 'WARNING: qice out of bounds: g, n, qice =', g, n, this%qice_grc(g,n)
end if
end if
diff --git a/src/cpl/nuopc/lnd_comp_nuopc.F90 b/src/cpl/nuopc/lnd_comp_nuopc.F90
index 19c774829..c79f45670 100644
--- a/src/cpl/nuopc/lnd_comp_nuopc.F90
+++ b/src/cpl/nuopc/lnd_comp_nuopc.F90
@@ -211,7 +211,7 @@ contains
! reset shr logging to my log file
!----------------------------------------------------------------------------
- call set_component_logging(gcomp, localPet==0, iulog, shrlogunit, rc)
+ call set_component_logging(gcomp, .false., iulog, shrlogunit, rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
! Note still need compid for those parts of the code that use the data model I got 5 passes in a row (whereas, without this change, I was getting failures about 50% of the time). So it seems like the issue only arises during writes to the lnd.log file: writes to stdout do not appear to be impacted. Maybe this gives some hints at the possible problem? (Not to me, but maybe to one of you?) But it also hints at a possible solution. I'll put my ideas of a possible solution in a separate comment, since this comment is already getting so long. |
Given my above finding that the issue only appears to impact writes to the lnd.log file, I'm wondering about the following solution, if @theurich 's ideas don't work out and/or it is starting to feel too time-consuming to try to address the root cause. @jedwards4b and @mvertens I'm wondering what you think about this solution: Change the logging logic so that only the master thread on the master proc outputs to the lnd log file; all other threads (and procs) output to stdout/cesm.log. I think this could be done by: Wherever we currently reference the variable I think this would solve the problem for writes within the CTSM code. It would lead to some inconsistencies with any write statements done in share code, since those use the proc-level iulog (so send all output from the master proc to the lnd.log file, for writes called from share code called from CTSM). So I think that leaves open the possibility that we'd get model hangs for writes done in share code. This probably mainly occurs as the model is about to abort: the write statements called just before this abort. A few ways to deal with this are:
|
Bill, that is very interesting. It definitely gives me something to try
within the ESMF regression testing. I think we always just output
directly to stdout from within a threaded region in the tests (unless we
go through ESMF_Log, which we know is not thread-safe, and therefore
must use OMP CRITICAL for that reason).
A quick question, just to check my understanding. So you are saying that
under CESM, when a component is entered, the master proc's iulog is
changed to output to the component specific file. Output from all
threads under the master proc therefore go there. However, for all other
procs output from all their threads continue to go to the iulog which is
unit 6, and redirect to cesm.log. But doesn't that mean that e.g. CTSM
output ends up in both places? Some going to lnd.log, some to cesm.log.
Is that actually desirable? Just makes me wonder if I got that detail right?
…-Gerhard
On 4/13/21 9:49 AM, Bill Sacks wrote:
Thanks a lot for your continued thoughts, @theurich
<https://github.com/theurich> and @jedwards4b
<https://github.com/jedwards4b> .
I wanted to share a few more thoughts and findings. *The important
finding is near the bottom: the problem only seems to appear for
writes to the lnd.log file: writes to the cesm.log file (unit 6) seem
fine.* ***@***.*** <https://github.com/theurich> - in CESM, we open
separate files for each component's master proc; so, for example, the
CTSM master proc writes to a lnd.log file; all other procs from all
components write to the cesm.log file by writing to unit 6, I think
via stdout being redirected to the cesm.log file. In other words, it
seems that output to stdout is fine, but output to other files is
affected.)
1.
I realized that my testing probably isn't incorporating some of
@theurich <https://github.com/theurich> 's suggestions from the
last week. I haven't followed all of the email discussions
closely, but my sense is that @jedwards4b
<https://github.com/jedwards4b> has tried a variety of
suggestions, but none seem to completely solve the problem. But it
may be that the specific issue I'm hitting here /is/ resolved by
one of these suggestions.
2.
I tried changing the offending write statement to use a format
string rather than list-directed i/o, but this did *not* solve the
problem:
diff --git a/src/main/lnd2glcMod.F90 b/src/main/lnd2glcMod.F90
index f48b3ef8b..dce65272d 100644
--- a/src/main/lnd2glcMod.F90
+++ b/src/main/lnd2glcMod.F90
@@ -225,7 +225,7 @@ contains
! Check for bad values of qice
if ( abs(this%qice_grc(g,n)) > 1.0_r8) then
- write(iulog,*) 'WARNING: qice out of bounds: g, n, qice =', g, n,
this%qice_grc(g,n)
+ write(iulog,'(a,i0,i0,f14.7)') 'WARNING: qice out of bounds: g, n,
qice =', g, n, this%qice_grc(g,n)
end if
end if
3.
I noticed that the problem seems isolated to the master proc
and/or output to the lnd.log file (in CESM, the master proc
outputs to the lnd.log file); output from other procs to the
cesm.log file (which is done in CESM via redirecting stdout to the
cesm log file) does /not/ appear to be impacted. I saw this by
noticing that, at the point of the hang, there is always just a
single write statement's output on lnd.log, but the full
complement of output from this write statement on cesm.log (i.e.,
the same amount of output as we get in a run that gets past the
hang, including multiple outputs per proc).
4.
I then tried sending all CTSM output to the cesm log file (i.e.,
stdout) by making this change:
diff --git a/src/cpl/nuopc/lnd_comp_nuopc.F90
b/src/cpl/nuopc/lnd_comp_nuopc.F90
index 19c774829..c79f45670 100644
--- a/src/cpl/nuopc/lnd_comp_nuopc.F90
+++ b/src/cpl/nuopc/lnd_comp_nuopc.F90
@@ -211,7 +211,7 @@ contains
! reset shr logging to my log file
!----------------------------------------------------------------------------
- call set_component_logging(gcomp, localPet==0, iulog, shrlogunit, rc)
+ call set_component_logging(gcomp, .false., iulog, shrlogunit, rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
! Note still need compid for those parts of the code that use the data model
I got 5 passes in a row (whereas, without this change, I was getting
failures about 50% of the time).
*So it seems like the issue only arises during writes to the lnd.log
file: writes to stdout do not appear to be impacted.* Maybe this gives
some hints at the possible problem? (Not to me, but maybe to one of
you?) But it also hints at a possible solution. I'll put my ideas of a
possible solution in a separate comment, since this comment is already
getting so long.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1331 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABC2OMAKJWZPS3O7SWCJI3TTIRY2XANCNFSM42WPNGHQ>.
|
@theurich yes currently output goes either to lnd.log or cesm.log. But, all threads on the root MPI task go to lnd.log and everything else to cesm.log. If I understand @billsacks the only difference with his suggestion is that on the root MPI task only the master thread goes to lnd.log and other threads go to cesm.log. Personally I think this is fine. And I love it as a straight forward solution. |
@ekluzek yep, was just checking if I understood the current situation correctly. I had not realized that output from different procs go to different files. But I certainly have no issue with that. I also agree that the proposed solution seems straight forward, and just pushes more output over to cesm.log vs lnd.log, but is in a sense consistent with having such a split that is already happening. |
@ekluzek yes, you are understanding my suggestion right – thanks for the clear restatement of it. @theurich yes, you are also understanding correctly that output from different tasks goes to different log files. Whether this is desirable is up for debate... it's the way CESM has done things for at least the 10 years I've been here, and probably much longer, so I don't know the rationale that led to this design. Good question about a possible race condition on iulog. I think this is just set in ctsm's nuopc cap, outside of any threaded region, but let me double check that after lunch and get back to you on that. |
@theurich As far as I can see, the only place where the module-level variable CTSM/src/cpl/nuopc/lnd_comp_nuopc.F90 Line 214 in 30ca184
Here, in case you're interested, is the routine that calls: After that point, I believe iulog is only read, not written. |
Inspired by @billsacks analysis of the hanging write problem, I tried several things. The conclusion for me now is that we are dealing with an Intel compiler bug for versions < intel/19.1.1. First thing I did was to add write statements inside of OpenMP threaded blocks in components of some of the NUOPC prototype codes (used as examples and for regression testing). Everything worked fine on my local GNU setup - as expected. However, when I took it over to Cheyenne, and built with Intel+MPT, I experienced exactly the same hang when writing to a specific unit, other than 6. For 6 (stdout), everything ran fine, but as soon as I used a different unit, to write directly to a file, it would hang if the code was writing from more than a single thread. I could work around the issue, just as you and Jim have, by adding the !$omp critical block around the write statement. At this point I still was going under the assumption that somehow ESMF was part of the problem. To make sure it actually was, I just wanted to quickly verify that for a simple Fortran program things would actually work correctly. So I put a simple code together (see below). It turned out that when I built that code with the default intel/19.0.5 on Cheyenne, it also hung in the write!!! Unless I either let it go to stdout, or used omp critical! No ESMF, not even MPT involved, and it would hang every single time!!! I then tried different versions: intel/17.0.1 intel/18.0.5 intel/19.0.2 all showed the same issue! My conclusion now is: The final test of the above theory will be for me to rebuild an ESMF installation with intel/19.1.1, and then for you to re-run the regression tests that showed the 50% rate of hanging. I will kick that off still tonight and it should be ready in the morning. Plain Fortran reproducer:
|
@theurich Thanks so much for your analysis of the issue. I really appreciate all of your time helping with this – especially since it's looking like this isn't an ESMF issue after all! I was still puzzled by why we are hitting this with esmf but not mct. So I looked more closely at what might differ in setting up the output log unit. I noticed that the esmf version in CESM uses After reproducing the hang with your very helpful little reproducer, I made a small change: using a hard-coded unit number rather than program omp_write
!$ use omp_lib
implicit none
integer :: i,tid, funit
funit = 17
! open(newunit=funit, file="test.out")
open(funit, file="test.out")
!$omp parallel private(tid)
tid = -1 ! initialize to obvious value if building without OpenMP
!$ tid = omp_get_thread_num()
!$omp do
do i=1, 100
!!$omp critical
write(funit,*) "test write, tid=", tid, " i=", i
!!$omp end critical
enddo
!$omp end parallel
close(funit)
end program omp_write This worked, implicating the I am now rerunning with these diffs in our nuopc code: diff --git a/nuopc_cap_share/nuopc_shr_methods.F90 b/nuopc_cap_share/nuopc_shr_methods.F90
index 8d3283a..d4ffff5 100644
--- a/nuopc_cap_share/nuopc_shr_methods.F90
+++ b/nuopc_cap_share/nuopc_shr_methods.F90
@@ -22,7 +22,7 @@ module nuopc_shr_methods
use NUOPC_Model , only : NUOPC_ModelGet
use shr_kind_mod , only : r8 => shr_kind_r8, cl=>shr_kind_cl, cs=>shr_kind_cs
use shr_sys_mod , only : shr_sys_abort
- use shr_file_mod , only : shr_file_setlogunit, shr_file_getLogUnit
+ use shr_file_mod , only : shr_file_setlogunit, shr_file_getLogUnit, shr_file_getUnit
implicit none
private
@@ -154,7 +154,8 @@ contains
call NUOPC_CompAttributeGet(gcomp, name="logfile", value=logfile, rc=rc)
if (chkerr(rc,__LINE__,u_FILE_u)) return
- open(newunit=logunit,file=trim(diro)//"/"//trim(logfile))
+ logunit = shr_file_getUnit()
+ open(logunit,file=trim(diro)//"/"//trim(logfile))
else
logUnit = 6
endif I have gotten three passes in a row. I'm going to run a couple more times to be sure, but I think that works around the compiler issue.
Thanks again for your help with this! |
@billsacks this is awesome! It is very satisfying to have this mystery resolved, with a clear understanding as to why the MCT version of the code consistently worked, even with the older Intel versions, while the NUOPC version ran into these frequent hangs. So nice to have it traced down to one specific difference of using Not that it really matters all that much anymore, but the ESMF 8.1.0 installation with
|
@theurich @billsacks - thank you both for working on this and finding a resolution to a problem that has plagued us for months. @theurich - I want to particularly thank you for helping with this given that it ended up not being an ESMF issue. |
@mvertens - no problem at all. I slept a lot better last night knowing this was likely not anything that had to do with ESMF. Great work from @billsacks and @jedwards4b pushing this forward, too! @mvertens - one thing I am now wondering, is there any chance any other components, under their NUOPC caps might be doing similar things with use of |
@theurich yes they all use the same mechanism. |
Thanks for the new builds, @theurich . From separate discussions with @jedwards4b this morning, our tentative plan is to update to intel 19.1.1, so your new builds could be helpful for that. To clarify, the code in question (with I'll leave this issue open here until we update to a newer version of CIME that points to a more recent version of the intel compiler. |
We are using intel 19.1.1 on cheyenne now, and the problem no longer appears. so I'm closing this issue. |
Brief summary of bug
Some runs with NUOPC driver with multiple threads can hang. From what @jedwards4b has found, this happens during write statements in threaded regions.
General bug information
CTSM version you are using: master
Does this bug cause significantly incorrect results in the model's science? No
Configurations affected: Runs using NUOPC driver with multiple OpenMP threads
Details of bug
This test sometimes hangs in testing from the branch that I'm about to bring to master (will be ctsm5.1.dev031):
SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline
. Note that there are a couple of fixes for this issue in this upcoming tag: one write statement is wrapped inOMP MASTER
and a bunch of unnecessary write statements in subgrid initialization have been removed. Those fixes apparently make the issue appear less frequently, but don't completely solve the issue.We do have a few other nuopc tests with threading that are passing, but those all use 36x2 or 72x2 layouts – so many fewer tasks and 2 rather than 3 threads.
@jedwards4b and @theurich are working to identify an underlying cause of this problem. If they can't, then we may need to put
OMP CRITICAL
blocks around our write statements (either directly or via using a write_log subroutine that does this).The text was updated successfully, but these errors were encountered: