Some runs with NUOPC driver with multiple threads can hang #1331

billsacks · 2021-04-10T12:53:44Z

Brief summary of bug

Some runs with NUOPC driver with multiple threads can hang. From what @jedwards4b has found, this happens during write statements in threaded regions.

General bug information

CTSM version you are using: master

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: Runs using NUOPC driver with multiple OpenMP threads

Details of bug

This test sometimes hangs in testing from the branch that I'm about to bring to master (will be ctsm5.1.dev031): SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline. Note that there are a couple of fixes for this issue in this upcoming tag: one write statement is wrapped in OMP MASTER and a bunch of unnecessary write statements in subgrid initialization have been removed. Those fixes apparently make the issue appear less frequently, but don't completely solve the issue.

We do have a few other nuopc tests with threading that are passing, but those all use 36x2 or 72x2 layouts – so many fewer tasks and 2 rather than 3 threads.

@jedwards4b and @theurich are working to identify an underlying cause of this problem. If they can't, then we may need to put OMP CRITICAL blocks around our write statements (either directly or via using a write_log subroutine that does this).

The text was updated successfully, but these errors were encountered:

I'm temporarily changing the processor layout for SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline from 480x3 to 480x1 because it is sometimes hanging with a 480x3 layout. Once we have resolved the threading-related hangs with nuopc documented in ESCOMP#1331, we can change this back to use threading.

billsacks · 2021-04-10T13:12:23Z

For now, I'm changing SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline to use a 480x1 layout so that it will pass reliably. Once we resolve this issue, we should change it back – reverting the changes in 6c3abc6.

jedwards4b · 2021-04-10T13:29:30Z

Three attempts, no hangs:
/glade/scratch/jedwards/SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline.20210410_071232_x6tojp

billsacks · 2021-04-10T13:40:35Z

@jedwards4b - Interesting, thanks for trying this. It failed for me twice using the externals here https://github.com/billsacks/ctsm/blob/6c3abc680cce8328569a3d9efddfeee801231def/Externals.cfg even though I'm pretty sure it passed at least once for me with essentially the same set of externals a few days ago. I wonder if your externals are different or if this is just a chance occurrence whether it passes or fails.

Update externals and fixes for nuopc threading (1) Some fixes for threading with the nuopc/cmeps driver. (However, threading with nuopc/cmeps still doesn't work completely: see #1331.) (2) Updates externals to versions needed for these nuopc threading changes

theurich · 2021-04-12T15:48:02Z

Has anybody tried running 480x3 without the OMP_WAIT_POLICY=passive setting? I think that would still be an interesting test.

billsacks · 2021-04-12T21:12:25Z

Has anybody tried running 480x3 without the OMP_WAIT_POLICY=passive setting? I think that would still be an interesting test.

I just tried this from CTSM master. Unfortunately, I still get a hang in what appears to be the same (not-totally-determined) location.

I did this with this change to cime:

diff --git a/config/cesm/machines/config_machines.xml b/config/cesm/machines/config_machines.xml
index 3809a2c9f..17dc94f22 100644
--- a/config/cesm/machines/config_machines.xml
+++ b/config/cesm/machines/config_machines.xml
@@ -634,7 +634,6 @@ This allows using a different mpirun command to launch unit tests
       <env name="UGCSINPUTPATH">/glade/work/turuncu/FV3GFS/benchmark-inputs/2012010100/gfs/fcst</env>
       <env name="UGCSFIXEDFILEPATH">/glade/work/turuncu/FV3GFS/fix_am</env>
       <env name="UGCSADDONPATH">/glade/work/turuncu/FV3GFS/addon</env>
-      <env name="OMP_WAIT_POLICY">PASSIVE</env>
       <env name="MPI_DSM_VERBOSE">true</env>
     </environment_variables>
     <environment_variables unit_testing="true">

jedwards4b · 2021-04-12T21:18:04Z

Here is how I find the hang location:

qstat -f jobid  # this will give a long output including the list of compute nodes.
ssh -XY computenode
cd $casedir
source .env_mach_specific.sh
module load arm-forge
ddt

(In the ddt gui attach to cesm) then look at the stack - in my experience it'll point to a write statement in ctsm.

theurich · 2021-04-12T21:22:02Z

One more permutation to try:
OMP_WAIT_POLICY=active
KMP_BLOCKTIME=infinite
Thanks!

billsacks · 2021-04-13T02:53:13Z

@jedwards4b thanks a lot for that pointer. To be honest, I had completely forgotten how to use DDT, so thanks for that jump-start. And yes, it points to a write statement in CTSM (https://github.com/escomp/CTSM/blob/master/src/main/lnd2glcMod.F90#L228). (As a side-note, I'm just realizing that there's probably an underlying issue that we should resolve that is leading to the generation of a bunch of these WARNING messages in the first time step; I'll open a separate issue for that.)

@theurich unfortunately, I still get a hang in the same place with that additional permutation:

diff --git a/config/cesm/machines/config_machines.xml b/config/cesm/machines/config_machines.xml
index 3809a2c9f..db87985c2 100644
--- a/config/cesm/machines/config_machines.xml
+++ b/config/cesm/machines/config_machines.xml
@@ -634,7 +634,8 @@ This allows using a different mpirun command to launch unit tests
       <env name="UGCSINPUTPATH">/glade/work/turuncu/FV3GFS/benchmark-inputs/2012010100/gfs/fcst</env>
       <env name="UGCSFIXEDFILEPATH">/glade/work/turuncu/FV3GFS/fix_am</env>
       <env name="UGCSADDONPATH">/glade/work/turuncu/FV3GFS/addon</env>
-      <env name="OMP_WAIT_POLICY">PASSIVE</env>
+      <env name="OMP_WAIT_POLICY">active</env>
+      <env name="KMP_BLOCKTIME">infinite</env>
       <env name="MPI_DSM_VERBOSE">true</env>
     </environment_variables>
     <environment_variables unit_testing="true">

Incidentally, I also ran a couple more times with the out-of-the-box settings, and this time they ran successfully, suggesting that (not surprisingly) this failure is somewhat sporadic / random, presumably based on the exact timing of when different threads hit the write statements.

theurich · 2021-04-13T05:08:57Z

@billsacks thanks for running the extra tests. It definitely looks like a race condition. You say hat for the new permutation of settings you tested, it hung "in the same place". Do you mean that when it hangs, it always hangs in the same write statement? If so, it would be interesting to know what that write statement looks like. E.g., does it write out maybe parts of a shared array, or does it construct an internal unit first, and then write that out? Maybe coming from that end, i.e. the actual write statement that triggers the issues, we might be able to come up with some more ideas of what might be the underlying issue we are hitting.

billsacks · 2021-04-13T12:48:48Z

Yes, I mean that, when it hangs, it seems to hang in the same write statement. I've only run it through a debugger to confirm once, but I'm pretty sure of this based on what's present in the log files at the time of the hang. This isn't the only problematic write statement: @jedwards4b encountered some others that we worked around, and this is just the next one causing problems.

The current problem statement is this one:
https://github.com/escomp/CTSM/blob/master/src/main/lnd2glcMod.F90#L228

qice_grc is a shared array.

However, one of the lines that caused problems for @jedwards4b was this one, before it was wrapped in an OMP MASTER:
https://github.com/escomp/CTSM/blob/master/src/biogeochem/CNVegetationFacade.F90#L1131

which doesn't print out any values.

jedwards4b · 2021-04-13T13:03:55Z

I've been down this rabbit hole and offered a branch of ctsm with critical clauses surrounding most of the write statements in ctsm. There is nothing special I can see about any of the write statements except that they are in threaded regions.

theurich · 2021-04-13T15:58:57Z

@jedwards4b and @billsacks I suggest we attempt a test run where we try to eliminate as many of the ESMF threading-related aspects as possible. The changes I have in mind are theses:

Use an ESMF that is completely built without Pthread dependency:
/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.0.5/esmf-fit2-ncdfio-mpt-g.lua
/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.0.5/esmf-fit2-ncdfio-mpt-O.lua
Comment out the call to ESMF_InitializePreMPI().
Replace the call to MPI_Init_thread() with a regular MPI_Init().
Comment out any of the OMP_WAIT_POLICY and KMP_BLOCKTIME environment variables.

jedwards4b · 2021-04-13T16:18:36Z

So then you also want me to turn off esmf_aware_threading? with esmf_aware_threading enabled this leads to a setvm error

…

On Tue, Apr 13, 2021 at 9:59 AM Gerhard Theurich ***@***.***> wrote: @jedwards4b <https://github.com/jedwards4b> and @billsacks <https://github.com/billsacks> I suggest we attempt a test run where we try to eliminate as many of the ESMF threading-related aspects as possible. The changes I have in mind are theses: 1. Use an ESMF that is completely built without Pthread dependency: /glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.0.5/esmf-fit2-ncdfio-mpt-g.lua /glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.0.5/esmf-fit2-ncdfio-mpt-O.lua 2. Comment out the call to ESMF_InitializePreMPI(). 3. Replace the call to MPI_Init_thread() with a regular MPI_Init(). 4. Comment out any of the OMP_WAIT_POLICY and KMP_BLOCKTIME environment variables. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABOXUGAWPAY6TYGKAGRUUPTTIRS5LANCNFSM42WPNGHQ> .

-- Jim Edwards CESM Software Engineer National Center for Atmospheric Research Boulder, CO

theurich · 2021-04-13T16:21:06Z

That is correct. This is only a test with esmf_aware_threading off, to see if any of the listed aspects might be underlying cause for the problem. This is not a long term solution, because I think we really want esmf_aware_threading support, but if we can figure out what causes the issue, maybe we can find a way to resolve it.

theurich · 2021-04-13T16:24:41Z

One other thing I just thought of is the fact that the hanging was observed with Intel+MPT. I am not aware of such hanging under UFS with Intel+IntelMPI (not sure they ever run with MPT). But I do know they run FV3 threaded. However, I am not 100% sure that they have write statements inside threaded regions - but I would almost assume they do. So, another interesting test would be to look into Intel+IntelMPI on Cheyenne and see if CTSM writes would hang for that combo ever. That is a separate test from the other set of changes I suggested above.

billsacks · 2021-04-13T16:49:28Z

Thanks a lot for your continued thoughts, @theurich and @jedwards4b .

I wanted to share a few more thoughts and findings. The important finding is near the bottom: the problem only seems to appear for writes to the lnd.log file: writes to the cesm.log file (unit 6) seem fine. (@theurich - in CESM, we open separate files for each component's master proc; so, for example, the CTSM master proc writes to a lnd.log file; all other procs from all components write to the cesm.log file by writing to unit 6, I think via stdout being redirected to the cesm.log file. In other words, it seems that output to stdout is fine, but output to other files is affected.)

I realized that my testing probably isn't incorporating some of @theurich 's suggestions from the last week. I haven't followed all of the email discussions closely, but my sense is that @jedwards4b has tried a variety of suggestions, but none seem to completely solve the problem. But it may be that the specific issue I'm hitting here is resolved by one of these suggestions.
I tried changing the offending write statement to use a format string rather than list-directed i/o, but this did not solve the problem:

diff --git a/src/main/lnd2glcMod.F90 b/src/main/lnd2glcMod.F90
index f48b3ef8b..dce65272d 100644
--- a/src/main/lnd2glcMod.F90
+++ b/src/main/lnd2glcMod.F90
@@ -225,7 +225,7 @@ contains
 
          ! Check for bad values of qice
          if ( abs(this%qice_grc(g,n)) > 1.0_r8) then
-            write(iulog,*) 'WARNING: qice out of bounds: g, n, qice =', g, n, this%qice_grc(g,n)
+            write(iulog,'(a,i0,i0,f14.7)') 'WARNING: qice out of bounds: g, n, qice =', g, n, this%qice_grc(g,n)
          end if
       end if

I noticed that the problem seems isolated to the master proc and/or output to the lnd.log file (in CESM, the master proc outputs to the lnd.log file); output from other procs to the cesm.log file (which is done in CESM via redirecting stdout to the cesm log file) does not appear to be impacted. I saw this by noticing that, at the point of the hang, there is always just a single write statement's output on lnd.log, but the full complement of output from this write statement on cesm.log (i.e., the same amount of output as we get in a run that gets past the hang, including multiple outputs per proc).
I then tried sending all CTSM output to the cesm log file (i.e., stdout) by making this change:

diff --git a/src/cpl/nuopc/lnd_comp_nuopc.F90 b/src/cpl/nuopc/lnd_comp_nuopc.F90
index 19c774829..c79f45670 100644
--- a/src/cpl/nuopc/lnd_comp_nuopc.F90
+++ b/src/cpl/nuopc/lnd_comp_nuopc.F90
@@ -211,7 +211,7 @@ contains
     ! reset shr logging to my log file
     !----------------------------------------------------------------------------
 
-    call set_component_logging(gcomp, localPet==0, iulog, shrlogunit, rc)
+    call set_component_logging(gcomp, .false., iulog, shrlogunit, rc)
     if (ChkErr(rc,__LINE__,u_FILE_u)) return
 
     ! Note still need compid for those parts of the code that use the data model

I got 5 passes in a row (whereas, without this change, I was getting failures about 50% of the time).

So it seems like the issue only arises during writes to the lnd.log file: writes to stdout do not appear to be impacted. Maybe this gives some hints at the possible problem? (Not to me, but maybe to one of you?) But it also hints at a possible solution. I'll put my ideas of a possible solution in a separate comment, since this comment is already getting so long.

billsacks · 2021-04-13T17:07:57Z

Given my above finding that the issue only appears to impact writes to the lnd.log file, I'm wondering about the following solution, if @theurich 's ideas don't work out and/or it is starting to feel too time-consuming to try to address the root cause. @jedwards4b and @mvertens I'm wondering what you think about this solution:

Change the logging logic so that only the master thread on the master proc outputs to the lnd log file; all other threads (and procs) output to stdout/cesm.log. I think this could be done by: Wherever we currently reference the variable iulog in ctsm, change this to a function (simply by changing iulog to iulog()). This function would have some logic that returns a different value for OMP MASTER than for other threads.

I think this would solve the problem for writes within the CTSM code. It would lead to some inconsistencies with any write statements done in share code, since those use the proc-level iulog (so send all output from the master proc to the lnd.log file, for writes called from share code called from CTSM). So I think that leaves open the possibility that we'd get model hangs for writes done in share code. This probably mainly occurs as the model is about to abort: the write statements called just before this abort. A few ways to deal with this are:

Change CTSM's nuopc cap so that it sets the share code's logunit set to the default (unit 6) for all procs. This would mean that share code output from the master task would go to cesm log rather than lnd log, which isn't ideal, but may be acceptable.
Put similar logic in the share code, so that rather than using a logunit variable, shr code instead uses a logunit function whose value depends in part on whether you are the master thread.
Change the interface to the relevant share code so that callers pass in a log unit, rather than getting this from a share code module variable. This could be an optional argument. Then CTSM could pass in the result of its iulog() function. However, this might require a lot of changes, since I imagine there are a lot of share code routines that do write statements themselves or call other subroutines that do write statements.
(I started to think about a way to set a thread-specific logunit variable in the share code at the start of each OMP loop in CTSM, but I'm struggling to come up with a way that could work.)

theurich · 2021-04-13T17:18:16Z

Bill, that is very interesting. It definitely gives me something to try within the ESMF regression testing. I think we always just output directly to stdout from within a threaded region in the tests (unless we go through ESMF_Log, which we know is not thread-safe, and therefore must use OMP CRITICAL for that reason). A quick question, just to check my understanding. So you are saying that under CESM, when a component is entered, the master proc's iulog is changed to output to the component specific file. Output from all threads under the master proc therefore go there. However, for all other procs output from all their threads continue to go to the iulog which is unit 6, and redirect to cesm.log. But doesn't that mean that e.g. CTSM output ends up in both places? Some going to lnd.log, some to cesm.log. Is that actually desirable? Just makes me wonder if I got that detail right?

…

-Gerhard

On 4/13/21 9:49 AM, Bill Sacks wrote: Thanks a lot for your continued thoughts, @theurich <https://github.com/theurich> and @jedwards4b <https://github.com/jedwards4b> . I wanted to share a few more thoughts and findings. *The important finding is near the bottom: the problem only seems to appear for writes to the lnd.log file: writes to the cesm.log file (unit 6) seem fine.* ***@***.*** <https://github.com/theurich> - in CESM, we open separate files for each component's master proc; so, for example, the CTSM master proc writes to a lnd.log file; all other procs from all components write to the cesm.log file by writing to unit 6, I think via stdout being redirected to the cesm.log file. In other words, it seems that output to stdout is fine, but output to other files is affected.) 1. I realized that my testing probably isn't incorporating some of @theurich <https://github.com/theurich> 's suggestions from the last week. I haven't followed all of the email discussions closely, but my sense is that @jedwards4b <https://github.com/jedwards4b> has tried a variety of suggestions, but none seem to completely solve the problem. But it may be that the specific issue I'm hitting here /is/ resolved by one of these suggestions. 2. I tried changing the offending write statement to use a format string rather than list-directed i/o, but this did *not* solve the problem: diff --git a/src/main/lnd2glcMod.F90 b/src/main/lnd2glcMod.F90 index f48b3ef8b..dce65272d 100644 --- a/src/main/lnd2glcMod.F90 +++ b/src/main/lnd2glcMod.F90 @@ -225,7 +225,7 @@ contains ! Check for bad values of qice if ( abs(this%qice_grc(g,n)) > 1.0_r8) then - write(iulog,*) 'WARNING: qice out of bounds: g, n, qice =', g, n, this%qice_grc(g,n) + write(iulog,'(a,i0,i0,f14.7)') 'WARNING: qice out of bounds: g, n, qice =', g, n, this%qice_grc(g,n) end if end if 3. I noticed that the problem seems isolated to the master proc and/or output to the lnd.log file (in CESM, the master proc outputs to the lnd.log file); output from other procs to the cesm.log file (which is done in CESM via redirecting stdout to the cesm log file) does /not/ appear to be impacted. I saw this by noticing that, at the point of the hang, there is always just a single write statement's output on lnd.log, but the full complement of output from this write statement on cesm.log (i.e., the same amount of output as we get in a run that gets past the hang, including multiple outputs per proc). 4. I then tried sending all CTSM output to the cesm log file (i.e., stdout) by making this change: diff --git a/src/cpl/nuopc/lnd_comp_nuopc.F90 b/src/cpl/nuopc/lnd_comp_nuopc.F90 index 19c774829..c79f45670 100644 --- a/src/cpl/nuopc/lnd_comp_nuopc.F90 +++ b/src/cpl/nuopc/lnd_comp_nuopc.F90 @@ -211,7 +211,7 @@ contains ! reset shr logging to my log file !---------------------------------------------------------------------------- - call set_component_logging(gcomp, localPet==0, iulog, shrlogunit, rc) + call set_component_logging(gcomp, .false., iulog, shrlogunit, rc) if (ChkErr(rc,__LINE__,u_FILE_u)) return ! Note still need compid for those parts of the code that use the data model I got 5 passes in a row (whereas, without this change, I was getting failures about 50% of the time). *So it seems like the issue only arises during writes to the lnd.log file: writes to stdout do not appear to be impacted.* Maybe this gives some hints at the possible problem? (Not to me, but maybe to one of you?) But it also hints at a possible solution. I'll put my ideas of a possible solution in a separate comment, since this comment is already getting so long. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1331 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABC2OMAKJWZPS3O7SWCJI3TTIRY2XANCNFSM42WPNGHQ>.

ekluzek · 2021-04-13T17:26:50Z

@theurich yes currently output goes either to lnd.log or cesm.log. But, all threads on the root MPI task go to lnd.log and everything else to cesm.log. If I understand @billsacks the only difference with his suggestion is that on the root MPI task only the master thread goes to lnd.log and other threads go to cesm.log. Personally I think this is fine. And I love it as a straight forward solution.

theurich · 2021-04-13T17:42:23Z

@ekluzek yep, was just checking if I understood the current situation correctly. I had not realized that output from different procs go to different files. But I certainly have no issue with that. I also agree that the proposed solution seems straight forward, and just pushes more output over to cesm.log vs lnd.log, but is in a sense consistent with having such a split that is already happening.
@billsacks your analysis of the observations has me wonder about one thing right now. Could it be that the root cause of the issue is a race condition on the iulog variable on the master proc? I assume that iulog is indicated as shared, and resetting to unit 6 happens outside of parallel. Just the fact that this is happening in writes, where the iulog is being modified by the component seems special.

billsacks · 2021-04-13T17:49:09Z

@ekluzek yes, you are understanding my suggestion right – thanks for the clear restatement of it.

@theurich yes, you are also understanding correctly that output from different tasks goes to different log files. Whether this is desirable is up for debate... it's the way CESM has done things for at least the 10 years I've been here, and probably much longer, so I don't know the rationale that led to this design.

Good question about a possible race condition on iulog. I think this is just set in ctsm's nuopc cap, outside of any threaded region, but let me double check that after lunch and get back to you on that.

billsacks · 2021-04-13T19:22:30Z

@theurich As far as I can see, the only place where the module-level variable iulog is set is in this call in initialization (outside of any threading loops as far as I can tell):

CTSM/src/cpl/nuopc/lnd_comp_nuopc.F90

Line 214 in 30ca184

call set_component_logging(gcomp, localPet==0, iulog, shrlogunit, rc)

Here, in case you're interested, is the routine that calls:

https://github.com/ESCOMP/CMEPS/blob/448a81c2b251018311aafbcb028314ddf0352918/nuopc_cap_share/nuopc_shr_methods.F90#L133-L164

After that point, I believe iulog is only read, not written.

theurich · 2021-04-14T05:46:49Z

Inspired by @billsacks analysis of the hanging write problem, I tried several things. The conclusion for me now is that we are dealing with an Intel compiler bug for versions < intel/19.1.1.

First thing I did was to add write statements inside of OpenMP threaded blocks in components of some of the NUOPC prototype codes (used as examples and for regression testing). Everything worked fine on my local GNU setup - as expected. However, when I took it over to Cheyenne, and built with Intel+MPT, I experienced exactly the same hang when writing to a specific unit, other than 6. For 6 (stdout), everything ran fine, but as soon as I used a different unit, to write directly to a file, it would hang if the code was writing from more than a single thread. I could work around the issue, just as you and Jim have, by adding the !$omp critical block around the write statement.

At this point I still was going under the assumption that somehow ESMF was part of the problem. To make sure it actually was, I just wanted to quickly verify that for a simple Fortran program things would actually work correctly. So I put a simple code together (see below). It turned out that when I built that code with the default intel/19.0.5 on Cheyenne, it also hung in the write!!! Unless I either let it go to stdout, or used omp critical! No ESMF, not even MPT involved, and it would hang every single time!!!

I then tried different versions: intel/17.0.1 intel/18.0.5 intel/19.0.2 all showed the same issue!
However, the latest intel/19.1.1 worked. So I also tried it with the NUOPC test, and that also worked!
Finally I tried all of the GNU modules on Cheyenne, and the NVHPC module. Those also all worked as expected.

My conclusion now is:
(1) Intel < 19.1.1 has a bug that causes the issue. It is fixed in 19.1.1.
(2) If you explicitly make sure only a single thread every calls write (with a unit other than 6) via OMP CRITICAL, you do not trigger the issue.
(3) This probably could also explain why some NUOPC runs did not hang. And could be a way to explain why MCT runs apparently never hung. It would mean that under MCT the threads under a the master proc were always sufficiently out of sync hitting the write statement that it would not trigger the Intel bug, and for NUOPC that was sometimes the case, and sometimes they were just close enough in sync to trigger it.

The final test of the above theory will be for me to rebuild an ESMF installation with intel/19.1.1, and then for you to re-run the regression tests that showed the 50% rate of hanging. I will kick that off still tonight and it should be ready in the morning.

Plain Fortran reproducer:

program omp_write
!$  use omp_lib
  implicit none

  integer                     :: i,tid, funit
 
#define STDOUTxx
#ifdef STDOUT
    funit = 6
#else
    open(newunit=funit, file="test.out")
#endif

!$omp parallel private(tid)
    tid = -1 ! initialize to obvious value if building without OpenMP
!$  tid = omp_get_thread_num()
!$omp do
    do i=1, 100
!!$omp critical
      write(funit,*) "test write, tid=", tid, "  i=", i
!!$omp end critical
    enddo
!$omp end parallel

  close(funit)

end program

billsacks · 2021-04-14T13:05:50Z

@theurich Thanks so much for your analysis of the issue. I really appreciate all of your time helping with this – especially since it's looking like this isn't an ESMF issue after all!

I was still puzzled by why we are hitting this with esmf but not mct. So I looked more closely at what might differ in setting up the output log unit. I noticed that the esmf version in CESM uses open(newunit=..., whereas the mct version uses some old homegrown logic for finding an available unit number (long predating the newunit option to open).

After reproducing the hang with your very helpful little reproducer, I made a small change: using a hard-coded unit number rather than newunit. See below (I also removed the ifdefs for clarity):

program omp_write
  !$ use omp_lib
  implicit none

  integer :: i,tid, funit

  funit = 17
  ! open(newunit=funit, file="test.out")
  open(funit, file="test.out")

  !$omp parallel private(tid)
  tid = -1 ! initialize to obvious value if building without OpenMP
  !$ tid = omp_get_thread_num()
  !$omp do
  do i=1, 100
!!$omp critical
     write(funit,*) "test write, tid=", tid, " i=", i
!!$omp end critical
  enddo
  !$omp end parallel

  close(funit)

end program omp_write

This worked, implicating the newunit.

I am now rerunning with these diffs in our nuopc code:

diff --git a/nuopc_cap_share/nuopc_shr_methods.F90 b/nuopc_cap_share/nuopc_shr_methods.F90
index 8d3283a..d4ffff5 100644
--- a/nuopc_cap_share/nuopc_shr_methods.F90
+++ b/nuopc_cap_share/nuopc_shr_methods.F90
@@ -22,7 +22,7 @@ module nuopc_shr_methods
   use NUOPC_Model  , only : NUOPC_ModelGet
   use shr_kind_mod , only : r8 => shr_kind_r8, cl=>shr_kind_cl, cs=>shr_kind_cs
   use shr_sys_mod  , only : shr_sys_abort
-  use shr_file_mod , only : shr_file_setlogunit, shr_file_getLogUnit
+  use shr_file_mod , only : shr_file_setlogunit, shr_file_getLogUnit, shr_file_getUnit
 
   implicit none
   private
@@ -154,7 +154,8 @@ contains
        call NUOPC_CompAttributeGet(gcomp, name="logfile", value=logfile, rc=rc)
        if (chkerr(rc,__LINE__,u_FILE_u)) return
 
-       open(newunit=logunit,file=trim(diro)//"/"//trim(logfile))
+       logunit = shr_file_getUnit()
+       open(logunit,file=trim(diro)//"/"//trim(logfile))
     else
        logUnit = 6
     endif

I have gotten three passes in a row. I'm going to run a couple more times to be sure, but I think that works around the compiler issue.

Update: I am now at five passes in a row. This seems to be a robust workaround.

Thanks again for your help with this!

theurich · 2021-04-14T15:25:24Z

@billsacks this is awesome! It is very satisfying to have this mystery resolved, with a clear understanding as to why the MCT version of the code consistently worked, even with the older Intel versions, while the NUOPC version ran into these frequent hangs. So nice to have it traced down to one specific difference of using newunit or not.

Not that it really matters all that much anymore, but the ESMF 8.1.0 installation with intel/19.1.1 is now available:

/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.1.1/esmf-8.1.0-ncdfio-mpt-O.lua
/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.1.1/esmf-8.1.0-ncdfio-mpt-g.lua

mvertens · 2021-04-14T15:47:08Z

@theurich @billsacks - thank you both for working on this and finding a resolution to a problem that has plagued us for months. @theurich - I want to particularly thank you for helping with this given that it ended up not being an ESMF issue.

theurich · 2021-04-14T16:09:08Z

@mvertens - no problem at all. I slept a lot better last night knowing this was likely not anything that had to do with ESMF. Great work from @billsacks and @jedwards4b pushing this forward, too!

@mvertens - one thing I am now wondering, is there any chance any other components, under their NUOPC caps might be doing similar things with use of open(newunit=...) for their logs?

jedwards4b · 2021-04-14T16:14:53Z

@theurich yes they all use the same mechanism.

billsacks · 2021-04-14T16:34:26Z

Thanks for the new builds, @theurich . From separate discussions with @jedwards4b this morning, our tentative plan is to update to intel 19.1.1, so your new builds could be helpful for that.

To clarify, the code in question (with open(newunit=...)) is in shared code CESM's CMEPS repository, not in the CTSM repository. So no code changes appear to be needed in CTSM. Our tentative plan, from discussions with @jedwards4b , is to not put the above workaround in the code, but instead to require intel 19.1.1 or later when doing threaded runs with NUOPC/CMEPS.

I'll leave this issue open here until we update to a newer version of CIME that points to a more recent version of the intel compiler.

billsacks · 2021-09-03T16:07:35Z

We are using intel 19.1.1 on cheyenne now, and the problem no longer appears. so I'm closing this issue.

billsacks added the bug something is working incorrectly label Apr 10, 2021

billsacks mentioned this issue Apr 13, 2021

Frequent "qice out of bounds" warnings due to big negative qice in first time step #1335

Open

billsacks closed this as completed Sep 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some runs with NUOPC driver with multiple threads can hang #1331

Some runs with NUOPC driver with multiple threads can hang #1331

billsacks commented Apr 10, 2021 •

edited

Loading

billsacks commented Apr 10, 2021

jedwards4b commented Apr 10, 2021

billsacks commented Apr 10, 2021

theurich commented Apr 12, 2021

billsacks commented Apr 12, 2021 •

edited

Loading

jedwards4b commented Apr 12, 2021

theurich commented Apr 12, 2021

billsacks commented Apr 13, 2021

theurich commented Apr 13, 2021

billsacks commented Apr 13, 2021

jedwards4b commented Apr 13, 2021

theurich commented Apr 13, 2021

jedwards4b commented Apr 13, 2021 via email

theurich commented Apr 13, 2021

theurich commented Apr 13, 2021

billsacks commented Apr 13, 2021

billsacks commented Apr 13, 2021

theurich commented Apr 13, 2021 via email

ekluzek commented Apr 13, 2021

theurich commented Apr 13, 2021 •

edited

Loading

billsacks commented Apr 13, 2021

billsacks commented Apr 13, 2021

theurich commented Apr 14, 2021 •

edited

Loading

billsacks commented Apr 14, 2021 •

edited

Loading

theurich commented Apr 14, 2021

mvertens commented Apr 14, 2021

theurich commented Apr 14, 2021

jedwards4b commented Apr 14, 2021

billsacks commented Apr 14, 2021

billsacks commented Sep 3, 2021

Some runs with NUOPC driver with multiple threads can hang #1331

Some runs with NUOPC driver with multiple threads can hang #1331

Comments

billsacks commented Apr 10, 2021 • edited Loading

Brief summary of bug

General bug information

Details of bug

billsacks commented Apr 10, 2021

jedwards4b commented Apr 10, 2021

billsacks commented Apr 10, 2021

theurich commented Apr 12, 2021

billsacks commented Apr 12, 2021 • edited Loading

jedwards4b commented Apr 12, 2021

theurich commented Apr 12, 2021

billsacks commented Apr 13, 2021

theurich commented Apr 13, 2021

billsacks commented Apr 13, 2021

jedwards4b commented Apr 13, 2021

theurich commented Apr 13, 2021

jedwards4b commented Apr 13, 2021 via email

theurich commented Apr 13, 2021

theurich commented Apr 13, 2021

billsacks commented Apr 13, 2021

billsacks commented Apr 13, 2021

theurich commented Apr 13, 2021 via email

ekluzek commented Apr 13, 2021

theurich commented Apr 13, 2021 • edited Loading

billsacks commented Apr 13, 2021

billsacks commented Apr 13, 2021

theurich commented Apr 14, 2021 • edited Loading

billsacks commented Apr 14, 2021 • edited Loading

theurich commented Apr 14, 2021

mvertens commented Apr 14, 2021

theurich commented Apr 14, 2021

jedwards4b commented Apr 14, 2021

billsacks commented Apr 14, 2021

billsacks commented Sep 3, 2021

billsacks commented Apr 10, 2021 •

edited

Loading

billsacks commented Apr 12, 2021 •

edited

Loading

theurich commented Apr 13, 2021 •

edited

Loading

theurich commented Apr 14, 2021 •

edited

Loading

billsacks commented Apr 14, 2021 •

edited

Loading