Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some runs with NUOPC driver with multiple threads can hang #1331

Closed
billsacks opened this issue Apr 10, 2021 · 30 comments
Closed

Some runs with NUOPC driver with multiple threads can hang #1331

billsacks opened this issue Apr 10, 2021 · 30 comments
Labels
type: bug something is working incorrectly

Comments

@billsacks
Copy link
Member

billsacks commented Apr 10, 2021

Brief summary of bug

Some runs with NUOPC driver with multiple threads can hang. From what @jedwards4b has found, this happens during write statements in threaded regions.

General bug information

CTSM version you are using: master

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: Runs using NUOPC driver with multiple OpenMP threads

Details of bug

This test sometimes hangs in testing from the branch that I'm about to bring to master (will be ctsm5.1.dev031): SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline. Note that there are a couple of fixes for this issue in this upcoming tag: one write statement is wrapped in OMP MASTER and a bunch of unnecessary write statements in subgrid initialization have been removed. Those fixes apparently make the issue appear less frequently, but don't completely solve the issue.

We do have a few other nuopc tests with threading that are passing, but those all use 36x2 or 72x2 layouts – so many fewer tasks and 2 rather than 3 threads.

@jedwards4b and @theurich are working to identify an underlying cause of this problem. If they can't, then we may need to put OMP CRITICAL blocks around our write statements (either directly or via using a write_log subroutine that does this).

@billsacks billsacks added the type: bug something is working incorrectly label Apr 10, 2021
billsacks added a commit to billsacks/ctsm that referenced this issue Apr 10, 2021
I'm temporarily changing the processor layout for
SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline
from 480x3 to 480x1 because it is sometimes hanging with a 480x3 layout.
Once we have resolved the threading-related hangs with nuopc documented
in ESCOMP#1331, we can change this back
to use threading.
@billsacks
Copy link
Member Author

For now, I'm changing SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline to use a 480x1 layout so that it will pass reliably. Once we resolve this issue, we should change it back – reverting the changes in 6c3abc6.

@jedwards4b
Copy link
Contributor

Three attempts, no hangs:
/glade/scratch/jedwards/SMS_D_Ln9_P480x3_Vnuopc.f19_g17.IHistClm50Sp.cheyenne_intel.clm-waccmx_offline.20210410_071232_x6tojp

@billsacks
Copy link
Member Author

@jedwards4b - Interesting, thanks for trying this. It failed for me twice using the externals here https://github.com/billsacks/ctsm/blob/6c3abc680cce8328569a3d9efddfeee801231def/Externals.cfg even though I'm pretty sure it passed at least once for me with essentially the same set of externals a few days ago. I wonder if your externals are different or if this is just a chance occurrence whether it passes or fails.

billsacks added a commit that referenced this issue Apr 10, 2021
Update externals and fixes for nuopc threading

(1) Some fixes for threading with the nuopc/cmeps driver. (However,
    threading with nuopc/cmeps still doesn't work completely: see
    #1331.)

(2) Updates externals to versions needed for these nuopc threading changes
@theurich
Copy link

Has anybody tried running 480x3 without the OMP_WAIT_POLICY=passive setting? I think that would still be an interesting test.

@billsacks
Copy link
Member Author

billsacks commented Apr 12, 2021

Has anybody tried running 480x3 without the OMP_WAIT_POLICY=passive setting? I think that would still be an interesting test.

I just tried this from CTSM master. Unfortunately, I still get a hang in what appears to be the same (not-totally-determined) location.

I did this with this change to cime:

diff --git a/config/cesm/machines/config_machines.xml b/config/cesm/machines/config_machines.xml
index 3809a2c9f..17dc94f22 100644
--- a/config/cesm/machines/config_machines.xml
+++ b/config/cesm/machines/config_machines.xml
@@ -634,7 +634,6 @@ This allows using a different mpirun command to launch unit tests
       <env name="UGCSINPUTPATH">/glade/work/turuncu/FV3GFS/benchmark-inputs/2012010100/gfs/fcst</env>
       <env name="UGCSFIXEDFILEPATH">/glade/work/turuncu/FV3GFS/fix_am</env>
       <env name="UGCSADDONPATH">/glade/work/turuncu/FV3GFS/addon</env>
-      <env name="OMP_WAIT_POLICY">PASSIVE</env>
       <env name="MPI_DSM_VERBOSE">true</env>
     </environment_variables>
     <environment_variables unit_testing="true">

@jedwards4b
Copy link
Contributor

Here is how I find the hang location:

qstat -f jobid  # this will give a long output including the list of compute nodes.
ssh -XY computenode
cd $casedir
source .env_mach_specific.sh
module load arm-forge
ddt 

(In the ddt gui attach to cesm) then look at the stack - in my experience it'll point to a write statement in ctsm.

@theurich
Copy link

One more permutation to try:
OMP_WAIT_POLICY=active
KMP_BLOCKTIME=infinite

Thanks!

@billsacks
Copy link
Member Author

@jedwards4b thanks a lot for that pointer. To be honest, I had completely forgotten how to use DDT, so thanks for that jump-start. And yes, it points to a write statement in CTSM (https://github.com/escomp/CTSM/blob/master/src/main/lnd2glcMod.F90#L228). (As a side-note, I'm just realizing that there's probably an underlying issue that we should resolve that is leading to the generation of a bunch of these WARNING messages in the first time step; I'll open a separate issue for that.)

@theurich unfortunately, I still get a hang in the same place with that additional permutation:

diff --git a/config/cesm/machines/config_machines.xml b/config/cesm/machines/config_machines.xml
index 3809a2c9f..db87985c2 100644
--- a/config/cesm/machines/config_machines.xml
+++ b/config/cesm/machines/config_machines.xml
@@ -634,7 +634,8 @@ This allows using a different mpirun command to launch unit tests
       <env name="UGCSINPUTPATH">/glade/work/turuncu/FV3GFS/benchmark-inputs/2012010100/gfs/fcst</env>
       <env name="UGCSFIXEDFILEPATH">/glade/work/turuncu/FV3GFS/fix_am</env>
       <env name="UGCSADDONPATH">/glade/work/turuncu/FV3GFS/addon</env>
-      <env name="OMP_WAIT_POLICY">PASSIVE</env>
+      <env name="OMP_WAIT_POLICY">active</env>
+      <env name="KMP_BLOCKTIME">infinite</env>
       <env name="MPI_DSM_VERBOSE">true</env>
     </environment_variables>
     <environment_variables unit_testing="true">

Incidentally, I also ran a couple more times with the out-of-the-box settings, and this time they ran successfully, suggesting that (not surprisingly) this failure is somewhat sporadic / random, presumably based on the exact timing of when different threads hit the write statements.

@theurich
Copy link

@billsacks thanks for running the extra tests. It definitely looks like a race condition. You say hat for the new permutation of settings you tested, it hung "in the same place". Do you mean that when it hangs, it always hangs in the same write statement? If so, it would be interesting to know what that write statement looks like. E.g., does it write out maybe parts of a shared array, or does it construct an internal unit first, and then write that out? Maybe coming from that end, i.e. the actual write statement that triggers the issues, we might be able to come up with some more ideas of what might be the underlying issue we are hitting.

@billsacks
Copy link
Member Author

Yes, I mean that, when it hangs, it seems to hang in the same write statement. I've only run it through a debugger to confirm once, but I'm pretty sure of this based on what's present in the log files at the time of the hang. This isn't the only problematic write statement: @jedwards4b encountered some others that we worked around, and this is just the next one causing problems.

The current problem statement is this one:
https://github.com/escomp/CTSM/blob/master/src/main/lnd2glcMod.F90#L228

qice_grc is a shared array.

However, one of the lines that caused problems for @jedwards4b was this one, before it was wrapped in an OMP MASTER:
https://github.com/escomp/CTSM/blob/master/src/biogeochem/CNVegetationFacade.F90#L1131

which doesn't print out any values.

@jedwards4b
Copy link
Contributor

I've been down this rabbit hole and offered a branch of ctsm with critical clauses surrounding most of the write statements in ctsm. There is nothing special I can see about any of the write statements except that they are in threaded regions.

@theurich
Copy link

@jedwards4b and @billsacks I suggest we attempt a test run where we try to eliminate as many of the ESMF threading-related aspects as possible. The changes I have in mind are theses:

  1. Use an ESMF that is completely built without Pthread dependency:
    /glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.0.5/esmf-fit2-ncdfio-mpt-g.lua
    /glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.0.5/esmf-fit2-ncdfio-mpt-O.lua

  2. Comment out the call to ESMF_InitializePreMPI().

  3. Replace the call to MPI_Init_thread() with a regular MPI_Init().

  4. Comment out any of the OMP_WAIT_POLICY and KMP_BLOCKTIME environment variables.

@jedwards4b
Copy link
Contributor

jedwards4b commented Apr 13, 2021 via email

@theurich
Copy link

That is correct. This is only a test with esmf_aware_threading off, to see if any of the listed aspects might be underlying cause for the problem. This is not a long term solution, because I think we really want esmf_aware_threading support, but if we can figure out what causes the issue, maybe we can find a way to resolve it.

@theurich
Copy link

One other thing I just thought of is the fact that the hanging was observed with Intel+MPT. I am not aware of such hanging under UFS with Intel+IntelMPI (not sure they ever run with MPT). But I do know they run FV3 threaded. However, I am not 100% sure that they have write statements inside threaded regions - but I would almost assume they do. So, another interesting test would be to look into Intel+IntelMPI on Cheyenne and see if CTSM writes would hang for that combo ever. That is a separate test from the other set of changes I suggested above.

@billsacks
Copy link
Member Author

Thanks a lot for your continued thoughts, @theurich and @jedwards4b .

I wanted to share a few more thoughts and findings. The important finding is near the bottom: the problem only seems to appear for writes to the lnd.log file: writes to the cesm.log file (unit 6) seem fine. (@theurich - in CESM, we open separate files for each component's master proc; so, for example, the CTSM master proc writes to a lnd.log file; all other procs from all components write to the cesm.log file by writing to unit 6, I think via stdout being redirected to the cesm.log file. In other words, it seems that output to stdout is fine, but output to other files is affected.)

  1. I realized that my testing probably isn't incorporating some of @theurich 's suggestions from the last week. I haven't followed all of the email discussions closely, but my sense is that @jedwards4b has tried a variety of suggestions, but none seem to completely solve the problem. But it may be that the specific issue I'm hitting here is resolved by one of these suggestions.

  2. I tried changing the offending write statement to use a format string rather than list-directed i/o, but this did not solve the problem:

diff --git a/src/main/lnd2glcMod.F90 b/src/main/lnd2glcMod.F90
index f48b3ef8b..dce65272d 100644
--- a/src/main/lnd2glcMod.F90
+++ b/src/main/lnd2glcMod.F90
@@ -225,7 +225,7 @@ contains
 
          ! Check for bad values of qice
          if ( abs(this%qice_grc(g,n)) > 1.0_r8) then
-            write(iulog,*) 'WARNING: qice out of bounds: g, n, qice =', g, n, this%qice_grc(g,n)
+            write(iulog,'(a,i0,i0,f14.7)') 'WARNING: qice out of bounds: g, n, qice =', g, n, this%qice_grc(g,n)
          end if
       end if
  1. I noticed that the problem seems isolated to the master proc and/or output to the lnd.log file (in CESM, the master proc outputs to the lnd.log file); output from other procs to the cesm.log file (which is done in CESM via redirecting stdout to the cesm log file) does not appear to be impacted. I saw this by noticing that, at the point of the hang, there is always just a single write statement's output on lnd.log, but the full complement of output from this write statement on cesm.log (i.e., the same amount of output as we get in a run that gets past the hang, including multiple outputs per proc).

  2. I then tried sending all CTSM output to the cesm log file (i.e., stdout) by making this change:

diff --git a/src/cpl/nuopc/lnd_comp_nuopc.F90 b/src/cpl/nuopc/lnd_comp_nuopc.F90
index 19c774829..c79f45670 100644
--- a/src/cpl/nuopc/lnd_comp_nuopc.F90
+++ b/src/cpl/nuopc/lnd_comp_nuopc.F90
@@ -211,7 +211,7 @@ contains
     ! reset shr logging to my log file
     !----------------------------------------------------------------------------
 
-    call set_component_logging(gcomp, localPet==0, iulog, shrlogunit, rc)
+    call set_component_logging(gcomp, .false., iulog, shrlogunit, rc)
     if (ChkErr(rc,__LINE__,u_FILE_u)) return
 
     ! Note still need compid for those parts of the code that use the data model        

I got 5 passes in a row (whereas, without this change, I was getting failures about 50% of the time).

So it seems like the issue only arises during writes to the lnd.log file: writes to stdout do not appear to be impacted. Maybe this gives some hints at the possible problem? (Not to me, but maybe to one of you?) But it also hints at a possible solution. I'll put my ideas of a possible solution in a separate comment, since this comment is already getting so long.

@billsacks
Copy link
Member Author

Given my above finding that the issue only appears to impact writes to the lnd.log file, I'm wondering about the following solution, if @theurich 's ideas don't work out and/or it is starting to feel too time-consuming to try to address the root cause. @jedwards4b and @mvertens I'm wondering what you think about this solution:

Change the logging logic so that only the master thread on the master proc outputs to the lnd log file; all other threads (and procs) output to stdout/cesm.log. I think this could be done by: Wherever we currently reference the variable iulog in ctsm, change this to a function (simply by changing iulog to iulog()). This function would have some logic that returns a different value for OMP MASTER than for other threads.

I think this would solve the problem for writes within the CTSM code. It would lead to some inconsistencies with any write statements done in share code, since those use the proc-level iulog (so send all output from the master proc to the lnd.log file, for writes called from share code called from CTSM). So I think that leaves open the possibility that we'd get model hangs for writes done in share code. This probably mainly occurs as the model is about to abort: the write statements called just before this abort. A few ways to deal with this are:

  1. Change CTSM's nuopc cap so that it sets the share code's logunit set to the default (unit 6) for all procs. This would mean that share code output from the master task would go to cesm log rather than lnd log, which isn't ideal, but may be acceptable.
  2. Put similar logic in the share code, so that rather than using a logunit variable, shr code instead uses a logunit function whose value depends in part on whether you are the master thread.
  3. Change the interface to the relevant share code so that callers pass in a log unit, rather than getting this from a share code module variable. This could be an optional argument. Then CTSM could pass in the result of its iulog() function. However, this might require a lot of changes, since I imagine there are a lot of share code routines that do write statements themselves or call other subroutines that do write statements.
  4. (I started to think about a way to set a thread-specific logunit variable in the share code at the start of each OMP loop in CTSM, but I'm struggling to come up with a way that could work.)

@theurich
Copy link

theurich commented Apr 13, 2021 via email

@ekluzek
Copy link
Contributor

ekluzek commented Apr 13, 2021

@theurich yes currently output goes either to lnd.log or cesm.log. But, all threads on the root MPI task go to lnd.log and everything else to cesm.log. If I understand @billsacks the only difference with his suggestion is that on the root MPI task only the master thread goes to lnd.log and other threads go to cesm.log. Personally I think this is fine. And I love it as a straight forward solution.

@theurich
Copy link

theurich commented Apr 13, 2021

@ekluzek yep, was just checking if I understood the current situation correctly. I had not realized that output from different procs go to different files. But I certainly have no issue with that. I also agree that the proposed solution seems straight forward, and just pushes more output over to cesm.log vs lnd.log, but is in a sense consistent with having such a split that is already happening.
@billsacks your analysis of the observations has me wonder about one thing right now. Could it be that the root cause of the issue is a race condition on the iulog variable on the master proc? I assume that iulog is indicated as shared, and resetting to unit 6 happens outside of parallel. Just the fact that this is happening in writes, where the iulog is being modified by the component seems special.

@billsacks
Copy link
Member Author

@ekluzek yes, you are understanding my suggestion right – thanks for the clear restatement of it.

@theurich yes, you are also understanding correctly that output from different tasks goes to different log files. Whether this is desirable is up for debate... it's the way CESM has done things for at least the 10 years I've been here, and probably much longer, so I don't know the rationale that led to this design.

Good question about a possible race condition on iulog. I think this is just set in ctsm's nuopc cap, outside of any threaded region, but let me double check that after lunch and get back to you on that.

@billsacks
Copy link
Member Author

@theurich As far as I can see, the only place where the module-level variable iulog is set is in this call in initialization (outside of any threading loops as far as I can tell):

call set_component_logging(gcomp, localPet==0, iulog, shrlogunit, rc)

Here, in case you're interested, is the routine that calls:

https://github.com/ESCOMP/CMEPS/blob/448a81c2b251018311aafbcb028314ddf0352918/nuopc_cap_share/nuopc_shr_methods.F90#L133-L164

After that point, I believe iulog is only read, not written.

@theurich
Copy link

theurich commented Apr 14, 2021

Inspired by @billsacks analysis of the hanging write problem, I tried several things. The conclusion for me now is that we are dealing with an Intel compiler bug for versions < intel/19.1.1.

First thing I did was to add write statements inside of OpenMP threaded blocks in components of some of the NUOPC prototype codes (used as examples and for regression testing). Everything worked fine on my local GNU setup - as expected. However, when I took it over to Cheyenne, and built with Intel+MPT, I experienced exactly the same hang when writing to a specific unit, other than 6. For 6 (stdout), everything ran fine, but as soon as I used a different unit, to write directly to a file, it would hang if the code was writing from more than a single thread. I could work around the issue, just as you and Jim have, by adding the !$omp critical block around the write statement.

At this point I still was going under the assumption that somehow ESMF was part of the problem. To make sure it actually was, I just wanted to quickly verify that for a simple Fortran program things would actually work correctly. So I put a simple code together (see below). It turned out that when I built that code with the default intel/19.0.5 on Cheyenne, it also hung in the write!!! Unless I either let it go to stdout, or used omp critical! No ESMF, not even MPT involved, and it would hang every single time!!!

I then tried different versions: intel/17.0.1 intel/18.0.5 intel/19.0.2 all showed the same issue!
However, the latest intel/19.1.1 worked. So I also tried it with the NUOPC test, and that also worked!
Finally I tried all of the GNU modules on Cheyenne, and the NVHPC module. Those also all worked as expected.

My conclusion now is:
(1) Intel < 19.1.1 has a bug that causes the issue. It is fixed in 19.1.1.
(2) If you explicitly make sure only a single thread every calls write (with a unit other than 6) via OMP CRITICAL, you do not trigger the issue.
(3) This probably could also explain why some NUOPC runs did not hang. And could be a way to explain why MCT runs apparently never hung. It would mean that under MCT the threads under a the master proc were always sufficiently out of sync hitting the write statement that it would not trigger the Intel bug, and for NUOPC that was sometimes the case, and sometimes they were just close enough in sync to trigger it.

The final test of the above theory will be for me to rebuild an ESMF installation with intel/19.1.1, and then for you to re-run the regression tests that showed the 50% rate of hanging. I will kick that off still tonight and it should be ready in the morning.

Plain Fortran reproducer:

program omp_write
!$  use omp_lib
  implicit none

  integer                     :: i,tid, funit
 
#define STDOUTxx
#ifdef STDOUT
    funit = 6
#else
    open(newunit=funit, file="test.out")
#endif

!$omp parallel private(tid)
    tid = -1 ! initialize to obvious value if building without OpenMP
!$  tid = omp_get_thread_num()
!$omp do
    do i=1, 100
!!$omp critical
      write(funit,*) "test write, tid=", tid, "  i=", i
!!$omp end critical
    enddo
!$omp end parallel

  close(funit)

end program

@billsacks
Copy link
Member Author

billsacks commented Apr 14, 2021

@theurich Thanks so much for your analysis of the issue. I really appreciate all of your time helping with this – especially since it's looking like this isn't an ESMF issue after all!

I was still puzzled by why we are hitting this with esmf but not mct. So I looked more closely at what might differ in setting up the output log unit. I noticed that the esmf version in CESM uses open(newunit=..., whereas the mct version uses some old homegrown logic for finding an available unit number (long predating the newunit option to open).

After reproducing the hang with your very helpful little reproducer, I made a small change: using a hard-coded unit number rather than newunit. See below (I also removed the ifdefs for clarity):

program omp_write
  !$ use omp_lib
  implicit none

  integer :: i,tid, funit

  funit = 17
  ! open(newunit=funit, file="test.out")
  open(funit, file="test.out")

  !$omp parallel private(tid)
  tid = -1 ! initialize to obvious value if building without OpenMP
  !$ tid = omp_get_thread_num()
  !$omp do
  do i=1, 100
!!$omp critical
     write(funit,*) "test write, tid=", tid, " i=", i
!!$omp end critical
  enddo
  !$omp end parallel

  close(funit)

end program omp_write

This worked, implicating the newunit.

I am now rerunning with these diffs in our nuopc code:

diff --git a/nuopc_cap_share/nuopc_shr_methods.F90 b/nuopc_cap_share/nuopc_shr_methods.F90
index 8d3283a..d4ffff5 100644
--- a/nuopc_cap_share/nuopc_shr_methods.F90
+++ b/nuopc_cap_share/nuopc_shr_methods.F90
@@ -22,7 +22,7 @@ module nuopc_shr_methods
   use NUOPC_Model  , only : NUOPC_ModelGet
   use shr_kind_mod , only : r8 => shr_kind_r8, cl=>shr_kind_cl, cs=>shr_kind_cs
   use shr_sys_mod  , only : shr_sys_abort
-  use shr_file_mod , only : shr_file_setlogunit, shr_file_getLogUnit
+  use shr_file_mod , only : shr_file_setlogunit, shr_file_getLogUnit, shr_file_getUnit
 
   implicit none
   private
@@ -154,7 +154,8 @@ contains
        call NUOPC_CompAttributeGet(gcomp, name="logfile", value=logfile, rc=rc)
        if (chkerr(rc,__LINE__,u_FILE_u)) return
 
-       open(newunit=logunit,file=trim(diro)//"/"//trim(logfile))
+       logunit = shr_file_getUnit()
+       open(logunit,file=trim(diro)//"/"//trim(logfile))
     else
        logUnit = 6
     endif

I have gotten three passes in a row. I'm going to run a couple more times to be sure, but I think that works around the compiler issue.

  • Update: I am now at five passes in a row. This seems to be a robust workaround.

Thanks again for your help with this!

@theurich
Copy link

@billsacks this is awesome! It is very satisfying to have this mystery resolved, with a clear understanding as to why the MCT version of the code consistently worked, even with the older Intel versions, while the NUOPC version ran into these frequent hangs. So nice to have it traced down to one specific difference of using newunit or not.

Not that it really matters all that much anymore, but the ESMF 8.1.0 installation with intel/19.1.1 is now available:

/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.1.1/esmf-8.1.0-ncdfio-mpt-O.lua
/glade/p/cesmdata/cseg/PROGS/modulefiles/esmfpkgsNEW/intel/19.1.1/esmf-8.1.0-ncdfio-mpt-g.lua

@mvertens
Copy link

@theurich @billsacks - thank you both for working on this and finding a resolution to a problem that has plagued us for months. @theurich - I want to particularly thank you for helping with this given that it ended up not being an ESMF issue.

@theurich
Copy link

@mvertens - no problem at all. I slept a lot better last night knowing this was likely not anything that had to do with ESMF. Great work from @billsacks and @jedwards4b pushing this forward, too!

@mvertens - one thing I am now wondering, is there any chance any other components, under their NUOPC caps might be doing similar things with use of open(newunit=...) for their logs?

@jedwards4b
Copy link
Contributor

@theurich yes they all use the same mechanism.

@billsacks
Copy link
Member Author

Thanks for the new builds, @theurich . From separate discussions with @jedwards4b this morning, our tentative plan is to update to intel 19.1.1, so your new builds could be helpful for that.

To clarify, the code in question (with open(newunit=...)) is in shared code CESM's CMEPS repository, not in the CTSM repository. So no code changes appear to be needed in CTSM. Our tentative plan, from discussions with @jedwards4b , is to not put the above workaround in the code, but instead to require intel 19.1.1 or later when doing threaded runs with NUOPC/CMEPS.

I'll leave this issue open here until we update to a newer version of CIME that points to a more recent version of the intel compiler.

@billsacks
Copy link
Member Author

We are using intel 19.1.1 on cheyenne now, and the problem no longer appears. so I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

5 participants