Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in dp_coupling::d_p_coupling with newer module versions and compilers (GNU version 12.3) #6451

Closed
ndkeen opened this issue May 29, 2024 · 6 comments · Fixed by #6687
Closed
Labels
HOMME intel Intel compilers Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented May 29, 2024

Originally reporting issue with newer Intel compiler, but as of Sep7, 2024, no longer seeing the issue with intel and created PR to upgrade (#6596), but still see issue with GNU. I might close this issue and open a fresh one for GNU only, but for now, leaving text below as-is:

Trying to update module versions on pm-cpu, but I have hit a few issues. One with intel is that this test hangs in init.
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_intel.allactive-wcprodssp
I'm noting the hang in HOMME, but as I don't know root cause, it may not actually be issue there.
The test works with current intel version (intel/2023.1.0) and what I'd like to use is the new default for the machine (intel/2023.2.0)

We see this in cpl.log (to indicate still in init):

(seq_mct_drv) : Calling atm_init_mct phase 2
(component_init_cc:mct) : Initialize component atm

Looking at where the stack is on compute node:

#0  cxi_eq_peek_event (eq=0x22e12dc8) at /usr/include/cxi_prov_hw.h:1531
#1  cxip_ep_ctrl_eq_progress (ep_obj=0x22e25790, ctrl_evtq=0x22e12dc8, tx_evtq=true, ep_obj_locked=true) at prov/cxi/src/cxip_ctrl.c:318
#2  0x00001503828591dd in cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:186
#3  0x000015038285e969 in cxip_util_cq_progress (util_cq=0x22e15220) at prov/cxi/src/cxip_cq.c:112
#4  0x000015038283a301 in ofi_cq_readfrom (cq_fid=0x22e15220, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#5  0x00001503860fa0f2 in MPIR_Wait_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#6  0x0000150386c9b926 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_intel.so.12
#7  0x0000150386ca7685 in MPIC_Sendrecv () from /opt/cray/pe/lib64/libmpi_intel.so.12
#8  0x0000150386bd232d in MPIR_Alltoall_intra_brucks () from /opt/cray/pe/lib64/libmpi_intel.so.12
#9  0x00001503855bee8a in MPIR_Alltoall_intra_auto.part.0 () from /opt/cray/pe/lib64/libmpi_intel.so.12
#10 0x00001503855bf05c in MPIR_Alltoall_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#11 0x00001503855bf83f in PMPI_Alltoall () from /opt/cray/pe/lib64/libmpi_intel.so.12
#12 0x0000150387c4364e in pmpi_alltoall__ () from /opt/cray/pe/lib64/libmpifort_intel.so.12
#13 0x0000000000bcad8f in mpialltoallint (sendbuf=..., sendcnt=1, recvbuf=..., recvcnt=1, comm=-1006632954) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/wrap_mpi.F90:1143
#14 0x0000000002b93c02 in phys_grid::transpose_block_to_chunk (record_size=88, block_buffer=<error reading variable: value requires 2509056 bytes, which is more than max-value-size>, chunk_buffer=<error reading variable: value requires 2452032 bytes, which is more than max-value-size>,
    window=<error reading variable: Cannot access memory at address 0x0>) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/physics/cam/phys_grid.F90:4137
#15 0x0000000005304965 in dp_coupling::d_p_coupling (phys_state=..., phys_tend=..., pbuf2d=0x26500aa0, dyn_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/dp_coupling.F90:242
#16 0x0000000003719020 in stepon::stepon_run1 (dtime_out=1800, phys_state=..., phys_tend=..., pbuf2d=0x26500aa0, dyn_in=..., dyn_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/stepon.F90:244
#17 0x0000000000948d7c in cam_comp::cam_run1 (cam_in=..., cam_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:251
#18 0x0000000000905530 in atm_comp_mct::atm_init_mct (eclock=..., cdata_a=..., x2a_a=..., a2x_a=..., nlfilename=..., .tmp.NLFILENAME.len_V$5bab=6) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:499
#19 0x00000000004a7045 in component_mod::component_init_cc (eclock=..., comp=..., infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., .tmp.NLFILENAME.len_V$7206=6, .tmp.SEQ_FLDS_X2C_FLUXES.len_V$7209=4096, .tmp.SEQ_FLDS_C2X_FLUXES.len_V$720c=4096)
    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:257
#20 0x000000000045d9d6 in cime_comp_mod::cime_init () at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:2370
#21 0x000000000049dfc2 in cime_driver () at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122

Above, I pasted results from running on muller-cpu, but I can see same behavior on pm-cpu (just need to update the module versions).

I made a copy of the case on PSCRATCH in case someone wanted to look at logs:

/pscratch/sd/n/ndk/SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_intel.allactive-wcprodssp.20240529_092039_q82rr

I would like to try this test with other compilers, but we currently have a segfault with GNU #6428

@ndkeen ndkeen added HOMME pm-cpu Perlmutter at NERSC (CPU-only nodes) intel Intel compilers labels May 29, 2024
@ndkeen ndkeen changed the title Hang in dp_coupling::d_p_coupling with newer intel compiler (version 2023.2.0) Hang in dp_coupling::d_p_coupling with newer intel compiler (version 2023.2.0) May 29, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented May 29, 2024

After adding a temporary work-around to the GNU issue noted above, I can now run with GNU built exe. And I see that it also suffers same fate -- hangs in what looks like same place.
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu.allactive-wcprodssp

Also, I can still see the hang without the test modifier. For both intel/gnu

SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370

Without DEBUG, the test completes (for both intel/gnu)

@ndkeen
Copy link
Contributor Author

ndkeen commented May 30, 2024

Since there appears to be a difference in behavior DEBUG vs OPT, I'm trying a few different things. If I stay with DEBUG but simplify the flags to only use -O -g, I actually get a diff error. Which if real, might be good to track:

213: SHR_REPROSUM_CALC: Input contains  0.10000E+01 NaNs and  0.00000E+00 INFs on MPI task     213
213:  ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
213: #0  0x14891a423372 in ???
213: #1  0x23f19fc in __shr_abort_mod_MOD_shr_abort_backtrace
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:104
213: #2  0x23f1b83 in __shr_abort_mod_MOD_shr_abort_abort
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:61
213: #3  0x24361d5 in __shr_reprosum_mod_MOD_shr_reprosum_calc
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_reprosum_mod.F90:644
213: #4  0xc6f638 in __global_norms_mod_MOD_wrap_repro_sum
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/global_norms_mod.F90:864
213: #5  0xcc31e5 in __prim_state_mod_MOD_prim_printstate
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/prim_state_mod.F90:216
213: #6  0xc8a5e3 in __prim_driver_base_MOD_prim_init2
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/prim_driver_base.F90:1033
213: #7  0xf3a909 in __dyn_comp_MOD_dyn_init2
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/dyn_comp.F90:380
213: #8  0xc352fe in __inital_MOD_cam_initial
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:73
213: #9  0x520eb3 in __cam_comp_MOD_cam_init
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:162
213: #10  0x51aad1 in __atm_comp_mct_MOD_atm_init_mct
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:371
213: #11  0x489151 in __component_mod_MOD_component_init_cc
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:258
213: #12  0x477ef1 in __cime_comp_mod_MOD_cime_init
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:1488
213: #13  0x4866dc in cime_driver
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122
213: #14  0x4866dc in main
213:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:23
213: MPICH ERROR [Rank 213] [job id 692934.0] [Wed May 29 16:50:38 2024] [nid200068] - Abort(1001) (rank 213 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 213

I also try running with OPT, but without -O2 which completed.
This was all with gnu using a test like SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu

@ndkeen
Copy link
Contributor Author

ndkeen commented May 30, 2024

Adjusting compiler flags, I was able to get a stack trace -- which may or may not be same issue.

391: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
391: 
391: Backtrace for this error:
391: #0  0x145ddf423372 in ???
391: #1  0x145ddf422505 in ???
391: #2  0x145dde851dbf in ???
391: #3  0xcb15ec in __eos_MOD_pnh_and_exner_from_eos2
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:121
391: #4  0xcb238f in __eos_MOD_pnh_and_exner_from_eos
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:74
391: #5  0xcaea84 in __element_ops_MOD_tests_finalize
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:723
391: #6  0xcb068f in __element_ops_MOD_set_thermostate
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:489
391: #7  0xf32265 in __inidat_MOD_read_inidat
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inidat.F90:674
391: #8  0xd3684e in __startup_initialconds_MOD_initial_conds
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/startup_initialconds.F90:54
391: #9  0xc34dd7 in __inital_MOD_cam_initial
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:67
391: #10  0x5209c9 in __cam_comp_MOD_cam_init
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:162
391: #11  0x51a5e7 in __atm_comp_mct_MOD_atm_init_mct
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:371
391: #12  0x48903b in __component_mod_MOD_component_init_cc
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:258
391: #13  0x477dd1 in __cime_comp_mod_MOD_cime_init
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:1488
391: #14  0x4865c6 in cime_driver
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122
391: #15  0x4865c6 in main
391:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:23

components/homme/src/theta-l/share/eos.F90

  ! check for bad state that will crash exponential function below                                                                                                                                                                                                                                                                           
  if (theta_hydrostatic_mode) then
    ierr= any(dp3d(:,:,:) < 0 ) ! <-- line 121
  else
    ierr= any(vtheta_dp(:,:,:) < 0 )  .or. &
          any(dp3d(:,:,:) < 0 ) .or. &
          any(dphi(:,:,:) > 0 )
  endif

@ndkeen
Copy link
Contributor Author

ndkeen commented May 30, 2024

With a slightly diff flag variation I see this error:

  2:  bad state in EOS, called from: not specified
  2:  bad i,j,k=           1           4          42
  2:  vertical column: dphi,dp3d,vtheta_dp
  2:   1           NaN        4.7223    15234.5361
  2:   2           NaN        6.9384    21262.8696
  2:   3           NaN       10.1644    28133.1975
  2:   4           NaN       14.8259    35098.1232
  2:   5           NaN       21.4903    46351.4278
  2:   6           NaN       30.8760    62234.7248
  2:   7           NaN       43.8245    79485.2115
  2:   8           NaN       61.2040    99465.7967
  2:   9           NaN       83.7154   122918.9372
  2:  10           NaN      111.5971   147073.4247
  2:  11           NaN      144.2831   166819.3864
  2:  12           NaN      180.1535   182548.8554

...

  2:  ERROR: EOS bad state: d(phi), dp3d or vtheta_dp < 0
  2: #0  0x1470f0c23372 in ???
  2: #1  0x23c0c04 in __shr_abort_mod_MOD_shr_abort_backtrace
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:104
  2: #2  0x23c0d8b in __shr_abort_mod_MOD_shr_abort_abort
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/share/util/shr_abort_mod.F90:61
  2: #3  0x57673b in __cam_abortutils_MOD_endrun
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/utils/cam_abortutils.F90:60
  2: #4  0xc7f2c5 in __parallel_mod_MOD_abortmp
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/share/parallel_mod.F90:278
  2: #5  0xcb1923 in __eos_MOD_pnh_and_exner_from_eos2
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:140
  2: #6  0xcb2325 in __eos_MOD_pnh_and_exner_from_eos
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/eos.F90:74
  2: #7  0xcaea1a in __element_ops_MOD_tests_finalize
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:723
  2: #8  0xcb0625 in __element_ops_MOD_set_thermostate
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/homme/src/theta-l/share/element_ops.F90:489
  2: #9  0xf321fb in __inidat_MOD_read_inidat
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inidat.F90:674
  2: #10  0xd367e4 in __startup_initialconds_MOD_initial_conds
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/startup_initialconds.F90:54
  2: #11  0xc34d6d in __inital_MOD_cam_initial
  2:    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/inital.F90:67

/mscratch/sd/n/ndk/e3sm_scratch/muller-cpu/mullerup/SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.muller-cpu_gnu.NDEBUG-Og/run/e3sm.log.692939.240529-174020

@ndkeen
Copy link
Contributor Author

ndkeen commented Jun 11, 2024

I've been adjusting compiler flags in attempt to debug. With -g -O -DNDEBUG -ffpe-trap=invalid,zero, I'm able to run a DEBUG case to get a quick FPE. Adding checks on arrays higher up, I see there are issues with the data just after it gets read in from file. The test on that data does not catch an issue as there is a mask involved. So I don't know if the issue is with the data, the mask, or how the rest of the code expects/uses the data.

components/eam/src/dynamics/se/inidat.F90

    fieldname = 'PS'
    tmp(:,:,:) = 0.0_r8 ! ndk try further init                                                                                                                                                                                                                         
    tmp(:,1,:) = 0.0_r8
    call t_startf('read_inidat_infld')
    if (.not. scm_multcols) then
      call infld(fieldname, ncid_ini, ncol_name,      &
           1, npsq, 1, nelemd, tmp(:,1,:), found, gridname=grid_name)
    else
      call infld(fieldname, ncid_ini, ncol_name,      &
           1, 1, 1, 1, tmp(:,1,:), found, gridname=grid_name)
    endif
    call t_stopf('read_inidat_infld')
    if(.not. found) then
       call endrun('Could not find PS field on input datafile')
    end if

    ! Check read-in data to make sure it is in the appropriate units                                                                                                                                                                                                   
    allocate(tmpmask(npsq,nelemd))
    tmpmask = (reshape(ldof, (/npsq,nelemd/)) /= 0)

    if(minval(tmp(:,1,:), mask=tmpmask) < 10000._r8 .and. .not. scm_multcols) then
       call endrun('Problem reading ps field')
    end if

    ierr= any(tmp(1,1,:) < 0.0 ) !ndk  this test will catch FPE's that will later have issues

@ndkeen ndkeen changed the title Hang in dp_coupling::d_p_coupling with newer intel compiler (version 2023.2.0) Hang in dp_coupling::d_p_coupling with newer module versions and compilers (Intel version 2023.2.0, GNU version 12.3) Jun 11, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 6, 2024

I'm not sure what has changed, but when I try the test again with updated repo, it is completing for intel (using updated version).
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp
with intel/2023.2.0

I can check again with updated GNU compiler as well, but this may be enough for me to make a PR to at least update intel compiler version.

@ndkeen ndkeen changed the title Hang in dp_coupling::d_p_coupling with newer module versions and compilers (Intel version 2023.2.0, GNU version 12.3) Hang in dp_coupling::d_p_coupling with newer module versions and compilers (GNU version 12.3) Sep 7, 2024
ndkeen added a commit that referenced this issue Oct 19, 2024
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for
certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far.
kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing.
It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default.

Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower.

Fixes #6655

I also found some older issues (some with lower node-count) that this fixes:
Fixes #6516
Fixes #6451
Fixes #6521

[bfb]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HOMME intel Intel compilers Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant