Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSI built with debug mode failed in the test global_4denvar on wcoss2 #712

Closed
TingLei-NOAA opened this issue Mar 12, 2024 · 14 comments · Fixed by #722
Closed

GSI built with debug mode failed in the test global_4denvar on wcoss2 #712

TingLei-NOAA opened this issue Mar 12, 2024 · 14 comments · Fixed by #722

Comments

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Mar 12, 2024

On wcoss2, when GSI is built with the debug mode, GSI would become idle and the job would finally be killed for , like ,

PBS: job killed: walltime 12607 exceeded limit 12600

the error message would show:

nid001408.cactus.wcoss2.ncep.noaa.gov 76: forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
gsi.x              0000000007B4769B  Unknown               Unknown  Unknown
libpthread-2.31.s  000014BE749408C0  Unknown               Unknown  Unknown
.......
libmpi_intel.so.1  000014BE70BA2394  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014BE6EF08231  PMPI_Allreduce        Unknown  Unknown
libmpifort_intel.  000014BE75014856  mpi_allreduce_        Unknown  Unknown
gsi.x              0000000000982CA8  m_gpsstats_mp_gen         311  genstats_gps.f90
gsi.x              0000000002600FF7  setuprhsall_              531  setuprhsall.f90

The line 311 in genstat_gps.f90 is

  call mpi_allreduce(toss_gps_sub,toss_gps,nprof_gps,mpi_rtype,mpi_max,&
       mpi_comm_world,ierror)

The reason for GSI hanging at this point needs to be investigated.
Added on Mar. 15,2024, another issue was found that loproc_updat !=hiproc_updat and loproc_contrl !=hirpoc_contrl and hirpoc_updat !=hirpoc_contrl , only loproc_contrl=loproc_updat.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 test

The following has been done on Cactus

  1. clone develop at fca6bea
  2. change BUILD_TYPE in ush/build.sh from Release to Debug
  3. build gsi.x and enkf.x in debug mode
  4. increase wall clock limit to 3 hours for global_4denvar in regression/regression_param.sh
  5. execute ctest -R global_4denvar

global_4denvar_loproc_updat and global_4denvar_hiproc_updat ran to completion in debug mode. Neither job hangs. The loproc job took 2835.414645 seconds to complete. The hiproc job took 1534.105594 seconds to complete.

Interestingly (and disturbingly) the initial gradients between the loproc and hiproc jobs differ in the 10th printed digit. The initial total penalties are identical for all 19 printed digits..

loproc

Initial cost function =  6.584168406980320578E+05
Initial gradient norm =  1.700751272515137998E+03
cost,grad,step,b,step? =   1   0  6.584168406980320578E+05  1.700751272515137998E+03  1.057429134927269976E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  6.547802752899429761E+05  2.113894264916150860E+03  1.927405067994416132E+00  1.302727127739893298E+00  good

hiproc

Initial cost function =  6.584168406980320578E+05
Initial gradient norm =  1.700751274435556979E+03
cost,grad,step,b,step? =   1   0  6.584168406980320578E+05  1.700751274435556979E+03  1.057429135803278131E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  6.547802752834755229E+05  2.113894268161601758E+03  1.927405066764274588E+00  1.302727128193786887E+00  good

Differences in WCOSS2 results were observed in PR #616 and #692. Refactoring code yielded reproducible results with respect to the control. Now loproc and hiproc debug runs demonstrate lack of reproducibility.

The WCOSS2 build uses hpc-stack with an older version of the intel compiler. The GSI builds on other platforms use spack-stack modules and newer intel compilers. Are we dealing with a compiler or module issue on WCOSS2? Would repeating the above test on other platforms yield non-reproducible loproc and hiproc results?

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Thanks. I will do some further digging for some remaining questions to me and come back with an update. The failure of global_4densvar for non-reproducible results (update vs contrl) is reported in #679 (comment).

@RussTreadon-NOAA
Copy link
Contributor

@xincjin-NOAA 's PR #692 yields reproducible global_4denvar results on WCOSS2 (Cactus). PR #692 is now the head of develop.

@RussTreadon-NOAA
Copy link
Contributor

PR #692 updates the global_4denvar case date to include gmi data (monitored, not assimilated). Repeating the above debug test on Cactus from the current f282a94 head of develop results in a floating invalid segmentation fault in read_ozone.f90.

@RussTreadon-NOAA
Copy link
Contributor

GSI develop at f282a94 seg faults in read_ozone.f90 due to an inconsistency between the mnemonics GSI uses to read the GOME bufr dump file and the mnemonics actually encoded in the file.

The DOYR mnemonic was replaced by MNTH DAYS effective 20240131 18Z. The GOME reader in read_ozone.f90 needs to be updated accordingly. This was done in a working copy of develop on Cactus. After this change gsi.x ran to completion in debug mode in 6474.196832 seconds.

Issue #716 was opened to document addition of the required changes to read_ozone.f90

@TingLei-NOAA
Copy link
Contributor Author

By adding extra debug compiler options ( -init=snan,arrays ) , an apparent mis use of variables was found in read_nsstbufr.f90. To document this fix, a draft pr was created at my fork : TingLei-daprediction#2.
After this fix was applied for both control and update in global_ens4dvar test, all runs finished "in time". loproc_contrl and loproc_updat produce identical results. But hiproc_contrl and hiproc_updat gave different results . The differences were first shown in initial gradient.
@ADCollard @emilyhcliu @XuLi-NOAA Would you please confirm/correct the fix in read_sstbufr.f90 following "changed files" in the above draft PR?
BTW: I hadn't updated my GSI (both control and update to the current HEAD of GSI to, hopefully, make things simpler).

@TingLei-NOAA
Copy link
Contributor Author

A reduced version of global_4densvar was run with radiance obs removed. The obs setup in gisparm.anl is as below

OBS_INPUT::
!  dfile          dtype       dplat       dsis                dval    dthin dsfcalc
   prepbufr       ps          null        ps                  0.0     0     0
   prepbufr       t           null        t                   0.0     0     0
   prepbufr_profl t           null        t                   0.0     0     0
   hdobbufr       t           null        t                   0.0     0     0
   prepbufr       q           null        q                   0.0     0     0
   prepbufr_profl q           null        q                   0.0     0     0
   hdobbufr       q           null        q                   0.0     0     0
   prepbufr       pw          null        pw                  0.0     0     0
   prepbufr       uv          null        uv                  0.0     0     0
   prepbufr_profl uv          null        uv                  0.0     0     0
   satwndbufr     uv          null        uv                  0.0     0     0
   hdobbufr       uv          null        uv                  0.0     0     0
   prepbufr       spd         null        spd                 0.0     0     0
   hdobbufr       spd         null        spd                 0.0     0     0
   prepbufr       dw          null        dw                  0.0     0     0
   radarbufr      rw          null        rw                  0.0     0     0

The similar behavior of GSI was found, namely, loproc_contrl and loproc_updat show identical results while the hiproc ones show differences from the lorproc ones and between themselves. So, the culprit seems not specific to radiance observations.

@TingLei-NOAA
Copy link
Contributor Author

Another "reduced" version of the global_4densvar still showed the same behavior, in which , only static B was used (namely, a 3DVar with fgat).

@TingLei-NOAA
Copy link
Contributor Author

An interesting findings: running global_4densvar test with a reduced setup as in the previous runs ( only 2 global members were used), when factqmin=factqmax=0 ( namely this constraint is turned off), this test would indeed succeed (only with " Failure of max-time in the regression test").
So further digging on this issue could focus on the related codes/steps.

@TingLei-NOAA
Copy link
Contributor Author

Another update: using debug mode built GSI, the global_4densvar failed on hera for the same reason as on wcoss2, though GSI (both update and contrl) hasn't been updated to the current head of emc gsi.

@TingLei-NOAA
Copy link
Contributor Author

A modification within one OpenMP directive appears to have addressed the reproducibility issue observed between loproc and hiproc runs in the reduced version of global_4denvar ( only use 2 members and the maximum inner iteration steps of 5 ).
The changes can be reviewed https://github.com/TingLei-daprediction/GSI/pull/2/files#diff-ff9860deeec140b2a1307734f3bf0ba00df64a66ae682aea121de529536926bf.
I will update the control and the PR to the current head of EMC GSI and see if GSI works as expected.

@RussTreadon-NOAA
Copy link
Contributor

Great detective work! ii should clearly be declared private in threaded loop in subroutine intlimq (file intjcmod.f90).

@TingLei-NOAA
Copy link
Contributor Author

An update on the global_4densvar using update and control updated with the current head of EMC GSI develop branch.
The GSI are built with the debug mode (but for the control GSI, -init=snan was not used, otherwise the control run would fail as reported for the issue in read_nsstbufr.f90).
As expected, the loproc_contrl = lorpoc_updat and loproc_updat=hiproc_update, but loproc_contrl != hiproc_contrl.
@RussTreadon-NOAA Do you think I should open a separate PR for changes mentioned in this issue for review or they could be in the current PR #698 ?

@RussTreadon-NOAA
Copy link
Contributor

I suggest a separate PR for the intjcmod.f90 omp bug fix. PR #698 addresses a different problem.

@TingLei-NOAA TingLei-NOAA changed the title GSI built with debug mode became idle in the test global_4denvar on wcoss2 GSI built with debug mode failed in the test global_4denvar on wcoss2 Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants