Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix & safeguard updates #623

Closed
RussTreadon-NOAA opened this issue Sep 18, 2023 · 10 comments · Fixed by #629
Closed

Bug fix & safeguard updates #623

RussTreadon-NOAA opened this issue Sep 18, 2023 · 10 comments · Fixed by #629
Assignees

Comments

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Sep 18, 2023

Two GSI-specific issues were identified while testing develop at 008c63c in the global-workflow.

  1. bug: gsi.x encodes Time into netcdf radiance diagnostic file but enkf.x expects Obs_Time
  2. safeguard: PR Update intel compile to Intel2022 #571 modified how gsi.x creates sub-directories. This update did not consider the case in which the directory to be created already exists.

This issue is opened to address both of these points.

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Sep 18, 2023
@RussTreadon-NOAA RussTreadon-NOAA changed the title Check for sub-directory existence before creating Bug fix & safeguard updates Sep 19, 2023
@RussTreadon-NOAA
Copy link
Contributor Author

RussTreadon-NOAA commented Sep 19, 2023

Obs_Time or Time in netcdf diagnostic files

A check of src/gsi/setup* shows that Time, not Obs_Time, is the variable encoded in netcdf observation diagnostic files for all except two observation classes. The two exceptions are setupaod.f90 and setuprad.f90. Both of these routines use Obs_Time when writing the observation time to netcdf diagnostic files. All other setup routines use Time.

Given this, it seems preferable to replace Obs_Time in setupaod.f90 and setuprad.f90 with Time. Doing so requires that Obs_Time be replaced with Time in src/enkf/readsatobs.f90. References to aod were not found in src/enkf.

@CoryMartin-NOAA, @EdwardSafford-NOAA , and @andytangborn, would changing Obs_Time in the section of setupaod.f90 below

         call nc_diag_metadata_to_single("Longitude",(cenlon)) ! observation longitude (degrees)
         call nc_diag_metadata_to_single("Obs_Time",(dtime))!-time_offset)) ! observation time (hours relative to analysis time)
         call nc_diag_metadata_to_single("Sol_Zenith_Angle",(pangs)) ! solar zenith angle (degrees)

to

         call nc_diag_metadata_to_single("Longitude",(cenlon)) ! observation longitude (degrees)
         call nc_diag_metadata_to_single("Time",(dtime))!-time_offset)) ! observation time (hours relative to analysis time)
         call nc_diag_metadata_to_single("Sol_Zenith_Angle",(pangs)) ! solar zenith angle (degrees)

cause any problems in monitoring code, workflow, UFO evaluation, etc?

@RussTreadon-NOAA
Copy link
Contributor Author

gsi.x lrun_subdirs option

Subroutine init_directories in src/gsi/obsmod.F90 creates mpi rank specific sub-directories for use by gsi.x when SETUP namelist variable lrun_subdirs = .true.

PR #571 modified init_directory to use intel portability function MAKEDIRQQ to create the sub-directories. This function fails (rightfully so) when attempting to create a directory which already exists. Existing logic in init_directories notes the failure and gsi.x execution is terminated. This happens when running g-w jobs anal.sh and eobs.sh with gsi.x built from GSI hashes at or after PR #571 was merged into develop.

If mpi rank specific sub-directories exist, we don't want gsi.x to abort. Instead, execution should continue. To achieve this end, a call to the INQUIRE function is added to init_directories. INQUIRE is used to check for the existence of mpi rank directories. MAKEDIRQQ is only executed when the directories do not exist.

@RussTreadon-NOAA
Copy link
Contributor Author

Work for this issue will be done in forked branch feature/intel2022_updates.

@RussTreadon-NOAA
Copy link
Contributor Author

@EdwardSafford-NOAA , a closer look of src/gsi finds Obs_Time also in read_diag.f90

src/gsi/read_diag.f90:  real(r_single), allocatable, dimension(:)  :: Latitude, Longitude, Elevation, Obs_Time, Scan_Position, &
src/gsi/read_diag.f90:            Obs_Time(ndatum),                 Scan_Position(ndatum),                    Sat_Zenith_Angle(ndatum),                      &
src/gsi/read_diag.f90:  call nc_diag_read_get_var(ftin, 'Obs_Time', Obs_Time)
src/gsi/read_diag.f90:    diag_status%all_data_fix(ir)%obstime           = Obs_Time(cdatum)
src/gsi/read_diag.f90:  deallocate(Channel_Index, Latitude, Longitude, Elevation, Obs_Time, Scan_Position,    &

This prompted me to check source code in NOAA-EMC/GSI-Monitor. Obs_Time occurs in multiple copies of read_diag.f90.

Given this, the do no harm approach is to replace Time with Obs_Time in src/gsi/setuprad.f90. The allows us to also leave src/enkf/readsatobs.f90 as is.

To keep things simple @CoryMartin-NOAA and @andytangborn , I can also leave Obs_Time alone in src/gsi/setupaod.f90

Thoughts? Comments?

RussTreadon-NOAA added a commit to RussTreadon-NOAA/GSI that referenced this issue Sep 19, 2023
@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA we don't use GSI for AOD for any operational code and don't plan to for GFS/GDAS, but may for RRFS-SD. But since it is self-contained, I see no reason why these can't be made consistent, it shouldn't break anything related to AOD assimilation.

RussTreadon-NOAA added a commit to RussTreadon-NOAA/GSI that referenced this issue Sep 19, 2023
@RussTreadon-NOAA
Copy link
Contributor Author

@CoryMartin-NOAA , thanks for the feedback. I agree. My preference is for consistency across netcdf diagnostic files. I replaced Obs_Time in setupaod.f90 with Time. Done at 4fdbd2f.

RussTreadon-NOAA added a commit to RussTreadon-NOAA/GSI that referenced this issue Sep 21, 2023
@RussTreadon-NOAA
Copy link
Contributor Author

While working on g-w issue #1863, a failure was encountered in ozmon_time.x. The source of the problem was traced to gsi/src/setupoz.f90. Subroutine setupozlay writes an integer for netcdf ozone diagnostic variable Analysis_Use_Flag. Subroutine setupozlev writes a float for Analysis_Use_Flag OznMon source code expects an integer.

A short term fix is to replace the float Analysis_Use_Flag in setupozlev with an integer. Change committed at 45eff70

@RussTreadon-NOAA
Copy link
Contributor Author

Cycled tests on Hera, Orion, and WCOSS2

g-w issue #1863 documents the cycling of gsi.x and enkf.x built from feature/intel2022_updates in parallels run on Hera, Orion, and WCOSS2 (Cactus). The parallel was cold started from 2021073106 and ran through 2021080118. The gfs cycle ran at 2021080100. All jobs successfully ran to completion.

@RussTreadon-NOAA
Copy link
Contributor Author

WCOSS2 ctests

Run ctests on WCOSS2 (Cactus) with the following results

russ.treadon@clogin04:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/intel2022_updates/build> ctest -j 9
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/intel2022_updates/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed  483.04 sec
2/9 Test #5: hwrf_nmm_d3 ......................   Passed  493.64 sec
3/9 Test #7: rrfs_3denvar_glbens ..............   Passed  605.25 sec
4/9 Test #4: hwrf_nmm_d2 ......................   Passed  606.66 sec
5/9 Test #9: global_enkf ......................   Passed  611.33 sec
6/9 Test #6: rtma .............................   Passed  1149.42 sec
7/9 Test #3: global_4denvar ...................   Passed  1445.76 sec
8/9 Test #1: global_3dvar .....................***Failed  1565.02 sec
9/9 Test #2: global_4dvar .....................   Passed  1625.12 sec

78% tests passed, 2 tests failed out of 9

Total Test time (real) = 1625.12 sec

The following tests FAILED:
          1 - global_3dvar (Failed)
          8 - netcdf_fv3_regional (Failed)

The global_3dvar test failed a timing check

The runtime for global_3dvar_loproc_updat is 434.012112 seconds.  This has exceeded maximum allowable threshold time of 412.364447 seconds,
resulting in Failure time-thresh of the regression test.

A check of the gsi.x wall times shows the loproc_updat ran longer than the contrl.

global_3dvar_hiproc_contrl/stdout:The total amount of wall time                        = 254.987663
global_3dvar_hiproc_updat/stdout:The total amount of wall time                        = 265.071814
global_3dvar_loproc_contrl/stdout:The total amount of wall time                        = 374.876770
global_3dvar_loproc_updat/stdout:The total amount of wall time                        = 434.012112

Rerun the global_3dvar test. This time the test passed.

russ.treadon@clogin04:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/intel2022_updates/build> ctest -R global_3dvar
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/intel2022_updates/build
    Start 1: global_3dvar
1/1 Test #1: global_3dvar .....................   Passed  1502.81 sec

100% tests passed, 0 tests failed out of 1
Total Test time (real) = 1502.93 sec

Apparently there is considerable runtime variability on Cactus. This may be due to system load (filesystem, interconnect, etc).

The netcdf_nmm_fv3 test failed due to a memory check

The memory for netcdf_fv3_regional_loproc_updat is 291348 KBs.  This has exceeded maximum allowable memory of 201176 KBs,
resulting in Failure memthresh of the regression test.

This is not a fatal fail.

Orion ctests
Run ctests on Orion with the following results.

Orion-login-2:/work2/noaa/da/rtreadon/git/gsi/intel2022_updates/build$ ctest -j 9
Test project /work2/noaa/da/rtreadon/git/gsi/intel2022_updates/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............   Passed  482.54 sec
2/9 Test #9: global_enkf ......................   Passed  489.93 sec
3/9 Test #5: hwrf_nmm_d3 ......................   Passed  496.50 sec
4/9 Test #4: hwrf_nmm_d2 ......................   Passed  547.15 sec
5/9 Test #7: rrfs_3denvar_glbens ..............   Passed  605.34 sec
6/9 Test #6: rtma .............................***Failed  1269.48 sec
7/9 Test #3: global_4denvar ...................   Passed  1682.11 sec
8/9 Test #2: global_4dvar .....................   Passed  1861.84 sec
9/9 Test #1: global_3dvar .....................***Failed  1981.78 sec

78% tests passed, 2 tests failed out of 9

Total Test time (real) = 1981.79 sec

The following tests FAILED:
          1 - global_3dvar (Failed)
          6 - rtma (Failed)
Errors while running CTest

The global_3dvar failure is due to a time check

The runtime for global_3dvar_loproc_updat is 632.400759 seconds.  This has exceeded maximum allowable threshold time of 625.180949 seconds,
resulting in Failure time-thresh of the regression test.

A check of the gsi.x wall times shows that the loproc_updat runs longer than the contrl

global_3dvar_hiproc_contrl/stdout:The total amount of wall time                        = 306.361695
global_3dvar_hiproc_updat/stdout:The total amount of wall time                        = 313.146756
global_3dvar_loproc_contrl/stdout:The total amount of wall time                        = 568.346318
global_3dvar_loproc_updat/stdout:The total amount of wall time                        = 632.400759

The rtma test also failed due to a timing check

The runtime for rtma_loproc_updat is 308.126300 seconds.  This has exceeded maximum allowable threshold time of 296.189397 seconds,
resulting in Failure time-thresh of the regression test.

Again we see the loproc_updat wall time exceeds the contrl

rtma_hiproc_contrl/stdout:The total amount of wall time                        = 278.963886
rtma_hiproc_updat/stdout:The total amount of wall time                        = 264.492583
rtma_loproc_contrl/stdout:The total amount of wall time                        = 269.263089
rtma_loproc_updat/stdout:The total amount of wall time                        = 308.126300

Rerun the rtma test. This time the test passed.

Orion-login-2:/work2/noaa/da/rtreadon/git/gsi/intel2022_updates/build$ ctest -R rtma
Test project /work2/noaa/da/rtreadon/git/gsi/intel2022_updates/build
    Start 6: rtma
1/1 Test #6: rtma .............................   Passed  1209.61 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1209.63 sec

Apparently there is considerable runtime variability on Cactus. This may be due to system load (filesystem, interconnect, etc).

Ctests on WCOSS2 and Orion show acceptable behavior.

@RussTreadon-NOAA
Copy link
Contributor Author

Hera ctests

Run ctests on WCOSS2 (Cactus) with the following results

Hera(hfe01):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/intel2022_updates/build$ ctest -j 9
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/intel2022_updates/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #7: rrfs_3denvar_glbens ..............   Passed  733.27 sec
2/9 Test #9: global_enkf ......................   Passed  1490.65 sec
3/9 Test #8: netcdf_fv3_regional ..............   Passed  4929.17 sec
4/9 Test #3: global_4denvar ...................   Passed  5450.31 sec
5/9 Test #5: hwrf_nmm_d3 ......................   Passed  5780.40 sec
6/9 Test #1: global_3dvar .....................   Passed  6170.57 sec
7/9 Test #2: global_4dvar .....................   Passed  6366.06 sec
8/9 Test #4: hwrf_nmm_d2 ......................***Failed  6372.81 sec
9/9 Test #6: rtma .............................   Passed  10642.88 sec

89% tests passed, 1 tests failed out of 9

Total Test time (real) = 10642.97 sec

The following tests FAILED:
          4 - hwrf_nmm_d2 (Failed)

The hwrf_nmm_d2 failure is due to a time check

The case has Failed the scalability test.
The slope for the update (16.086432 seconds per node) is less than that for the control (16.591666 seconds per node).

A check of the gsi.x wall times does not reveal any anomalous behavior

hwrf_nmm_d2_hiproc_contrl/stdout:The total amount of wall time                        = 76.143124
hwrf_nmm_d2_hiproc_updat/stdout:The total amount of wall time                        = 80.511887
hwrf_nmm_d2_loproc_contrl/stdout:The total amount of wall time                        = 92.734790
hwrf_nmm_d2_loproc_updat/stdout:The total amount of wall time                        = 91.236175

This is not a fatal fail.

CoryMartin-NOAA pushed a commit that referenced this issue Sep 25, 2023
**Description**
This PR fixes two types of bugs discovered when cycling `gsi.x` and
`enkf.x` with intel/2022 in the global workflow
1. modify variables written to netcdf diagnostic files by `gsi.x` to be
consistent with codes which read netcdf diagnostic files
2. modify `lrun_subdirs=.true.` option of `gsi.x` to properly handle the
case in which sub-directories already exist in the run directory

Fixes #623

**Type of change**
- [x] Bug fix (non-breaking change which fixes an issue)

**How Has This Been Tested?**
Ctests have been on Hera, Orion, and WCOSS2 (Cactus) with acceptable
behavior. A global parallel covering the period 2021073106 through
2021080118 has been run on Hera, Orion, and WCOSS2 (Cactus). All global
workflow jobs ran as expected.
  
**Checklist**
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] New and existing tests pass with my changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants