Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github issue #37: fix for broken generate_ens=T option #283

Closed
wants to merge 9 commits into from

Conversation

jswhit2
Copy link
Contributor

@jswhit2 jswhit2 commented Jan 7, 2022

supersedes PR#42

@MichaelLueken
Copy link
Contributor

@jswhit2 While running the regression tests, two test configurations failed:

  • global_4dvar_T62
  • global_lanczos_T62

Both test configurations failed due to segfault with the following output:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
gsi.x              0000000007874CDD  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B8DE15E6630  Unknown               Unknown  Unknown
gsi.x              00000000007CE938  general_sub2grid_        1605  general_sub2grid_mod.f90
gsi.x              00000000007CC70B  general_sub2grid_        1506  general_sub2grid_mod.f90
gsi.x              0000000005AEB6C0  ckgcov_                   158  bkgcov.f90
gsi.x              0000000005CB51F2  control2model_            176  control2model.f90
gsi.x              000000000583B879  adjtest_mp_adtest         117  adjtest.f90
gsi.x              000000000524A73F  sqrtmin_                   84  sqrtmin.f90
gsi.x              00000000039F499F  glbsoi_                   362  glbsoi.f90
gsi.x              0000000000DD6B04  gsisub_                   200  gsisub.F90
gsi.x              000000000041FC69  gsimod_mp_gsimain        2137  gsimod.F90
gsi.x              0000000000408F3B  MAIN__                    631  gsimain.f90
gsi.x              0000000000408D9E  Unknown               Unknown  Unknown
libc-2.17.so       00002B8DE767D555  __libc_start_main     Unknown  Unknown
gsi.x              0000000000408CA9  Unknown               Unknown  Unknown

The changes in this PR won't be able to go out for review and be merged to the authoritative repo until these two configurations are able to run. If you have any questions or would like assistance to track this issue down, please let me know.

@jswhit2
Copy link
Contributor Author

jswhit2 commented Jan 13, 2022

@MichaelLueken-NOAA I updated the branch with a change that I think should fix it, but for some reason I can't run the regression tests on hera (they all fail).

aerorahul
aerorahul previously approved these changes Feb 24, 2022
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done a visual review of the branch and found it to be acceptable for merge.

There was one stray , that could/should be removed, as noted in the review.
I have noted some points in the review that will hopefully help in readability and debugging.
I have also noted some commented calls which are no longer necessary.

This PR:

  • fixes a previously broken functionality of generating an ensemble by sampling the static BECM (for the GFS)
  • adds an option to write out the generated ensemble in a netCDF file.

@@ -173,7 +173,7 @@ subroutine control2model(xhat,sval,bval)
! Apply sqrt of variance, as well as vertical & horizontal parts of background
! error

call ckgcov(xhat%step(jj)%values(:),wbundle,size(xhat%step(jj)%values(:)))
call ckgcov(grd_a,,xhat%step(jj)%values(:),wbundle,size(xhat%step(jj)%values(:)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the extra , is an oversight.

Comment on lines 1406 to 1409
!if ( mype == 0 ) then
! write(6,*) 'write_gfsncatm is not adapted to write out perturbations yet'
! iret = 999
!endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safe to remove these comments, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes done

@@ -2256,6 +2261,447 @@ subroutine write_atm_ (grd,sp_a,filename,mype_out,gfs_bundle,ibin)

end subroutine write_atm_

subroutine write_atm_pert_ (grd,sp_a,filename,mype_out,gfs_bundle,ibin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in this subroutine here references to write_atm or write_ncatm.
The print statements etc would be cleaner if it used my_name which should also be corrected to WRITE_GFSNCATM_PERT.

The code here works. These changes will likely improve readability and help debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my_name changed to WRITE_GFSNCATM_PERT

! depends on model changes from Jeff Whitaker
nfhour = fhour(1)

atmanl = create_dataset(filename, atmges, &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not be atmpert?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -487,7 +487,8 @@ subroutine prewgt(mype)
end do

! Special case of dssv for qoption=2 and cw
if (qoption==2) call compute_qvar3d
!if (qoption==2) call compute_qvar3d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to save the commented line?
I am not familiar if qoption /= 2 , supports the calling of this function.

@@ -4017,7 +4051,7 @@ subroutine hybens_localization_setup
if(verbose .and. mype == 0)print_verbose=.true.

! Allocate
call create_hybens_localization_parameters
!call create_hybens_localization_parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now called in glbsoi during the initialization.
It can be removed from here, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes done

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , this PR dates back to early 2022. Do we still want the changes in this PR to be merged into NOAA-EMC/GSI develop? If so, feature/generate_ens needs to be updated to the current head of develop.

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , this PR dates back to January 2022. What is the status of this PR. I assume the changes in this PR are still needed in the GSI. Is this true?

@jswhit
Copy link
Contributor

jswhit commented Nov 16, 2022

@RussTreadon-NOAA if we want the generate_ens namelist parameter to actually work, then this should remain open. I can update the branch to the latest develop so more testing can be done to make sure this fix doesn't break anything else.

@RussTreadon-NOAA
Copy link
Contributor

Thank you, @jswhit2 , for the quick reply. Yes, please update your branch to the head of develop and ensure the updated branch works as intended. I'll add myself as a reviewer.

@RussTreadon-NOAA RussTreadon-NOAA requested review from RussTreadon-NOAA and removed request for MichaelLueken November 16, 2022 14:52
@jswhit
Copy link
Contributor

jswhit commented Nov 16, 2022

now up-to-date with GSI develop

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , merger of PR #460 into develop generated a conflict with this PR (#283). Would you please update jswhit2:feature/generate_ens to the current head, 3a4484d4. I can then review and merge this PR into develop.

@jswhit
Copy link
Contributor

jswhit commented Nov 22, 2022

@jswhit2 , merger of PR #460 into develop generated a conflict with this PR (#283). Would you please update jswhit2:feature/generate_ens to the current head, 3a4484d4. I can then review and merge this PR into develop.

@RussTreadon-NOAA I have updated jswhit2:feature/generate_ens to the current NOAA-EMC/GSI/develop.

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , merger of PR #460 into develop generated a conflict with this PR (#283). Would you please update jswhit2:feature/generate_ens to the current head, 3a4484d4. I can then review and merge this PR into develop.

@RussTreadon-NOAA I have updated jswhit2:feature/generate_ens to the current NOAA-EMC/GSI/develop.

Thank you, @jswhit2 . I'll review the changes and run regression tests.

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , I ran the standard suite of 19 regression tests on Hera. 5 out of 19 tests fail

 1/19 Test  #2: [=[global_T62_ozonly]=] ..........   Passed  1143.39 sec
 2/19 Test #11: [=[nmm_netcdf]=] .................   Passed  2703.56 sec
 3/19 Test  #8: [=[arw_netcdf]=] .................   Passed  2822.11 sec
 4/19 Test  #3: [=[global_4dvar_T62]=] ...........***Failed  2822.67 sec
 5/19 Test  #9: [=[arw_binary]=] .................   Passed  2823.46 sec
 6/19 Test #16: [=[netcdf_fv3_regional]=] ........   Passed  2883.63 sec
 7/19 Test #19: [=[global_enkf_T62]=] ............   Passed  2887.03 sec
 8/19 Test  #4: [=[global_4denvar_T126]=] ........***Failed  3003.32 sec
 9/19 Test #18: [=[global_C96_fv3aerorad]=] ......   Passed  9970.32 sec
10/19 Test #13: [=[hwrf_nmm_d2]=] ................   Passed  14894.46 sec
11/19 Test #14: [=[hwrf_nmm_d3]=] ................   Passed  14903.48 sec
12/19 Test #17: [=[global_C96_fv3aero]=] .........***Failed  23594.32 sec
13/19 Test #10: [=[nmm_binary]=] .................   Passed  24264.77 sec
14/19 Test  #7: [=[global_lanczos_T62]=] .........***Failed  25091.28 sec
15/19 Test  #6: [=[global_fv3_4denvar_C192]=] ....***Failed  25872.51 sec
16/19 Test #15: [=[rtma]=] .......................   Passed  25881.83 sec
17/19 Test #12: [=[nmmb_nems_4denvar]=] ..........   Passed  26133.85 sec
18/19 Test  #5: [=[global_fv3_4denvar_T126]=] ....   Passed  40036.92 sec
19/19 Test  #1: [=[global_T62]=] .................   Passed  44178.34 sec

74% tests passed, 5 tests failed out of 19

Total Test time (real) = 44178.39 sec

The following tests FAILED:
          3 - [=[global_4dvar_T62]=] (Failed)
          4 - [=[global_4denvar_T126]=] (Failed)
          6 - [=[global_fv3_4denvar_C192]=] (Failed)
          7 - [=[global_lanczos_T62]=] (Failed)
         17 - [=[global_C96_fv3aero]=] (Failed)
Errors while running CTest

I individually reran the failed tests. 4 of the 5 tests still fail. Job run directories for all jobs are in /scratch1/NCEPDEV/stmp2/Russ.Treadon/_scratch1_NCEPDEV_da_Russ.Treadon_git_gsi_pr283_build/. The sub-directories of the failed runs are:

  • tmpreg_global_4dvar_T62

    • non-reproducible results between updat hiproc and loproc. Neither reproduces contrl
    • different sqrtmin: Initial gradient norm in updat hiproc and loproc stdout
  • tmpreg_global_fv3_4denvar_C192

    • exceeded maximum allowable threshold time
global_fv3_4denvar_C192_hiproc_contrl/stdout:The total amount of wall time                        = 458.221288
global_fv3_4denvar_C192_hiproc_updat/stdout:The total amount of wall time                        = 465.680441
global_fv3_4denvar_C192_loproc_contrl/stdout:The total amount of wall time                        = 634.784478
global_fv3_4denvar_C192_loproc_updat/stdout:The total amount of wall time                        = 710.690504
  • tmpreg_global_lanczos_T62

    • non-reproducible results between updat hiproc and loproc. Neither reproduces contrl.
    • different sqrtmin: Initial gradient norm in updat hiproc and loproc stdout
  • tmpreg_global_C96_fv3aero

    • exceed maximum allowable threshold time
global_C96_fv3aero_hiproc_contrl/stdout:The total amount of wall time                        = 29.196846
global_C96_fv3aero_hiproc_updat/stdout:The total amount of wall time                        = 33.172050
global_C96_fv3aero_loproc_contrl/stdout:The total amount of wall time                        = 52.918691
global_C96_fv3aero_loproc_updat/stdout:The total amount of wall time                        = 67.770020

The global_fv3_4denvar_C192 and global_C96_fv3aero fails are not fatal fails. The other 2 fails are due to non-reproducible results. These failures should be examined more closely.

The contrl is the current head of develop 3a4484d. updat is the head of your branch feature/generate_ens at 7f44526.

@RussTreadon-NOAA
Copy link
Contributor

generate_ens test

Build and run feature/generate_ens gsi.x with the following settings in namelist HYBRID_ENSEMBLE

 GENERATE_ENS = T,
 WRITE_GENERATED_ENS = F,

Jobs submitted on WCOSS2 and Hera. gsi.x ran to completion on both machines.

Rerun with

 GENERATE_ENS = T,
 WRITE_GENERATED_ENS = T,

Jobs submitted on WCOSS2 and Hera. gsi.x seg faulted on both machines. Debugging thus far has not identified specifically where the failure occurs in the ens_pert write.

@jswhit2, have you successfully run gsi.x with WRITE_GENERATED_ENS = T,? If so, where is your run script and/or run directory?

@RussTreadon-NOAA
Copy link
Contributor

generate_ens test follow up

Problem found to be with the length of the character variable containing the name of the output ensemble perturbation file.

  1. cplr_gfs_ensmod.f90 declares character(len=70) :: filename.
  2. netcdfgfs_io.f90 declares character(len=24), intent(in) :: filename ! file to open and write to

For my runs the name of the output filename was ./ensemble_data/sigf06_ens_pertXXX. This string is 34 characters long. Thus, netcdfgfs_io.f90 was attempting to repeatedly write to ./ensemble_data/sigf06_e.

As a test, increase the length of filename in netcdfgfs_io.f90 to 70. Recompile and rerun. Job with

GENERATE_ENS = T,
WRITE_GENERATED_ENS = T

successfully ran to completion.


type(sub2grid_info), intent(in) :: grd
type(spec_vars), intent(in) :: sp_a
character(len=24), intent(in) :: filename ! file to open and write to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increase to at least len=70 to be consistent with size of filename in cplr_gfs_ensmod.f90. Is there a way to move away from a fixed character length? It's possible len=70 could fail.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

19 regression tests run on Hera. Two cases fail with non-reproducible results.

  1. global_4dvar_T62
  2. global_lanczos_T62

It would be good to examine these case and determine why results are not reproducible.

Need to increase the declared length of filename in netcdfgfs_io.f90 len=24 was not sufficient in a test case with WRITE_GENERATED_ENS = T.

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , this PR has two remaining issues:

  1. please increase the length of character variable filename in netcdfgfs_io.f90 - see comments above for details
  2. feature/generate_ens gsi.x does not reproduce analysis results from develop gsi.x for regression tests global_4dvar_T62 and global_lanczos_T62. Both these tests use the Lanczos solver.

Issue 1 is easily resolved. Resolving issue 2 may prove challenging.

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , for this PR to move foward we need address a few items

  • address reviewer comments above
  • update jswhit2:feature/generate_ens to the current head of develop
  • decide if a new ctest should be added to ensure future updates to develop do not break the generate_ens=T option

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Jan 23, 2023

@jswhit2 , just a quick check on the status of this PR. The GSI Review team meets in a few weeks (2/13).

@jswhit
Copy link
Contributor

jswhit commented Jan 24, 2023

@RussTreadon-NOAA this has fallen down in the priority queue for me, so I don't think I will be able to revisit this before the next meeting

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , thank you for your reply. With your permission I can commit changes to your branch which

  • increase the length of character variable filename in netcdfgfs_io.f90
  • bring your branch up to date with the head of NOAA-EMC/GSI develop.

I'll try to look at non-reproducibility when using lanczos as time permits.

@RussTreadon-NOAA
Copy link
Contributor

@jswhit2 , I have a working copy of your forked master up to date with the current head of NOAA-EMC/GSI develop. If you like, I can push this to your fork and get feature/generate_ens up to date with develop. After this netcdfgfs_io.f90 can be updated and ctests rerun to see how the lanczos minimization behaves. Shall I proceed?

@jswhit
Copy link
Contributor

jswhit commented Jan 24, 2023

@RussTreadon-NOAA if you have time to do that I would very much appreciate it.

@RussTreadon-NOAA
Copy link
Contributor

@RussTreadon-NOAA if you have time to do that I would very much appreciate it.

Thanks @jswhit2 for the green light. I tried but unfortunately got the following when I tried to push the updated master to your fork

Hera(hfe11):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr283$ git status
On branch master
Your branch is ahead of 'origin/master' by 9 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
Hera(hfe11):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr283$ git push origin master
Total 0 (delta 0), reused 0 (delta 0)
To https://github.com/jswhit2/GSI.git
 ! [remote rejected]   master -> master (permission denied)
error: failed to push some refs to 'https://github.com/jswhit2/GSI.git'

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Feb 9, 2023
@RussTreadon-NOAA
Copy link
Contributor

With the transition to JEDI for UFS DA this PR is closed.

Hash 7f44526, the current head of branch jswhit2:feature/generate_ens, is recorded here for reference. If developers want to exercise the generate_ens=.true. option they may do so from this hash.

As noted above setting write_generated_ens=.true. when generate_ens=.true. will likely result in a gsi.x segmentation fault due to character variable filename in netcdfgfs_io.f90 being declared with too small a length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

generate_ens=T (in &hybrid_ensemble) causes segfault
5 participants