Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy FV3 restarts and bugfix on COM_MED_RESTART_PREV #2534

Closed
wants to merge 3 commits into from

Conversation

aerorahul
Copy link
Contributor

Description

This PR fixes 2 issues:

Relevant for #2524
model_configure has an entry restart_interval. In the past, an entry such as 6 -1 would translate into a restarts written at a frequency of 6h and at the end of the forecast. This has been replaced by needing to provide an explicit list of forecast hours the restarts are required.
In the case of gdas, this means, if restarts are needed at 6h and the end, the appropriate restart_interval = 6 9
To achieve this, forecast_predet.sh calculates the FV3_RESTART_FH list based on restart_interval (a frequency) and this array is used to copy restarts back to COM from DATArestart

Type of change

  • Bug fix
  • Maintenance

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO

How has this been tested?

Ran a standalone first half cycle from the UFS DA ci test

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

@aerorahul aerorahul added the CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS label Apr 25, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA added the CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion label Apr 25, 2024
@RussTreadon-NOAA
Copy link
Contributor

Install aerorahul:bugfix/issue_2524 on Hera and run C96C48_ufs_hybatmDA CI.

Half-cycle forecasts for 2024022318 fail because coupler.res need to be updated as described in #2524. Also need to add enfkgdas ratminc.nc for mem001 and mem002. After these edits and additions, half-cycle forecasts complete.

Unfortunately, the gdas, gfs, and enkfgdas forecasts for 2024022400 all fail for the same reason

 0: INPUT/coupler.res: date_init=2024   2  23  18   0   0
 0: INPUT/coupler.res: date     =2024   2  24   0   0   0
 0: fcst_initialize ERROR: date_init /= date_init_res
 0:                        date_init     = 2024   2  24   0   0   0
 0:                        date_init_res = 2024   2  23  18   0   0

JEDI ATM CI runs with DOIAU=NO. Log files are in /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prtest/logs/2024022400

Repeat the above test using C96_atm3DVar. Half-cycle for 2021122018 ran to completion. The 2021122100 gdas and enkfgdas forecasts run to completion. The C96_atm3DVar runs with DOIAU=YES.

The 2021122100 gfs forecast immediately aborts upon execution

+ exglobal_forecast.sh[153]: srun -l --export=ALL -n 40 /scratch1/NCEPDEV/stmp2/Russ.Treadon/RUNDIRS/prtest_gsi/gfsfcst.2021122100/fcst.1657314/ufs_model.x
 0:
 0:
 0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
 0:      PROGRAM ufs-weather-model HAS BEGUN. COMPILED       0.00     ORG: np23
 0:      STARTING DATE-TIME  APR 25,2024  13:52:00.492  116  THU   2460426
 0:
 0:
 0: MPI Library = Intel(R) MPI Library 2021.5 for Linux* OS
 0:
 0: MPI Version = 3.1
 6: Abort(1) on node 6 (rank 6 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 6
20: Abort(1) on node 20 (rank 20 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 20

The gfsfcst was rewound twice and rebooted. All three attempts fail in the same manner. The log file is /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/prtest_gsi/logs/2021122100/gfsfcst.log

Could the JEDI ATM CI forecast failures be due to the fact that JEDI ATM CI runs with DOIAU=NO?

@aerorahul aerorahul removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS labels Apr 25, 2024
@aerorahul
Copy link
Contributor Author

Thanks @RussTreadon-NOAA
The issue here is the inconsistency in the times in the coupler.res file when DOIAU=NO. I think we should do either:

  1. not have a INPUT/coupler.res when DOIAU=NO, or
  2. have a INPUT/coupler.res generated in the workflow with the model start time == model current time. This mismatch is the cause of the failure in the 00z cycle in your case. This mismatch is caused because we are copying restarts from COM_ATMOS_RESTART_PREV. We should copy all restarts, but create the coupler.res based on the configurations and times. This will become tricky when doing restarts from failure and segments. I'll think some more.

@emcbot emcbot added the CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion label Apr 25, 2024
@RussTreadon-NOAA
Copy link
Contributor

@aerorahul , I modified a local copy of forecast_postdet.sh to implement your option 2.

When DOIAU=NO the modified script creates a new coupler.res file in the run directory

--- a/ush/forecast_postdet.sh
+++ b/ush/forecast_postdet.sh
@@ -99,6 +99,15 @@ FV3_postdet() {
 EOF
       fi

+      if [[ "${DOIAU}" == "NO" ]]; then
+        rm -f "${DATA}/INPUT/coupler.res"
+        cat >> "${DATA}/INPUT/coupler.res" << EOF
+        3        (Calendar: no_calendar=0, thirty_day_months=1, julian=2, gregorian=3, noleap=4)
+        ${current_cycle:0:4}  ${current_cycle:4:2}  ${current_cycle:6:2}  ${current_cycle:8:2}  0  0        Model star
t time: year, month, day, hour, minute, second
+        ${current_cycle:0:4}  ${current_cycle:4:2}  ${current_cycle:6:2}  ${current_cycle:8:2}  0  0        Current mo
del time: year, month, day, hour, minute, second
+EOF
+      fi
+
       # Create a array of increment files
       local inc_files inc_file iaufhrs iaufhr
       if [[ "${DOIAU}" == "YES" ]]; then

With the above change in forecast_postdet.sh the JEDI ATM CI 2024022318 gdas and enkfgdas forecasts ran to completion. The same comment applies to the 2024022400 gdas, enkfgdas, and gfs forecasts.

Not sure if this is the modification you had in mind. You mention that option 2 may prove problematic when doing restarts from failures and segments. Thus, I do not advocate the above modification to forecast_postdet.sh as the solution. It's a data point on the path to a robust solution.

Ideally the model would write a coupler.res appropriate to the way in which the model is being run.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion and removed CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Apr 25, 2024
@emcbot emcbot added CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion labels Apr 25, 2024
@emcbot
Copy link

emcbot commented Apr 25, 2024

Build FAILED on Orion with error logs:

/work2/noaa/stmp/CI/ORION/2534/gfs/sorc/logs/build_gdas.log

Follow link here to view the contents of the above file(s): (link)

@RussTreadon-NOAA
Copy link
Contributor

A check of /work2/noaa/stmp/CI/ORION/2534/gfs/sorc/logs/build_gdas.log shows

CMake Error at gdas/test/CMakeLists.txt:9 (file):
  file COPY cannot find
  "/work2/noaa/da/cmartin/CI/GDASApp/data/test//gdasapp-fix-bda76f.tgz":
  Permission denied.


CMake Error: Problem with archive_read_open_file(): Failed to open '/work2/noaa/stmp/CI/ORION/2534/gfs/sorc/gdas.cd/build/gdas/test/gd\
asapp-fix-bda76f.tgz'
CMake Error at gdas/test/CMakeLists.txt:23 (file):
  file failed to extract:
  /work2/noaa/stmp/CI/ORION/2534/gfs/sorc/gdas.cd/build/gdas/test/gdasapp-fix-bda76f.tgz

While file in question has user, group, and global read permission

Orion-login-4:~$ ls -l /work2/noaa/da/cmartin/CI/GDASApp/data/test/
total 23528
-rw-r--r-- 1 cmartin da 24092163 Apr 16 20:20 gdasapp-fix-bda76f.tgz

directory /work2/noaa/da/cmartin/CI/GDASApp/ only has user and group read permission

Orion-login-4:~$ ls -l /work2/noaa/da/cmartin/CI/GDASApp
total 4
drwxr-s--- 7 cmartin da 4096 Apr 16 20:20 data

@CoryMartin-NOAA , can we chmod /work2/noaa/da/cmartin/CI/GDASApp/data so that it is readable by everyone?

@TerrenceMcGuinness-NOAA
Copy link
Collaborator

Build FAILED on Orion with error logs:

/work2/noaa/stmp/CI/ORION/2534/gfs/sorc/logs/build_gdas.log

Follow link here to view the contents of the above file(s): (link)

@CoryMartin-NOAA
Copy link
Contributor

@RussTreadon-NOAA will do thanks

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed and removed CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Apr 25, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion and removed CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed labels Apr 25, 2024
@emcbot emcbot added CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Apr 25, 2024
@emcbot emcbot added CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed and removed CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress labels Apr 25, 2024
@emcbot
Copy link

emcbot commented Apr 25, 2024

Experiment C96C48_hybatmDA FAILED on Orion
in/work2/noaa/stmp/CI/ORION/2534/RUNTESTS/C96C48_hybatmDA_a4c00e35

@emcbot
Copy link

emcbot commented Apr 25, 2024

Experiment C96_atm3DVar FAILED on Orion with error logs:

/work2/noaa/stmp/CI/ORION/2534/RUNTESTS/COMROOT/C96_atm3DVar_a4c00e35/logs/2021122100/gfsfcst.log

Follow link here to view the contents of the above file(s): (link)

@emcbot
Copy link

emcbot commented Apr 25, 2024

Experiment C96_atm3DVar FAILED on Orion
in/work2/noaa/stmp/CI/ORION/2534/RUNTESTS/C96_atm3DVar_a4c00e35

@aerorahul aerorahul removed the CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed label Apr 26, 2024
@aerorahul
Copy link
Contributor Author

I will close this PR and open it when I have the tests running successfully.

@aerorahul
Copy link
Contributor Author

@RussTreadon-NOAA
All the changes from this PR should now be in develop, except for the change in config.base.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @aerorahul . Tests of C96C48_ufs_hybatmDA last night and this morning on Hera and Orion run OK with the following exceptions

  • need to update setup_expt.py to link ensemble increments for warm_start
  • 2024022400 gfsfcst aborts on ufs_model.x start

The ufs_model.x abort was also observed in the 2021122100 gfsfcst for C96C48_hybatmDA CI.

@aerorahul
Copy link
Contributor Author

Thanks @RussTreadon-NOAA

@RussTreadon-NOAA
Copy link
Contributor

gfsfcst also failed this morning on Orion. PR #2553 has been opened to add ensemble analysis increment links to setup_expt.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COM_MED_RESTART_PREV is not defined C96C48_ufs_hybatmDA breaks with 3b208124
6 participants