Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFSv16.3.8 - add debug flag to resolve wave post job runtime issues #1843

Closed
KateFriedman-NOAA opened this issue Sep 8, 2023 · 2 comments
Closed
Assignees
Labels
production update Processing update in production triage Issues that are triage

Comments

@KateFriedman-NOAA
Copy link
Member

KateFriedman-NOAA commented Sep 8, 2023

Description

The wave_post_bndpnt job walltime is extended in production. (9/11/23 update) A ldebug flag was added to the wave post ecf PBS statements to resolve long runtimes with those jobs. Initially the walltimes were extended but then the debug flag was added, the runtimes came back down, and the walltimes were reverted back. This is a temporary measure while the cause it determined and resolved. Plan to revert walltime change eventually.

Initial email from NCO:

Andrew
We schedule the GFS changes on coming Monday 9/11 at 1430z. It will be GFS v16.3.9 (from original para 
gfs v16.3.8) since we implemented an ARFC v16.3.8 (based on prod v16.3.7) on 9/6 to add ladybug in 
wave_post*pnt ecf scripts to resolve the long runtime issue.

Please let me know if you have any questions or need more information.

Thanks,

/Simon
SPA Office

Mentions of issue from SDM logs:
9/5 log:

CONTINUED...GFS WAVE WALLTIME EXTENDED/FAILURES
NWPS PROD JOB FAILURES

2a. 1104Z - SOS Fred extended the wallclock of 06Z
gfs_wave_post_bndpnt and gfs_wave_post_bndpntbll

2b. 1258Z -
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed
due to the missing 06Z GFS data.

2c. 1346Z - GFS job finished. Fred reran the nwps prep job to
completion.

2d. 1710Z - Same for 12Z jgfs_wave_post_bndpntbll.

2e. 1808Z - Fred extended the wallclock of 12Z jgfs_wave_postpnt
job.

2f. 2318Z - SOS Houmin reported that aborted:
/prod/primary/18/gfs/v16.3/gfs/wave/post/jgfs_wave_post_bndpnt
failed due to walltime limit. Reran with increased time. (RJS)

2g. 0056Z - Houmin reports job
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed as
waiting gfs post jobs.  Houmin will rerun when the gfs wave
completes. (RJS)

2h. 0247Z - SOS Kevin reports that the NWPS rerun completed.
The 18Z gfs wave is still running. 0158Z - GFS wave job
completed. (RJS)

2i. 0658Z - Job
/prod/primary/cron/nwps/v1.4/regions/ER/gyx/jnwps_prep failed
due to missing 00Z GFS wave data.  0741Z - SOS Kevin reports
that the nwps prep job rerun completed.  0825Z - SS Kevin
reports that the 00Z GFS wave post job runs completed. (RLR)

9/6 log:

CONTINUED...GFS WAVE WALLTIME EXTENDED

3a. 1212Z - SOS Ying noted the gfs_wave_post_bndpnt job was
running long and wall time for the job has been extended. 1340Z
- Complete.

3b. 2105Z - SPA Simon reports testing continues and requested
permission to increase the wallclock of the jobs that are
failing to allow ops to run smoother overnight. Approved. (KAL)

9/11 email from Simon:

The "ldebug" was added into gfs_wave post*pnt* ecf scripts to resolve the long runtime issue.
From GDIT, it will It only remounts /apps, lustre debugging actions are skipped. Since the
"ldebug" option works, we have used the original walltime for these gfs_wave post*pnt* ecf
scripts in prod gfs.v16.3.8 and v16.3.9.

Target version

v16.3.8

Expected workflow changes

Walltimes in relevant ecf scripts.
Add additional debug PBS statements into wave post ecf scripts.

FYI @JessicaMeixner-NOAA

@KateFriedman-NOAA KateFriedman-NOAA added production update Processing update in production triage Issues that are triage labels Sep 8, 2023
@KateFriedman-NOAA KateFriedman-NOAA self-assigned this Sep 8, 2023
@KateFriedman-NOAA
Copy link
Member Author

Created release branch off of dev/gfs.v16 branch for this ARFC (release/gfs.v16.3.8).

@KateFriedman-NOAA KateFriedman-NOAA changed the title GFSv16.3.8 - extend wave_post_bndpnt job walltime GFSv16.3.8 - add debug flag to resolve wave post job runtime issues Sep 11, 2023
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Sep 11, 2023
The wave post jobs were running long in production.
GDIT added the `ldebug` PBS statements to the wave post
ecf scripts to resolve the issue.

This change became GFSv16.3.8 in operations.

Refs NOAA-EMC#1843
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Sep 11, 2023
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Sep 11, 2023
Update run.ver to new v16.3.8 GFS version.

Refs NOAA-EMC#1843
KateFriedman-NOAA added a commit that referenced this issue Sep 11, 2023
ARFC to add debug flag to resolve wave post runtimes

* The wave post jobs were running long in production. GDIT added the `ldebug` PBS statements to the wave post ecf scripts to resolve the issue.
* Updates version to v16.3.8

Refs #1843
@KateFriedman-NOAA
Copy link
Member Author

Will open separate issue to make similar change for wave post jobs in develop branch. Completing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
production update Processing update in production triage Issues that are triage
Projects
None yet
Development

No branches or pull requests

1 participant