Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v16.3 DA pre-implementation parallel #776

Closed
lgannoaa opened this issue May 10, 2022 · 72 comments
Closed

v16.3 DA pre-implementation parallel #776

lgannoaa opened this issue May 10, 2022 · 72 comments
Assignees

Comments

@lgannoaa
Copy link
Contributor

lgannoaa commented May 10, 2022

Description

This issue is to document the v16.3 da pre-implementation parallel.

Initial tasking email sent by @arunchawla-NOAA indicated:
Emily and Andrew summarized on May 9th, 2022:

  1. All v16.3 implementation code, script and fix file update have been merged into GSI master except the IR bug fix from Jim and Andrew (simple changes and it is in PR already)
  2. The WCOSS-2 porting branch was merged into GSI master about 3 hours ago!
  3. We will ask Mike to create a gfsda.v16.3 as soon as (1) is merged --- this should be done by the end of this week

First full cycle starting CDATE is retro 2021101600
HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0
pslot: da-dev16-ecf
EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config
COM: /lfs/h2/emc/ptmp/Lin.Gan/da-dev16-ecf/para/com/gfs/v16.3
log: /lfs/h2/emc/ptmp/Lin.Gan/da-dev16-ecf/para/com/output/prod/today
on-line archive: /lfs/h2/emc/global/noscrub/lin.gan/archive/da-dev16-ecf
METPlus stat files: /lfs/h2/emc/global/noscrub/lin.gan/archive/metplus_data
FIT2OBS: /lfs/h2/emc/global/noscrub/lin.gan/archive/da-dev16-ecf/fits
Verification Web site: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/da-dev16-ecf
(Updated daily at 14:00 UTC on PDY-1)
HPSS archive: /NCEPDEV/emc-global/5year/lin.gan/WCOSS2/scratch/da-dev16-ecf

FIT2OBS:
/lfs/h2/emc/global/save/emc.global/git/Fit2Obs/newm.1.5
df1827cb (HEAD, tag: newm.1.5, origin/newmaster, origin/HEAD)

obsproc:
/lfs/h2/emc/global/save/emc.global/git/obsproc/v1.0.2
83992615 (HEAD, tag: OT.obsproc.v1.0.2_20220628, origin/develop, origin/HEAD)

prepobs
/lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1
5d0b36fba (HEAD, tag: OT.prepobs.v1.0.1_20220628, origin/develop, origin/HEAD)

HOMEMET
/apps/ops/para/libs/intel/19.1.3.304/met/9.1.3

METplus
/apps/ops/para/libs/intel/19.1.3.304/metplus/3.1.1

verif_global
/lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0/sorc/verif-global.fd
1aabae3aa (HEAD, tag: verif_global_v2.9.4)

Requirements

A meeting has been setup to discuss what is the action summary for package preparation.

Acceptance Criteria (Definition of Done)

Dependencies

@lgannoaa lgannoaa self-assigned this May 10, 2022
@lgannoaa
Copy link
Contributor Author

A meeting hosted by @aerorahul joined by @arunchawla-NOAA on May 10th outlined the following action regarding this issue:

  1. @KateFriedman-NOAA will arrange dump archive transfer from wcoss1 to wcoss2 and enable ability for parallel running in retro mode (such making change to switch input from EMC dump archive or NCO realtime).
  2. @KateFriedman-NOAA will try to setup and test end-to-end retro parallel using rocoto workflow on wcoss2.
  3. Lin will test retro mode in ecflow using dump archive transfer after step 1. above is completed.
  4. We will get some more information from DA on a scheduled meeting May 11th for the science changing detail of the DA package.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented May 13, 2022

  • A plaining meeting from EIB (@aerorahul @lgannoaa @KateFriedman-NOAA ) and DA (@emilyhcliu ) on May 11th outlined the following:
  • gfsda v16.3 changes included science changes that require 4 tags update.
  • A summary Google document (GFSv16.3 Implementation) can be used to identify tags and changing status.

https://docs.google.com/spreadsheets/d/1fVa-yqGmxqwCrruq73agRKbXnpAPVzNOlvnu-pp1sn0/edit#gid=0

  • retro and realtime were suggested to be run in wcoss2 using ecflow because Global Workflow dev-v16 package is not fully available in other platform.
  • EIB is currently waiting for a preliminary version of all four tags created.
  • Once the tag is ready EIB will run a one week testing parallel
  • Da (Emily) will take a look and get the final version of all changes ready
  • Da (Emily) will update tags as needed
  • EIB will start up the retro parallel with SDATE=20211015
  • Da (Emily) will provide bios correction as needed to ensure parallel startup in correct state
  • Official pre-implementation parallel started

@lgannoaa
Copy link
Contributor Author

Setting DA retro ecflow workflow parallel on WCOSS2 started on 5/12. This activity has a bit delay from 5/12 to 5/23 due to WCOSS2 RFC, switch, system issue, ETC.
A few issues has been resolved or workaround provided:

  • EMC dump file directories structure and naming. (Adding atmos, using slink to NCO approved prefix directory look alike location)
  • Working with NCO/network/wcoss2 helpdesk on adjust cactus ecflow network ping timeout issue.
  • Modify cleanup configuration due to WCOSS2 quota change.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jun 1, 2022

The new GSI package as of June 1st is 047b5da submodule 99f147c. This version has three file and one directory changed in build process.
global_gsi.x ---:> gsi.x
global_enkf.x --- enkf.x
ncdiag_cat.x ---> nc_diag_cat.x
$LINK ../sorc/gsi.fd/exec/$gsiexe . ---> $LINK ../sorc/gsi.fd/install/bin/$gsiexe .
The above GSI package and modify is obsoleted.

As of June 10th we have the following on Dogwood for gsi:
release/gfsda.v16.3.0 at 42cc83
git submodule
99f147cc7a55032b329bcd4f738cabd28b129295 fix (fv3da.v1.0.1-159-g99f147c)

@MichaelLueken
Copy link

The new GSI package as of June 1st is 047b5da submodule 99f147c. This version has three file and one directory changed in build process.
global_gsi.x ---:> gsi.x
global_enkf.x --- enkf.x
ncdiag_cat.x ---> nc_diag_cat.x
$LINK ../sorc/gsi.fd/exec/$gsiexe . ---> $LINK ../sorc/gsi.fd/install/bin/$gsiexe .

Hi @lgannoaa - There is one minor correction here - instead of
ncdiag_cat.x ---> nc_diag_cat.x
it should be
ncdiag_cat.x ---> ncdiag_cat_serial.x

Additionally, given changes in the develop branch that have never made it into the release branch for GFS DA components previously, the .v1.0.0, .v2.0.0, and .v3.0.0 entries for Minimization_Monitor, Ozone_Monitor, and Radiance_Monitor, respectively, need to be removed.

Also, since fv3gfs_ncio has been removed from the GSI project, the ncio stack module will need to be added to versions/build.ver:

export ncio_ver=1.0.0

and ncio will need to be added to modulefiles/fv3gfs/enkf_chgres_recenter_nc.wcoss2.lua:

load(pathJoin("ncio", os.getenv("ncio_ver")))

Using the stack-built ncio module will also require changes to:

sorc/build_enkf_chgres_recenter_nc.sh - removal of the following lines:

export FV3GFS_NCIO_LIB="${cwd}/gsi.fd/build/lib/libfv3gfs_ncio.a"
export FV3GFS_NCIO_INC="${cwd}/gsi.fd/build/include"

if [ ! -f $FV3GFS_NCIO_LIB ]; then
  echo "BUILD ERROR: missing GSI library file"
  echo "Missing file: $FV3GFS_NCIO_LIB"
  echo "Please build the GSI first (build_gsi.sh)"
  echo "EXITING..."
  exit 1
fi

sorc/enkf_chgres_recenter_nc.fd/input_data.f90 - replace module_fv3gfs_ncio with module_ncio

sorc/enkf_chgres_recenter_nc.fd/output_data.f90 - replace module_fv3gfs_ncio with module_ncio

sorc/enkf_chgres_recenter_nc.fd/makefile - replace FV3GFS_NCIO_INC entries with NCIO_INC

Finally, for building the GSI, I'd recommend the following changes in sorc/build_gsi.sh:

export GSI_MODE="GFS"
export UTIL_OPTS="-DBUILD_UTIL_ENKF_GFS=ON -DBUILD_UTIL_MON=ON -DBUILD_UTIL_NCIO=ON"

This will build the GSI in global mode (the default is regional - adding WRF to the build) and will limit the building of utilities from all utilities (by default) to just those utilities required within the GFS.

@MichaelLueken
Copy link

@lgannoaa I missed an update that is required for making sorc/enkf_chgres_recenter_nc.fd/makefile work with the stack's ncio module:

Replacing FV3GFS_NCIO_LIB with NCIO_LIB.

Many thanks to @RussTreadon-NOAA for bringing this to my attention.

@MichaelLueken
Copy link

Hi @lgannoaa @KateFriedman-NOAA @emilyhcliu, I wanted to ask a question about how you would like to proceed with respect to the renaming of the gsi, enkf, and ncdiag_cat executables. Should I make changes to the GSI/scripts to use the new executables, or will sorc/link_fv3gfs.sh be updated to link these new executable names with the old naming convention? Please let me know your preference so that I can make the necessary changes for release/gfsda.v16.3.0.

@KateFriedman-NOAA
Copy link
Member

I wanted to ask a question about how you would like to proceed with respect to the renaming of the gsi, enkf, and ncdiag_cat executables. Should I make changes to the GSI/scripts to use the new executables, or will sorc/link_fv3gfs.sh be updated to link these new executable names with the old naming convention? Please let me know your preference so that I can make the necessary changes for release/gfsda.v16.3.0.

@MichaelLueken-NOAA Let's move this discussion back to the main issue for this upgrade: issue #744. This issue is just for documenting the parallel. @lgannoaa Please keep workflow changes and non-parallel setup discussions in issue #744. Thanks!

@MichaelLueken-NOAA Please summarize the GSI executable name changes that are occurring in a new comment in #744 and then tag Emily, Rahul, Lin, and myself to discuss. Thanks!

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jun 9, 2022

@arunchawla-NOAA @aerorahul @emilyhcliu @MichaelLueken-NOAA @RussTreadon-NOAA @KateFriedman-NOAA
This parallel is setup on Dogwood. Please note the following configuration information.
I will have a meeting with Emily to review the package before startup the run. This meeting is scheduled on June 10th.

HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0
pslot: da-dev16-ecf
EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config
COM: /lfs/h2/emc/ptmp/Lin.Gan/da-dev16-ecf/para/com/gfs/v16.3
log: /lfs/h2/emc/ptmp/Lin.Gan/da-dev16-ecf/para/com/output/prod/today

@lgannoaa
Copy link
Contributor Author

A meeting with Emily on June 10th outline the following action:

  • I will do a three cycled test run with current configuration
  • Emily will review the output later
  • There are still a few changes needed before this parallel can start:
    1 ICs need to be modified (this will be done by Emily later)
    2 Software package like fv3 and post need to be check by Emily to ensure the version checked out is correct.
    3 The configuration of the run in EXPDIR such as DELTIM (currently default for enkf is DELTIM=240) will need to be reviewed.

@yangfanglin
Copy link
Contributor

@emilyhcliu Emily, could you please check with Jun Wang and Helin Wei to make sure the updated forecast model is used in this cycled experiment ? Model updates include changes in LSM for improving snow forecast and in UPP for fixing cloud ceiling calculation. (@junwang-noaa @HelinWei-NOAA @WenMeng-NOAA )

@KateFriedman-NOAA
Copy link
Member

@lgannoaa I'm working with Helin to get a new GLDAS tag ready (which includes a needed small update related to adding the "atmos" subfolder into the GDA, the PRCP CPC gauge file path needed to add it too). The current GLDAS tag in the release/gfs.v16.3.0 branch will work for GDA dates prior to WCOSS2 go-live but not for dates after go-live (when the new "atmos" subfolder is added).

I'm also working to wrap up Fit2Obs testing and try to get a new tag for your use on WCOSS2 ASAP.

@junwang-noaa
Copy link
Contributor

@HelinWei-NOAA I do not see the snow updates PR to the ufs-weather-model production/GFS.v16 branch. Would you please make one if the code updates are ready? Thanks

@HelinWei-NOAA
Copy link
Contributor

@junwang-noaa I created one on fv3atm

@lgannoaa
Copy link
Contributor Author

HOMEgfs/parm/config config.resources.nco.static and config.fv3.nco.static have been used to fix eupd job card issue that caused it to fail. The Global Workflow emc.dyn versions of those configs aren't yet updated to run high res on WCOSS2.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jun 13, 2022

@emilyhcliu Please review the first three full cycles of the output from this parallel.
There were 7 and half cycles of a testing run available for review. It is located in:/lfs/h2/emc/ptmp/lin.gan/da-dev16-ecf/para/com/gfs/v16.3 The first half cycle is 20211015 18Z, and follow by 7 completed cycles.

@lgannoaa
Copy link
Contributor Author

A meeting with @emilyhcliu on June 14th. indicated there will be a few more short cycled tests required for package adjustment before start running the official implementation parallel.
This meeting also outlined the following action items need to happen before the next run of cycled testing.

  1. CRTM v.2.4.0 - They are waiting for NCO to approve and install new version of CRTM v.2.4.0 on wcoss2. This version is required to start the official parallel.
  2. Namelist change - Namelist change in configuration file is required before official parallel can start.
  3. Scripts change - The GSI branch will have more update/commit prior to starting the official parallel.
  4. Modify three files ICS - There are three files in initial conditions need to be modified for this parallel.

@lgannoaa
Copy link
Contributor Author

A meeting with @emilyhcliu @aerorahul @KateFriedman-NOAA on June 15th outlined the following:
A new job gdas/gfs prep will be installed in ecflow similar the the one running in realtime rocoto workflow. To run EMC gdas/gfs prep jobs to create prepbufr files.
ecflow workflow will run internal MetPlus and gplots jobs for create verification Web page.
Emily will check with DA team on UPP, UFS model, and VPPGB points of contact, and how many post hour output are needed for this parallel. Also if there is any external partners or downstream users be evaluating this parallel.
Change regarding wired TCVITL path assignment will be made to use TCVITL file from EMC dump archive for this parallel.

@lgannoaa lgannoaa changed the title v16.3 da pre-implementation parallel v16.3 DA pre-implementation parallel Jun 16, 2022
@lgannoaa
Copy link
Contributor Author

Information received from Daryl on June 16th outline the following:

  1. Helin Wei is the POC for physics changes added to V16.3. We do not need hourly output but would like to have 3-hourly 1-deg grib2 files from the 00Z cycle for an extended period of time for evaluation.
  2. Hui-Ya and Yali Mao are the UPP contacts for GFS v16.3. Yali is making several changes to WAFS aviation products, so I assume we'll need hourly data, but I'm not sure for how long.
  3. I think there is still an open question regarding stakeholder engagement and evaluation.

@lgannoaa
Copy link
Contributor Author

Ali Abdolali is now assigned as WAVE point of contact.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jun 24, 2022

As of noon on June 24th, here is the state of this parallel:

  • I run a testing cycles to check and modify configuration as needed.
  • The pending configuration change is CRTM and DELTIM. We are waiting for management to make final decision.
  • A test cycle 2021101718 run is now in review. Summary of changes are:
  • New ecflow workflow jobs for EMC gdas prep and EMC gfs prep are included. prepbufr output can be found in:
  • /lfs/h2/emc/ptmp/lin.gan/da-dev16-ecf/para/com/obsproc/v1.0/
    
  • None zero sized tcvitals files are located in COM
  • gfs minmon, gdas minmon, gdas radmon, and gdas oznmon stat files are copy into:
  • /lfs/h2/emc/global/noscrub/$USER/monitor/ 
    

@lgannoaa
Copy link
Contributor Author

A meeting with CYCLONE_TRACKER code manager Jiayi on June 24th. Checked Jobs and output in $COM/$RUN.$PDY/$cyc/atmos/epac and natl. These jobs run successfully.

@lgannoaa
Copy link
Contributor Author

Emily checked new gdas/gfs prep jobs and jobs using its output on June 24th. It is working.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jun 29, 2022

As of EOB June 28th, there are three incoming changes:

  • CRTM v2.4.0
  • gfsda.v16.3 GSI PR 412
  • parm/config.anl GW PR 876

A new run of cycled test started on June 29 to test a few days with DELETE_COM_IN_ARCHIVE_JOB="YES".
This test will be used to review EMC developer verification jobs, FIT2OBS , MetPlus, gplots, and EMC para-check jobs.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 5, 2022

Switch verif-global.fd to verif_global_v2.9.5 tag on July 5th.
Add METplus off-line driver into HOMEgfs/ecf/scripts/workflow_manager/scripts

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 5, 2022

Meeting with code manager for verif-global and FIT2OBS on July 5th indicated no issue found in a 10 days testing run.

@emilyhcliu
Copy link
Contributor

emilyhcliu commented Jul 5, 2022

Status of DA package

  • gfsda.v16.3.0 PR #412 --- merged
  • global-workflow PR #876 - in parm/config.anl --- merged
  • task assignment in DA group - create GSI observation monitoring plots and website for pre-implementation parallel
  • Installation of CRTM v2.4.0 on WCOSS-2 by NCO --- waiting for NCO to install
  • Restarted files for 20211015 18Z (initialization of satbias related files)

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 14, 2022

July 14th, Cactus has system degradation issue. NCO halt the Cactus system. This impacted the parallel into halt and incomplete transfer jobs.
Action:
Rerun jobs as needed.
DELETE_COM_IN_ARCHIVE_JOB="NO" is in place until parallel run smooth.
2021122412 gfs arch incomplete; rerun as need.
2021122500 gfs arch incomplete; rerun as need.
2021102512 efcs jobs are discovered unstable due to system error. Jobs are have been rerun.
DELETE_COM_IN_ARCHIVE_JOB="YES" is set after rerun completed.
2021102518 eupd impacted with system error become zombie. Rerun in place.
Cactus is still unstable therefore, set DELETE_COM_IN_ARCHIVE_JOB="NO" until parallel run smooth.

example of failed eupd (zombie) job from system issue:
enkfgdas_update_06.o8587654
sed: can't read /var/spool/pbs/aux/8587654.cbqs01: No such file or directory
sed: can't read /var/spool/pbs/aux/8587654.cbqs01: No such file or directory
grep: /tmp/qstat.8587654: No such file or directory
grep: /tmp/qstat.8587654: No such file or directory
grep: /tmp/qstat.8587654: No such file or directory
grep: /tmp/qstat.8587654: No such file or directory
grep: /tmp/qstat.8587654: No such file or directory
grep: /tmp/qstat.8587654: No such file or directory
grep: /tmp/qstat.8587654: No such file or directory
000 - nid001054 : Job 8587654.cbqs01 - DEBUG-DMESG: Unable to find NFS stats file: /tmp/nfsstats.8587654.cbqs01
000 - nid001054 Job 8587654.cbqs01 - DEBUG-DMESG: Unable to find Mount stats file: /tmp/mntstats.begin.8587654.cbqs01

@arunchawla-NOAA @dtkleist @emilyhcliu @aerorahul
The Cactus system degradation issue remains. This parallel proceeding very slow. Many jobs need to be rerun. Many jobs become zombie. I will execute touch commend to renew all files in ptmp by noon on July 15th to ensure files in COM not deleted.

Execute touch on COM.

@lgannoaa
Copy link
Contributor Author

Set DELETE_COM_IN_ARCHIVE_JOB="YES" after 20211025 rerun completed.
HPSS transfer remain slow cause archive jobs hit wall clock limit need to be rerun. parallel proceeding still remain slow.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 16, 2022

Due to the extreme slowness of the HPSS transfer rate which caused archive jobs failure. A redesign of the archive job is in progress. The prototype has been tested on July 15th for a single cycle. It provided much higher performance compare to the original archive job design architecture.
Due to the system degradation issue on July 15th, METPlus (part of the verification jobs) jobs taking too long to finish. Jobs hit wallclock need to be rerun. It may be better to run METPlus in offline mode.
These two modification will be in testing starting on Monday July 18th. Put in place after testing complete with satisfaction.

Cactus performance over the weekend of July 16th and 17th has improved a little bit. The HPSS transfer is still remain slow. The plan to include the above two changes is still on going.

Starting 2021103000 a two cycle testing on new designed archive jobs and offline METPlus cron task is in place.
DELETE_COM_IN_ARCHIVE_JOB="NO" is set for the duration of this test.
The status of performance improvement is:
The original speed for 20211029 18Z and 20211028 18Z is 17 hours.
-rw-r--r-- 1 lin.gan emc 2.5M Jul 17 18:35 gfs_emc_arch_18.o8694471
-rw-r--r-- 1 lin.gan emc 2.6M Jul 18 11:18 gfs_emc_arch_18.o8711811

As of July 19th 10:00AM, the test for the above two changes was in place for CDATE 2021103000 ~ 2021103106. The initial test result was good, however the wcoss2 hpss transfer slowness issue in the afternoon caused many test jobs to fail. The parallel on halt on CDATE 2021103106. The COM size is 91 TB waiting for the archive job to success to resume COM clean up and parallel job to proceed.

As of July 20th, these newly designed archive jobs have been running without issue. The HPSS transfer speed became slow again in the afternoon into the evening. This time, the parallel was running too advance to the point it has to be halt to wait for archive jobs to finish. The COM usage was too high. The archive jobs need to finish for clean up jobs to clear COM.

At 3:00P EST July 21th, a check on archive jobs shown all caught up to the speed of the parallel. Therefore, the the next time parallel resume to run on CDATE 2021110212 the COM cleanup job will be enabled.

Tag: @arunchawla-NOAA @dtkleist @emilyhcliu @aerorahul for your knowledge

@lgannoaa
Copy link
Contributor Author

@emilyhcliu HPSS has system error in the past few hours. Many archive jobs with error status=141. Parallel is on halt as of July 18th 10:00PM until HPSS system issue resolved.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 19, 2022

We do not have access to Cactus from July 19th 11-15Z, and again from 20-00Z due to system upgrades and test.
NCO updated the notice to extend outage until 18Z on July 19th.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 20, 2022

Transfer speed has been increased. Parallel resumed to run on CDATE=2021103118. The newly designed archive jobs, clean jobs are now in place. The METPlus jobs is now running offline.
The new standalone METPlus jobs run and generated its first output. The stat files were generated. The gplots job website shown plots with new day.

Transfer speed become slow again on 7/20 in the afternoon into the evening. Parallel halt on 2021110206 waiting for archive jobs to finish transfer.

Pending transfer jobs all completed. Parallel will resume when Cactus return to the developer.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 21, 2022

The Cactus will not available on 7/21 (Thu), 7/22 (Fri), and 7/25 (Mon). There will be a production switch on 7/28 (Thu). These events will impact parallel.

@lgannoaa
Copy link
Contributor Author

July 21th evening, two jobs run into system issue and failed. The 2021110300 eupd and gfs fcst.
Error msg: launch RPC: Couldn't allocate a port for PMI to use. eupd rerun success.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 22, 2022

July 22th. @emilyhcliu indicated a need to start second parallel on CDATE=2022062000. Preparation is pending on NCO approval to run parallel on production machine.
If it is approved, after the production switch (on July 28th), the current retro parallel will stay on Cactus (switched as the production machine). A new ecflow realtime parallel will be setup and run on Dogwood (switched as the development machine). These two parallel will stay on its current machine for the remaining time regardless of production switch.

Tag: @arunchawla-NOAA @dtkleist @emilyhcliu @aerorahul @junwang-noaa

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 23, 2022

July 22 evening, the transfer speed remain slow.
It looks like Cactus is now implemented memory enforcement. In which, all transfer jobs used more memory is killed by system. This impacted some of the archive jobs in the parallel. I modified the memory requirement and rerun jobs.
The parallel is on halt until archive jobs finish from the prior cycle to clean up COM. Current GROUP EMC usage for PTMP is at 95%.
Transfer remain slow for the July 23. Parallel on halt.
Transfer speed increased on July 24th in the morning. Parallel continue to progressing.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 24, 2022

July 23 evening into July 24, FIT2OBS starting CDATE=2021110518 failed on every cycle. Contacted code manager for assistance.

The following FIT2OBS job logs shown HDF error in reading gfs forecast output atmf nc files:
FITS.da-dev16-ecf.2021110518.180291: NetCDF: HDF error
FITS.da-dev16-ecf.2021110600.237555: NetCDF: HDF error
FITS.da-dev16-ecf.2021110606.152150: NetCDF: HDF error
FITS.da-dev16-ecf.2021110618.62068: NetCDF: HDF error
FITS.da-dev16-ecf.2021110700.126613: NetCDF: HDF error

The following FIT2OBS job logs shown segmentation fault:
FITS.da-dev16-ecf.2021110518.180291:nid001012.cactus.wcoss2.ncep.noaa.gov: rank 2 exited with code 174
FITS.da-dev16-ecf.2021110600.237555:nid001285.cactus.wcoss2.ncep.noaa.gov: rank 1 exited with code 174
FITS.da-dev16-ecf.2021110612.204246:nid001038.cactus.wcoss2.ncep.noaa.gov: rank 1 exited with code 174

The fit2obs from CDATE 2021110518 ~ 2021111000 were missing due to job failure noted above.
@emilyhcliu - What should we do if there is no way to recreate the missing fit2obs files?

Due to the need for testing fit2obs failure issue. The fit2obs package is now switched to /lfs/h2/emc/global/noscrub/lin.gan/git/Fit2Obs/newm.1.5 starting CDATE=2021110906

On July 25th, @jack-woollen suggested to rerun job with increased memory or the exclusive node use. I modified the ecflow to do this change. However, the job failure issue still remain. The parallel is resumed on CDATE=2021110906.
Tag @emilyhcliu @jack-woollen

@lgannoaa
Copy link
Contributor Author

@emilyhcliu @MichaelLueken-NOAA
Jobs using GSI executable sometime failed with RETURN CODE 127. For example, eupd (enkf.x) and analysis (gsi.x). The issue does not happen after rerun the same jobs.
You can see a list of failed jobs in cactus:/lfs/h2/emc/ptmp/lin.gan/da-dev16-ecf/para/com/output/prod/today/gsi_err_127.log

@MichaelLueken
Copy link

MichaelLueken commented Jul 26, 2022

@lgannoaa
Looking in the files for the failed runs, I see the following error message:

launch RPC: Couldn't allocate a port for PMI to use

I'm not sure what it means, but the stdout files are empty and this is the only error message. I'm not sure what this error message means. Is this an issue with allocating resources through ecflow?

@MichaelLueken-NOAA
Please let me know what is the certified resource requirement for the failed job above?
The gfs analysis on CDATE=2021111100 failed with the same issue. Both of these jobs claim same resource:
#PBS -l select=55:mpiprocs=15:ompthreads=8:ncpus=120
#PBS -l place=vscatter:exclhost
mpiexec -l -n 825 -ppn 15 --cpu-bind depth --depth 8 ...gsi.x
gfs_atmos_analysis_00.o9078921
DATA: /lfs/h2/emc/stmp/lin.gan/RUNDIRS/da-dev16-ecf/2021111100/gfs_atmos_analysis_00.9078921.cbqs01
This job failed with: launch RPC: Couldn't allocate a port for PMI to use
gfs_atmos_analysis_00.o9078937
This job is a rerun from the failure above.

@lgannoaa
Copy link
Contributor Author

@emilyhcliu @junwang-noaa
In the GFS v.16.3 parallel log directory: /lfs/h2/emc/ptmp/lin.gan/da-dev16-ecf/para/com/output/prod/today
Two jobs using global_fv3gfs.x failed with - launch RPC: Couldn't allocate a port for PMI to use.
Both of these jobs rerun into complete. I want to bring it to your attention.

The efcs group 36 on 2021110918 using
PDY=20211109
Both of these jobs claim same resource:
#PBS -l select=4:mpiprocs=128:ompthreads=1:ncpus=128
#PBS -l place=vscatter:exclhost
mpiexec -l -n 512 -ppn 128 --cpu-bind depth --depth 1 ...global_fv3gfs.x
enkfgdas_fcst_36_18.o9058893
This job failed with: launch RPC: Couldn't allocate a port for PMI to use
enkfgdas_fcst_36_18.o9058919
This job is a rerun from the failure above. It completed.

The gfs forecast on CDATE=2021111100
Both of these jobs claim same resource:
#PBS -l select=112:mpiprocs=24:ompthreads=5:ncpus=120
#PBS -l place=vscatter:exclhost
mpiexec -l -n 2688 -ppn 24 --cpu-bind depth --depth 5 ...global_fv3gfs.x
gfs_forecast_00.o9079588
This job failed with: launch RPC: Couldn't allocate a port for PMI to use
gfs_forecast_00.o9079598
This job is a rerun from the failure above.

@MichaelLueken
Copy link

@MichaelLueken-NOAA Please let me know what is the certified resource requirement for the failed job above? The gfs analysis on CDATE=2021111100 failed with the same issue. Both of these jobs claim same resource: #PBS -l select=55:mpiprocs=15:ompthreads=8:ncpus=120 #PBS -l place=vscatter:exclhost mpiexec -l -n 825 -ppn 15 --cpu-bind depth --depth 8 ...gsi.x gfs_atmos_analysis_00.o9078921 DATA: /lfs/h2/emc/stmp/lin.gan/RUNDIRS/da-dev16-ecf/2021111100/gfs_atmos_analysis_00.9078921.cbqs01 This job failed with: launch RPC: Couldn't allocate a port for PMI to use gfs_atmos_analysis_00.o9078937 This job is a rerun from the failure above.

@lgannoaa
I have no idea what the certified resource requirement for the failed job is. I don't run the global-workflow j-jobs and scripts. All I know is that the jobs failed with the launch RPC: Couldn't allocate a port for PMI to use error message. That is all I can assist with.

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Jul 26, 2022 via email

@KateFriedman-NOAA
Copy link
Member

@lgannoaa If you haven't already, I suggest opening a WCOSS2 helpdesk ticket about this error, it looks like a machine problem, particularly because reruns are successful. GDIT can take a look at the nodes that the failed jobs ran on.

@aerorahul
Copy link
Contributor

@SMoorthi-emc Thank you for providing a data point.
@lgannoaa Please open an issue with WCOSS2 helpdesk and cc Steven.Earle.

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented Jul 26, 2022 via email

@lgannoaa
Copy link
Contributor Author

Transfer speed slow down again in the evening of July 26th. PTMP is at 90%. Parallel paused on CDATE=2021111300 waiting for transfer job to catch on.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 27, 2022

The realtime parallel is in preparation. The configuration:
pslot: rt-gfsv163-ecf
SDATE=2022061918
HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0
slink location (NCO look alike):
/lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0
com: /lfs/h2/emc/ptmp/lin.gan/rt-gfsv163-ecf/para/com/gfs/v16.3
log: /lfs/h2/emc/ptmp/lin.gan/rt-gfsv163-ecf/para/com/output/prod/today
HPSS: /NCEPDEV/emc-global/1year/Lin.Gan/WCOSS2/202205da
gplots: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/rt-gfsv163-ecf/

DATA directory limit is 12T required self cleanup process.

@KateFriedman-NOAA
Copy link
Member

KateFriedman-NOAA commented Jul 27, 2022

@lgannoaa For the real-time parallel, are you planning to source dump/obs files from production com or the global dump archive?
@KateFriedman-NOAA In the beginning (CDATE=2022061918) will be retro. At the time when it catch up to realtime, it will remain using the location from EMC global dump archive.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Jul 28, 2022

Production switch on July 28th. This retro parallel is on halt CDATE=2021111506. All archive jobs completed.

NCO indicated the system auto scrub will remain on for PTMP on all systems. The COM /lfs/h2/emc/ptmp/lin.gan/da-dev16-ecf will be touched. The first touch will be executed at the end of July 28th.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 1, 2022

NCO announced on Aug 1st that Dogwood will have two days outage on Aug 2nd and Aug 4th.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 1, 2022

Emily requested the realtime parallel on hold for a NSST related issue #449 until pull #448 is in place.

@emilyhcliu
Copy link
Contributor

Status Update from DA - issues, diagnostics, solution and moving forward
Background
gfs.v16.3.0 retrospective parallel started from 2021101518z on Cactus. So far, we have about 3-4 week results. The overall forecast skills show degradation in NH. The DA team investigated to look for possible causes and solutions. The run configured and maintained by @lgannoaa has been very helpful for the DA team to spot a couple of issues from the gfsda.v16.3.0 package.

Issues, diagnostics, bug fixes, and tests
(1) An initialization problem for satellite bias correction coefficients were found for sensors with coefficients initialized from zero. The quasi-mode initialization procedure was skipped due to a bug merged from the GSI develop to gfs.v16.3.0

The issue and diagnostics are documented in NOAA-EMC/GSI#438
The bug fix is provided in NOAA-EMC/GSI#439
The bug fix had been merged into gfsda.v16.3.0

A short gfs.v16.3.0 parallel test (v163t) was performed to verify the bug fix

(2) Increasing NSST biases and RMS of O-F (no bias) are observed in the time seires of AVHRR MetOp-B channel 3 and the window channels from hyperspectral sensors (IASI, CrIS). Foundation temperature bias and rms compared to operational GFS and OSTIA increase with time. It was found that the NSST increment file from GSI was not passing into the global cycle properly.

The issue and diagnostics in detail are documented in NOAA-EMC/GSI#449

The bug fix is documented in NOAA-EMC/GSI#448

Test
A short gfs.v16.3.0 real-time parallel (starting from 2022061918z; v163ctl) with the bug fixes from (1) and (2) is currently running on Dogwood to verify the bug fixes.

We will keep this running for a few days....

Here is the link to the Verification page: https://www.emc.ncep.noaa.gov/gc_wmb/eliu/v163ctl/

We should stop the retrospective parallel on Cactus and re-run it with the bug fixes.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 2, 2022

Closing this issue. Replaced with:
A new issue #951 for GFS v16.3 retro parallel for implementation
A new issue #952 for GFS v16.3 realtime parallel for implementation
going forward.
Tag: @emilyhcliu @dtkleist @junwang-noaa @aerorahul

@lgannoaa lgannoaa closed this as completed Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants