Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to S4 #649

Closed
wants to merge 55 commits into from
Closed

Conversation

DavidHuber-NOAA
Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA commented Feb 10, 2022

Description

This ports the global workflow to S4, completing #138. S4 is capable of supporting experiments up to C384, so C768 and finer resolutions are disabled. Being a smaller system with up to 40 nodes available to a user at a given time, config.resources and config.fv3 have been heavily modified to optimize performance.

During testing, it was determined that C48 ensemble forecasts required more memory on all systems, so the threads were increased to 2 for that purpose.

Suggested documentation updates:
HPC Helpdesks, add an entry for S4: https://groups.ssec.wisc.edu/groups/S4/s4-user-guides/support-on-s4-cardinal

Restricted data access: a note should be added that restricted data is not allowed on S4.

After 'Restricted data access,' add a subsection titled "Additional considerations for running on S4" with the following information:
S4 can support deterministic/ensemble resolutions of C384/C192, C192/C96, and C96/C48. The C192/C96 resolution works well for testing purposes while the C384/C192 resolution run should be reserved for experiments that are at the appropriate stage. Any higher resolution runs (e.g. 768/384) will require too many nodes to run on S4 and will not process.

Similarly, users should weigh whether the use of the 4DVAR data assimilation (i.e. with ensembles) is necessary for their experiment. For instance, if you are testing the incorporation of a new temperature data source, 3DVAR assimilation may be adequate, while a new wind data source likely necessitates 4DVAR assimilation. If running 4DVAR, a minimum of 40 ensembles is encouraged, though all 80 may be required to maintain stability when running at C192/C96. If you are running without ensembles, when following the instructions in this guide, be sure to set the variable DOHYBVAR="NO" in the experiment's config.base before running setup_workflow.py.

Lastly, if you are performing data analysis experiments, the GFS forecast may not need to be run. If that is the case, be sure to set --gfs_cyc 0 when running setup_expt.py. If the GFS forecast does need to be run for your experiment, note that GFS data sources (i.e. the global dump archive) are only available for the 00Z cycle, and so setup_expt.py should be called with --gfs_cyc 1.

Prepare initial conditions: add the following information to the introduction
"Initial conditions must be created/obtained from a system that has access to HPSS. If you are working on S4 or Orion, these initial conditions must be transferred from one of the systems. For those working on S4 without access to system with HPSS access, contact David Huber (david.huber@noaa.gov) with the resolution(s) and cycle(s) that you need and he will generate them for you."

Free-forecast coupled: Add that users wishing to run coupled forecasts on S4 should contact David Huber (david.huber@noaa.gov) with their requests.

Run setup scripts: Add the following command for S4 to the introduction
for Python: load miniconda/3.8-s4
for rocoto: load rocoto/1.3.2

Also in this section, under the $EXPDIR bullet, add this information: "On S4, users should choose a directory within their home directory."

Set user and experiment settings: Add three variables to possibly change in config.base
PARTITION_BATCH (S4 only, may be s4 or ivy)
QUEUE (S4 only, must match PARTITION_BATCH)
FHMAX_GFS_XX (on S4, C384 forecasts should be limited to 168 hours)

HPCs with access to directly edit your crontab files: add S4

Dependency: #628

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

  • Cycled C96/C48, C192/C96, C384/C192 tests on S4
  • Free-forecast, S2SW test on S4 at C384
  • Cycled test on Orion or Hera
  • Cycled test on WCOSS
  • Free-forecast on Orion or Hera

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

DavidHuber-NOAA and others added 30 commits December 14, 2021 19:45
Updates ufs-weather-model to the 2021 Dec 23 commit and the matching UPP hash. Coupled settings are updated to run the P8a mini prototype.

Updates include:

Turn on ice-albedo feedback in atm (Requires changing input.nml to set use_cice_alb=true in &gfs_physics_nml )
Updates to CA namelists
Updates for NOAH-MP which require input.nml update for to export iopt_sfc="3" and modifying parm_fv3diag to include pahi, pah_ave, ecan_acc, etran_acc,edir_acc,wa_acc, lfrac to the grib outputs "gfs_phys", "pahi", "pahi", "fv3_history2d", "all", .false., "none", 2

Closes NOAA-EMC#525
The incorrect path to config.base was displayed to the user after
the experiment directory was created. The correct paths are now
shown.

Refs: NOAA-EMC#536
@ADCollard
Copy link
Contributor

ADCollard commented Feb 15, 2022

This is going to take some debugging, I think. Something is being corrupted but I'm not sure why.

In read_bufrtovs.f90 around line 690 are these lines:
! Read data record. Increment data counter
! TMBR is actually the antenna temperature for most microwave
! sounders.
if (llll == 1) then
call ufbrep(lnbufr,data1b8,1,nchanl,iret,'TMBR')
else ! EARS / DB
call ufbrep(lnbufr,data1b8,1,nchanl,iret,'TMBRST')
if ( amsua .or. amsub .or. mhs )then
data1b8x=data1b8
data1b4=data1b8
call remove_antcorr(sc(instrument)%ac,ifov,data1b4)
data1b8=data1b4
do j=1,nchanl
if(data1b8x(j) > r1000)data1b8(j) = 1000000._r_kind
end do
end if
endif

Can you print out obstype, sc(instrument)%ac%n_fovs and ifov?

That might give a clue as to what is going on.

@RussTreadon-NOAA
Copy link
Contributor

gdasanal.log shows

CRTM_FIX=/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/crtm/2.4.0/fix
RTMFIX=/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/crtm/2.4.0/fix

but

GSIEXEC=/work/noaa/nesdis-rdo2/dhuber/gw_port_2_s4/exec/global_gsi.x

is built with crtm/2.3.0.

Module module_base.orion contains

module load crtm/2.3.0
setenv CRTM_FIX /apps/contrib/NCEPLIBS/orion/fix/crtm_v2.3.0
... 
module load upp/10.0.8

While crtm/2.3.0 is initially loaded, upp/10.0.8 unloads crtm/2.3.0 and loads crtm/2.4.0. module show upp/10.0.8 returns

-------------------------------------------------------------------------------------------------------------------------
   /apps/contrib/NCEP/libs/hpc-stack/modulefiles/mpi/intel/2018.4/impi/2018.4/upp/10.0.8.lua:
--------------------------------------------------------------------------------------------------------------------------
help([[]])
conflict("upp")
load("bacio","g2","g2tmpl","ip","sp","w3nco","w3emc","crtm")
prereq("bacio","g2","g2tmpl","ip","sp","w3nco","w3emc","crtm")
setenv("upp_ROOT","/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/impi-2018.4/upp/10.0.8")
setenv("upp_VERSION","10.0.8")
setenv("NCEPPOST_INC","/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/impi-2018.4/upp/10.0.8/include")
setenv("NCEPPOST_LIB","/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/impi-2018.4/upp/10.0.8/lib/libupp.a")
setenv("POST_INC","/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/impi-2018.4/upp/10.0.8/include")
setenv("POST_LIB","/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/impi-2018.4/upp/10.0.8/lib/libupp.a")
setenv("UPP_INC","/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/impi-2018.4/upp/10.0.8/include")
setenv("UPP_LIB","/apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/impi-2018.4/upp/10.0.8/lib/libupp.a")
whatis("Name: upp")
whatis("Version: 10.0.8")
whatis("Category: library")
whatis("Description: upp library")

File amsua_metop-c.SpcCoeff.bin differs between /apps/contrib/NCEPLIBS/orion/fix/crtm_v2.3.0 and /apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/crtm/2.4.0/fix

Orion-login-4[90] rtreadon$ ls -l /apps/contrib/NCEPLIBS/orion/fix/crtm_v2.3.0/amsua_metop-c.SpcCoeff.bin /apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/crtm/2.4.0/fix/amsua_metop-c.SpcCoeff.bin
-rw-r----- 1 gkyle noaa-hpc  1252 Feb  4 20:03 /apps/contrib/NCEP/libs/hpc-stack/intel-2018.4/crtm/2.4.0/fix/amsua_metop-c.SpcCoeff.bin
-rw-r----- 1 hlei  noaa-hpc 12196 Jun 11  2020 /apps/contrib/NCEPLIBS/orion/fix/crtm_v2.3.0/amsua_metop-c.SpcCoeff.bin

Is this a problem?

@DavidHuber-NOAA
Copy link
Contributor Author

@ADCollard @RussTreadon-NOAA OK, the job is running now with the print statement indicating the problematic instrument is amsua.

@ADCollard
Copy link
Contributor

Did you print out sc(instrument)%ac%n_fovs and ifov.

If they are outside of the range 1-30 there is a problem. If it's sc(instrument)%ac%n_fovs that is wrong it could be the CRTM issue that @RussTreadon-NOAA noted.

@DavidHuber-NOAA
Copy link
Contributor Author

@ADCollard Yes, this is the line I added: print*, obstype, sc(instrument)%ac%n_fovs,ifov at line 686, just before call remove_antcorr(sc(instrument)%ac,ifov,data1b4).

@DavidHuber-NOAA
Copy link
Contributor Author

I will try rearranging the module loads to make sure that crtm/2.3.0 is loaded and see if that fixes the issue.

@WalterKolczynski-NOAA
Copy link
Contributor

Sounds like we really should just be updating crtm to be 2.4.0

@DavidHuber-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA I believe the plan is to upgrade CRTM as there is currently an open GSI issue to do so.

Rearranging does fix the issue, so it does indeed appear to be related to using the wrong CRTM fix files. Since the GSI currently uses CRTM 2.3.0, I suspect this is how we will want to proceed for at least this PR. Would you like for me to add this change to the modulefile into this PR or address it in a new issue? @KateFriedman-NOAA @WalterKolczynski-NOAA

@DavidHuber-NOAA
Copy link
Contributor Author

I'm going to restart this test run on Orion and have moved the current log files to /work/noaa/nesdis-rdo2/dhuber/SAVELOGS/p_2_s4_192_orig.

@DavidHuber-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for pointing out the CRTM 2.4.0 module load. Everything seems to have run well on Orion now.

I do notice some differences between S4 and Orion:

  • Both S4 and Orion module files specify that g2/3.4.1 should be loaded though 3.4.2 and 3.4.3 are loaded, respectively. Similarly, g2tmpl/1.10.0 is loaded instead of 1.9.1 on both systems. Both of these are additional upp load casualties. I have rearranged the S4 load order and rerun one gdasanal successfully.
  • BACK=YES vs BACK=NO in prep jobs. Can someone enlighten me to the differences between running the prep jobs with BACK=YES and BACK=NO? Is it simply reading in the prepbufr files in parallel? For S4 and Hera, BACK=NO, while for Orion, BACK=YES.
  • There are differences in the number of measurements ingested due to data restrictions (expected)
  • Also in the prep step, S4 must create its own NSSTBUFR file since it is not available due to data restrictions (expected).
  • There are data quality warnings on Orion for SAPHIR on the 2020080200 cycle during the analysis jobs, which is not available on S4 due to data restrictions.

The Orion test thus seems to have been successful based on the current module files.

S4 has a restriction of 5000 core-minutes.  For 384, using 35 nodes,
this comes out to ~4.46 hours.
Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things while I free up the bandwidth to test this.

@@ -61,6 +67,11 @@ case $case_in in
export layout_y=4
export layout_x_gfs=6
export layout_y_gfs=4
if [[ "$machine" == "S4" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not add all these machine-specific layouts if we can at all help it. It adds a lot of ugly code for a machine that isn't even a tier-1 supported machine. I know S4 is small, but is there a reason S4 can't just use the same layout as the other machines for most of these?

Besides that, this script shouldn't be looking at $CASE_ENKF. First, there is little reason the layout should be different for EnKF than for any other forecast at the same resolution. Second, this construct will not behave as you want if EnKF were ever run at the same resolution as deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these changes are here to improve cycle completion times and minimize resource usage on the system so it isn't too bogged down. In particular, the EnKF is problematic with the default layouts. S4 is restricted to 40 nodes per user. Each efcs instance, by default, would use 6 nodes (i.e. 6 instances at once). With the 4x4 tiling, it is possible to push 10 instances through at a time. In timing experiments with 80 ensembles, this resulted in mean cycle timings of ~1.5 hours (5.5 vs 4) when running at C384/C192. Similar speedups were observed for C192/C96.

That said, perhaps this could be addressed a different way. I could try modifying the S4.env file to specify the layouts to use, and rewrite the layouts in config.fv3 to accept a predefined value, e.g.
export layout_x=${layout_x:-6}. I'm not sure off the top of my head if this would work with the setup scripts.

Alternatively, I may be able to add some magic to setup_expt.py and config.base to specify layouts, but this could get messy, too.

If you have another suggestion, I would be willing to try it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding something to S4.env will not work since it is not loaded by setup_expt.py.

Looking through the scripts and configs, I see one other relatively clean option:

  1. Add an S4-specific if-block to config.efcs that exports layout_x for different CASE_ENKFs before sourcing config.fv3.
  2. As discussed above, modify the layout assignments in config.fv3 (export layout_x=${layout_x:-6}).

@WalterKolczynski-NOAA Would you prefer this solution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I'd like putting it in config.efcs better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I will test that this works on Orion and run single gdasefcs and gdasfcst jobs at 384, 192, and 96 on both Orion and S4.

sorc/machine-setup.sh Show resolved Hide resolved
README.md Show resolved Hide resolved
@DavidHuber-NOAA
Copy link
Contributor Author

Comparisons of a 10-day experiment run on both S4 and Orion showed significant differences. I have traced these back to two issues on Orion. The first is that while I copied over the GDA from S4 and pointed to it on Orion, I failed to specify `DO_MAKE_NSSTBUFR=YES" in config.prep. Since the nsstbufr files are not available in S4's GDA due to the restricted data, they must be created on the fly with the available unrestricted bufr files.

The second issue was related to other module loading order problems. In particular, when the UPP loads in module_base.orion, g2/3.4.3 and g2tmpl/1.10.0 were being loaded. After fixing these, the two experiments are now much better aligned after one cycle. I'm not sure which of these was causing the more significant differences, but my guess is g2tmpl.

Another 10-day run has started now that I will let run over the weekend and then perform another comparison on Monday.

@DavidHuber-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA @KateFriedman-NOAA The cycled testing on Orion is complete. Comparisons between Orion and S4 out to 6 days show very good agreement and this port should now be ready for testing on WCOSS.

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments re. fv3 configurations for a specific machine.

Comment on lines +11 to +18
if [[ "$CASE_ENKF" == "C96" ]]; then
export layout_x=4
export layout_y=4
elif [[ "$CASE_ENKF" == "C192" ]]; then
export layout_x=4
export layout_y=4
export nth_fv3=1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I agree with this in config.efcs

Comment on lines +11 to +21
unset -v layout_x layout_y layout_x_gfs layout_y_gfs nth_fv3 nth_fv3_gfs
if [[ "$machine" == "S4" ]]; then
if [[ "$CASE" == "C384" ]]; then
export layout_x=8
export layout_y=8
export layout_x_gfs=8
export layout_y_gfs=8
export nth_fv3=1
export nth_fv3_gfs=1
fi
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this.

Comment on lines +31 to +36
elif [[ "$machine" = "S4" ]]; then
if [[ "$PARTITION_BATCH" = "s4" ]]; then
export npe_node_max=32
elif [[ "$PARTITION_BATCH" = "ivy" ]]; then
export npe_node_max=20
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is for consistency, but, config.fv3.nco.static is for a specific application and machine

@WalterKolczynski-NOAA
Copy link
Contributor

@DavidHuber-NOAA The global-workflow team is going to get together and discuss this so we're not pulling you all over the place with edits. Might be a couple days before we can though.

@DavidHuber-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA @aerorahul Sounds good, thanks. I will finish the tests on this round of edits just to verify it could work.

if [ $CASE = "C768" ]; then
export wtime_fcst_gfs="06:00:00"
elif [ $CASE = "C384" ]; then
export wtime_fcst_gfs="06:00:00"
if [[ "$machine" == "S4" && "$APP" == "S2SW" ]]; then export wtime_fcst_gfs="04:00:00"; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think remove this line and reduce the default to 4hrs.

@@ -335,6 +385,7 @@ elif [ $step = "vrfy" ]; then
export wtime_vrfy="03:00:00"
export wtime_vrfy_gfs="06:00:00"
export npe_vrfy=3
[[ "$machine" == "S4" ]] && export npe_vrfy=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to know what is the drive behind some of these special S4 resource requests.

Comment on lines +478 to +481
if [[ "$machine" = "S4" ]]; then
export npe_node_eobs=10
export wtime_eobs="00:30:00"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the calculation on Line 475, for the different partitions of S4, the result on 479 will either be 16 or 10. Whats wrong with those?
Is the 15min for eobs too short on S4?

Comment on lines +68 to +71
#The NSST bufr files are not available on S4, so we need to make them
if [[ "$machine" = "S4" ]]; then
DO_MAKE_NSSTBUFR="YES"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because the S4 GDA doesn't include restricted datasets (e.g. nsstbufr)?

@aerorahul
Copy link
Contributor

This PR is likely out of date and will likely remain so till @DavidHuber-NOAA recovers.
Closing PR now and will ask to reopen when ready.

@aerorahul aerorahul closed this Apr 5, 2022
kayeekayee pushed a commit to kayeekayee/global-workflow that referenced this pull request May 30, 2024
* Switch range_check_2d to use minval/maxval intrinsics as requested
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants