Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gldas fails with USE_CFP=YES on Hera #1089

Closed
RussTreadon-NOAA opened this issue Oct 23, 2022 · 3 comments · Fixed by #1094
Closed

gldas fails with USE_CFP=YES on Hera #1089

RussTreadon-NOAA opened this issue Oct 23, 2022 · 3 comments · Fixed by #1094
Assignees
Labels
bug Something isn't working

Comments

@RussTreadon-NOAA
Copy link
Contributor

Expected behavior
gdasgldas should run to completion with USE_CFP=YES

Current behavior
When gdasgldas runs on Hera with USE_CFP=YES, it fails with

+ gldas_forcing.sh[68](20211222): [[ YES = \Y\E\S ]]
+ gldas_forcing.sh[69](20211222): rm -f ./cfile
+ gldas_forcing.sh[70](20211222): touch ./cfile
+ gldas_forcing.sh[72](20211222): echo '/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-18.0.5.274/grib_util/1.2.2/bin/copygb -i3 '\''-g255 0 2881 1441 90000 0 128 -90000 360000 125 125'\'' -x gdas.2021122112 grib.12'
+ gldas_forcing.sh[73](20211222): echo '/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-18.0.5.274/grib_util/1.2.2/bin/copygb -i3 '\''-g255 0 2881 1441 90000 0 128 -90000 360000 125 125'\'' -x gdas.2021122118 grib.18'
+ gldas_forcing.sh[74](20211222): echo '/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-18.0.5.274/grib_util/1.2.2/bin/copygb -i3 '\''-g255 0 2881 1441 90000 0 128 -90000 360000 125 125'\'' -x gdas.2021122200 grib.00'
+ gldas_forcing.sh[75](20211222): echo '/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-18.0.5.274/grib_util/1.2.2/bin/copygb -i3 '\''-g255 0 2881 1441 90000 0 128 -90000 360000 125 125'\'' -x gdas.2021122206 grib.06'
+ gldas_forcing.sh[77](20211222): srun -l --export=ALL -n 84 --multi-prog ./cfile
srun: error: Invalid task range specification (/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-18.0.5.274/grib_util/1.2.2/bin/copygb)
srun: error: Line 1 of configuration file ./cfile invalid
+ gldas_forcing.sh[1](20211222): postamble gldas_forcing.sh 1666501527 1

Machines affected
Hera

To Reproduce

  1. install a fresh clone of g-w develop on Hera
  2. create EXPDIR
  3. populate ROTDIR with files needed to run 00Z gldas. Need to ensure ROTDIR contains sufficient history of sfluxgrb files to fully run at 00Z
  4. submit 00Z gdasgldas job

Additional Information
The Hera gdasgldas job log file indicates that CFP failed because the command file contains four entries but srun was invoked with 84 tasks. This, apparently, causes a problem on Hera. A check of operational gdasgldas log files on WCOSS2 show that CFP on WCOSS2 is OK with specifying more tasks than entries in the command file.

Possible Implementation
If it is true that on Hera the number of tasks must equal the number of entries in the command file, the gldas script(s) invoking CFP can count the number of entries in command files and execute CFP specifying that number of tasks.

For the time being, I changed USE_CFP="YES" in the gldas section of HERA.env to USE_CFP="NO". The Hera gldas job runs to completion with this change.

@RussTreadon-NOAA RussTreadon-NOAA added the bug Something isn't working label Oct 23, 2022
@WalterKolczynski-NOAA
Copy link
Contributor

I suspect the command file (also?) needs to have the CPU indices prepended to each line. This is likely an issue on other slurm machines as well.

@KateFriedman-NOAA
Copy link
Member

USE_CFP=NO is set in ORION.env and may need to have been set in HERA.env but I see I specifically added USE_CFP=YES in HERA.env in the recent PR to "fix" the GLDAS job. I can't get onto supercomputers today to look at my Hera logs from tests for that PR but I can take a look tomorrow and see what's missing (either USE_CFP=NO in HERA.env or a change to the cfile section of the GLDAS scripts. Each of our supported platforms handles CFP a bit differently (wants the leading counter # or not). Will assign this issue to myself. Thanks for reporting @RussTreadon-NOAA !

@KateFriedman-NOAA KateFriedman-NOAA self-assigned this Oct 24, 2022
KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Oct 25, 2022
@KateFriedman-NOAA
Copy link
Member

This should definitely be set to NO. This is an oversight on my part. Will remedy in a PR momentarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants