Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update RDHPCS Hera resource for eupd task #2636

Conversation

HenryRWinterbottom
Copy link
Contributor

This PR addresses issue #2454. The following is accomplished:

Type of change

  • Bug fix (fixes something broken)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO

How has this been tested?

This has been tested by @wx20jjung during experiment applications. A Rocoto workflow (e.g., XML) was provided containing the following job information for the gdaseupd task:

<task name="enkfgdaseupd" cycledefs="gdas" maxtries="&MAXTRIES;">

        <command>/scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454/jobs/rocoto/eupd.sh</command>

        <jobname><cyclestr>HW_test_enkfgdaseupd_@H</cyclestr></jobname>
        <account>da-cpu</account>
        <queue>batch</queue>
        <partition>hera</partition>
        <walltime>00:30:00</walltime>
        <nodes>16:ppn=5:tpp=8</nodes>
        <native>--export=NONE</native>

The changes to parm/config/gfs/config.resources results in the following:

<task name="enkfgdaseupd" cycledefs="gdas" maxtries="&MAXTRIES;">

        <command>/scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454/jobs/rocoto/eupd.sh</command>

        <jobname><cyclestr>x002_gwdev_issue_2454_enkfgdaseupd_@H</cyclestr></jobname>
        <account>fv3-cpu</account>
        <queue>batch</queue>
        <partition>hera</partition>
        <walltime>00:30:00</walltime>
        <nodes>16:ppn=5:tpp=8</nodes>
        <native>--export=NONE</native>

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

Co-authored-by: Walter Kolczynski - NOAA <Walter.Kolczynski@noaa.gov>
@KateFriedman-NOAA
Copy link
Member

@wx20jjung found a solution to be to change the runtime layout to 5 PEs per node with 8 threads (instead of 8 PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much shorter wait times and only about 5 minutes longer run time.

@wx20jjung @HenryWinterbottom-NOAA Interesting, I'm not used to seeing a resource fix that means fewer tasks and nodes...if this is a memory issue then doesn't fewer nodes mean not enough memory? The change in this PR really only reduces the task number for this job on Hera, it will still be using 8 threads and 5 ppn (as it was before). Since the issue, that this PR aims to fix, reported that the issue was intermittent and resolvable upon rerun, was this tested over many cycles to ensure it's good and fixes the problem?

Another question...why change the default nth_eupd to be 5 threads instead of 8? That change means that every machine not already specified with a machine if-block in that section will now use 5 threads instead of 8 (e.g. Orion, Hercules, and Jet). Was that change tested on those machines at C384?

@wx20jjung
Copy link
Contributor

wx20jjung commented May 30, 2024 via email

@KateFriedman-NOAA
Copy link
Member

@wx20jjung Thanks for that explanation, that helps my understanding of the issue!

I have been running the 5 task / 8 thread combination at C384 on hera

So, the global-workflow already has/had 8 threads and 5 tasks (ppn = 40/8threads) for C384 so this PR is only lowering the total task number and thus resulting node number. It seems like we already have the thread/ppn solution that was working for you. Does lowering the task number and node help further then (what this PR does)? I suspect a different resource configuration is needed.

The global workflow is also not setup to call tasks-per-node for this step, which would help optimize the node (and
memory) usage.

The global-workflow config.resources has npe_node_JOB variables that can be adjusted if needed. We generally just set them like this for each job:

export npe_node_eupd=$(( npe_node_max / nth_eupd ))

...but, if needed, we can set this differently for a job/resolution.

Currently we set resources based on the following variables and calculations (showing this PR eupd resources as example):

npe_node_max=40 (total number of PEs per node on Hera)
npe_eupd=80 (total number of tasks for job)
nth_eupd=8 (threads for job)
npe_node_eupd=40/8=5 (PEs per node for job)

--> nodes=npe_eupd/npe_node_eupd=80/5=16

I am not allowed to login to the cluster nodes to monitor memory usage so I do not know what the optimum
configuration should be.

We can add a memory command to a job to get the memory information printed in the log if needed. It's messy output with some error messages that can be ignored...which is why we don't have it on by default on Hera. Let me know if that would help to determine the memory needed.

I suspect 6 or 7 tasks (and 1 thread) would be the optimum use for the 40 core nodes on hera and jet.

Perhaps stepping back from what I went through above...what would you suggest for the resulting xml resource statement? The current result from this PR would be: <nodes>16:ppn=5:tpp=8</nodes>
Sounds like this may be what you're suggesting: <nodes>13:ppn=6:tpp=1</nodes> (note, the node value is a round down using 6ppn, which doesn't divide evenly into 80 tasks, it may end up as 14 nodes, one would have to run setup_xml step to see)

Let us know what would potentially be a better resource configuration for C384 eupd.

Note: the resource configuration method in global-workflow is being redesigned now, so feel free to provide a resource suggestion that doesn't have the current calculation constraints and we can see if we can accommodate it

@wx20jjung
Copy link
Contributor

wx20jjung commented May 30, 2024 via email

@@ -1045,13 +1045,16 @@ case ${step} in
;;
"C384")
export npe_eupd=270
export nth_eupd=8
export nth_eupd=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HenryWinterbottom-NOAA
Is this change necessary? This will impact all machines except the ones noted in the if block below.
Also, in the if-block, Hera is reset to 8. 8 is the develop version.
Is the only change needed here on line 1056? For Hera, npe_eupd=80?

Copy link
Contributor Author

@HenryRWinterbottom HenryRWinterbottom Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, thank you for catching my oversight. Updating only line 1056 results in

<task name="enkfgdaseupd" cycledefs="gdas" maxtries="&MAXTRIES;">

        <command>/scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454/jobs/rocoto/eupd.sh</command>

        <jobname><cyclestr>x002_gwdev_issue_2454_enkfgdaseupd_@H</cyclestr></jobname>
        <account>fv3-cpu</account>
        <queue>batch</queue>
        <partition>hera</partition>
        <walltime>00:30:00</walltime>
        <nodes>16:ppn=5:tpp=8</nodes>
        <native>--export=NONE</native>

Which is consistent with @wx20jjung configuration:

        <task name="enkfgdaseupd" cycledefs="gdas" maxtries="&MAXTRIES;">

        <command>/scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454/jobs/rocoto/eupd.sh</command>

        <jobname><cyclestr>x002_gwdev_issue_2454_enkfgdaseupd_@H</cyclestr></jobname>
        <account>fv3-cpu</account>
        <queue>batch</queue>
        <partition>hera</partition>
        <walltime>00:30:00</walltime>
        <nodes>16:ppn=5:tpp=8</nodes>
        <native>--export=NONE</native>

I pushed the correction.

@aerorahul
Copy link
Contributor

@wx20jjung Do the changes in this PR resolve the issue reported in #2454?
Thank you for your time

Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditionally approved

@WalterKolczynski-NOAA WalterKolczynski-NOAA added the CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera label Jun 6, 2024
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera labels Jun 6, 2024
@emcbot emcbot added the CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress label Jun 6, 2024
@emcbot emcbot added CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully and removed CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Jun 6, 2024
@emcbot
Copy link

emcbot commented Jun 6, 2024

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2636

@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Hercules-Ready **CM use only** PR is ready for CI testing on Hercules CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Hercules-Ready **CM use only** PR is ready for CI testing on Hercules CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS labels Jun 7, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit 9caa51d into NOAA-EMC:develop Jun 7, 2024
4 checks passed
danholdaway added a commit to danholdaway/global-workflow that referenced this pull request Jun 13, 2024
…bal-workflow into feature/move_jcb

* 'feature/move_jcb' of https://github.com/danholdaway/global-workflow:
  Add COM template for JEDI obs (NOAA-EMC#2678)
  Link both global-nest fix files and non-nest ones at the same time (NOAA-EMC#2632)
  Update ufs-weather-model  (NOAA-EMC#2663)
  Add ability to process ocean/ice products specific to GEFS (NOAA-EMC#2561)
  Update cleanup job to use COMIN/COMOUT (NOAA-EMC#2649)
  Add overwrite to creat experiment in BASH CI (NOAA-EMC#2676)
  Add handling to select CRTM cloud optical table based on cloud scheme and update calcanal_gfs.py  (NOAA-EMC#2645)
  Update RDHPCS Hera resource for `eupd` task (NOAA-EMC#2636)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gdaseupd memory issues on Hera
8 participants