-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update RDHPCS Hera resource for eupd
task
#2636
Update RDHPCS Hera resource for eupd
task
#2636
Conversation
Co-authored-by: Walter Kolczynski - NOAA <Walter.Kolczynski@noaa.gov>
@wx20jjung @HenryWinterbottom-NOAA Interesting, I'm not used to seeing a resource fix that means fewer tasks and nodes...if this is a memory issue then doesn't fewer nodes mean not enough memory? The change in this PR really only reduces the task number for this job on Hera, it will still be using 8 threads and 5 ppn (as it was before). Since the issue, that this PR aims to fix, reported that the issue was intermittent and resolvable upon rerun, was this tested over many cycles to ensure it's good and fixes the problem? Another question...why change the default |
@kate Friedman - NOAA Federal ***@***.***>
The underlying problem is memory per node, not total memory. The eupd
step does not have openmp statements so having threads greater than 1 just
shuts down cores on the node. Using 8 tasks can sometimes cause a memory
use problem within a node. Adding more nodes does not solve this memory
failure as it is not a total memory problem. I am not allowed to login to
the cluster nodes to monitor memory usage so I do not know what the optimum
configuration should be. The global workflow is also not setup to call
tasks-per-node for this step, which would help optimize the node (and
memory) usage. I suspect 6 or 7 tasks (and 1 thread) would be the optimum
use for the 40 core nodes on hera and jet.
I have been running the 5 task / 8 thread combination at C384 on hera and
jet (kjet, 40 core nodes) for several months now with no failures. I can't
comment on any of the other machines or model resolutions.
…On Thu, May 30, 2024 at 10:12 AM Kate Friedman ***@***.***> wrote:
@wx20jjung <https://github.com/wx20jjung> found a solution to be to
change the runtime layout to 5 PEs per node with 8 threads (instead of 8
PEs/5 threads) and 80 PEs total (instead of 270). This resulted in much
shorter wait times and only about 5 minutes longer run time.
@wx20jjung <https://github.com/wx20jjung> @HenryWinterbottom-NOAA
<https://github.com/HenryWinterbottom-NOAA> Interesting, I'm not used to
seeing a resource fix that means fewer tasks and nodes...if this is a
memory issue then doesn't fewer nodes mean not enough memory? The change in
this PR really only reduces the task number for this job on Hera, it will
still be using 8 threads and 5 ppn (as it was before). Since the issue,
that this PR aims to fix, reported that the issue was intermittent and
resolvable upon rerun, was this tested over many cycles to ensure it's good
and fixes the problem?
Another question...why change the default nth_eupd to be 5 threads
instead of 8? That change means that every machine not already specified
with a machine if-block in that section will now use 5 threads instead of 8
(e.g. Orion, Hercules, and Jet). Was that change tested on those machines
at C384?
—
Reply to this email directly, view it on GitHub
<#2636 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMPASA4LSQFMI5DH2WHNNQLZE4XV5AVCNFSM6AAAAABIPKNKQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZGY2DSNBQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@wx20jjung Thanks for that explanation, that helps my understanding of the issue!
So, the global-workflow already has/had 8 threads and 5 tasks (ppn = 40/8threads) for C384 so this PR is only lowering the total task number and thus resulting node number. It seems like we already have the thread/ppn solution that was working for you. Does lowering the task number and node help further then (what this PR does)? I suspect a different resource configuration is needed.
The global-workflow
...but, if needed, we can set this differently for a job/resolution. Currently we set resources based on the following variables and calculations (showing this PR eupd resources as example):
We can add a memory command to a job to get the memory information printed in the log if needed. It's messy output with some error messages that can be ignored...which is why we don't have it on by default on Hera. Let me know if that would help to determine the memory needed.
Perhaps stepping back from what I went through above...what would you suggest for the resulting xml resource statement? The current result from this PR would be: Let us know what would potentially be a better resource configuration for C384 eupd. Note: the resource configuration method in global-workflow is being redesigned now, so feel free to provide a resource suggestion that doesn't have the current calculation constraints and we can see if we can accommodate it |
@kate Friedman - NOAA Federal ***@***.***>
First, a clarification. The version(s) of global-workflow I am using
have the "old" configuration of npe_eupd=270, nth_eupd=5. I changed these
in my versions so that npe_eupd=80, ppn=5,tpp=8, or
<nodes>16:ppn=5:tpp=8</nodes> to keep the jobs from failing on hera and
jet. This keeps the *.xml consistent with the config.* file.
From this point on, I have to be careful as grant funding is not allowed to
"transition items to operations" and I am already in trouble for
transitioning code to EMC. So, these are only suggestions.
My first suggestion is to identify the total memory needed for a specific
number of ensembles and resolution and observation data volume. You only
need this info for a few cycles. This should identify how many nodes you
will need. If possible, also check the memory requirements for each mpi
task. The nature of this failure suggests the memory requirement for each
task is not balanced. There are probably one or more "outliers". The
compiler and hardware vendors, and RDHPCS should be able to help with
this. You will need to assume all the tasks use the maximum (outlier)
memory. There is no "one size fits all' configuration for the complex
workflow you have. Your configurations seem to be setup for task/thread
ratios per node. SLURM gives you a lot of options on how to pack a node.
I do not know what the defaults are on the various machines. S4 was setup
to fill a node before moving on to the next node. Some put a task on each
node (round robin) until it runs out of tasks. I suggest a scenario where
you put as many tasks on a node as possible to keep the MPI communication
traffic across the network to a minimum. Any distribution scenario is
messy and will have to be tailored for each job and the node configuration.
…On Thu, May 30, 2024 at 11:24 AM Kate Friedman ***@***.***> wrote:
@wx20jjung <https://github.com/wx20jjung> Thanks for that explanation,
that helps my understanding of the issue!
I have been running the 5 task / 8 thread combination at C384 on hera
So, the global-workflow already has/had 8 threads
<https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.resources#L1048>
and 5 tasks (ppn = 40/8threads) for C384 so this PR is only lowering the
total task number and thus resulting node number. It seems like we already
have the thread/ppn solution that was working for you. Does lowering the
task number and node help further then (what this PR does)? I suspect a
different resource configuration is needed.
The global workflow is also not setup to call tasks-per-node for this
step, which would help optimize the node (and
memory) usage.
The global-workflow config.resources has npe_node_JOB variables
<https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/gfs/config.resources#L1069>
that can be adjusted if needed. We generally just set them like this for
each job:
export npe_node_eupd=$(( npe_node_max / nth_eupd ))
...but, if needed, we can set this differently for a job/resolution.
Currently we set resources based on the following variables and
calculations (showing this PR eupd resources as example):
npe_node_max=40 (total number of PEs per node on Hera)
npe_eupd=80 (total number of tasks for job)
nth_eupd=8 (threads for job)
npe_node_eupd=40/8=5 (PEs per node for job)
--> nodes=npe_eupd/npe_node_eupd=80/5=16
I am not allowed to login to the cluster nodes to monitor memory usage so
I do not know what the optimum
configuration should be.
We can add a memory command to a job to get the memory information printed
in the log if needed. It's messy output with some error messages that can
be ignored...which is why we don't have it on by default on Hera. Let me
know if that would help to determine the memory needed.
I suspect 6 or 7 tasks (and 1 thread) would be the optimum use for the 40
core nodes on hera and jet.
Perhaps stepping back from what I went through above...what would you
suggest for the resulting xml resource statement? The current result from
this PR would be: <nodes>16:ppn=5:tpp=8</nodes>
Sounds like this may be what you're suggesting:
<nodes>13:ppn=6:tpp=1</nodes> (note, the node value is a round down using
6ppn, which doesn't divide evenly into 80 tasks, it may end up as 14 nodes,
one would have to run setup_xml step to see)
Let us know what would potentially be a better resource configuration for
C384 eupd.
Note: the resource configuration method in global-workflow is being
redesigned now, so feel free to provide a resource suggestion that doesn't
have the current calculation constraints and we can see if we can
accommodate it
—
Reply to this email directly, view it on GitHub
<#2636 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMPASAZA53PYJTQUW4KOPB3ZE5ACBAVCNFSM6AAAAABIPKNKQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZHEZDSMRSHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
parm/config/gfs/config.resources
Outdated
@@ -1045,13 +1045,16 @@ case ${step} in | |||
;; | |||
"C384") | |||
export npe_eupd=270 | |||
export nth_eupd=8 | |||
export nth_eupd=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HenryWinterbottom-NOAA
Is this change necessary? This will impact all machines except the ones noted in the if block below.
Also, in the if-block, Hera is reset to 8. 8 is the develop version.
Is the only change needed here on line 1056? For Hera, npe_eupd=80
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, thank you for catching my oversight. Updating only line 1056 results in
<task name="enkfgdaseupd" cycledefs="gdas" maxtries="&MAXTRIES;">
<command>/scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454/jobs/rocoto/eupd.sh</command>
<jobname><cyclestr>x002_gwdev_issue_2454_enkfgdaseupd_@H</cyclestr></jobname>
<account>fv3-cpu</account>
<queue>batch</queue>
<partition>hera</partition>
<walltime>00:30:00</walltime>
<nodes>16:ppn=5:tpp=8</nodes>
<native>--export=NONE</native>
Which is consistent with @wx20jjung configuration:
<task name="enkfgdaseupd" cycledefs="gdas" maxtries="&MAXTRIES;">
<command>/scratch1/NCEPDEV/da/Henry.Winterbottom/trunk/global-workflow.gwdev_issue_2454/jobs/rocoto/eupd.sh</command>
<jobname><cyclestr>x002_gwdev_issue_2454_enkfgdaseupd_@H</cyclestr></jobname>
<account>fv3-cpu</account>
<queue>batch</queue>
<partition>hera</partition>
<walltime>00:30:00</walltime>
<nodes>16:ppn=5:tpp=8</nodes>
<native>--export=NONE</native>
I pushed the correction.
@wx20jjung Do the changes in this PR resolve the issue reported in #2454? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditionally approved
CI Passed Hera at |
…bal-workflow into feature/move_jcb * 'feature/move_jcb' of https://github.com/danholdaway/global-workflow: Add COM template for JEDI obs (NOAA-EMC#2678) Link both global-nest fix files and non-nest ones at the same time (NOAA-EMC#2632) Update ufs-weather-model (NOAA-EMC#2663) Add ability to process ocean/ice products specific to GEFS (NOAA-EMC#2561) Update cleanup job to use COMIN/COMOUT (NOAA-EMC#2649) Add overwrite to creat experiment in BASH CI (NOAA-EMC#2676) Add handling to select CRTM cloud optical table based on cloud scheme and update calcanal_gfs.py (NOAA-EMC#2645) Update RDHPCS Hera resource for `eupd` task (NOAA-EMC#2636)
This PR addresses issue #2454. The following is accomplished:
As per @wx20jjung, the resource for the
eupd
task have been updated for RDHPCS Hera to account for memory issues for C384gdaseupd
job fails.Resolves gdaseupd memory issues on Hera #2454
Type of change
Change characteristics
How has this been tested?
This has been tested by @wx20jjung during experiment applications. A Rocoto workflow (e.g., XML) was provided containing the following job information for the
gdaseupd
task:The changes to
parm/config/gfs/config.resources
results in the following:Checklist