You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When gfswavepostpnt runs, it spawns 200 MPMD jobs which are wrappers around the ush/wave_outp_spec.sh script. Altogether, this script is called several thousand times and has many operations on disk (e.g. cat, sed, grep). These operations do not scale well on large nodes causing the job to run very slowly.
What should have happened?
Ideally, the job would use more efficient operations to achieve its goals, but a workaround in its current state is to run it with fewer MPMD jobs per node.
What machines are impacted?
Hercules
Steps to reproduce
Run a gfswavepostpnt job on Hercules and Hera and compare the runtimes of the wave_outp_spec.sh script with Hera.
This fixes the slow runtime of the gfswavepostpnt job on Hercules. The
job is very I/O intensive and does not scale well to large nodes, so
limit the number of jobs/node to 40.
Resolves#2587
commit 6ca106e (origin/develop, origin/HEAD, may30, develop)
Author: David Huber <69919478+DavidHuber-NOAA@users.noreply.github.com>
Date: Mon May 13 22:57:38 2024 +0000
Limit gfswavepostpnt to 40 PEs/node (NOAA-EMC#2588)
This fixes the slow runtime of the gfswavepostpnt job on Hercules. The
job is very I/O intensive and does not scale well to large nodes, so
limit the number of jobs/node to 40.
ResolvesNOAA-EMC#2587
What is wrong?
When gfswavepostpnt runs, it spawns 200 MPMD jobs which are wrappers around the
ush/wave_outp_spec.sh
script. Altogether, this script is called several thousand times and has many operations on disk (e.g.cat
,sed
,grep
). These operations do not scale well on large nodes causing the job to run very slowly.What should have happened?
Ideally, the job would use more efficient operations to achieve its goals, but a workaround in its current state is to run it with fewer MPMD jobs per node.
What machines are impacted?
Hercules
Steps to reproduce
Run a gfswavepostpnt job on Hercules and Hera and compare the runtimes of the wave_outp_spec.sh script with Hera.
Additional information
Discovered while testing #2527.
Do you have a proposed solution?
Spread the jobs over 5 nodes on Hercules (40/node) instead of 3.
The text was updated successfully, but these errors were encountered: