Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The gfswavepostpnt job runs very slowly on Hercules #2587

Closed
DavidHuber-NOAA opened this issue May 10, 2024 · 0 comments · Fixed by #2588
Closed

The gfswavepostpnt job runs very slowly on Hercules #2587

DavidHuber-NOAA opened this issue May 10, 2024 · 0 comments · Fixed by #2588
Labels
bug Something isn't working

Comments

@DavidHuber-NOAA
Copy link
Contributor

What is wrong?

When gfswavepostpnt runs, it spawns 200 MPMD jobs which are wrappers around the ush/wave_outp_spec.sh script. Altogether, this script is called several thousand times and has many operations on disk (e.g. cat, sed, grep). These operations do not scale well on large nodes causing the job to run very slowly.

What should have happened?

Ideally, the job would use more efficient operations to achieve its goals, but a workaround in its current state is to run it with fewer MPMD jobs per node.

What machines are impacted?

Hercules

Steps to reproduce

Run a gfswavepostpnt job on Hercules and Hera and compare the runtimes of the wave_outp_spec.sh script with Hera.

Additional information

Discovered while testing #2527.

Do you have a proposed solution?

Spread the jobs over 5 nodes on Hercules (40/node) instead of 3.

@DavidHuber-NOAA DavidHuber-NOAA added bug Something isn't working triage Issues that are triage and removed triage Issues that are triage labels May 10, 2024
WalterKolczynski-NOAA pushed a commit that referenced this issue May 13, 2024
This fixes the slow runtime of the gfswavepostpnt job on Hercules. The
job is very I/O intensive and does not scale well to large nodes, so
limit the number of jobs/node to 40.

Resolves #2587
zhanglikate added a commit to zhanglikate/global-workflow that referenced this issue May 17, 2024
commit 6ca106e (origin/develop, origin/HEAD, may30, develop)
Author: David Huber <69919478+DavidHuber-NOAA@users.noreply.github.com>
Date:   Mon May 13 22:57:38 2024 +0000

    Limit gfswavepostpnt to 40 PEs/node (NOAA-EMC#2588)

    This fixes the slow runtime of the gfswavepostpnt job on Hercules. The
    job is very I/O intensive and does not scale well to large nodes, so
    limit the number of jobs/node to 40.

    Resolves NOAA-EMC#2587
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant