Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gitlab-ci: add concurrent jobs in run stage #150

Merged
merged 2 commits into from
Jul 4, 2022

Conversation

adcroft
Copy link
Member

@adcroft adcroft commented Jun 17, 2022

  • Splits the run stage (~40 mins) into four smaller jobs.
  • Prior to this, typical turn around for a pipeline >1 hour but two consecutive tests of this re-factoring finished in 23 minutes
  • The old run stage used all executables in one run script and so could not start until the pgi executable was ready, even though the gnu executable was ready 10 minutes earlier
  • Breaking the run stage into tests grouped by compiler allows some "tetris" to be played to minimize wait time between jobs
  • Implemented by making four copies of MOM6-examples to allow concurrency across the three compilers (gnu, intel, pgi), and a fourth for restart tests (gnu only)
  • The results are copied into sub-directories under results/ for later comparison, no longer using tar files for caching output
  • Added "needs:" so jobs can start when their dependency is ready
  • Re-ordered jobs in the .gitlab-ci.yml files so that the slowest compilation starts first (pgi)

Considerations:

  • We can't run two tests in the same directory at the same time because of colliding output. Therefore, the old CI would launch tests of all experiments/configurations concurrently but would cycle through each group of tests (compilers, layout, etc.) sequentially, copying the output and reusing the same work space. Making copies of the work space is slow, and running more concurrent jobs requires more nodes to be available at once, so the "four" has been found to be optimal for gaea and current work load.
  • We only have six runners (on the six compilation nodes) which limits the pipeline to six jobs at once. Allowing multiple jobs per runner could remove this limitation but would impact the system more.
  • The restart testing is the slowest section of the run stage (even though for a subset of experiments). Separating restarts out allows more concurrency. Doing restart tests for more experiments and all compilers would be very expensive.

- This commit splits the run stage (~40 mins) into four smaller jobs.
- Prior to this commit, typical turn around for a pipeline ~1 hour but two consecutive tests of this re-factoring finished in 23 minutes
- The old run stage used all executables in one run script and so could not start until the pgi executable was ready, even though the gnu executable was ready 10 minutes earlier
- Breaking the run stage into tests grouped by compiler allows some "tetris" to be played to minimize wait time between jobs
- Implemented by making four copies of MOM6-examples to allow concurrency across the three compilers (gnu, intel, pgi), and a fourth for restart tests (gnu only)
- The results are copied into sub-directories under results/ for later comparison, no longer using tar files for caching output
- Added "needs:" so jobs can start when their dependency is ready
- Re-ordered jobs in the .gitlab-ci.yml files so that the slowest compilation starts first (pgi)

Considerations:
- We can't run two tests in the same directory at the same time because of colliding output. Therefore, the old CI would launch tests of all experiments/configurations concurrently but would cycle through each group of tests (compilers, layout, etc.) sequentially, copying the output and reusing the same work space. Making copies of the work space is slow, and running more concurrent jobs requires more nodes to be available at once, so the "four" has been found to be optimal for gaea and current work load.
- We only have six runners (on the six compilation nodes) which limits the pipeline to six jobs at once. Allowing multiple jobs per runner could remove this limitation but would impact the system more.
- The restart testing is the slowest section of the run stage (even though for a subset of experiments). Separating restarts out allows more concurrency. Doing restart tests for more experiments and all compilers would be very expensive.
@codecov
Copy link

codecov bot commented Jun 17, 2022

Codecov Report

Merging #150 (408939d) into dev/gfdl (1e9febe) will not change coverage.
The diff coverage is n/a.

❗ Current head 408939d differs from pull request most recent head f3867c5. Consider uploading reports for the commit f3867c5 to get more accurate results

@@            Coverage Diff            @@
##           dev/gfdl     #150   +/-   ##
=========================================
  Coverage     34.05%   34.05%           
=========================================
  Files           259      259           
  Lines         70126    70126           
  Branches      12984    12984           
=========================================
  Hits          23879    23879           
  Misses        41753    41753           
  Partials       4494     4494           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e9febe...f3867c5. Read the comment docs.

Copy link
Member

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic!

@marshallward
Copy link
Member

Gaea regression: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/16054 ✔️

34 jobs for pr/150 in 32 minutes and 46 seconds (queued for 2 seconds)

@marshallward marshallward merged commit 12f2e55 into NOAA-GFDL:dev/gfdl Jul 4, 2022
@adcroft adcroft deleted the new-pipes2 branch June 26, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants