gitlab-ci: add concurrent jobs in run stage #150

adcroft · 2022-06-17T19:15:14Z

Splits the run stage (~40 mins) into four smaller jobs.
Prior to this, typical turn around for a pipeline >1 hour but two consecutive tests of this re-factoring finished in 23 minutes
The old run stage used all executables in one run script and so could not start until the pgi executable was ready, even though the gnu executable was ready 10 minutes earlier
Breaking the run stage into tests grouped by compiler allows some "tetris" to be played to minimize wait time between jobs
Implemented by making four copies of MOM6-examples to allow concurrency across the three compilers (gnu, intel, pgi), and a fourth for restart tests (gnu only)
The results are copied into sub-directories under results/ for later comparison, no longer using tar files for caching output
Added "needs:" so jobs can start when their dependency is ready
Re-ordered jobs in the .gitlab-ci.yml files so that the slowest compilation starts first (pgi)

Considerations:

We can't run two tests in the same directory at the same time because of colliding output. Therefore, the old CI would launch tests of all experiments/configurations concurrently but would cycle through each group of tests (compilers, layout, etc.) sequentially, copying the output and reusing the same work space. Making copies of the work space is slow, and running more concurrent jobs requires more nodes to be available at once, so the "four" has been found to be optimal for gaea and current work load.
We only have six runners (on the six compilation nodes) which limits the pipeline to six jobs at once. Allowing multiple jobs per runner could remove this limitation but would impact the system more.
The restart testing is the slowest section of the run stage (even though for a subset of experiments). Separating restarts out allows more concurrency. Doing restart tests for more experiments and all compilers would be very expensive.

- This commit splits the run stage (~40 mins) into four smaller jobs. - Prior to this commit, typical turn around for a pipeline ~1 hour but two consecutive tests of this re-factoring finished in 23 minutes - The old run stage used all executables in one run script and so could not start until the pgi executable was ready, even though the gnu executable was ready 10 minutes earlier - Breaking the run stage into tests grouped by compiler allows some "tetris" to be played to minimize wait time between jobs - Implemented by making four copies of MOM6-examples to allow concurrency across the three compilers (gnu, intel, pgi), and a fourth for restart tests (gnu only) - The results are copied into sub-directories under results/ for later comparison, no longer using tar files for caching output - Added "needs:" so jobs can start when their dependency is ready - Re-ordered jobs in the .gitlab-ci.yml files so that the slowest compilation starts first (pgi) Considerations: - We can't run two tests in the same directory at the same time because of colliding output. Therefore, the old CI would launch tests of all experiments/configurations concurrently but would cycle through each group of tests (compilers, layout, etc.) sequentially, copying the output and reusing the same work space. Making copies of the work space is slow, and running more concurrent jobs requires more nodes to be available at once, so the "four" has been found to be optimal for gaea and current work load. - We only have six runners (on the six compilation nodes) which limits the pipeline to six jobs at once. Allowing multiple jobs per runner could remove this limitation but would impact the system more. - The restart testing is the slowest section of the run stage (even though for a subset of experiments). Separating restarts out allows more concurrency. Doing restart tests for more experiments and all compilers would be very expensive.

codecov · 2022-06-17T19:24:43Z

Codecov Report

Merging #150 (408939d) into dev/gfdl (1e9febe) will not change coverage.
The diff coverage is n/a.

❗ Current head 408939d differs from pull request most recent head f3867c5. Consider uploading reports for the commit f3867c5 to get more accurate results

@@            Coverage Diff            @@
##           dev/gfdl     #150   +/-   ##
=========================================
  Coverage     34.05%   34.05%           
=========================================
  Files           259      259           
  Lines         70126    70126           
  Branches      12984    12984           
=========================================
  Hits          23879    23879           
  Misses        41753    41753           
  Partials       4494     4494

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e9febe...f3867c5. Read the comment docs.

Hallberg-NOAA

Fantastic!

marshallward · 2022-07-04T15:55:13Z

Gaea regression: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/16054 ✔️

34 jobs for pr/150 in 32 minutes and 46 seconds (queued for 2 seconds)

Hallberg-NOAA approved these changes Jun 20, 2022

View reviewed changes

Merge branch 'dev/gfdl' into new-pipes2

f3867c5

marshallward merged commit 12f2e55 into NOAA-GFDL:dev/gfdl Jul 4, 2022

marshallward mentioned this pull request Jul 21, 2022

GFDL to main 2022-07-21 mom-ocean/MOM6#1577

Merged

adcroft deleted the new-pipes2 branch June 26, 2023 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gitlab-ci: add concurrent jobs in run stage #150

gitlab-ci: add concurrent jobs in run stage #150

adcroft commented Jun 17, 2022

codecov bot commented Jun 17, 2022 •

edited

Loading

Hallberg-NOAA left a comment

marshallward commented Jul 4, 2022

gitlab-ci: add concurrent jobs in run stage #150

gitlab-ci: add concurrent jobs in run stage #150

Conversation

adcroft commented Jun 17, 2022

codecov bot commented Jun 17, 2022 • edited Loading

Codecov Report

Hallberg-NOAA left a comment

Choose a reason for hiding this comment

marshallward commented Jul 4, 2022

codecov bot commented Jun 17, 2022 •

edited

Loading