-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gitlab-ci: add concurrent jobs in run stage #150
Conversation
- This commit splits the run stage (~40 mins) into four smaller jobs. - Prior to this commit, typical turn around for a pipeline ~1 hour but two consecutive tests of this re-factoring finished in 23 minutes - The old run stage used all executables in one run script and so could not start until the pgi executable was ready, even though the gnu executable was ready 10 minutes earlier - Breaking the run stage into tests grouped by compiler allows some "tetris" to be played to minimize wait time between jobs - Implemented by making four copies of MOM6-examples to allow concurrency across the three compilers (gnu, intel, pgi), and a fourth for restart tests (gnu only) - The results are copied into sub-directories under results/ for later comparison, no longer using tar files for caching output - Added "needs:" so jobs can start when their dependency is ready - Re-ordered jobs in the .gitlab-ci.yml files so that the slowest compilation starts first (pgi) Considerations: - We can't run two tests in the same directory at the same time because of colliding output. Therefore, the old CI would launch tests of all experiments/configurations concurrently but would cycle through each group of tests (compilers, layout, etc.) sequentially, copying the output and reusing the same work space. Making copies of the work space is slow, and running more concurrent jobs requires more nodes to be available at once, so the "four" has been found to be optimal for gaea and current work load. - We only have six runners (on the six compilation nodes) which limits the pipeline to six jobs at once. Allowing multiple jobs per runner could remove this limitation but would impact the system more. - The restart testing is the slowest section of the run stage (even though for a subset of experiments). Separating restarts out allows more concurrency. Doing restart tests for more experiments and all compilers would be very expensive.
Codecov Report
@@ Coverage Diff @@
## dev/gfdl #150 +/- ##
=========================================
Coverage 34.05% 34.05%
=========================================
Files 259 259
Lines 70126 70126
Branches 12984 12984
=========================================
Hits 23879 23879
Misses 41753 41753
Partials 4494 4494 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic!
Gaea regression: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/16054 ✔️
|
Considerations: