Optimize CI #1310

joshlf · 2024-05-19T17:05:13Z

As of this writing, our CI tests (specified in .github/workflows/ci.yml) take ~5.5m to run end-to-end during PR development and ~19.5m to run end-to-end in the merge queue. This significantly affects developer velocity, especially when developing a sequence of features which stack (ie, one PR needs to land before the next PR can be seriously considered).

This task tracks optimizing our end-to-end CI latency. Anything is on the table!

Note that both the PR latency and the merge queue latency are on the table. The PR latency is obviously the more important metric, since PR tests may run multiple times during PR development. However, given that GitHub has no automated way to merge a stack of PRs, we often have to actively keep an eye on the merge queue in order to know when we can kick off the next PR's merge. For this reason, merge queue latency is important as well.

Advice

As of this writing, we skip 5 out of 7 build targets and all Miri tests during PR development. Thus, the merge queue CI tests have somewhat different performance characteristics than PR CI tests.

In my own investigations, I've discovered the following:

In the merge queue, the bottleneck seems to be the build_test job, which encompasses the primary test matrix (there are other ancillary jobs such as kani, check_fmt, etc; these do not appear to be the bottleneck)
Among individual matrix jobs, the distribution of times appears to be highly bimodal:
- Most matrix jobs take ~1-2m to complete
- Some matrix jobs take ~13m to complete
- What distinguishes the two appears to be Miri tests, which are run only in the latter (~13m) group
It also seems to take a few minutes just to spawn all of the ~200 jobs in the matrix (before they start executing)

We've already done some work to speed up Miri test execution (recently, #1307, #1308, and #1313). There is probably a lot more that could be done there.

There are probably also a lot of other optimization opportunities besides Miri; I just haven't taken the time to investigate in detail.

Failed attempts

I tried these, but found no speedup, or wasn't able to get them working:

[ci] Run Miri test alias models in parallel #1311 - no measurable speedup
[ci] Run on the ubuntu-20.04-64core runner #1309 - confusing build failures

The text was updated successfully, but these errors were encountered:

Makes progress on #1310

Comparing [1] (run with the parent commit) and [2] (run with this commit), we see an overall speedup of 6m54s -> 5m36s, or ~19%. These gains will only be realized during PR development; the CI test execution time in the merge queue will remain unchanged. Makes progress on #1310 [1] https://github.com/google/zerocopy/actions/runs/9149561660 [2] https://github.com/google/zerocopy/actions/runs/9149620991?pr=1314

* [ci] Only run Miri tests in merge queue Comparing [1] (run with the parent commit) and [2] (run with this commit), we see an overall speedup of 19m33s -> 6m53s, or ~65%. These gains will only be realized during PR development; the CI test execution time in the merge queue will remain unchanged. [1] https://github.com/google/zerocopy/actions/runs/9149347472 [2] https://github.com/google/zerocopy/actions/runs/9149505999?pr=1313 * [ci] Test some targets only in the merge queue Comparing [1] (run with the parent commit) and [2] (run with this commit), we see an overall speedup of 6m54s -> 5m36s, or ~19%. These gains will only be realized during PR development; the CI test execution time in the merge queue will remain unchanged. Makes progress on #1310 [1] https://github.com/google/zerocopy/actions/runs/9149561660 [2] https://github.com/google/zerocopy/actions/runs/9149620991?pr=1314

joshlf added help wanted Extra attention is needed experience-medium This issue is of medium difficulty, and requires some experience google-20%-project Potential 20% project for a Google employee labels May 19, 2024

joshlf added a commit that referenced this issue May 19, 2024

[ci] Test some targets only in the merge queue

7c33a64

Makes progress on #1310

joshlf mentioned this issue May 19, 2024

[ci] Test some targets only in the merge queue #1314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CI #1310

Optimize CI #1310

joshlf commented May 19, 2024 •

edited

Optimize CI #1310

Optimize CI #1310

Comments

joshlf commented May 19, 2024 • edited

Advice

Failed attempts

joshlf commented May 19, 2024 •

edited