Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CI #1310

Open
joshlf opened this issue May 19, 2024 · 0 comments
Open

Optimize CI #1310

joshlf opened this issue May 19, 2024 · 0 comments
Labels
experience-medium This issue is of medium difficulty, and requires some experience google-20%-project Potential 20% project for a Google employee help wanted Extra attention is needed

Comments

@joshlf
Copy link
Member

joshlf commented May 19, 2024

As of this writing, our CI tests (specified in .github/workflows/ci.yml) take ~5.5m to run end-to-end during PR development and ~19.5m to run end-to-end in the merge queue. This significantly affects developer velocity, especially when developing a sequence of features which stack (ie, one PR needs to land before the next PR can be seriously considered).

This task tracks optimizing our end-to-end CI latency. Anything is on the table!

Note that both the PR latency and the merge queue latency are on the table. The PR latency is obviously the more important metric, since PR tests may run multiple times during PR development. However, given that GitHub has no automated way to merge a stack of PRs, we often have to actively keep an eye on the merge queue in order to know when we can kick off the next PR's merge. For this reason, merge queue latency is important as well.

Advice

As of this writing, we skip 5 out of 7 build targets and all Miri tests during PR development. Thus, the merge queue CI tests have somewhat different performance characteristics than PR CI tests.

In my own investigations, I've discovered the following:

  • In the merge queue, the bottleneck seems to be the build_test job, which encompasses the primary test matrix (there are other ancillary jobs such as kani, check_fmt, etc; these do not appear to be the bottleneck)
  • Among individual matrix jobs, the distribution of times appears to be highly bimodal:
    • Most matrix jobs take ~1-2m to complete
    • Some matrix jobs take ~13m to complete
    • What distinguishes the two appears to be Miri tests, which are run only in the latter (~13m) group
  • It also seems to take a few minutes just to spawn all of the ~200 jobs in the matrix (before they start executing)

We've already done some work to speed up Miri test execution (recently, #1307, #1308, and #1313). There is probably a lot more that could be done there.

There are probably also a lot of other optimization opportunities besides Miri; I just haven't taken the time to investigate in detail.

See also: #1312, #1314

Failed attempts

I tried these, but found no speedup, or wasn't able to get them working:

@joshlf joshlf added help wanted Extra attention is needed experience-medium This issue is of medium difficulty, and requires some experience google-20%-project Potential 20% project for a Google employee labels May 19, 2024
joshlf added a commit that referenced this issue May 19, 2024
joshlf added a commit that referenced this issue May 19, 2024
Comparing [1] (run with the parent commit) and [2] (run with this
commit), we see an overall speedup of 6m54s -> 5m36s, or ~19%. These
gains will only be realized during PR development; the CI test execution
time in the merge queue will remain unchanged.

Makes progress on #1310

[1] https://github.com/google/zerocopy/actions/runs/9149561660
[2] https://github.com/google/zerocopy/actions/runs/9149620991?pr=1314
github-merge-queue bot pushed a commit that referenced this issue May 19, 2024
* [ci] Only run Miri tests in merge queue

Comparing [1] (run with the parent commit) and [2] (run with this
commit), we see an overall speedup of 19m33s -> 6m53s, or ~65%. These
gains will only be realized during PR development; the CI test execution
time in the merge queue will remain unchanged.

[1] https://github.com/google/zerocopy/actions/runs/9149347472
[2] https://github.com/google/zerocopy/actions/runs/9149505999?pr=1313

* [ci] Test some targets only in the merge queue

Comparing [1] (run with the parent commit) and [2] (run with this
commit), we see an overall speedup of 6m54s -> 5m36s, or ~19%. These
gains will only be realized during PR development; the CI test execution
time in the merge queue will remain unchanged.

Makes progress on #1310

[1] https://github.com/google/zerocopy/actions/runs/9149561660
[2] https://github.com/google/zerocopy/actions/runs/9149620991?pr=1314
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experience-medium This issue is of medium difficulty, and requires some experience google-20%-project Potential 20% project for a Google employee help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant