New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
front end caching #308
Comments
I think so. There are plenty of actions that will automatically commit to the repository: https://github.com/devops-infra/action-commit-push I do not see a way to do this without committing to the repository, we cannot access artifacts across runs ("Note: You can only download artifacts in a workflow that were uploaded during the same workflow run.") and I do not think we can write environment variables (at least, I have not found a way to in their documentation). We could potentially write JSON or YAML with the compiler version, kernel version, and build status from the previous run then parse that when a new run starts. Some initial thoughts:
|
Talking with @kees about this during our weekly meeting, we think this is doable by giving every workflow file its own branch and information file, which eliminates the concern I had over multiple workflows updating one branch. Rough idea for a
I am not sure propagating the build status is necessary, since that flow does not seem to care about the result of the last build. I guess we would potentially want to retry builds that failed the previous time they were run in case things were flakey but it is possible we should just be re-running workflows manually in those cases and ignoring this check, which would simplify the overall implementation because we do not have to wait for the build results to update the file. |
There are times where neither the kernel revision nor the compiler have changed since the last run, which means there is little point running the build because it is unlikely anything would change (except due to flakiness with GitHub Actions, more on that later). While tuxsuite does have compiler caching enabled, it still relies on spinning up all the build machines, letting the cache work, then firing off the boot tests, which can be expensive. By caching the compiler string and the kernel revision, it is possible to avoid even spinning up the tuxsuite jobs if nothing has changed. should_run.py takes care of this by exiting: * 0 if the build should run (something changed or there was no previous results) * 1 for any internal assertion failure * 2 if nothing changed from the previous run If there is any internal assertion failure, the cache step fails, failing the whole workflow, so that those situations can be properly dealt with. The script should never do this, but it is possible there are things I have not considered happening yet. To avoid contention with pushing and pulling, each should_run.py job gets its own branch. While this will result in a lot of branches, they should not cause too many issues because they will just contain the JSON file with the last run info. should_run.py does not try to account for flakiness on the GitHub Actions or TuxSuite side, so it is possible that a previous build will fail due to flakiness and not be retried on the next cron if nothing changes since then. To attempt to account for a situation where we re-run a known flaky build, the script gets out of the way when the GitHub Actions event is "workflow_dispatch", meaning a workflow was manually run. Additionally, if the previous run was only flaky during the QEMU boots (rather than during the TuxSuite stage), they can generally just be re-run right away, since the kernels do not need to be rebuilt. I do not think this will happen too often but if it does, we can try to come up with a better heuristic. Closes: ClangBuiltLinux#308 Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Initial take based on my comment above: #522 |
There are times where neither the kernel revision nor the compiler have changed since the last run, which means there is little point running the build because it is unlikely anything would change (except due to flakiness with GitHub Actions, more on that later). While tuxsuite does have compiler caching enabled, it still relies on spinning up all the build machines, letting the cache work, then firing off the boot tests, which can be expensive. By caching the compiler string and the kernel revision, it is possible to avoid even spinning up the tuxsuite jobs if nothing has changed. should_run.py takes care of this by exiting: * 0 if the build should run (something changed or there was no previous results) * 1 for any internal assertion failure * 2 if nothing changed from the previous run If there is any internal assertion failure, the cache step fails, failing the whole workflow, so that those situations can be properly dealt with. The script should never do this, but it is possible there are things I have not considered happening yet. To avoid contention with pushing and pulling, each should_run.py job gets its own branch. While this will result in a lot of branches, they should not cause too many issues because they will just contain the JSON file with the last run info. should_run.py does not try to account for flakiness on the GitHub Actions or TuxSuite side, so it is possible that a previous build will fail due to flakiness and not be retried on the next cron if nothing changes since then. To attempt to account for a situation where we re-run a known flaky build, the script gets out of the way when the GitHub Actions event is "workflow_dispatch", meaning a workflow was manually run. Additionally, if the previous run was only flaky during the QEMU boots (rather than during the TuxSuite stage), they can generally just be re-run right away, since the kernels do not need to be rebuilt. I do not think this will happen too often but if it does, we can try to come up with a better heuristic. Closes: ClangBuiltLinux#308 Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Here's a one month data dump in regards to potential caching opportunities: This was from October 8th to November 8th 2023. tl;dr: 17.1% of builds may have been skippable (assuming Kernel SHA and full compiler version are the only things that affect a build's outcome per job). Note FWIW, the actual total jobs ran in a month is |
Wow, thanks for running those statistics! Some insights I get from the data (sort the sheet by column F):
No matter what, caching would be a win.
👍 |
To be honest, it probably makes sense to just rely on -next testing for this one and just delete builds for it outright in my opinion. Doing the build once a week makes little sense to me. It gets included in the fixes side of -next and the likelihood of a change getting added there that breaks LLVM and fast tracked to Linus without us noticing from -next is fairly low to me, as the arm64 folks are usually pretty good about holding fixes until they are in -next for a bit.
4.14 goes EOL in January, so much agreed. We could reduce the LLVM ToT/stable builds of these trees to once a week, which would significantly reduce the number of builds of those trees.
Agreed. Maybe we sit down this weekend and think through implementation details? |
I'm up for it! I think we can copy KernelCI's architecture: https://kernelci.org/docs/api/pipeline-details/ tl;dr: dispatcher system where a central dispatch service just spins forever until a new version is detected and then a job is dispatched. No cron scheduling; no wasted builds. |
Implemented in #664 :) |
The text was updated successfully, but these errors were encountered: