front end caching #308

nickdesaulniers · 2022-02-17T23:22:42Z

can we propagate info between builds?
- propagate HEAD shas for the kernel tree
- propagate HEAD shas or version in for LLVM (maybe hardcode for released versions)
- propagate prior run build status (red/green)
at the start of a workflow, refetch SHAs
if no change to SHAs, set exit status to prior build status
rewrite latest SHAs for propagation

nathanchance · 2022-02-28T19:40:33Z

can we propagate info between builds?

I think so. There are plenty of actions that will automatically commit to the repository:

https://github.com/devops-infra/action-commit-push
https://github.com/stefanzweifel/git-auto-commit-action
https://github.com/EndBug/add-and-commit

I do not see a way to do this without committing to the repository, we cannot access artifacts across runs ("Note: You can only download artifacts in a workflow that were uploaded during the same workflow run.") and I do not think we can write environment variables (at least, I have not found a way to in their documentation).

We could potentially write JSON or YAML with the compiler version, kernel version, and build status from the previous run then parse that when a new run starts.

Some initial thoughts:

Will we run into issues where multiple workflows attempt to update their status files at the same time (git push will fail)? We could potentially run git pull -r before pushing? It looks like the last action I linked above has support for this.
To get the current compiler version, we will need to use the TuxMake container, like we do in clang-version.yml.

git ls-remote <REPO> <BRANCH> | awk '{print $1}' to get latest SHA without cloning:

$ git ls-remote https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git linux-4.9.y | awk '{print $1}'
9279031d74f8fe8760ce32ac527bc4658b578926

nathanchance · 2023-02-22T21:35:02Z

Talking with @kees about this during our weekly meeting, we think this is doable by giving every workflow file its own branch and information file, which eliminates the concern I had over multiple workflows updating one branch.

Rough idea for a do-we-need-to-run.py (or whatever we want to do):

Does the workflow's branch exist in the repository?
- If yes, clone the repo, checkout the branch, and read the file.
- If no, create it, generate the information file, and exit with "we do need to build"
Generate the new information file, including the current:
- Kernel SHA
- Compiler version string
Has either item changed since the last run?
- If yes, update the file, commit it, push it, and exit "we need to build"
- If no, exit "we do not need to build"

I am not sure propagating the build status is necessary, since that flow does not seem to care about the result of the last build. I guess we would potentially want to retry builds that failed the previous time they were run in case things were flakey but it is possible we should just be re-running workflows manually in those cases and ignoring this check, which would simplify the overall implementation because we do not have to wait for the build results to update the file.

There are times where neither the kernel revision nor the compiler have changed since the last run, which means there is little point running the build because it is unlikely anything would change (except due to flakiness with GitHub Actions, more on that later). While tuxsuite does have compiler caching enabled, it still relies on spinning up all the build machines, letting the cache work, then firing off the boot tests, which can be expensive. By caching the compiler string and the kernel revision, it is possible to avoid even spinning up the tuxsuite jobs if nothing has changed. should_run.py takes care of this by exiting: * 0 if the build should run (something changed or there was no previous results) * 1 for any internal assertion failure * 2 if nothing changed from the previous run If there is any internal assertion failure, the cache step fails, failing the whole workflow, so that those situations can be properly dealt with. The script should never do this, but it is possible there are things I have not considered happening yet. To avoid contention with pushing and pulling, each should_run.py job gets its own branch. While this will result in a lot of branches, they should not cause too many issues because they will just contain the JSON file with the last run info. should_run.py does not try to account for flakiness on the GitHub Actions or TuxSuite side, so it is possible that a previous build will fail due to flakiness and not be retried on the next cron if nothing changes since then. To attempt to account for a situation where we re-run a known flaky build, the script gets out of the way when the GitHub Actions event is "workflow_dispatch", meaning a workflow was manually run. Additionally, if the previous run was only flaky during the QEMU boots (rather than during the TuxSuite stage), they can generally just be re-run right away, since the kernels do not need to be rebuilt. I do not think this will happen too often but if it does, we can try to come up with a better heuristic. Closes: ClangBuiltLinux#308 Signed-off-by: Nathan Chancellor <nathan@kernel.org>

nathanchance · 2023-02-26T00:30:14Z

Initial take based on my comment above: #522

There are times where neither the kernel revision nor the compiler have changed since the last run, which means there is little point running the build because it is unlikely anything would change (except due to flakiness with GitHub Actions, more on that later). While tuxsuite does have compiler caching enabled, it still relies on spinning up all the build machines, letting the cache work, then firing off the boot tests, which can be expensive. By caching the compiler string and the kernel revision, it is possible to avoid even spinning up the tuxsuite jobs if nothing has changed. should_run.py takes care of this by exiting: * 0 if the build should run (something changed or there was no previous results) * 1 for any internal assertion failure * 2 if nothing changed from the previous run If there is any internal assertion failure, the cache step fails, failing the whole workflow, so that those situations can be properly dealt with. The script should never do this, but it is possible there are things I have not considered happening yet. To avoid contention with pushing and pulling, each should_run.py job gets its own branch. While this will result in a lot of branches, they should not cause too many issues because they will just contain the JSON file with the last run info. should_run.py does not try to account for flakiness on the GitHub Actions or TuxSuite side, so it is possible that a previous build will fail due to flakiness and not be retried on the next cron if nothing changes since then. To attempt to account for a situation where we re-run a known flaky build, the script gets out of the way when the GitHub Actions event is "workflow_dispatch", meaning a workflow was manually run. Additionally, if the previous run was only flaky during the QEMU boots (rather than during the TuxSuite stage), they can generally just be re-run right away, since the kernels do not need to be rebuilt. I do not think this will happen too often but if it does, we can try to come up with a better heuristic. Closes: ClangBuiltLinux#308 Signed-off-by: Nathan Chancellor <nathan@kernel.org>

JustinStitt · 2023-11-09T02:50:48Z

Here's a one month data dump in regards to potential caching opportunities:

This was from October 8th to November 8th 2023.

tl;dr: 17.1% of builds may have been skippable (assuming Kernel SHA and full compiler version are the only things that affect a build's outcome per job).

https://docs.google.com/spreadsheets/d/1Ag_N3kXYrBrBAq1VuXJ6ZAbBY8tHnWOaGV81VwaoGwk/edit?usp=sharing&resourcekey=0-OWVX-KZdJx23OOGn4nC-Cg

Note

FWIW, the actual total jobs ran in a month is 39874 and ./scripts/estimate-builds.py said we would have 9425 per week. Multiply that by 4-ish and get 39585. So hey, that script is pretty good 😄

nickdesaulniers · 2023-11-09T18:07:26Z

Wow, thanks for running those statistics! Some insights I get from the data (sort the sheet by column F):

we're way overbuilding arm64-fixes. That branch is not changing that often, so we would have had ~95% cache hits (if we had caching). Until we do, I think it makes immediate sense to turn down the frequency of arm64-fixes builds.
5.10 (at least for Android and CrOS) and newer are changing pretty rapidly; the opportunities for caching is low.
android-4.14 and android12-5.4 should be built less frequently.

No matter what, caching would be a win.

So hey, that script is pretty good 😄

👍

nathanchance · 2023-11-09T18:15:54Z

we're way overbuilding arm64-fixes. That branch is not changing that often, so we would have had ~95% cache hits (if we had caching). Until we do, I think it makes immediate sense to turn down the frequency of arm64-fixes builds.

To be honest, it probably makes sense to just rely on -next testing for this one and just delete builds for it outright in my opinion. Doing the build once a week makes little sense to me. It gets included in the fixes side of -next and the likelihood of a change getting added there that breaks LLVM and fast tracked to Linus without us noticing from -next is fairly low to me, as the arm64 folks are usually pretty good about holding fixes until they are in -next for a bit.

android-4.14 and android12-5.4 should be built less frequently.

4.14 goes EOL in January, so much agreed. We could reduce the LLVM ToT/stable builds of these trees to once a week, which would significantly reduce the number of builds of those trees.

No matter what, caching would be a win.

Agreed. Maybe we sit down this weekend and think through implementation details?

JustinStitt · 2023-11-09T19:40:35Z

Agreed. Maybe we sit down this weekend and think through implementation details?

I'm up for it! I think we can copy KernelCI's architecture: https://kernelci.org/docs/api/pipeline-details/

tl;dr: dispatcher system where a central dispatch service just spins forever until a new version is detected and then a job is dispatched. No cron scheduling; no wasted builds.

nathanchance · 2023-12-20T22:30:40Z

Implemented in #664 :)

nathanchance added the enhancement New feature or request label Aug 31, 2022

nathanchance mentioned this issue Feb 26, 2023

Initial stab at front end cache checks #522

Closed

nathanchance closed this as completed Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

front end caching #308

front end caching #308

nickdesaulniers commented Feb 17, 2022

nathanchance commented Feb 28, 2022

nathanchance commented Feb 22, 2023

nathanchance commented Feb 26, 2023

JustinStitt commented Nov 9, 2023 •

edited

nickdesaulniers commented Nov 9, 2023

nathanchance commented Nov 9, 2023

JustinStitt commented Nov 9, 2023

nathanchance commented Dec 20, 2023

front end caching #308

front end caching #308

Comments

nickdesaulniers commented Feb 17, 2022

nathanchance commented Feb 28, 2022

nathanchance commented Feb 22, 2023

nathanchance commented Feb 26, 2023

JustinStitt commented Nov 9, 2023 • edited

nickdesaulniers commented Nov 9, 2023

nathanchance commented Nov 9, 2023

JustinStitt commented Nov 9, 2023

nathanchance commented Dec 20, 2023

JustinStitt commented Nov 9, 2023 •

edited