Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

front end caching #308

Closed
7 tasks
nickdesaulniers opened this issue Feb 17, 2022 · 8 comments
Closed
7 tasks

front end caching #308

nickdesaulniers opened this issue Feb 17, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@nickdesaulniers
Copy link
Member

  • can we propagate info between builds?
    • propagate HEAD shas for the kernel tree
    • propagate HEAD shas or version in for LLVM (maybe hardcode for released versions)
    • propagate prior run build status (red/green)
  • at the start of a workflow, refetch SHAs
  • if no change to SHAs, set exit status to prior build status
  • rewrite latest SHAs for propagation
@nathanchance
Copy link
Member

can we propagate info between builds?

I think so. There are plenty of actions that will automatically commit to the repository:

https://github.com/devops-infra/action-commit-push
https://github.com/stefanzweifel/git-auto-commit-action
https://github.com/EndBug/add-and-commit

I do not see a way to do this without committing to the repository, we cannot access artifacts across runs ("Note: You can only download artifacts in a workflow that were uploaded during the same workflow run.") and I do not think we can write environment variables (at least, I have not found a way to in their documentation).

We could potentially write JSON or YAML with the compiler version, kernel version, and build status from the previous run then parse that when a new run starts.

Some initial thoughts:

  • Will we run into issues where multiple workflows attempt to update their status files at the same time (git push will fail)? We could potentially run git pull -r before pushing? It looks like the last action I linked above has support for this.
  • To get the current compiler version, we will need to use the TuxMake container, like we do in clang-version.yml.
  • git ls-remote <REPO> <BRANCH> | awk '{print $1}' to get latest SHA without cloning:
    $ git ls-remote https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git linux-4.9.y | awk '{print $1}'
    9279031d74f8fe8760ce32ac527bc4658b578926
    

@nathanchance nathanchance added the enhancement New feature or request label Aug 31, 2022
@nathanchance
Copy link
Member

Talking with @kees about this during our weekly meeting, we think this is doable by giving every workflow file its own branch and information file, which eliminates the concern I had over multiple workflows updating one branch.

Rough idea for a do-we-need-to-run.py (or whatever we want to do):

  • Does the workflow's branch exist in the repository?
    • If yes, clone the repo, checkout the branch, and read the file.
    • If no, create it, generate the information file, and exit with "we do need to build"
  • Generate the new information file, including the current:
    • Kernel SHA
    • Compiler version string
  • Has either item changed since the last run?
    • If yes, update the file, commit it, push it, and exit "we need to build"
    • If no, exit "we do not need to build"

I am not sure propagating the build status is necessary, since that flow does not seem to care about the result of the last build. I guess we would potentially want to retry builds that failed the previous time they were run in case things were flakey but it is possible we should just be re-running workflows manually in those cases and ignoring this check, which would simplify the overall implementation because we do not have to wait for the build results to update the file.

nathanchance added a commit to nathanchance/continuous-integration2 that referenced this issue Feb 26, 2023
There are times where neither the kernel revision nor the compiler have
changed since the last run, which means there is little point running the
build because it is unlikely anything would change (except due to
flakiness with GitHub Actions, more on that later). While tuxsuite does
have compiler caching enabled, it still relies on spinning up all the
build machines, letting the cache work, then firing off the boot tests,
which can be expensive.

By caching the compiler string and the kernel revision, it is possible
to avoid even spinning up the tuxsuite jobs if nothing has changed.
should_run.py takes care of this by exiting:

    * 0 if the build should run (something changed or there was no
      previous results)
    * 1 for any internal assertion failure
    * 2 if nothing changed from the previous run

If there is any internal assertion failure, the cache step fails,
failing the whole workflow, so that those situations can be properly
dealt with. The script should never do this, but it is possible there
are things I have not considered happening yet.

To avoid contention with pushing and pulling, each should_run.py job
gets its own branch. While this will result in a lot of branches, they
should not cause too many issues because they will just contain the JSON
file with the last run info.

should_run.py does not try to account for flakiness on the GitHub
Actions or TuxSuite side, so it is possible that a previous build will
fail due to flakiness and not be retried on the next cron if nothing
changes since then. To attempt to account for a situation where we
re-run a known flaky build, the script gets out of the way when the
GitHub Actions event is "workflow_dispatch", meaning a workflow was
manually run. Additionally, if the previous run was only flaky during
the QEMU boots (rather than during the TuxSuite stage), they can
generally just be re-run right away, since the kernels do not need to be
rebuilt. I do not think this will happen too often but if it does, we
can try to come up with a better heuristic.

Closes: ClangBuiltLinux#308
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
@nathanchance
Copy link
Member

Initial take based on my comment above: #522

nathanchance added a commit to nathanchance/continuous-integration2 that referenced this issue Feb 28, 2023
There are times where neither the kernel revision nor the compiler have
changed since the last run, which means there is little point running the
build because it is unlikely anything would change (except due to
flakiness with GitHub Actions, more on that later). While tuxsuite does
have compiler caching enabled, it still relies on spinning up all the
build machines, letting the cache work, then firing off the boot tests,
which can be expensive.

By caching the compiler string and the kernel revision, it is possible
to avoid even spinning up the tuxsuite jobs if nothing has changed.
should_run.py takes care of this by exiting:

    * 0 if the build should run (something changed or there was no
      previous results)
    * 1 for any internal assertion failure
    * 2 if nothing changed from the previous run

If there is any internal assertion failure, the cache step fails,
failing the whole workflow, so that those situations can be properly
dealt with. The script should never do this, but it is possible there
are things I have not considered happening yet.

To avoid contention with pushing and pulling, each should_run.py job
gets its own branch. While this will result in a lot of branches, they
should not cause too many issues because they will just contain the JSON
file with the last run info.

should_run.py does not try to account for flakiness on the GitHub
Actions or TuxSuite side, so it is possible that a previous build will
fail due to flakiness and not be retried on the next cron if nothing
changes since then. To attempt to account for a situation where we
re-run a known flaky build, the script gets out of the way when the
GitHub Actions event is "workflow_dispatch", meaning a workflow was
manually run. Additionally, if the previous run was only flaky during
the QEMU boots (rather than during the TuxSuite stage), they can
generally just be re-run right away, since the kernels do not need to be
rebuilt. I do not think this will happen too often but if it does, we
can try to come up with a better heuristic.

Closes: ClangBuiltLinux#308
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
@JustinStitt
Copy link
Contributor

JustinStitt commented Nov 9, 2023

Here's a one month data dump in regards to potential caching opportunities:

This was from October 8th to November 8th 2023.

tl;dr: 17.1% of builds may have been skippable (assuming Kernel SHA and full compiler version are the only things that affect a build's outcome per job).

https://docs.google.com/spreadsheets/d/1Ag_N3kXYrBrBAq1VuXJ6ZAbBY8tHnWOaGV81VwaoGwk/edit?usp=sharing&resourcekey=0-OWVX-KZdJx23OOGn4nC-Cg

Note

FWIW, the actual total jobs ran in a month is 39874 and ./scripts/estimate-builds.py said we would have 9425 per week. Multiply that by 4-ish and get 39585. So hey, that script is pretty good 😄

@nickdesaulniers
Copy link
Member Author

Wow, thanks for running those statistics! Some insights I get from the data (sort the sheet by column F):

  • we're way overbuilding arm64-fixes. That branch is not changing that often, so we would have had ~95% cache hits (if we had caching). Until we do, I think it makes immediate sense to turn down the frequency of arm64-fixes builds.
  • 5.10 (at least for Android and CrOS) and newer are changing pretty rapidly; the opportunities for caching is low.
  • android-4.14 and android12-5.4 should be built less frequently.

No matter what, caching would be a win.

So hey, that script is pretty good 😄

👍

@nathanchance
Copy link
Member

  • we're way overbuilding arm64-fixes. That branch is not changing that often, so we would have had ~95% cache hits (if we had caching). Until we do, I think it makes immediate sense to turn down the frequency of arm64-fixes builds.

To be honest, it probably makes sense to just rely on -next testing for this one and just delete builds for it outright in my opinion. Doing the build once a week makes little sense to me. It gets included in the fixes side of -next and the likelihood of a change getting added there that breaks LLVM and fast tracked to Linus without us noticing from -next is fairly low to me, as the arm64 folks are usually pretty good about holding fixes until they are in -next for a bit.

  • android-4.14 and android12-5.4 should be built less frequently.

4.14 goes EOL in January, so much agreed. We could reduce the LLVM ToT/stable builds of these trees to once a week, which would significantly reduce the number of builds of those trees.

No matter what, caching would be a win.

Agreed. Maybe we sit down this weekend and think through implementation details?

@JustinStitt
Copy link
Contributor

Agreed. Maybe we sit down this weekend and think through implementation details?

I'm up for it! I think we can copy KernelCI's architecture: https://kernelci.org/docs/api/pipeline-details/

tl;dr: dispatcher system where a central dispatch service just spins forever until a new version is detected and then a job is dispatched. No cron scheduling; no wasted builds.

@nathanchance
Copy link
Member

Implemented in #664 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants