fix(ci): partition GHA sccache cache per arch in shadow spike#961
Merged
fix(ci): partition GHA sccache cache per arch in shadow spike#961
Conversation
Without a per-arch SCCACHE_GHA_VERSION, both matrix jobs (linux-amd64-cpu8 and linux-arm64-cpu8) target the same GHA cache key. Run 1 on 2026-04-24 showed this concretely: arm64 started first and landed 575/1064 cache writes; amd64 started ~1 min later and every one of its 1062 writes failed, leaving the cache with no amd64 entries at all. Partitioning by matrix.runner ensures each arch gets its own cache namespace, letting us see real warm-cache behavior on the subsequent Run 2-4 dispatches. Signed-off-by: Jonas Toelke <jtoelke@nvidia.com>
pimlock
approved these changes
Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Run 1 of
shadow-shared-cpu-spikeon 2026-04-24 (run 24890968200) succeeded on wall time but exposed a cache-key collision: both matrix jobs wrote to the same GHA cache key because we didn't partition by arch. arm64 started first and landed 575/1064 writes; amd64 started ~60s later and every one of its 1062 writes failed (409 Conflict on keys already claimed by arm64).Fix: add job-level
SCCACHE_GHA_VERSION: \${{ matrix.runner }}so each arch gets its own cache namespace. One-liner (plus comment).Related Issue
OS-49 runner migration, Phase 2 / OS-126. Unblocks the real warm-cache behavior on Run 2-4 dispatches.
Changes
.github/workflows/shadow-shared-cpu-spike.yml: new job-level `env.SCCACHE_GHA_VERSION` keyed to `${{ matrix.runner }}`. Workflow-level `env` can't reference `matrix.*`, hence the job-level placement.Testing
Context — Run 1 data (cache-side is void, wall/queue are good)
Wall time comfortably below the exit-criterion threshold even on cold cache — the real question left is whether warm cache holds up, which this PR unblocks.
Checklist