Skip to content

fix(ci): partition GHA sccache cache per arch in shadow spike#961

Merged
jtoelke2 merged 1 commit intomainfrom
jtoelke/os-126-partition-gha-sccache
Apr 24, 2026
Merged

fix(ci): partition GHA sccache cache per arch in shadow spike#961
jtoelke2 merged 1 commit intomainfrom
jtoelke/os-126-partition-gha-sccache

Conversation

@jtoelke2
Copy link
Copy Markdown
Collaborator

Summary

Run 1 of shadow-shared-cpu-spike on 2026-04-24 (run 24890968200) succeeded on wall time but exposed a cache-key collision: both matrix jobs wrote to the same GHA cache key because we didn't partition by arch. arm64 started first and landed 575/1064 writes; amd64 started ~60s later and every one of its 1062 writes failed (409 Conflict on keys already claimed by arm64).

Fix: add job-level SCCACHE_GHA_VERSION: \${{ matrix.runner }} so each arch gets its own cache namespace. One-liner (plus comment).

Related Issue

OS-49 runner migration, Phase 2 / OS-126. Unblocks the real warm-cache behavior on Run 2-4 dispatches.

Changes

  • .github/workflows/shadow-shared-cpu-spike.yml: new job-level `env.SCCACHE_GHA_VERSION` keyed to `${{ matrix.runner }}`. Workflow-level `env` can't reference `matrix.*`, hence the job-level placement.

Testing

  • `mise run pre-commit` passes
  • Unit tests added/updated — N/A; workflow config change
  • E2E tests added/updated — N/A
  • End-to-end dispatch — redispatch planned immediately after merge; expect both arches to land writes independently and show symmetric cache-hit numbers on the next warm run

Context — Run 1 data (cache-side is void, wall/queue are good)

Metric arm64 amd64 ARC baseline (branch-checks)
Queue time 29s 85s p95 15.4m
Wall time (cold) 3m45s 3m03s p50 4.0m
Cache writes 575 / 1064 0 / 1062 n/a
Cache hit rate 0% (cold) 0% (cold) n/a

Wall time comfortably below the exit-criterion threshold even on cold cache — the real question left is whether warm cache holds up, which this PR unblocks.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated — N/A; plan lives in Linear OS-126

Without a per-arch SCCACHE_GHA_VERSION, both matrix jobs
(linux-amd64-cpu8 and linux-arm64-cpu8) target the same GHA cache
key. Run 1 on 2026-04-24 showed this concretely: arm64 started
first and landed 575/1064 cache writes; amd64 started ~1 min later
and every one of its 1062 writes failed, leaving the cache with no
amd64 entries at all.

Partitioning by matrix.runner ensures each arch gets its own
cache namespace, letting us see real warm-cache behavior on the
subsequent Run 2-4 dispatches.

Signed-off-by: Jonas Toelke <jtoelke@nvidia.com>
@jtoelke2 jtoelke2 requested a review from a team as a code owner April 24, 2026 15:33
@jtoelke2 jtoelke2 self-assigned this Apr 24, 2026
@jtoelke2 jtoelke2 requested a review from pimlock April 24, 2026 15:33
@jtoelke2 jtoelke2 merged commit 77a88c3 into main Apr 24, 2026
21 checks passed
@jtoelke2 jtoelke2 deleted the jtoelke/os-126-partition-gha-sccache branch April 24, 2026 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants