fix(ci): partition GHA sccache cache per arch in shadow spike by jtoelke2 · Pull Request #961 · NVIDIA/OpenShell

jtoelke2 · 2026-04-24T15:33:04Z

Summary

Run 1 of shadow-shared-cpu-spike on 2026-04-24 (run 24890968200) succeeded on wall time but exposed a cache-key collision: both matrix jobs wrote to the same GHA cache key because we didn't partition by arch. arm64 started first and landed 575/1064 writes; amd64 started ~60s later and every one of its 1062 writes failed (409 Conflict on keys already claimed by arm64).

Fix: add job-level SCCACHE_GHA_VERSION: \${{ matrix.runner }} so each arch gets its own cache namespace. One-liner (plus comment).

Related Issue

OS-49 runner migration, Phase 2 / OS-126. Unblocks the real warm-cache behavior on Run 2-4 dispatches.

Changes

.github/workflows/shadow-shared-cpu-spike.yml: new job-level `env.SCCACHE_GHA_VERSION` keyed to `${{ matrix.runner }}`. Workflow-level `env` can't reference `matrix.*`, hence the job-level placement.

Testing

`mise run pre-commit` passes
Unit tests added/updated — N/A; workflow config change
E2E tests added/updated — N/A
End-to-end dispatch — redispatch planned immediately after merge; expect both arches to land writes independently and show symmetric cache-hit numbers on the next warm run

Context — Run 1 data (cache-side is void, wall/queue are good)

Metric	arm64	amd64	ARC baseline (branch-checks)
Queue time	29s	85s	p95 15.4m
Wall time (cold)	3m45s	3m03s	p50 4.0m
Cache writes	575 / 1064	0 / 1062	n/a
Cache hit rate	0% (cold)	0% (cold)	n/a

Wall time comfortably below the exit-criterion threshold even on cold cache — the real question left is whether warm cache holds up, which this PR unblocks.

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated — N/A; plan lives in Linear OS-126

Without a per-arch SCCACHE_GHA_VERSION, both matrix jobs (linux-amd64-cpu8 and linux-arm64-cpu8) target the same GHA cache key. Run 1 on 2026-04-24 showed this concretely: arm64 started first and landed 575/1064 cache writes; amd64 started ~1 min later and every one of its 1062 writes failed, leaving the cache with no amd64 entries at all. Partitioning by matrix.runner ensures each arch gets its own cache namespace, letting us see real warm-cache behavior on the subsequent Run 2-4 dispatches. Signed-off-by: Jonas Toelke <jtoelke@nvidia.com>

jtoelke2 requested a review from a team as a code owner April 24, 2026 15:33

jtoelke2 self-assigned this Apr 24, 2026

jtoelke2 requested a review from pimlock April 24, 2026 15:33

pimlock approved these changes Apr 24, 2026

View reviewed changes

jtoelke2 merged commit 77a88c3 into main Apr 24, 2026
21 checks passed

jtoelke2 deleted the jtoelke/os-126-partition-gha-sccache branch April 24, 2026 16:39

jtoelke2 mentioned this pull request Apr 24, 2026

feat(ci): add shadow-rust-native-build workflow for OS-49 Phase 4 (PR 4a) #973

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): partition GHA sccache cache per arch in shadow spike#961

fix(ci): partition GHA sccache cache per arch in shadow spike#961
jtoelke2 merged 1 commit intomainfrom
jtoelke/os-126-partition-gha-sccache

jtoelke2 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jtoelke2 commented Apr 24, 2026

Summary

Related Issue

Changes

Testing

Context — Run 1 data (cache-side is void, wall/queue are good)

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants