Skip to content

fix: keep GPU image fresh for SkyPilot launches#732

Merged
bradhilton merged 7 commits into
mainfrom
codex/gpu-prewarm-mutable-tag
Jun 22, 2026
Merged

fix: keep GPU image fresh for SkyPilot launches#732
bradhilton merged 7 commits into
mainfrom
codex/gpu-prewarm-mutable-tag

Conversation

@bradhilton

Copy link
Copy Markdown
Collaborator

Summary

  • publish and prewarm the GPU image through Docker Hub so downstream SkyPilot jobs can use the fresh mutable latest tag
  • pin prewarm pulls to the pushed digest and install a steady-state DaemonSet after all GPU nodes are warmed
  • keep dynamic image revision metadata at the end of the Dockerfile so normal commits do not invalidate heavy dependency layers
  • add a workflow smoke launch that asserts ART_IMAGE_REVISION matches the workflow commit

Root Cause

The failing builds were publishing to Docker Hub but prewarming/launching through a CoreWeave pull-through path. Digest pinning a Docker Hub digest under the CoreWeave registry failed, and the CoreWeave mutable latest tag was stale, so downstream launches could not rely on it for freshness.

Validation

  • bash -n scripts/build-gpu-image.sh
  • git diff --check
  • GitHub Actions Build GPU Image run 27983169214 succeeded: https://github.com/OpenPipe/ART/actions/runs/27983169214
  • Pushed docker.io/bradhiltonnw/art-gpu:latest@sha256:6082406c573dd0a30de022268cf8b42c0483ffbaa0913f2493c8349426419c0c
  • Prewarm DaemonSet reported 10 desired / 10 updated / 10 available
  • SkyPilot smoke cluster printed ART_IMAGE_REVISION 16d439d61679a7c37c24d61ef324e39073bd845c and ART_IMAGE_SMOKE_OK

@bradhilton bradhilton marked this pull request as ready for review June 22, 2026 21:22
@bradhilton bradhilton merged commit f5ef20d into main Jun 22, 2026
6 checks passed
@bradhilton bradhilton deleted the codex/gpu-prewarm-mutable-tag branch June 22, 2026 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant