ci: serialize uv installs against the shared node-local cache#1630
Merged
Conversation
Self-hosted Frontier/Frontier-AMD matrix legs (acc/omp/cpu x shards) run their "Fetch Dependencies" step directly on the same login node as the same OS user, all pointed at the same UV_CACHE_DIR (introduced in #1385 to dodge NFS file-lock errors on ~/.cache/uv). uv's own cache lock guards individual entries, but concurrent installs from separate uv processes can still race while one extracts/prunes the shared archive-v0 store, leaving a corrupted entry behind (e.g. a missing dist-info METADATA file) that fails every subsequent install until the cache is cleared by hand -- as happened on PR #1414's Frontier gpu-acc [2/2] job. Serialize the actual `uv pip install` call with flock so only one process touches a given cache dir at a time, while keeping the cache itself shared and warm across runs.
f3ff276 to
74db3a0
Compare
|
Claude Code Review Head SHA: f3ff276 Files changed:
Findings:
|
Addresses Claude Code Review on #1630: - flock hard-fails (and skips running uv pip install entirely) if its lock file's directory doesn't exist or isn't writable. TMPDIR can be stale on HPC login nodes (e.g. left over from a prior job's since-deleted scratch dir), so an unconditional flock target risked breaking installs that worked fine before this lock existed. Guard with the same -w check already used for the UV_CACHE_DIR redirect, falling back to /tmp. - Clarified the comment: uv's cache is shared per-user by default (~/.cache/uv), not just in the CI node-local-redirect case, so the serialization also protects concurrent local builds, not only self-hosted CI matrix legs.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses intermittent CI failures caused by concurrent uv pip install processes racing on a shared per-user uv cache directory (notably on self-hosted runners), by serializing the install operation with a file lock.
Changes:
- Introduces a
uv_installwrapper that usesflock(when available) to serializeuv pip installagainst a per-user lock file. - Uses a TMPDIR-based lock location with a fallback to
/tmpwhen TMPDIR is not writable. - Replaces direct
uv pip installinvocations with the newuv_installwrapper in both verbose and non-verbose paths.
Comment on lines
+206
to
+208
| UV_LOCK_DIR="${TMPDIR:-/tmp}" | ||
| [ -w "$UV_LOCK_DIR" ] || UV_LOCK_DIR=/tmp | ||
| UV_INSTALL_LOCK="${UV_LOCK_DIR}/mfc-uv-install-${USER:-$(id -un)}.lock" |
9 tasks
Addresses Copilot review on #1630, plus a live recurrence caught on this PR's own CI run (job 85133634699, "Frontier (AMD) cpu [1/2]"): - UV_LOCK_DIR's guard only checked -w, so a writable non-directory TMPDIR would pass and then get used as a directory prefix, breaking flock. Add a -d check alongside -w, per Copilot's suggestion. - That same CI run hit a *new* corruption symptom ("The wheel is invalid: Missing .dist-info directory" for pandas) on the same physical login node (login05) as the original incident on #1414, even with the new lock in place. Root cause: a cache entry corrupted before the lock existed (or by any other cause) just fails forever until someone manually clears it -- which is exactly what had happened here; login05's ~1.2GiB cache had never actually been cleared (an earlier `uv cache clean` in this investigation was run from a different login node's session and never touched login05). Since self-hosted runners are spread across login nodes we can't all individually SSH into and inspect every time this happens, make the script self-heal instead: on install failure, clear the uv cache and retry once before giving up.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1630 +/- ##
=======================================
Coverage 60.43% 60.43%
=======================================
Files 83 83
Lines 19871 19871
Branches 2956 2956
=======================================
Hits 12010 12010
Misses 5860 5860
Partials 2001 2001 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes CI failures like the one seen on #1414's Frontier
gpu-acc [2/2]job:Root cause
toolchain/bootstrap/python.shredirectsUV_CACHE_DIRto a node-local, per-user path ($TMPDIR/uv-cache-$USER) on self-hosted GitHub Actions runners (added in #1385 to dodge an NFS file-lock error,os error 524, when~/.cache/uvlives on shared NFS$HOME).That path is keyed only by username, not by job/run/matrix-leg. The
selfjob intest.ymlruns several Frontier/Frontier-AMD matrix legs (acc/omp/cpu x shards) whose "Fetch Dependencies" step runs directly on the shared login node, as the same OS user, at the same time — all pointed at the identical cache directory. uv's own lock protects individual cache entries, but concurrent installs from separateuvprocesses can still race while one extracts/prunes the sharedarchive-v0store, leaving behind a corrupted entry (like a missingdist-info/METADATAfile) that fails every subsequent install until someone manually clears the cache.Fix
Serialize the actual
uv pip installcall withflock(falls back to running unlocked ifflockisn't available, e.g. stock macOS) so only one process touches a given cache directory at a time, while keeping the cache itself shared and warm across runs — no loss of the caching benefit #1385 was going for, just closes the intra-node concurrency gap it didn't address.Testing
bash -non the modified script (syntax check)uv_install-style wrapper against a shared lock file and confirmed they run strictly serialized (each start/end pair completes before the next begins), rather than interleaving.Type of change
Checklist