Skip to content

ci: serialize uv installs against the shared node-local cache#1630

Merged
sbryngelson merged 4 commits into
masterfrom
ci/serialize-uv-cache-install
Jul 4, 2026
Merged

ci: serialize uv installs against the shared node-local cache#1630
sbryngelson merged 4 commits into
masterfrom
ci/serialize-uv-cache-install

Conversation

@sbryngelson

Copy link
Copy Markdown
Member

Description

Fixes CI failures like the one seen on #1414's Frontier gpu-acc [2/2] job:

error: Failed to install: pandas-3.0.3-...whl (pandas==3.0.3)
  Caused by: failed to open file `.../pandas-3.0.3.dist-info/METADATA`: No such file or directory (os error 2)

Root cause

toolchain/bootstrap/python.sh redirects UV_CACHE_DIR to a node-local, per-user path ($TMPDIR/uv-cache-$USER) on self-hosted GitHub Actions runners (added in #1385 to dodge an NFS file-lock error, os error 524, when ~/.cache/uv lives on shared NFS $HOME).

That path is keyed only by username, not by job/run/matrix-leg. The self job in test.yml runs several Frontier/Frontier-AMD matrix legs (acc/omp/cpu x shards) whose "Fetch Dependencies" step runs directly on the shared login node, as the same OS user, at the same time — all pointed at the identical cache directory. uv's own lock protects individual cache entries, but concurrent installs from separate uv processes can still race while one extracts/prunes the shared archive-v0 store, leaving behind a corrupted entry (like a missing dist-info/METADATA file) that fails every subsequent install until someone manually clears the cache.

Fix

Serialize the actual uv pip install call with flock (falls back to running unlocked if flock isn't available, e.g. stock macOS) so only one process touches a given cache directory at a time, while keeping the cache itself shared and warm across runs — no loss of the caching benefit #1385 was going for, just closes the intra-node concurrency gap it didn't address.

Testing

  • bash -n on the modified script (syntax check)
  • Local smoke test: launched 3 concurrent invocations of the new uv_install-style wrapper against a shared lock file and confirmed they run strictly serialized (each start/end pair completes before the next begins), rather than interleaving.
  • Have not yet reproduced the actual corruption race end-to-end on Frontier (hard to force deterministically); opening as draft pending a live CI run against this branch.

Type of change

  • Bug fix (CI infra)

Checklist

  • Tested locally (script syntax + lock-serialization behavior)
  • Verified fix on an actual Frontier CI run (pending — draft)

Self-hosted Frontier/Frontier-AMD matrix legs (acc/omp/cpu x shards) run
their "Fetch Dependencies" step directly on the same login node as the
same OS user, all pointed at the same UV_CACHE_DIR (introduced in #1385
to dodge NFS file-lock errors on ~/.cache/uv). uv's own cache lock
guards individual entries, but concurrent installs from separate uv
processes can still race while one extracts/prunes the shared
archive-v0 store, leaving a corrupted entry behind (e.g. a missing
dist-info METADATA file) that fails every subsequent install until the
cache is cleared by hand -- as happened on PR #1414's Frontier gpu-acc
[2/2] job.

Serialize the actual `uv pip install` call with flock so only one
process touches a given cache dir at a time, while keeping the cache
itself shared and warm across runs.
@sbryngelson sbryngelson force-pushed the ci/serialize-uv-cache-install branch from f3ff276 to 74db3a0 Compare July 4, 2026 12:38
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

Claude Code Review

Head SHA: f3ff276

Files changed:

  • 1
  • toolchain/bootstrap/python.sh

Findings:

  • toolchain/bootstrap/python.sh:188-206: The UV_CACHE_DIR redirect just above is explicitly gated on [ "${GITHUB_ACTIONS:-}" = "true" ] && [ -w "${TMPDIR:-/tmp}" ], but the new UV_INSTALL_LOCK/uv_install wrapper is unconditional — it always serializes uv pip install via flock for every invocation (any local dev machine, any CI job), not just the self-hosted HPC case described in the comment, and it never checks that ${TMPDIR:-/tmp} is writable before pointing flock at a lock file there. If TMPDIR is set to a directory that doesn't exist or isn't writable, flock will fail to open/create mfc-uv-install-${USER}.lock and uv_install will return non-zero without ever invoking uv pip install, so the script reports "Installation failed" (and dumps a log) even though the actual uv pip install was never attempted. This is a regression for any environment where TMPDIR is non-writable, since prior to this change the same environment installed fine (no lock file was created).
  • toolchain/bootstrap/python.sh:201: The lock filename only encodes $USER, not the repository path/build dir, so two unrelated concurrent invocations of mfc.sh by the same user (e.g. separate git worktrees/clones of MFC building at once) will now serialize against each other through /tmp/mfc-uv-install-$USER.lock even though they don't share a UV_CACHE_DIR (that redirect is CI-only) — a behavior change beyond the stated self-hosted-runner-cache-corruption fix.

Addresses Claude Code Review on #1630:

- flock hard-fails (and skips running uv pip install entirely) if its
  lock file's directory doesn't exist or isn't writable. TMPDIR can be
  stale on HPC login nodes (e.g. left over from a prior job's
  since-deleted scratch dir), so an unconditional flock target risked
  breaking installs that worked fine before this lock existed. Guard
  with the same -w check already used for the UV_CACHE_DIR redirect,
  falling back to /tmp.

- Clarified the comment: uv's cache is shared per-user by default
  (~/.cache/uv), not just in the CI node-local-redirect case, so the
  serialization also protects concurrent local builds, not only
  self-hosted CI matrix legs.
@sbryngelson sbryngelson marked this pull request as ready for review July 4, 2026 12:49
Copilot AI review requested due to automatic review settings July 4, 2026 12:49

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent CI failures caused by concurrent uv pip install processes racing on a shared per-user uv cache directory (notably on self-hosted runners), by serializing the install operation with a file lock.

Changes:

  • Introduces a uv_install wrapper that uses flock (when available) to serialize uv pip install against a per-user lock file.
  • Uses a TMPDIR-based lock location with a fallback to /tmp when TMPDIR is not writable.
  • Replaces direct uv pip install invocations with the new uv_install wrapper in both verbose and non-verbose paths.

Comment on lines +206 to +208
UV_LOCK_DIR="${TMPDIR:-/tmp}"
[ -w "$UV_LOCK_DIR" ] || UV_LOCK_DIR=/tmp
UV_INSTALL_LOCK="${UV_LOCK_DIR}/mfc-uv-install-${USER:-$(id -un)}.lock"
Addresses Copilot review on #1630, plus a live recurrence caught on
this PR's own CI run (job 85133634699, "Frontier (AMD) cpu [1/2]"):

- UV_LOCK_DIR's guard only checked -w, so a writable non-directory
  TMPDIR would pass and then get used as a directory prefix, breaking
  flock. Add a -d check alongside -w, per Copilot's suggestion.

- That same CI run hit a *new* corruption symptom ("The wheel is
  invalid: Missing .dist-info directory" for pandas) on the same
  physical login node (login05) as the original incident on #1414,
  even with the new lock in place. Root cause: a cache entry corrupted
  before the lock existed (or by any other cause) just fails forever
  until someone manually clears it -- which is exactly what had
  happened here; login05's ~1.2GiB cache had never actually been
  cleared (an earlier `uv cache clean` in this investigation was run
  from a different login node's session and never touched login05).

  Since self-hosted runners are spread across login nodes we can't all
  individually SSH into and inspect every time this happens, make the
  script self-heal instead: on install failure, clear the uv cache and
  retry once before giving up.
@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.43%. Comparing base (804c3bf) to head (b9c8665).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1630   +/-   ##
=======================================
  Coverage   60.43%   60.43%           
=======================================
  Files          83       83           
  Lines       19871    19871           
  Branches     2956     2956           
=======================================
  Hits        12010    12010           
  Misses       5860     5860           
  Partials     2001     2001           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson merged commit 9478e2d into master Jul 4, 2026
22 checks passed
@sbryngelson sbryngelson deleted the ci/serialize-uv-cache-install branch July 4, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants