Skip to content

Centralize cache delete-and-push mechanism to one place#1645

Merged
coreyjadams merged 6 commits into
mainfrom
fix-testmon-db-cache
May 18, 2026
Merged

Centralize cache delete-and-push mechanism to one place#1645
coreyjadams merged 6 commits into
mainfrom
fix-testmon-db-cache

Conversation

@coreyjadams

Copy link
Copy Markdown
Collaborator

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@copy-pr-bot

copy-pr-bot Bot commented May 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR centralises the GitHub Actions delete-before-save cache pattern into a new reusable replace-cache composite action, eliminating ~100 lines of duplicated shell logic across the nightly workflow. It simultaneously fixes a long-standing stale-cache bug: testmon and coverage caches previously used hashFiles('uv.lock', 'pyproject.toml') keys that collided on consecutive nightlies with an unchanged lockfile, silently leaving stale data in place for days.

  • New replace-cache action: encapsulates delete → save → verify for any mutable -latest slot; callers supply their own if: gate and github-token.
  • Nightly workflow: four mutable-slot saves (uv, JIT, testmon, coverage) now all go through replace-cache; testmon and coverage keys migrated from hash-suffix to -latest.
  • PR workflow: restore steps updated to the new -latest key; restore-keys prefix fallback removed intentionally (fail-open semantics preserved, testmon handles stale DBs gracefully).
  • Dependency pinning: CI-only test deps in setup-uv-env tightened from >= to == to stabilise the testmon DB environment fingerprint and prevent spurious full-suite re-runs on PRs.

Important Files Changed

Filename Overview
.github/actions/replace-cache/action.yml New composite action encapsulating delete-before-save for mutable -latest cache slots; verify step retries 5×5 s which may be tight under heavy GitHub API load
.github/workflows/github-nightly-uv.yml Replaces three separate inline delete/save/verify blocks with single replace-cache invocations; testmon and coverage keys migrated from hash-suffix to -latest; no logic regressions
.github/workflows/github-pr.yml PR restore steps updated to match new -latest keys; restore-keys prefix fallback removed intentionally (fail-open design)
.github/actions/setup-uv-env/action.yml CI-only test deps pinned with == to stabilise testmon DB environment fingerprint; transitive churn acknowledged in comments
.github/CACHE_CONTRACT.md Documentation updated to cover testmon and coverage cache contracts, -latest mutable-slot rationale, and the replace-cache building block

Reviews (1): Last reviewed commit: "Merge branch 'main' into fix-testmon-db-..." | Re-trigger Greptile

Comment thread .github/actions/replace-cache/action.yml
Comment thread .github/actions/replace-cache/action.yml
Comment thread .github/actions/setup-uv-env/action.yml Outdated
Comment on lines +237 to +243
"moto[s3]==5.2.1" \
"numpy-stl==3.2.0" \
"scikit-image==0.26.0" \
"shapely==2.1.2" \
"multi-storage-client[boto3]==0.48.0" \
"tensorstore==0.1.83" \
"pyarrow==24.0.0"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickGeneva @laserkelvin and also @peterdsharpe there has been a little discussion about pinning here, vs. pinning in pyproject.toml. Summarizing some pros and cons.

Why pin? If we don't pin, and one of these updates, the nightly build will get out of sync with the PR venv and it will trigger a rebuild of the environment (slow on the PR on the GPU nodes) and trigger ALL tests to run (also slow) because the testmon DB requires the venv to match. So, pinning is a good idea IMO.

We can pin here, and that is nice because it's not disruptive to pyproject.toml, and can control our CI system independently. We already have that in blossom since we run in a container, and the installed packages are not necessarily aligned with what's in uv.lock. I contend that is OK. On the other hand, we might want to be able to control the CI env tightly against the uv.lock file for some reason?

We can pin in pyproject.toml by creating all of these deps in ci-deps development group with specified numbers. That's an update to pyproject.toml (no big deal) and extra lock resolution (no big deal) but any changes to CI env will have to go through that instead. And changes to pyproject.toml are meant, deliberately, to invalidate the testmon db and trigger all tests to rerun, for what it's worth (and I like that design, updating pyproject.toml in physicsnemo should be painful).

A middle ground might be to put these in a ci-requirements.txt or similar that is contained?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good analysis!

I agree that pinning overall is a good idea.

As for how to implement it, I think all three strategies are viable (pyproject.toml, here, or pulled out into a ci-requirements.txt) and I'd approve of any of them. I've been mulling over all three options for the past ~5 mins in my head and struggle to come up with any truly airtight ideas for why one is better than the others.

@coreyjadams coreyjadams requested a review from peterdsharpe May 14, 2026 14:41
@coreyjadams

Copy link
Copy Markdown
Collaborator Author

This PR effectively is doing two things:

  1. Consolidate the logic of delete-then-upload to refresh immutable caches for the various caching mechanisms I've set up to accelerate our CI. Since there are now several, it made sense to turn it into a custom action that can be reused.

  2. Pin CI dependencies to specific versions.

If needed I'll split these up.

@peterdsharpe peterdsharpe left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Interesting discussion about how to implement version-pinning; I think any of the three presented options are perfectly fine (including as-is).

Comment thread .github/actions/replace-cache/action.yml
Comment thread .github/actions/replace-cache/action.yml
Comment thread .github/actions/setup-uv-env/action.yml Outdated
Comment on lines +237 to +243
"moto[s3]==5.2.1" \
"numpy-stl==3.2.0" \
"scikit-image==0.26.0" \
"shapely==2.1.2" \
"multi-storage-client[boto3]==0.48.0" \
"tensorstore==0.1.83" \
"pyarrow==24.0.0"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good analysis!

I agree that pinning overall is a good idea.

As for how to implement it, I think all three strategies are viable (pyproject.toml, here, or pulled out into a ci-requirements.txt) and I'd approve of any of them. I've been mulling over all three options for the past ~5 mins in my head and struggle to come up with any truly airtight ideas for why one is better than the others.

@coreyjadams coreyjadams merged commit 0b2c91a into main May 18, 2026
3 checks passed
@coreyjadams coreyjadams deleted the fix-testmon-db-cache branch May 18, 2026 18:11
kashif pushed a commit to kashif/physicsnemo that referenced this pull request May 21, 2026
* Centralize cache delete-and-push mechanism to one place

* Pin CI deps in a file, instead of the github action.

* Use a localized reinstally for pyg.

* increasing retry backoff

Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants