[PRI-280] Nightly notebook install check#901
Conversation
Catches the failure mode that caused PRI-280: a transitive dependency in one of the example notebooks silently caps `tabpfn` below the latest PyPI release, so Colab users get an outdated model. Two files: - tests/test_notebook_installs.py Parses pip-install lines from each `examples/notebooks/*.ipynb` and reproduces the install sequence in a fresh per-notebook venv. Asserts the resolved `tabpfn` version matches the current latest on PyPI. Skipped by default; enabled via RUN_NOTEBOOK_INSTALL_CHECK=1. Sets UV_NO_CONFIG=1 in the subprocess env so any [tool.uv] config in the repo's pyproject.toml (e.g. `exclude-newer = "7 days"`) doesn't bleed into the simulated user environment. Without this the test would produce false positives when a freshly-released dep is excluded by exclude-newer and the resolver falls back to a stale capped version. Network failures during install or PyPI metadata fetch are treated as infra issues (test skip, not fail) so a flaky PyPI doesn't open false tracking issues from the nightly run. - .github/workflows/nightly-notebook-check.yml Daily cron at 07:17 UTC + workflow_dispatch. Runs the test against the live PyPI index. On failure, opens or comments on a tracking GitHub issue with the `nightly-failure` label — scheduled-run failures don't notify maintainers by default, so we make the failure visible by surfacing it in the issue tracker. `pytest --noconftest` is used because tests/conftest.py imports numpy/torch and the runner here doesn't install them (the test creates its own per-notebook venvs). Verified end-to-end locally: with the cap fix in tabpfn-extensions 0.3.0 now live on PyPI, both notebooks resolve `tabpfn==7.1.1` and the test passes. With UV_NO_CONFIG omitted, the test correctly fails because TabPFN's repo-local `exclude-newer = "7 days"` excludes the 3-day-old 0.3.0 release and the resolver falls back to 0.2.2 → 6.4.1. That negative test confirms the assertion catches the real bug class.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces a new test suite in tests/test_notebook_installs.py designed to verify that example notebooks correctly resolve the tabpfn package to its latest PyPI release. The test simulates a fresh installation environment using uv to detect potential transitive dependency conflicts. Feedback focuses on optimizing the PyPI version lookup through caching and improving the portability of the test suite by handling cross-platform path differences for virtual environments.
- Suppress S603/S607 on subprocess calls (uv args come from controlled notebook source, uv intentionally found via PATH). - Suppress S310 on the urlopen (hardcoded https://pypi.org URL, no scheme ambiguity) — same pattern as src/tabpfn/model_loading.py. - Cast json-loaded version string to str so mypy stops complaining about Returning Any. - Add explicit check=False on the install subprocess (return code is inspected manually). - Wrap the over-long stderr-tail line. - Unquote the cron expression to satisfy yamllint. - Apply ruff format. Verified: `ruff check` + `ruff format --check` both pass; the test still passes locally with RUN_NOTEBOOK_INSTALL_CHECK=1.
- @functools.lru_cache on _latest_tabpfn_version: hit PyPI once per session instead of once per notebook. Also makes assertion stable if a tabpfn release happens to land mid-test-session. - Cross-platform PATH/bin construction: use os.pathsep and `Scripts` on Windows. CI is ubuntu-only, but local-on-Windows now works.
Previously the test caught network-shaped errors and converted them to pytest.skip — to avoid false-positive nightly-failure issues from transient PyPI blips. In practice this hides real problems behind a "skipped" status and adds non-trivial complexity (retry loops, backoff, NETWORKY string matching, an "unreachable" RuntimeError fallback). New approach: any error is a test failure. The workflow's failure-issue filer catches it; the maintainer triages and reruns via workflow_dispatch if it was just a transient network issue. One click vs. 30 lines of defensive code. Concretely: - Drop retry loop in _latest_tabpfn_version; one urlopen, raise on failure. - Drop NETWORKY constant + the network-error-detection branch in the install loop. - Use check=True on the install subprocess and drop capture_output — uv's progress now streams directly to the GH Actions log, which is more useful for debugging than a captured stderr tail. - Drop now-unused imports: time, urllib.error. 13 insertions, 39 deletions.
- Drop speculative "likely cause" line from the assertion message and the issue body. Just state that the installed version doesn't match PyPI latest; let the maintainer triage. - Drop the "catches transitive caps" framing from the test docstring and workflow header — describe what the workflow does, not what it's supposed to catch. - Add changelog/901.added.md so the towncrier changelog check passes.
- Remove the "skip if no install line mentions tabpfn" check. It's half-broken (matches tabpfn-client, which doesn't install tabpfn) and defensive against a scenario that shouldn't exist: an example notebook that doesn't install tabpfn. If that ever happens, the test failing loudly is the right outcome. - Drop --python 3.12 from `uv venv`. The workflow's actions/setup-python step already pins the runner Python to 3.12, so the flag was duplicate configuration. Keeping the version in one place means future bumps (e.g. when Colab moves off 3.12) only need a single edit.
Catches Python-version-specific resolver outcomes (e.g. a transitive dep with `python_version`-conditional pins, or a wheel that ships for some versions but not others). Future-proofs against Colab moving off 3.12 too — the matrix will catch a regression on the new version without requiring a workflow edit. - strategy.matrix.python-version: ["3.11", "3.12", "3.13"] - fail-fast: false so 3.11 failing doesn't hide 3.12 / 3.13 results - Each failing matrix cell files or comments on the same shared tracking issue, with the Python version included in the body for triage. Documented race condition: if all three matrix cells fail simultaneously and all see an empty issue list, multiple issues may get filed. Acceptable — maintainer closes duplicates.
Previously the issue body only had the run URL and Python version. The
maintainer had to click through to the GH Actions log to see which
notebook failed and what version was installed.
Now the test step tees pytest output to pytest_output.txt (alongside the
live stream to the runner log). The github-script step reads the file
and extracts lines starting with "FAILED " — which pytest -v prints in
the short summary section, including the test ID (with notebook name)
and the full assertion message. Both are factual content, not
speculation.
Example body the maintainer would see:
Run: https://...
Python: 3.13
```
FAILED tests/test_notebook_installs.py::...[TabPFN_Demo_Local.ipynb]
- AssertionError: TabPFN_Demo_Local.ipynb: installed tabpfn==6.4.1,
PyPI latest is 7.1.1.
```
See logs for full output.
If the file is missing or contains no FAILED lines (e.g. a step crashed
before pytest ran), the body falls back to a "(no FAILED lines parsed;
see logs)" placeholder.
To be reverted before merge. Lets us exercise the workflow end-to-end on a GitHub-hosted runner (matrix across Python 3.11/3.12/3.13) without waiting for the nightly cron. paths: filter prevents firing on unrelated PRs if this commit accidentally lands on main.
Was added to validate the workflow on a PR runner before merge. Now that we've seen it run end-to-end (https://github.com/PriorLabs/TabPFN/actions/runs/25063167744), revert to the schedule + workflow_dispatch shape.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
bejaeger
left a comment
There was a problem hiding this comment.
Thanks for working on this! Looks mostly good to me, just some suggestions.
Also, do you have an estimate how long that test would run?
| owner: context.repo.owner, | ||
| repo: context.repo.repo, | ||
| state: "open", | ||
| labels: ["nightly-failure"], |
There was a problem hiding this comment.
is that label already added?
| if (lines.length > 0) failedLines = lines.join("\n"); | ||
| } catch (e) { /* output file missing — fall through to default */ } | ||
|
|
||
| const title = "Nightly: notebook install check failed"; |
There was a problem hiding this comment.
let's add the python version in this title? otherwise we get three issues with the same title of all python versions fail?
Summary
Adds a nightly workflow that reproduces every
examples/notebooks/*.ipynbinstall sequence in a fresh venv across Python 3.11 / 3.12 / 3.13, and assertstabpfnresolves to PyPI's latest. Failures open or comment on a tracking issue (no email by default for scheduled runs).Catches the regression class behind PRI-280 — a transitive dep capping
tabpfnbelow the latest release.Files
tests/test_notebook_installs.py— pytest test, gated byRUN_NOTEBOOK_INSTALL_CHECK=1. Parameterized across notebooks..github/workflows/nightly-notebook-check.yml— daily cron +workflow_dispatch, matrix over Python 3.11/3.12/3.13.changelog/901.added.md— towncrier fragment.Verified end-to-end
Sample run on this PR: https://github.com/PriorLabs/TabPFN/actions/runs/25063167744/job/73422829076?pr=901