Skip to content

[PRI-280] Nightly notebook install check#901

Open
adrian-prior wants to merge 10 commits intomainfrom
adrian/pri-280-colab-notebook-outdated-tabpfn
Open

[PRI-280] Nightly notebook install check#901
adrian-prior wants to merge 10 commits intomainfrom
adrian/pri-280-colab-notebook-outdated-tabpfn

Conversation

@adrian-prior
Copy link
Copy Markdown
Contributor

@adrian-prior adrian-prior commented Apr 27, 2026

Summary

Adds a nightly workflow that reproduces every examples/notebooks/*.ipynb install sequence in a fresh venv across Python 3.11 / 3.12 / 3.13, and asserts tabpfn resolves to PyPI's latest. Failures open or comment on a tracking issue (no email by default for scheduled runs).

Catches the regression class behind PRI-280 — a transitive dep capping tabpfn below the latest release.

Files

  • tests/test_notebook_installs.py — pytest test, gated by RUN_NOTEBOOK_INSTALL_CHECK=1. Parameterized across notebooks.
  • .github/workflows/nightly-notebook-check.yml — daily cron + workflow_dispatch, matrix over Python 3.11/3.12/3.13.
  • changelog/901.added.md — towncrier fragment.

Verified end-to-end

Sample run on this PR: https://github.com/PriorLabs/TabPFN/actions/runs/25063167744/job/73422829076?pr=901

Catches the failure mode that caused PRI-280: a transitive dependency
in one of the example notebooks silently caps `tabpfn` below the latest
PyPI release, so Colab users get an outdated model.

Two files:

- tests/test_notebook_installs.py
  Parses pip-install lines from each `examples/notebooks/*.ipynb` and
  reproduces the install sequence in a fresh per-notebook venv. Asserts
  the resolved `tabpfn` version matches the current latest on PyPI.
  Skipped by default; enabled via RUN_NOTEBOOK_INSTALL_CHECK=1.

  Sets UV_NO_CONFIG=1 in the subprocess env so any [tool.uv] config in
  the repo's pyproject.toml (e.g. `exclude-newer = "7 days"`) doesn't
  bleed into the simulated user environment. Without this the test would
  produce false positives when a freshly-released dep is excluded by
  exclude-newer and the resolver falls back to a stale capped version.

  Network failures during install or PyPI metadata fetch are treated as
  infra issues (test skip, not fail) so a flaky PyPI doesn't open false
  tracking issues from the nightly run.

- .github/workflows/nightly-notebook-check.yml
  Daily cron at 07:17 UTC + workflow_dispatch. Runs the test against
  the live PyPI index. On failure, opens or comments on a tracking
  GitHub issue with the `nightly-failure` label — scheduled-run
  failures don't notify maintainers by default, so we make the failure
  visible by surfacing it in the issue tracker.

  `pytest --noconftest` is used because tests/conftest.py imports
  numpy/torch and the runner here doesn't install them (the test
  creates its own per-notebook venvs).

Verified end-to-end locally: with the cap fix in tabpfn-extensions
0.3.0 now live on PyPI, both notebooks resolve `tabpfn==7.1.1` and
the test passes. With UV_NO_CONFIG omitted, the test correctly fails
because TabPFN's repo-local `exclude-newer = "7 days"` excludes the
3-day-old 0.3.0 release and the resolver falls back to 0.2.2 → 6.4.1.
That negative test confirms the assertion catches the real bug class.
@adrian-prior adrian-prior requested a review from a team as a code owner April 27, 2026 22:04
@adrian-prior adrian-prior requested review from bejaeger and removed request for a team April 27, 2026 22:04
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new test suite in tests/test_notebook_installs.py designed to verify that example notebooks correctly resolve the tabpfn package to its latest PyPI release. The test simulates a fresh installation environment using uv to detect potential transitive dependency conflicts. Feedback focuses on optimizing the PyPI version lookup through caching and improving the portability of the test suite by handling cross-platform path differences for virtual environments.

Comment thread tests/test_notebook_installs.py
Comment thread tests/test_notebook_installs.py Outdated
Comment thread tests/test_notebook_installs.py
@adrian-prior adrian-prior marked this pull request as draft April 27, 2026 22:27
@adrian-prior adrian-prior removed the request for review from bejaeger April 27, 2026 22:28
- Suppress S603/S607 on subprocess calls (uv args come from controlled
  notebook source, uv intentionally found via PATH).
- Suppress S310 on the urlopen (hardcoded https://pypi.org URL, no scheme
  ambiguity) — same pattern as src/tabpfn/model_loading.py.
- Cast json-loaded version string to str so mypy stops complaining about
  Returning Any.
- Add explicit check=False on the install subprocess (return code is
  inspected manually).
- Wrap the over-long stderr-tail line.
- Unquote the cron expression to satisfy yamllint.
- Apply ruff format.

Verified: `ruff check` + `ruff format --check` both pass; the test still
passes locally with RUN_NOTEBOOK_INSTALL_CHECK=1.
- @functools.lru_cache on _latest_tabpfn_version: hit PyPI once per
  session instead of once per notebook. Also makes assertion stable if
  a tabpfn release happens to land mid-test-session.
- Cross-platform PATH/bin construction: use os.pathsep and `Scripts`
  on Windows. CI is ubuntu-only, but local-on-Windows now works.
Previously the test caught network-shaped errors and converted them to
pytest.skip — to avoid false-positive nightly-failure issues from
transient PyPI blips. In practice this hides real problems behind a
"skipped" status and adds non-trivial complexity (retry loops, backoff,
NETWORKY string matching, an "unreachable" RuntimeError fallback).

New approach: any error is a test failure. The workflow's failure-issue
filer catches it; the maintainer triages and reruns via workflow_dispatch
if it was just a transient network issue. One click vs. 30 lines of
defensive code.

Concretely:
- Drop retry loop in _latest_tabpfn_version; one urlopen, raise on failure.
- Drop NETWORKY constant + the network-error-detection branch in the
  install loop.
- Use check=True on the install subprocess and drop capture_output —
  uv's progress now streams directly to the GH Actions log, which is
  more useful for debugging than a captured stderr tail.
- Drop now-unused imports: time, urllib.error.

13 insertions, 39 deletions.
- Drop speculative "likely cause" line from the assertion message and
  the issue body. Just state that the installed version doesn't match
  PyPI latest; let the maintainer triage.
- Drop the "catches transitive caps" framing from the test docstring
  and workflow header — describe what the workflow does, not what it's
  supposed to catch.
- Add changelog/901.added.md so the towncrier changelog check passes.
- Remove the "skip if no install line mentions tabpfn" check. It's
  half-broken (matches tabpfn-client, which doesn't install tabpfn) and
  defensive against a scenario that shouldn't exist: an example notebook
  that doesn't install tabpfn. If that ever happens, the test failing
  loudly is the right outcome.

- Drop --python 3.12 from `uv venv`. The workflow's actions/setup-python
  step already pins the runner Python to 3.12, so the flag was duplicate
  configuration. Keeping the version in one place means future bumps
  (e.g. when Colab moves off 3.12) only need a single edit.
Catches Python-version-specific resolver outcomes (e.g. a transitive
dep with `python_version`-conditional pins, or a wheel that ships for
some versions but not others). Future-proofs against Colab moving off
3.12 too — the matrix will catch a regression on the new version
without requiring a workflow edit.

- strategy.matrix.python-version: ["3.11", "3.12", "3.13"]
- fail-fast: false so 3.11 failing doesn't hide 3.12 / 3.13 results
- Each failing matrix cell files or comments on the same shared
  tracking issue, with the Python version included in the body for
  triage. Documented race condition: if all three matrix cells fail
  simultaneously and all see an empty issue list, multiple issues may
  get filed. Acceptable — maintainer closes duplicates.
Previously the issue body only had the run URL and Python version. The
maintainer had to click through to the GH Actions log to see which
notebook failed and what version was installed.

Now the test step tees pytest output to pytest_output.txt (alongside the
live stream to the runner log). The github-script step reads the file
and extracts lines starting with "FAILED " — which pytest -v prints in
the short summary section, including the test ID (with notebook name)
and the full assertion message. Both are factual content, not
speculation.

Example body the maintainer would see:

  Run: https://...
  Python: 3.13

  ```
  FAILED tests/test_notebook_installs.py::...[TabPFN_Demo_Local.ipynb]
   - AssertionError: TabPFN_Demo_Local.ipynb: installed tabpfn==6.4.1,
     PyPI latest is 7.1.1.
  ```

  See logs for full output.

If the file is missing or contains no FAILED lines (e.g. a step crashed
before pytest ran), the body falls back to a "(no FAILED lines parsed;
see logs)" placeholder.
To be reverted before merge. Lets us exercise the workflow end-to-end
on a GitHub-hosted runner (matrix across Python 3.11/3.12/3.13) without
waiting for the nightly cron. paths: filter prevents firing on
unrelated PRs if this commit accidentally lands on main.
Was added to validate the workflow on a PR runner before merge. Now that
we've seen it run end-to-end (https://github.com/PriorLabs/TabPFN/actions/runs/25063167744),
revert to the schedule + workflow_dispatch shape.
@adrian-prior adrian-prior marked this pull request as ready for review April 28, 2026 15:56
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@adrian-prior adrian-prior requested a review from bejaeger April 28, 2026 15:57
Copy link
Copy Markdown
Collaborator

@bejaeger bejaeger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Looks mostly good to me, just some suggestions.

Also, do you have an estimate how long that test would run?

owner: context.repo.owner,
repo: context.repo.repo,
state: "open",
labels: ["nightly-failure"],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that label already added?

if (lines.length > 0) failedLines = lines.join("\n");
} catch (e) { /* output file missing — fall through to default */ }

const title = "Nightly: notebook install check failed";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add the python version in this title? otherwise we get three issues with the same title of all python versions fail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants