fix: cache notebook builds to avoid flaky upstream model failures by andreatgretel · Pull Request #370 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-03-05T11:02:27Z

Summary

The build-notebooks CI workflow executes all tutorial notebooks on every run. When an upstream model (e.g. black-forest-labs/flux.2-pro) is temporarily down, the entire docs build fails even if no notebooks changed. This PR adds per-notebook caching so unchanged notebooks are served from cache, and only modified ones are re-executed.

Caching is scoped to workflow_call only (i.e. when triggered from build-docs). Scheduled Monday runs and manual workflow_dispatch runs execute all notebooks without caching to catch regressions from library changes.

Changes

Added

docs/scripts/build_notebooks_cached.sh - per-notebook caching script that compares source file SHA-256 hashes against cached values, skipping execution for unchanged notebooks
Cache restore + artifact seed steps in build-notebooks.yml - uses actions/cache with a fallback that seeds from the last successful main-branch artifact when the cache is completely empty
USE_CACHE=1 option for make convert-execute-notebooks for local use

Changed

build-notebooks.yml now conditionally uses caching (workflow_call) or full execution (schedule, workflow_dispatch)
Added actions: read permission for gh run list / gh run download

Attention Areas

Reviewers: Please pay special attention to the following:

build-notebooks.yml#L34-L69 - seed step logic: only runs when cache dir is truly empty (not on partial restore), restricts artifact lookup to main branch
Caching is only used via workflow_call (from build-docs), not on scheduled or manual runs — see this comment for the rationale and open question about release builds

Test plan

Verified locally: make convert-execute-notebooks USE_CACHE=1 serves all 6 notebooks from cache when sources are unchanged
Triggered build-notebooks workflow — completed in 26s (vs ~9min) with all notebooks cached
Triggered build-docs workflow — completed successfully
Modify a notebook source and verify only that one is re-executed

Description updated with AI

The build-notebooks CI executes all tutorial notebooks on every run. When an upstream model (e.g. black-forest-labs/flux.2-pro) is down, the entire docs build fails even if no notebooks changed. Add per-notebook caching based on source file SHA-256 hashes. Unchanged notebooks are served from cache, and only modified ones are re-executed. On the first CI run (empty cache), the workflow seeds the cache from the last successful build artifact. Also add a minimal test script (test_flux_image_gen.py) to reproduce the flux.2-pro health check failure locally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-05T11:05:45Z

Greptile Summary

This PR introduces per-notebook caching for the build-notebooks CI workflow to prevent full docs-build failures when upstream AI models are transiently unavailable. It is a well-designed, multi-iteration change with the main concerns from prior review rounds now addressed.

Key changes:

A new docs/scripts/build_notebooks_cached.sh script computes a SHA-256 hash of each .py source file and only re-executes notebooks whose hash has changed since the last cached run.
actions/cache is used in build-notebooks.yml with a source-hash-based key and a notebooks- fallback restore key to support partial restores.
A seed step bootstraps an empty cache from the last successful main-branch artifact, with a partial-restore guard preventing stale hash overwrites.
Caching is opt-in: enabled by default for workflow_call (from build-docs), disabled by default for workflow_dispatch and schedule to preserve regression-detection runs.
build-docs.yml exposes use_cache as a manual input but always passes false for release-triggered builds (acknowledged open question in PR description).

Minor observations:

The seed step in build-notebooks.yml calls sha256sum directly rather than using the cross-platform compute_sha256 pattern from build_notebooks_cached.sh — inconsistent but harmless on ubuntu-latest.
build_notebooks_cached.sh does not set shopt -s nullglob before the *.py glob, which would cause a hard failure (via set -e) if SOURCE_DIR were ever empty.

Confidence Score: 4/5

Safe to merge; caching logic is correct and all previously flagged critical issues have been addressed.
The core caching mechanism (hash comparison, partial-restore guard, seed step, --branch main filter) is sound and all prior review concerns are resolved. The two remaining observations are style/robustness nits that do not affect correctness under normal CI conditions. The one acknowledged gap — release builds bypassing the cache — is a conscious design decision documented in the PR description.
.github/workflows/build-notebooks.yml seed step (lines 44–79) deserves the most scrutiny as it contains the most complex conditional logic; docs/scripts/build_notebooks_cached.sh is otherwise straightforward.

Important Files Changed

Filename	Overview
.github/workflows/build-notebooks.yml	Adds conditional caching logic (`use_cache` input), `actions/cache` restore step, an artifact seed step that seeds `.notebook-cache/` when it is truly empty, and plumbs `USE_CACHE=1` through to `make`. Previously-discussed concerns (partial-restore guard, `--branch main` filter, `SEED_TMPDIR` rename, sha hash-writing in bootstrapping case) are addressed in the current HEAD.
.github/workflows/build-docs.yml	Adds `use_cache` workflow_dispatch input (default true) and passes it to `build-notebooks.yml` only for dispatch events; release-triggered builds always pass `false`, leaving them without flakiness protection (acknowledged open question in PR description).
docs/scripts/build_notebooks_cached.sh	New per-notebook caching script using SHA-256 source hashes; includes cross-platform `compute_sha256` helper, correct cache read/write logic, and conditional cleanup only when at least one notebook is re-executed.
Makefile	Adds `ifeq (USE_CACHE,1)` branch to `convert-execute-notebooks` target that delegates to the new script, with a cleaner conditional artifact cleanup compared to the non-cached path.
.gitignore	Adds `.notebook-cache/` to .gitignore — correct and expected.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build-notebooks triggered] --> B{Trigger type?}
    B -- workflow_call\nuse_cache=true --> C[Restore notebook cache\nactions/cache keyed on source hashes]
    B -- workflow_dispatch\nuse_cache=false --> Z[Full re-execution\nno caching]
    B -- schedule Mon --> Z

    C --> D{cache-hit?}
    D -- Yes --> G[make convert-execute-notebooks USE_CACHE=1]
    D -- No --> E{.notebook-cache\nnon-empty?\npartial restore?}
    E -- Yes --> G
    E -- No --> F[Seed from last successful\nmain-branch artifact\ngh run download\nwrite .sha256 per notebook]
    F --> G

    G --> H{Per notebook:\nstored hash == current hash?}
    H -- Match --> I[Serve .ipynb from cache]
    H -- No match or missing --> J[jupytext execute source.py\nUpdate cache entry]
    I --> K[Upload docs/notebooks artifact]
    J --> K
    Z --> K

Comments Outside Diff (2)

.github/workflows/build-notebooks.yml, line 66 (link)

Seed step bypasses the cross-platform sha256sum wrapper

The previous review noted shasum vs sha256sum inconsistency, and the fix added a compute_sha256 wrapper function to build_notebooks_cached.sh that prefers sha256sum with a shasum -a 256 fallback. However, the seed step here still calls sha256sum directly.

On ubuntu-latest this is harmless, but if the runner were ever changed to macOS (where sha256sum is not in PATH by default), the seed step would fail while the script itself would succeed. For symmetry and to avoid future confusion, consider inlining the same two-liner fallback pattern here:
```
hash="$(if command -v sha256sum >/dev/null 2>&1; then sha256sum "$src" | cut -d' ' -f1; else shasum -a 256 "$src" | cut -d' ' -f1; fi)"
```
Or, since this step already sources a shell snippet, extracting it to a small helper script (similar to build_notebooks_cached.sh) would keep both places consistent.
docs/scripts/build_notebooks_cached.sh, line 36 (link)

Unguarded glob may fail with a literal pattern when SOURCE_DIR is empty

With set -euo pipefail active, if "$SOURCE_DIR"/*.py matches nothing, bash expands it to the literal string $SOURCE_DIR/*.py (bash's default failglob is off, but the unexpanded path is passed to compute_sha256, which then calls sha256sum on a non-existent file and exits non-zero).

Adding shopt -s nullglob before the loop (and optionally clearing it after) makes the empty-directory case a no-op instead of an error:

_{Last reviewed commit: 67101d6}

- Don't write .sha256 during seeding so changed notebooks are detected - Rename TMPDIR to SEED_TMPDIR to avoid shadowing the POSIX env var - Use portable sha256 helper (sha256sum with shasum fallback) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip artifact seeding when a partial cache was restored (it already has correct per-file hashes). Only seed + write current hashes when the cache dir is completely empty (true bootstrapping). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevents seeding from feature branch runs that may have different notebook sources. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The seed step uses gh run list and gh run download which require actions:read. Without it, these calls silently fail and the cold-start cache bootstrapping never executes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scheduled Monday runs and manual workflow_dispatch should execute all notebooks to catch regressions (e.g. library changes that break a notebook). Caching is only used via workflow_call (from build-docs) where the goal is fast, resilient doc deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andreatgretel · 2026-03-05T11:48:50Z

On caching strategy: Both workflows now have a use_cache input flag that can be toggled from the GitHub Actions UI. Defaults:

Trigger	Cache	Rationale
`build-docs` manual dispatch	on (toggleable)	fast doc rebuilds, skip flaky models
`build-docs` release	off	releases should validate everything
`build-notebooks` manual dispatch	off (toggleable)	manual runs are for regression checking
`build-notebooks` Monday cron	off	weekly full validation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andreatgretel · 2026-03-05T11:59:52Z

Also, re greptile's comment on the summary above - it's a fair point but I think it's fine in practice. The _pyproject.toml just has a loose data-designer>=0.1.0 constraint, and including uv.lock in the cache key would bust the cache on every unrelated dependency bump, which kind of defeats the purpose. The Monday scheduled run already re-executes everything without cache, so dependency regressions get caught weekly anyway.

Replace event_name-based cache logic with an explicit use_cache boolean input. Defaults: - build-notebooks: workflow_call=true, dispatch=false, schedule=false - build-docs: dispatch=true (toggleable), release=false This gives full control over caching from the GitHub Actions UI.

andreatgretel requested a review from a team as a code owner March 5, 2026 11:02

greptile-apps Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread .github/workflows/build-notebooks.yml

Comment thread .github/workflows/build-notebooks.yml Outdated

Comment thread docs/scripts/build_notebooks_cached.sh Outdated

greptile-apps Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread .github/workflows/build-notebooks.yml

greptile-apps Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread .github/workflows/build-notebooks.yml Outdated

andreatgretel and others added 3 commits March 5, 2026 08:19

fix: restrict artifact seed lookup to main branch

1502732

Prevents seeding from feature branch runs that may have different notebook sources. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use jq // empty to avoid "null" string on empty run list

c8add33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

johnnygreco approved these changes Mar 5, 2026

View reviewed changes

andreatgretel merged commit 2564834 into main Mar 5, 2026
47 checks passed

andreatgretel deleted the andreatgretel/fix/notebook-caching branch April 14, 2026 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cache notebook builds to avoid flaky upstream model failures#370

fix: cache notebook builds to avoid flaky upstream model failures#370
andreatgretel merged 8 commits intomainfrom
andreatgretel/fix/notebook-caching

andreatgretel commented Mar 5, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 5, 2026 •

edited

Loading

Confidence Score: 4/5

Flowchart

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel commented Mar 5, 2026 •

edited

Loading

Uh oh!

andreatgretel commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andreatgretel commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Added

Changed

Attention Areas

Test plan

Uh oh!

greptile-apps Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreatgretel commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreatgretel commented Mar 5, 2026 •

edited

Loading

greptile-apps Bot commented Mar 5, 2026 •

edited

Loading

andreatgretel commented Mar 5, 2026 •

edited

Loading