Skip to content

fix: cache notebook builds to avoid flaky upstream model failures#370

Merged
andreatgretel merged 8 commits intomainfrom
andreatgretel/fix/notebook-caching
Mar 5, 2026
Merged

fix: cache notebook builds to avoid flaky upstream model failures#370
andreatgretel merged 8 commits intomainfrom
andreatgretel/fix/notebook-caching

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel commented Mar 5, 2026

Summary

The build-notebooks CI workflow executes all tutorial notebooks on every run. When an upstream model (e.g. black-forest-labs/flux.2-pro) is temporarily down, the entire docs build fails even if no notebooks changed. This PR adds per-notebook caching so unchanged notebooks are served from cache, and only modified ones are re-executed.

Caching is scoped to workflow_call only (i.e. when triggered from build-docs). Scheduled Monday runs and manual workflow_dispatch runs execute all notebooks without caching to catch regressions from library changes.

Changes

Added

  • docs/scripts/build_notebooks_cached.sh - per-notebook caching script that compares source file SHA-256 hashes against cached values, skipping execution for unchanged notebooks
  • Cache restore + artifact seed steps in build-notebooks.yml - uses actions/cache with a fallback that seeds from the last successful main-branch artifact when the cache is completely empty
  • USE_CACHE=1 option for make convert-execute-notebooks for local use

Changed

  • build-notebooks.yml now conditionally uses caching (workflow_call) or full execution (schedule, workflow_dispatch)
  • Added actions: read permission for gh run list / gh run download

Attention Areas

Reviewers: Please pay special attention to the following:

  • build-notebooks.yml#L34-L69 - seed step logic: only runs when cache dir is truly empty (not on partial restore), restricts artifact lookup to main branch
  • Caching is only used via workflow_call (from build-docs), not on scheduled or manual runs — see this comment for the rationale and open question about release builds

Test plan

  • Verified locally: make convert-execute-notebooks USE_CACHE=1 serves all 6 notebooks from cache when sources are unchanged
  • Triggered build-notebooks workflow — completed in 26s (vs ~9min) with all notebooks cached
  • Triggered build-docs workflow — completed successfully
  • Modify a notebook source and verify only that one is re-executed

Description updated with AI

The build-notebooks CI executes all tutorial notebooks on every run.
When an upstream model (e.g. black-forest-labs/flux.2-pro) is down, the
entire docs build fails even if no notebooks changed.

Add per-notebook caching based on source file SHA-256 hashes. Unchanged
notebooks are served from cache, and only modified ones are re-executed.
On the first CI run (empty cache), the workflow seeds the cache from the
last successful build artifact.

Also add a minimal test script (test_flux_image_gen.py) to reproduce the
flux.2-pro health check failure locally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andreatgretel andreatgretel requested a review from a team as a code owner March 5, 2026 11:02
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 5, 2026

Greptile Summary

This PR introduces per-notebook caching for the build-notebooks CI workflow to prevent full docs-build failures when upstream AI models are transiently unavailable. It is a well-designed, multi-iteration change with the main concerns from prior review rounds now addressed.

Key changes:

  • A new docs/scripts/build_notebooks_cached.sh script computes a SHA-256 hash of each .py source file and only re-executes notebooks whose hash has changed since the last cached run.
  • actions/cache is used in build-notebooks.yml with a source-hash-based key and a notebooks- fallback restore key to support partial restores.
  • A seed step bootstraps an empty cache from the last successful main-branch artifact, with a partial-restore guard preventing stale hash overwrites.
  • Caching is opt-in: enabled by default for workflow_call (from build-docs), disabled by default for workflow_dispatch and schedule to preserve regression-detection runs.
  • build-docs.yml exposes use_cache as a manual input but always passes false for release-triggered builds (acknowledged open question in PR description).

Minor observations:

  • The seed step in build-notebooks.yml calls sha256sum directly rather than using the cross-platform compute_sha256 pattern from build_notebooks_cached.sh — inconsistent but harmless on ubuntu-latest.
  • build_notebooks_cached.sh does not set shopt -s nullglob before the *.py glob, which would cause a hard failure (via set -e) if SOURCE_DIR were ever empty.

Confidence Score: 4/5

  • Safe to merge; caching logic is correct and all previously flagged critical issues have been addressed.
  • The core caching mechanism (hash comparison, partial-restore guard, seed step, --branch main filter) is sound and all prior review concerns are resolved. The two remaining observations are style/robustness nits that do not affect correctness under normal CI conditions. The one acknowledged gap — release builds bypassing the cache — is a conscious design decision documented in the PR description.
  • .github/workflows/build-notebooks.yml seed step (lines 44–79) deserves the most scrutiny as it contains the most complex conditional logic; docs/scripts/build_notebooks_cached.sh is otherwise straightforward.

Important Files Changed

Filename Overview
.github/workflows/build-notebooks.yml Adds conditional caching logic (use_cache input), actions/cache restore step, an artifact seed step that seeds .notebook-cache/ when it is truly empty, and plumbs USE_CACHE=1 through to make. Previously-discussed concerns (partial-restore guard, --branch main filter, SEED_TMPDIR rename, sha hash-writing in bootstrapping case) are addressed in the current HEAD.
.github/workflows/build-docs.yml Adds use_cache workflow_dispatch input (default true) and passes it to build-notebooks.yml only for dispatch events; release-triggered builds always pass false, leaving them without flakiness protection (acknowledged open question in PR description).
docs/scripts/build_notebooks_cached.sh New per-notebook caching script using SHA-256 source hashes; includes cross-platform compute_sha256 helper, correct cache read/write logic, and conditional cleanup only when at least one notebook is re-executed.
Makefile Adds ifeq (USE_CACHE,1) branch to convert-execute-notebooks target that delegates to the new script, with a cleaner conditional artifact cleanup compared to the non-cached path.
.gitignore Adds .notebook-cache/ to .gitignore — correct and expected.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build-notebooks triggered] --> B{Trigger type?}
    B -- workflow_call\nuse_cache=true --> C[Restore notebook cache\nactions/cache keyed on source hashes]
    B -- workflow_dispatch\nuse_cache=false --> Z[Full re-execution\nno caching]
    B -- schedule Mon --> Z

    C --> D{cache-hit?}
    D -- Yes --> G[make convert-execute-notebooks USE_CACHE=1]
    D -- No --> E{.notebook-cache\nnon-empty?\npartial restore?}
    E -- Yes --> G
    E -- No --> F[Seed from last successful\nmain-branch artifact\ngh run download\nwrite .sha256 per notebook]
    F --> G

    G --> H{Per notebook:\nstored hash == current hash?}
    H -- Match --> I[Serve .ipynb from cache]
    H -- No match or missing --> J[jupytext execute source.py\nUpdate cache entry]
    I --> K[Upload docs/notebooks artifact]
    J --> K
    Z --> K
Loading

Comments Outside Diff (2)

  1. .github/workflows/build-notebooks.yml, line 66 (link)

    Seed step bypasses the cross-platform sha256sum wrapper

    The previous review noted shasum vs sha256sum inconsistency, and the fix added a compute_sha256 wrapper function to build_notebooks_cached.sh that prefers sha256sum with a shasum -a 256 fallback. However, the seed step here still calls sha256sum directly.

    On ubuntu-latest this is harmless, but if the runner were ever changed to macOS (where sha256sum is not in PATH by default), the seed step would fail while the script itself would succeed. For symmetry and to avoid future confusion, consider inlining the same two-liner fallback pattern here:

    hash="$(if command -v sha256sum >/dev/null 2>&1; then sha256sum "$src" | cut -d' ' -f1; else shasum -a 256 "$src" | cut -d' ' -f1; fi)"

    Or, since this step already sources a shell snippet, extracting it to a small helper script (similar to build_notebooks_cached.sh) would keep both places consistent.

  2. docs/scripts/build_notebooks_cached.sh, line 36 (link)

    Unguarded glob may fail with a literal pattern when SOURCE_DIR is empty

    With set -euo pipefail active, if "$SOURCE_DIR"/*.py matches nothing, bash expands it to the literal string $SOURCE_DIR/*.py (bash's default failglob is off, but the unexpanded path is passed to compute_sha256, which then calls sha256sum on a non-existent file and exits non-zero).

    Adding shopt -s nullglob before the loop (and optionally clearing it after) makes the empty-directory case a no-op instead of an error:

Last reviewed commit: 67101d6

Comment thread .github/workflows/build-notebooks.yml
Comment thread .github/workflows/build-notebooks.yml Outdated
Comment thread docs/scripts/build_notebooks_cached.sh Outdated
- Don't write .sha256 during seeding so changed notebooks are detected
- Rename TMPDIR to SEED_TMPDIR to avoid shadowing the POSIX env var
- Use portable sha256 helper (sha256sum with shasum fallback)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread .github/workflows/build-notebooks.yml
Skip artifact seeding when a partial cache was restored (it already has
correct per-file hashes). Only seed + write current hashes when the
cache dir is completely empty (true bootstrapping).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread .github/workflows/build-notebooks.yml Outdated
andreatgretel and others added 3 commits March 5, 2026 08:19
Prevents seeding from feature branch runs that may have different
notebook sources.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The seed step uses gh run list and gh run download which require
actions:read. Without it, these calls silently fail and the cold-start
cache bootstrapping never executes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scheduled Monday runs and manual workflow_dispatch should execute all
notebooks to catch regressions (e.g. library changes that break a
notebook). Caching is only used via workflow_call (from build-docs)
where the goal is fast, resilient doc deployment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andreatgretel
Copy link
Copy Markdown
Contributor Author

andreatgretel commented Mar 5, 2026

On caching strategy: Both workflows now have a use_cache input flag that can be toggled from the GitHub Actions UI. Defaults:

Trigger Cache Rationale
build-docs manual dispatch on (toggleable) fast doc rebuilds, skip flaky models
build-docs release off releases should validate everything
build-notebooks manual dispatch off (toggleable) manual runs are for regression checking
build-notebooks Monday cron off weekly full validation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andreatgretel
Copy link
Copy Markdown
Contributor Author

Also, re greptile's comment on the summary above - it's a fair point but I think it's fine in practice. The _pyproject.toml just has a loose data-designer>=0.1.0 constraint, and including uv.lock in the cache key would bust the cache on every unrelated dependency bump, which kind of defeats the purpose. The Monday scheduled run already re-executes everything without cache, so dependency regressions get caught weekly anyway.

Replace event_name-based cache logic with an explicit use_cache boolean
input. Defaults:
- build-notebooks: workflow_call=true, dispatch=false, schedule=false
- build-docs: dispatch=true (toggleable), release=false

This gives full control over caching from the GitHub Actions UI.
@andreatgretel andreatgretel merged commit 2564834 into main Mar 5, 2026
47 checks passed
@andreatgretel andreatgretel deleted the andreatgretel/fix/notebook-caching branch April 14, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants