fix: cache notebook builds to avoid flaky upstream model failures#370
fix: cache notebook builds to avoid flaky upstream model failures#370andreatgretel merged 8 commits intomainfrom
Conversation
The build-notebooks CI executes all tutorial notebooks on every run. When an upstream model (e.g. black-forest-labs/flux.2-pro) is down, the entire docs build fails even if no notebooks changed. Add per-notebook caching based on source file SHA-256 hashes. Unchanged notebooks are served from cache, and only modified ones are re-executed. On the first CI run (empty cache), the workflow seeds the cache from the last successful build artifact. Also add a minimal test script (test_flux_image_gen.py) to reproduce the flux.2-pro health check failure locally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR introduces per-notebook caching for the Key changes:
Minor observations:
|
| Filename | Overview |
|---|---|
| .github/workflows/build-notebooks.yml | Adds conditional caching logic (use_cache input), actions/cache restore step, an artifact seed step that seeds .notebook-cache/ when it is truly empty, and plumbs USE_CACHE=1 through to make. Previously-discussed concerns (partial-restore guard, --branch main filter, SEED_TMPDIR rename, sha hash-writing in bootstrapping case) are addressed in the current HEAD. |
| .github/workflows/build-docs.yml | Adds use_cache workflow_dispatch input (default true) and passes it to build-notebooks.yml only for dispatch events; release-triggered builds always pass false, leaving them without flakiness protection (acknowledged open question in PR description). |
| docs/scripts/build_notebooks_cached.sh | New per-notebook caching script using SHA-256 source hashes; includes cross-platform compute_sha256 helper, correct cache read/write logic, and conditional cleanup only when at least one notebook is re-executed. |
| Makefile | Adds ifeq (USE_CACHE,1) branch to convert-execute-notebooks target that delegates to the new script, with a cleaner conditional artifact cleanup compared to the non-cached path. |
| .gitignore | Adds .notebook-cache/ to .gitignore — correct and expected. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[build-notebooks triggered] --> B{Trigger type?}
B -- workflow_call\nuse_cache=true --> C[Restore notebook cache\nactions/cache keyed on source hashes]
B -- workflow_dispatch\nuse_cache=false --> Z[Full re-execution\nno caching]
B -- schedule Mon --> Z
C --> D{cache-hit?}
D -- Yes --> G[make convert-execute-notebooks USE_CACHE=1]
D -- No --> E{.notebook-cache\nnon-empty?\npartial restore?}
E -- Yes --> G
E -- No --> F[Seed from last successful\nmain-branch artifact\ngh run download\nwrite .sha256 per notebook]
F --> G
G --> H{Per notebook:\nstored hash == current hash?}
H -- Match --> I[Serve .ipynb from cache]
H -- No match or missing --> J[jupytext execute source.py\nUpdate cache entry]
I --> K[Upload docs/notebooks artifact]
J --> K
Z --> K
Comments Outside Diff (2)
-
.github/workflows/build-notebooks.yml, line 66 (link)Seed step bypasses the cross-platform
sha256sumwrapperThe previous review noted
shasumvssha256suminconsistency, and the fix added acompute_sha256wrapper function tobuild_notebooks_cached.shthat preferssha256sumwith ashasum -a 256fallback. However, the seed step here still callssha256sumdirectly.On
ubuntu-latestthis is harmless, but if the runner were ever changed to macOS (wheresha256sumis not in PATH by default), the seed step would fail while the script itself would succeed. For symmetry and to avoid future confusion, consider inlining the same two-liner fallback pattern here:hash="$(if command -v sha256sum >/dev/null 2>&1; then sha256sum "$src" | cut -d' ' -f1; else shasum -a 256 "$src" | cut -d' ' -f1; fi)"Or, since this step already sources a shell snippet, extracting it to a small helper script (similar to
build_notebooks_cached.sh) would keep both places consistent. -
docs/scripts/build_notebooks_cached.sh, line 36 (link)Unguarded glob may fail with a literal pattern when
SOURCE_DIRis emptyWith
set -euo pipefailactive, if"$SOURCE_DIR"/*.pymatches nothing, bash expands it to the literal string$SOURCE_DIR/*.py(bash's defaultfailglobis off, but the unexpanded path is passed tocompute_sha256, which then callssha256sumon a non-existent file and exits non-zero).Adding
shopt -s nullglobbefore the loop (and optionally clearing it after) makes the empty-directory case a no-op instead of an error:
Last reviewed commit: 67101d6
- Don't write .sha256 during seeding so changed notebooks are detected - Rename TMPDIR to SEED_TMPDIR to avoid shadowing the POSIX env var - Use portable sha256 helper (sha256sum with shasum fallback) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Skip artifact seeding when a partial cache was restored (it already has correct per-file hashes). Only seed + write current hashes when the cache dir is completely empty (true bootstrapping). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents seeding from feature branch runs that may have different notebook sources. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The seed step uses gh run list and gh run download which require actions:read. Without it, these calls silently fail and the cold-start cache bootstrapping never executes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scheduled Monday runs and manual workflow_dispatch should execute all notebooks to catch regressions (e.g. library changes that break a notebook). Caching is only used via workflow_call (from build-docs) where the goal is fast, resilient doc deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
On caching strategy: Both workflows now have a
|
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Also, re greptile's comment on the summary above - it's a fair point but I think it's fine in practice. The |
Replace event_name-based cache logic with an explicit use_cache boolean input. Defaults: - build-notebooks: workflow_call=true, dispatch=false, schedule=false - build-docs: dispatch=true (toggleable), release=false This gives full control over caching from the GitHub Actions UI.
Summary
The
build-notebooksCI workflow executes all tutorial notebooks on every run. When an upstream model (e.g.black-forest-labs/flux.2-pro) is temporarily down, the entire docs build fails even if no notebooks changed. This PR adds per-notebook caching so unchanged notebooks are served from cache, and only modified ones are re-executed.Caching is scoped to
workflow_callonly (i.e. when triggered frombuild-docs). Scheduled Monday runs and manualworkflow_dispatchruns execute all notebooks without caching to catch regressions from library changes.Changes
Added
docs/scripts/build_notebooks_cached.sh- per-notebook caching script that compares source file SHA-256 hashes against cached values, skipping execution for unchanged notebooksbuild-notebooks.yml- usesactions/cachewith a fallback that seeds from the last successful main-branch artifact when the cache is completely emptyUSE_CACHE=1option formake convert-execute-notebooksfor local useChanged
build-notebooks.ymlnow conditionally uses caching (workflow_call) or full execution (schedule,workflow_dispatch)actions: readpermission forgh run list/gh run downloadAttention Areas
build-notebooks.yml#L34-L69- seed step logic: only runs when cache dir is truly empty (not on partial restore), restricts artifact lookup tomainbranchworkflow_call(frombuild-docs), not on scheduled or manual runs — see this comment for the rationale and open question about release buildsTest plan
make convert-execute-notebooks USE_CACHE=1serves all 6 notebooks from cache when sources are unchangedbuild-notebooksworkflow — completed in 26s (vs ~9min) with all notebooks cachedbuild-docsworkflow — completed successfullyDescription updated with AI