Skip to content

fix(baker): per-file derive must update release-manifest.json (#207)#209

Merged
gerchowl merged 1 commit intodevfrom
bugfix/207-derive-manifest-update
Apr 28, 2026
Merged

fix(baker): per-file derive must update release-manifest.json (#207)#209
gerchowl merged 1 commit intodevfrom
bugfix/207-derive-manifest-update

Conversation

@gerchowl
Copy link
Copy Markdown
Contributor

Summary

Live anvil-dev dagger smoke uncovered: after a successful per-file derive (1k → 512), the derived tier appeared in the source catalog's `available_tiers` but NOT in `release-manifest.json`'s `sources..tiers` block. JS / shell / Rust clients (which fetch the static manifest) couldn't see derived tiers until a re-bake; Python masked it with the tree-fallback.

Same root cause family as #208 (bake-side manifest emission). This PR mirrors that fix on the derive side.

What changed

`hf_derive_per_file.py` now bundles the catalog update + manifest update into a single `create_commit` call with `parent_commit` for CAS retry (matrix derives won't trample each other). Reuses `_fetch_manifest_with_parent` and `_merge_manifest_for_source` from `hf_bake_per_file.py` for parity.

Verified live (anvil-dev, gerchowl/mat-vis-tst)

Before After
Bake polyhaven 1k `tiers: {1k}` ✓ `tiers: {1k}` ✓
Derive 1k → 512 catalog `["1k","512"]` ✓; manifest still `{1k}` ✗ both correct
JS/shell/Rust visibility of 512 invisible visible

Test plan

  • `just test-fast` — 327 baker + 211 client = 538 pass
  • `TestDeriveSmallerTier` updated to assert the bundled commit + parent_commit + manifest body
  • Live anvil-dev re-run after fix
  • CI: matrix tests + integration test

Live-test gap discovered on anvil-dev (dagger smoke against mat-vis-tst):
after a successful per-file derive (1k → 512), the derived tier was
present on disk + in the source catalog's available_tiers, but
release-manifest.json's sources.<src>.tiers block was NOT updated.

Effect: the Python client masks this via tree-fallback; JS / shell /
Rust clients fetching the static manifest never see derived tiers
until a new bake re-emits the manifest. Same shape as the bake-side
bug fixed in #208 (#207 part 1).

Fix: bundle the manifest update into the catalog commit so they land
atomically in one HF commit. CAS-retried with parent_commit so matrix
derives (multiple sources/tiers in parallel containers) can't trample
each other — same pattern as the bake fix.

Verified live on anvil-dev:
- Bake polyhaven 1k → manifest.sources.polyhaven.tiers = {1k}
- Derive 1k → 512 → manifest.sources.polyhaven.tiers = {1k, 512}
- Catalog available_tiers also correctly shows ["1k", "512"]
@github-actions github-actions Bot added area:testing Test infrastructure, BATS, pytest area:baker Baker pipeline, Dagger, data fetchers labels Apr 28, 2026
@gerchowl gerchowl merged commit 28936b8 into dev Apr 28, 2026
5 checks passed
@gerchowl gerchowl deleted the bugfix/207-derive-manifest-update branch April 28, 2026 10:02
gerchowl added a commit that referenced this pull request Apr 28, 2026
) (#211)

* test(baker): assert manifest emission + CAS-retry on bake (#210 part A)

#207 was a real contract bug — the per-file bake never wrote
release-manifest.json — that mocks didn't catch because they only
asserted the happy path (textures + catalog + sentinel). The
JS/shell/Rust clients hard-failed against a fresh bake; only live
anvil-dev runs surfaced it.

Tightening the unit-test contract so a future regression can't ship.

Adds two test classes (7 new tests):

TestManifestEmission (substrate-contract path set):
- commit history must include {release-manifest.json, <src>.json,
  <src>/<tier>/.tier_complete}
- manifest commit bundled atomically with catalog (single commit)
- manifest commit carries `parent_commit` for the CAS lock
- sentinel is strictly the LAST commit

TestCasRetryOnManifestCommit (the previously-untested retry loop):
- one 412 → fresh manifest re-fetch → retry → success
- continuous 412 → exactly 6 attempts before propagating
- non-412 (401 auth error) → propagates immediately, no retry

Net: 5 → 12 tests on the bake path.

* test(baker): assert manifest emission + CAS-retry on derive (#210 part A)

Mirror of the bake-side hardening. Locks down the derive path's
manifest emission contract and the previously-untested 412-retry
loop added in #209.

TestDeriveManifestEmission:
- commit history must include {release-manifest.json, <src>.json,
  <src>/<target_tier>/.tier_complete}
- manifest body's sources.<src>.tiers.<target_tier> = {complete: true}
  (catches a regression that drops the tier-update from the merge)

TestDeriveCasRetryOnManifestCommit:
- one 412 → retry → success (asserts 2 attempts + 2 manifest fetches)
- continuous 412 → exactly 6 attempts then propagate (locks the budget)
- 401 (non-412) → propagates immediately, no retry

Net: 13 → 18 tests on the derive path.

* test(e2e): multi-source merge + opt-in concurrency stress (#210 part B)

Two new E2E classes that catch contract slips mocks can't:

TestMultiSourceManifestMerge (always-on under MAT_VIS_E2E=1):
- Bakes polyhaven THEN ambientcg sequentially into the same release
  tag, asserts the manifest accumulates both sources. A regression
  that clobbers the second bake's manifest write would only show one
  source here. Lives nightly via e2e.yml from #205.

TestConcurrentBakesShareTag (opt-in MAT_VIS_E2E_CONCURRENCY=N):
- Spawns N=3 multiprocessing.Process workers (default) baking
  polyhaven/ambientcg/gpuopen at tier 1k into ONE release tag. The
  CAS retry loop is the only thing keeping their manifest writes
  from clobbering each other.
- Threading.Thread is NOT used — Python's GIL serializes live HTTP
  and never fires the retry path; verified during the #207 fix.
- Asserts every worker succeeds AND the final manifest carries
  every source. A "race lost" outcome would manifest as a missing
  source entry. Retry exhaustion would surface as a worker returning
  the 412 in its `error` field.

Both clean up their throwaway tags on teardown. The opt-in test
serves as the trigger for the Option-B "fragments + final merge"
escalation path: if it ever fails repeatedly in nightly CI, that's
the data-driven trigger to redesign.

Net: 7 → 9 E2E test classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:baker Baker pipeline, Dagger, data fetchers area:testing Test infrastructure, BATS, pytest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant