fix(baker): per-file derive must update release-manifest.json (#207)#209
Merged
fix(baker): per-file derive must update release-manifest.json (#207)#209
Conversation
Live-test gap discovered on anvil-dev (dagger smoke against mat-vis-tst): after a successful per-file derive (1k → 512), the derived tier was present on disk + in the source catalog's available_tiers, but release-manifest.json's sources.<src>.tiers block was NOT updated. Effect: the Python client masks this via tree-fallback; JS / shell / Rust clients fetching the static manifest never see derived tiers until a new bake re-emits the manifest. Same shape as the bake-side bug fixed in #208 (#207 part 1). Fix: bundle the manifest update into the catalog commit so they land atomically in one HF commit. CAS-retried with parent_commit so matrix derives (multiple sources/tiers in parallel containers) can't trample each other — same pattern as the bake fix. Verified live on anvil-dev: - Bake polyhaven 1k → manifest.sources.polyhaven.tiers = {1k} - Derive 1k → 512 → manifest.sources.polyhaven.tiers = {1k, 512} - Catalog available_tiers also correctly shows ["1k", "512"]
6 tasks
gerchowl
added a commit
that referenced
this pull request
Apr 28, 2026
) (#211) * test(baker): assert manifest emission + CAS-retry on bake (#210 part A) #207 was a real contract bug — the per-file bake never wrote release-manifest.json — that mocks didn't catch because they only asserted the happy path (textures + catalog + sentinel). The JS/shell/Rust clients hard-failed against a fresh bake; only live anvil-dev runs surfaced it. Tightening the unit-test contract so a future regression can't ship. Adds two test classes (7 new tests): TestManifestEmission (substrate-contract path set): - commit history must include {release-manifest.json, <src>.json, <src>/<tier>/.tier_complete} - manifest commit bundled atomically with catalog (single commit) - manifest commit carries `parent_commit` for the CAS lock - sentinel is strictly the LAST commit TestCasRetryOnManifestCommit (the previously-untested retry loop): - one 412 → fresh manifest re-fetch → retry → success - continuous 412 → exactly 6 attempts before propagating - non-412 (401 auth error) → propagates immediately, no retry Net: 5 → 12 tests on the bake path. * test(baker): assert manifest emission + CAS-retry on derive (#210 part A) Mirror of the bake-side hardening. Locks down the derive path's manifest emission contract and the previously-untested 412-retry loop added in #209. TestDeriveManifestEmission: - commit history must include {release-manifest.json, <src>.json, <src>/<target_tier>/.tier_complete} - manifest body's sources.<src>.tiers.<target_tier> = {complete: true} (catches a regression that drops the tier-update from the merge) TestDeriveCasRetryOnManifestCommit: - one 412 → retry → success (asserts 2 attempts + 2 manifest fetches) - continuous 412 → exactly 6 attempts then propagate (locks the budget) - 401 (non-412) → propagates immediately, no retry Net: 13 → 18 tests on the derive path. * test(e2e): multi-source merge + opt-in concurrency stress (#210 part B) Two new E2E classes that catch contract slips mocks can't: TestMultiSourceManifestMerge (always-on under MAT_VIS_E2E=1): - Bakes polyhaven THEN ambientcg sequentially into the same release tag, asserts the manifest accumulates both sources. A regression that clobbers the second bake's manifest write would only show one source here. Lives nightly via e2e.yml from #205. TestConcurrentBakesShareTag (opt-in MAT_VIS_E2E_CONCURRENCY=N): - Spawns N=3 multiprocessing.Process workers (default) baking polyhaven/ambientcg/gpuopen at tier 1k into ONE release tag. The CAS retry loop is the only thing keeping their manifest writes from clobbering each other. - Threading.Thread is NOT used — Python's GIL serializes live HTTP and never fires the retry path; verified during the #207 fix. - Asserts every worker succeeds AND the final manifest carries every source. A "race lost" outcome would manifest as a missing source entry. Retry exhaustion would surface as a worker returning the 412 in its `error` field. Both clean up their throwaway tags on teardown. The opt-in test serves as the trigger for the Option-B "fragments + final merge" escalation path: if it ever fails repeatedly in nightly CI, that's the data-driven trigger to redesign. Net: 7 → 9 E2E test classes.
This was referenced Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Live anvil-dev dagger smoke uncovered: after a successful per-file derive (1k → 512), the derived tier appeared in the source catalog's `available_tiers` but NOT in `release-manifest.json`'s `sources..tiers` block. JS / shell / Rust clients (which fetch the static manifest) couldn't see derived tiers until a re-bake; Python masked it with the tree-fallback.
Same root cause family as #208 (bake-side manifest emission). This PR mirrors that fix on the derive side.
What changed
`hf_derive_per_file.py` now bundles the catalog update + manifest update into a single `create_commit` call with `parent_commit` for CAS retry (matrix derives won't trample each other). Reuses `_fetch_manifest_with_parent` and `_merge_manifest_for_source` from `hf_bake_per_file.py` for parity.
Verified live (anvil-dev, gerchowl/mat-vis-tst)
Test plan