Add preservation-mirror fields to DataReleaseManifest#317
Open
Add preservation-mirror fields to DataReleaseManifest#317
Conversation
Extends the data-release manifest model to carry optional
preservation-grade mirror metadata:
- New PreservationMirror model with kind ('zenodo', 'archival_gcs',
etc.), url, and optional doi / sha256 / deposited_at fields.
- New preservation_mirrors list on each DataReleaseArtifact, for
per-artifact mirrors (Zenodo file deposits, GCS archival copies).
- New preservation_dois list on DataReleaseManifest for release-level
DOIs (Zenodo mints one per deposit covering all files).
All new fields have defaults and the existing manifest JSON schema
continues to validate unchanged — verified with a backwards-
compatibility test that loads a legacy manifest JSON blob.
This is the data contract for the Zenodo-mirror workstream scoped in
PolicyEngine/policyengine-us-data#810: the us-data Modal build will
deposit each certified h5 to Zenodo and populate these fields when
emitting the DataReleaseManifest to HuggingFace. The TRACE TRO
emission helpers will then read preservation_mirrors / preservation_dois
to record durable fallback locations in every TRO it builds.
Motivation (2026-04-21 meeting with Lars Vilhuber / AEA Data Editor):
HuggingFace doesn't publish a preservation commitment, so a TRO
citation URL that resolves only through HF can 404 decades from now.
Zenodo (CERN / OpenAIRE-operated, DOI-minting) is the reference
preservation-grade host Lars pointed at.
9 new tests; full non-integration suite green (444 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
This is the data contract for the Zenodo-mirror workstream in PolicyEngine/policyengine-us-data#810. The us-data Modal build will deposit each certified h5 to Zenodo and populate these fields when emitting the `DataReleaseManifest` to HuggingFace. The TRACE TRO emission helpers will then read these fields to record durable fallback locations in every TRO.
Motivated by the 2026-04-21 meeting with Lars Vilhuber (AEA Data Editor): HuggingFace doesn't publish a preservation commitment, so a TRO citation URL that resolves only through HF can 404 decades from now. Zenodo (CERN / OpenAIRE-operated, DOI-minting) is the reference preservation-grade host Lars pointed at.
Test plan
Related
🤖 Generated with Claude Code