Context
At the 2026-04-21 meeting with Lars Vilhuber (AEA Data Editor), Lars raised that Hugging Face does not publish a preservation commitment comparable to Zenodo, CLOCKSS, LOCKSS, or similar long-term-preservation services. Codex's review of our post-meeting plan also flagged this as a dropped workstream.
Our calibrated microdata artifacts (enhanced_cps_YYYY.h5, small_enhanced_cps_YYYY.h5, sparse_enhanced_cps_YYYY.h5, etc.) currently live at huggingface.co/PolicyEngine/policyengine-us-data and huggingface.co/PolicyEngine/policyengine-uk-data. Every TRACE TRO we emit binds these artifacts by SHA-256 and cites a Hugging Face URL.
If that URL breaks in 10 years — HF changes its hosting model, HF gets acquired, HF goes bankrupt, HF changes its free-tier policies — every TRO we have ever emitted becomes unverifiable. The SHA-256 still matches, but a reader cannot recover the bytes it matches against.
What to do
Mirror each certified release to a preservation-grade host with an explicit long-term preservation policy, and record the mirror URL in the DataReleaseManifest alongside the HF URL. Candidates:
- Zenodo — preferred. Run by CERN, Europe PMC / OpenAIRE funded. Each deposit gets a DOI. Explicit preservation commitment. Free tier sufficient for our file sizes.
- Internet Archive — secondary possibility, less DOI-centric.
- GCS Archive under a policyengine.org domain with archival storage class — cheapest but we're the single point of failure.
Zenodo is the best fit because a DOI gives us a citable identifier that is not tied to any single bucket / URL and that external data catalogs can resolve.
Mechanical changes
- Add a Zenodo upload step to the us-data Modal build pipeline; run on every certified release.
- Extend
DataReleaseManifest to carry an optional primary_mirror_url + preservation_dois list field.
- Update the TRO emission helpers in
policyengine.py to record the DOI in the TRO's artifact location list so verifiers have a durable fallback.
Scope
Not a blocker for webapp-run TRO emission (policyengine-api#3485); the emission itself works against HF URLs today. This is about making what we already emit durable enough to be citable in 20 years.
Related
Context
At the 2026-04-21 meeting with Lars Vilhuber (AEA Data Editor), Lars raised that Hugging Face does not publish a preservation commitment comparable to Zenodo, CLOCKSS, LOCKSS, or similar long-term-preservation services. Codex's review of our post-meeting plan also flagged this as a dropped workstream.
Our calibrated microdata artifacts (
enhanced_cps_YYYY.h5,small_enhanced_cps_YYYY.h5,sparse_enhanced_cps_YYYY.h5, etc.) currently live athuggingface.co/PolicyEngine/policyengine-us-dataandhuggingface.co/PolicyEngine/policyengine-uk-data. Every TRACE TRO we emit binds these artifacts by SHA-256 and cites a Hugging Face URL.If that URL breaks in 10 years — HF changes its hosting model, HF gets acquired, HF goes bankrupt, HF changes its free-tier policies — every TRO we have ever emitted becomes unverifiable. The SHA-256 still matches, but a reader cannot recover the bytes it matches against.
What to do
Mirror each certified release to a preservation-grade host with an explicit long-term preservation policy, and record the mirror URL in the
DataReleaseManifestalongside the HF URL. Candidates:Zenodo is the best fit because a DOI gives us a citable identifier that is not tied to any single bucket / URL and that external data catalogs can resolve.
Mechanical changes
DataReleaseManifestto carry an optionalprimary_mirror_url+preservation_doislist field.policyengine.pyto record the DOI in the TRO's artifact location list so verifiers have a durable fallback.Scope
Not a blocker for webapp-run TRO emission (policyengine-api#3485); the emission itself works against HF URLs today. This is about making what we already emit durable enough to be citable in 20 years.
Related
policyengine.py/docs/trace-case-study.md