Skip to content

Mirror certified releases to a preservation-grade host (Zenodo) with DOI #810

@MaxGhenis

Description

@MaxGhenis

Context

At the 2026-04-21 meeting with Lars Vilhuber (AEA Data Editor), Lars raised that Hugging Face does not publish a preservation commitment comparable to Zenodo, CLOCKSS, LOCKSS, or similar long-term-preservation services. Codex's review of our post-meeting plan also flagged this as a dropped workstream.

Our calibrated microdata artifacts (enhanced_cps_YYYY.h5, small_enhanced_cps_YYYY.h5, sparse_enhanced_cps_YYYY.h5, etc.) currently live at huggingface.co/PolicyEngine/policyengine-us-data and huggingface.co/PolicyEngine/policyengine-uk-data. Every TRACE TRO we emit binds these artifacts by SHA-256 and cites a Hugging Face URL.

If that URL breaks in 10 years — HF changes its hosting model, HF gets acquired, HF goes bankrupt, HF changes its free-tier policies — every TRO we have ever emitted becomes unverifiable. The SHA-256 still matches, but a reader cannot recover the bytes it matches against.

What to do

Mirror each certified release to a preservation-grade host with an explicit long-term preservation policy, and record the mirror URL in the DataReleaseManifest alongside the HF URL. Candidates:

  1. Zenodo — preferred. Run by CERN, Europe PMC / OpenAIRE funded. Each deposit gets a DOI. Explicit preservation commitment. Free tier sufficient for our file sizes.
  2. Internet Archive — secondary possibility, less DOI-centric.
  3. GCS Archive under a policyengine.org domain with archival storage class — cheapest but we're the single point of failure.

Zenodo is the best fit because a DOI gives us a citable identifier that is not tied to any single bucket / URL and that external data catalogs can resolve.

Mechanical changes

  • Add a Zenodo upload step to the us-data Modal build pipeline; run on every certified release.
  • Extend DataReleaseManifest to carry an optional primary_mirror_url + preservation_dois list field.
  • Update the TRO emission helpers in policyengine.py to record the DOI in the TRO's artifact location list so verifiers have a durable fallback.

Scope

Not a blocker for webapp-run TRO emission (policyengine-api#3485); the emission itself works against HF URLs today. This is about making what we already emit durable enough to be citable in 20 years.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions