Skip to content

Make policyengine.py the immutable release boundary for country model and data versions #270

@MaxGhenis

Description

@MaxGhenis

Problem

policyengine.py already has the right concepts for versioned orchestration, but it is not yet the authoritative immutable boundary for country model and data releases.

Today, the package still relies on mutable dataset locations and country-package-local defaults:

  • TaxBenefitModelVersion and DatasetVersion exist, but they do not currently pin or resolve concrete model/data artifact compatibility in a reusable way.
  • US dataset helpers default to floating HF paths like hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5 in src/policyengine/tax_benefit_models/us/datasets.py.
  • UK dataset helpers do the same for policyengine-uk-data in src/policyengine/tax_benefit_models/uk/datasets.py.
  • US region datasets are hard-coded to mutable GCS paths in src/policyengine/countries/us/regions.py.
  • UK region datasets and weight artifacts are hard-coded to mutable/private GCS paths in src/policyengine/countries/uk/regions.py.
  • The US model currently looks up policyengine-us release metadata from PyPI at runtime in src/policyengine/tax_benefit_models/us/model.py, which is both network-dependent and orthogonal to reproducible artifact resolution.

This means that pinning policyengine.py==X is not currently enough to guarantee a fully reproducible default simulation environment.

Desired contract

Pinning one top-level version should be sufficient.

If a user installs policyengine.py==X, that release should deterministically define, for each supported country:

  • the exact country model package version to use (policyengine-us, policyengine-uk, etc.)
  • the exact country data package version to use (policyengine-us-data, policyengine-uk-data, etc.)
  • the exact immutable dataset artifacts to fetch by default
  • checksums for those artifacts
  • enough provenance to rebuild the dataset deterministically from source inputs when needed

In other words:

policyengine.py version -> country manifest -> model package version + data package version -> exact dataset artifacts

The same contract should hold for US, UK, and future countries.

What should change

1. Add a packaged release manifest layer in policyengine.py

Introduce a machine-readable manifest format, versioned with policyengine.py, that maps each supported country to:

  • model_package.name
  • model_package.version
  • data_package.name
  • data_package.version
  • default datasets by logical name
  • artifact locators for each dataset
  • artifact checksums
  • optional build provenance metadata

Example shape:

{
  "country": "us",
  "policyengine_py_version": "X.Y.Z",
  "model_package": {
    "name": "policyengine-us",
    "version": "A.B.C"
  },
  "data_package": {
    "name": "policyengine-us-data",
    "version": "D.E.F"
  },
  "datasets": {
    "enhanced_cps_2024": {
      "repo": "policyengine/policyengine-us-data",
      "path": "enhanced_cps_2024.h5",
      "revision": "D.E.F",
      "sha256": "..."
    }
  }
}

This manifest should be the canonical default lookup mechanism for .py.

2. Stop treating free-form URLs as the source of truth

Inside policyengine.py, country code should resolve datasets from logical refs plus manifest metadata, not from handwritten floating URLs.

Examples:

  • Prefer dataset="enhanced_cps_2024" + manifest resolution over embedding hf://.../enhanced_cps_2024.h5
  • Prefer country-specific dataset registries resolved from a pinned data_package.version
  • Prefer manifest-based resolution for region datasets and weight matrices too, not just national microdata

3. Make each -data version discoverable on Hugging Face

Each country -data release should publish a release manifest or index that lives at the corresponding HF revision/tag.

For example, given policyengine-us-data==1.25.3, we should be able to resolve:

  • HF repo: policyengine/policyengine-us-data
  • revision/tag: 1.25.3
  • machine-readable manifest at that revision listing the datasets available there

This is close to current behavior in the US and UK data repos, which already upload root-level filenames and tag the HF commit with the package version. The missing contract is:

  • make the tag the official lookup boundary
  • publish a manifest/index at that revision
  • include checksums and provenance
  • avoid depending on users knowing filenames by convention

4. Support deterministic rebuilds from a pinned -data version

The -data package version should not only identify a downloadable artifact. It should also identify the build recipe.

For each published dataset artifact, the release metadata should include enough information to rebuild it from scratch, including:

  • source repo commit
  • pipeline name / entrypoint
  • upstream raw input identifiers and checksums
  • calibration inputs and checksums
  • any random seeds or deterministic parameters
  • expected output checksum

By default, .py should download the prebuilt artifact for speed. But the pinned -data version should also support a deterministic rebuild path that can be verified against the published checksum.

5. Record the resolved model/data bundle in simulation outputs

Every simulation/report/output that relies on this orchestration should expose the resolved bundle, including:

  • policyengine.py version
  • country model package name/version
  • country data package name/version
  • resolved dataset artifact locator
  • artifact checksum
  • optional build provenance identifier

That metadata should be easy to serialize into result objects, exports, or manifests for academic replication.

6. Generalize this across countries

This should not be implemented as a US-only special case.

The orchestration mechanism should work for:

  • US national datasets
  • US region datasets (states, districts, place-derived workflows)
  • UK national datasets
  • UK regional artifacts and weight matrices
  • future country packages that follow the same contract

Repo scope

This issue belongs in policyengine.py, but it requires coordinated changes across repos.

Expected participating repos:

  • policyengine.py
  • policyengine-us
  • policyengine-uk
  • policyengine-us-data
  • policyengine-uk-data

Likely responsibilities:

policyengine.py

  • define the manifest schema and resolution API
  • package country manifests with each release
  • resolve defaults from manifest rather than floating URLs
  • expose resolved bundle metadata on simulations/results
  • add replay tests for pinned historical manifests

country model repos (policyengine-us, policyengine-uk)

  • stop assuming mutable default dataset URLs are the default replication path
  • support explicit data_version / manifest-driven dataset resolution where needed
  • expose version metadata without requiring live PyPI lookups

country data repos (policyengine-us-data, policyengine-uk-data)

  • publish a per-release manifest/index at each release tag
  • make data_version -> datasets available at that release machine-readable
  • include checksums and build provenance
  • preserve or publish immutable artifact paths for region/national assets

Proposed rollout

Phase 1: contract and schema

  • define manifest schema in policyengine.py
  • implement resolver for country + logical dataset name + policyengine.py release
  • package initial manifests for US and UK

Phase 2: runtime migration

  • remove floating dataset defaults from .py dataset helpers
  • migrate region registries to manifest-backed artifact resolution
  • stop using runtime PyPI metadata lookups for orchestration decisions

Phase 3: provenance and replay

  • add per-release manifest publishing in -data repos
  • add checksum verification
  • add deterministic rebuild metadata
  • add golden historical replay tests in CI for at least one US and one UK release

Acceptance criteria

  • Installing policyengine.py==X is sufficient to determine the default country model and data versions for US and UK without relying on mutable latest dataset paths.
  • policyengine.py contains a machine-readable manifest for each supported country release bundle.
  • Each pinned -data version can be resolved to a machine-readable dataset index at the corresponding HF revision.
  • Default dataset resolution in .py goes through manifest-backed logical dataset names, not handwritten floating HF/GCS URLs.
  • Region datasets and related weight artifacts are covered by the same versioning contract.
  • Simulation outputs can report the exact resolved model/data artifact bundle.
  • At least one historical US release and one historical UK release can be replayed in CI from pinned manifests.

Non-goals

  • bundling large dataset artifacts directly into the policyengine.py wheel
  • forcing an immediate monorepo migration
  • solving only the US case and retrofitting UK later

Why this matters

This would make policyengine.py the actual immutable boundary for replication.

That is a much stronger and simpler contract for research, debugging, support, and reproducibility than the current situation where users effectively rely on a mix of package versions, floating dataset URLs, and country-specific conventions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    code-healthRelated to code qualityenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions