Problem
policyengine.py already has the right concepts for versioned orchestration, but it is not yet the authoritative immutable boundary for country model and data releases.
Today, the package still relies on mutable dataset locations and country-package-local defaults:
TaxBenefitModelVersion and DatasetVersion exist, but they do not currently pin or resolve concrete model/data artifact compatibility in a reusable way.
- US dataset helpers default to floating HF paths like
hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5 in src/policyengine/tax_benefit_models/us/datasets.py.
- UK dataset helpers do the same for
policyengine-uk-data in src/policyengine/tax_benefit_models/uk/datasets.py.
- US region datasets are hard-coded to mutable GCS paths in
src/policyengine/countries/us/regions.py.
- UK region datasets and weight artifacts are hard-coded to mutable/private GCS paths in
src/policyengine/countries/uk/regions.py.
- The US model currently looks up
policyengine-us release metadata from PyPI at runtime in src/policyengine/tax_benefit_models/us/model.py, which is both network-dependent and orthogonal to reproducible artifact resolution.
This means that pinning policyengine.py==X is not currently enough to guarantee a fully reproducible default simulation environment.
Desired contract
Pinning one top-level version should be sufficient.
If a user installs policyengine.py==X, that release should deterministically define, for each supported country:
- the exact country model package version to use (
policyengine-us, policyengine-uk, etc.)
- the exact country data package version to use (
policyengine-us-data, policyengine-uk-data, etc.)
- the exact immutable dataset artifacts to fetch by default
- checksums for those artifacts
- enough provenance to rebuild the dataset deterministically from source inputs when needed
In other words:
policyengine.py version -> country manifest -> model package version + data package version -> exact dataset artifacts
The same contract should hold for US, UK, and future countries.
What should change
1. Add a packaged release manifest layer in policyengine.py
Introduce a machine-readable manifest format, versioned with policyengine.py, that maps each supported country to:
model_package.name
model_package.version
data_package.name
data_package.version
- default datasets by logical name
- artifact locators for each dataset
- artifact checksums
- optional build provenance metadata
Example shape:
{
"country": "us",
"policyengine_py_version": "X.Y.Z",
"model_package": {
"name": "policyengine-us",
"version": "A.B.C"
},
"data_package": {
"name": "policyengine-us-data",
"version": "D.E.F"
},
"datasets": {
"enhanced_cps_2024": {
"repo": "policyengine/policyengine-us-data",
"path": "enhanced_cps_2024.h5",
"revision": "D.E.F",
"sha256": "..."
}
}
}
This manifest should be the canonical default lookup mechanism for .py.
2. Stop treating free-form URLs as the source of truth
Inside policyengine.py, country code should resolve datasets from logical refs plus manifest metadata, not from handwritten floating URLs.
Examples:
- Prefer
dataset="enhanced_cps_2024" + manifest resolution over embedding hf://.../enhanced_cps_2024.h5
- Prefer country-specific dataset registries resolved from a pinned
data_package.version
- Prefer manifest-based resolution for region datasets and weight matrices too, not just national microdata
3. Make each -data version discoverable on Hugging Face
Each country -data release should publish a release manifest or index that lives at the corresponding HF revision/tag.
For example, given policyengine-us-data==1.25.3, we should be able to resolve:
- HF repo:
policyengine/policyengine-us-data
- revision/tag:
1.25.3
- machine-readable manifest at that revision listing the datasets available there
This is close to current behavior in the US and UK data repos, which already upload root-level filenames and tag the HF commit with the package version. The missing contract is:
- make the tag the official lookup boundary
- publish a manifest/index at that revision
- include checksums and provenance
- avoid depending on users knowing filenames by convention
4. Support deterministic rebuilds from a pinned -data version
The -data package version should not only identify a downloadable artifact. It should also identify the build recipe.
For each published dataset artifact, the release metadata should include enough information to rebuild it from scratch, including:
- source repo commit
- pipeline name / entrypoint
- upstream raw input identifiers and checksums
- calibration inputs and checksums
- any random seeds or deterministic parameters
- expected output checksum
By default, .py should download the prebuilt artifact for speed. But the pinned -data version should also support a deterministic rebuild path that can be verified against the published checksum.
5. Record the resolved model/data bundle in simulation outputs
Every simulation/report/output that relies on this orchestration should expose the resolved bundle, including:
policyengine.py version
- country model package name/version
- country data package name/version
- resolved dataset artifact locator
- artifact checksum
- optional build provenance identifier
That metadata should be easy to serialize into result objects, exports, or manifests for academic replication.
6. Generalize this across countries
This should not be implemented as a US-only special case.
The orchestration mechanism should work for:
- US national datasets
- US region datasets (states, districts, place-derived workflows)
- UK national datasets
- UK regional artifacts and weight matrices
- future country packages that follow the same contract
Repo scope
This issue belongs in policyengine.py, but it requires coordinated changes across repos.
Expected participating repos:
policyengine.py
policyengine-us
policyengine-uk
policyengine-us-data
policyengine-uk-data
Likely responsibilities:
policyengine.py
- define the manifest schema and resolution API
- package country manifests with each release
- resolve defaults from manifest rather than floating URLs
- expose resolved bundle metadata on simulations/results
- add replay tests for pinned historical manifests
country model repos (policyengine-us, policyengine-uk)
- stop assuming mutable default dataset URLs are the default replication path
- support explicit
data_version / manifest-driven dataset resolution where needed
- expose version metadata without requiring live PyPI lookups
country data repos (policyengine-us-data, policyengine-uk-data)
- publish a per-release manifest/index at each release tag
- make
data_version -> datasets available at that release machine-readable
- include checksums and build provenance
- preserve or publish immutable artifact paths for region/national assets
Proposed rollout
Phase 1: contract and schema
- define manifest schema in
policyengine.py
- implement resolver for
country + logical dataset name + policyengine.py release
- package initial manifests for US and UK
Phase 2: runtime migration
- remove floating dataset defaults from
.py dataset helpers
- migrate region registries to manifest-backed artifact resolution
- stop using runtime PyPI metadata lookups for orchestration decisions
Phase 3: provenance and replay
- add per-release manifest publishing in
-data repos
- add checksum verification
- add deterministic rebuild metadata
- add golden historical replay tests in CI for at least one US and one UK release
Acceptance criteria
- Installing
policyengine.py==X is sufficient to determine the default country model and data versions for US and UK without relying on mutable latest dataset paths.
policyengine.py contains a machine-readable manifest for each supported country release bundle.
- Each pinned
-data version can be resolved to a machine-readable dataset index at the corresponding HF revision.
- Default dataset resolution in
.py goes through manifest-backed logical dataset names, not handwritten floating HF/GCS URLs.
- Region datasets and related weight artifacts are covered by the same versioning contract.
- Simulation outputs can report the exact resolved model/data artifact bundle.
- At least one historical US release and one historical UK release can be replayed in CI from pinned manifests.
Non-goals
- bundling large dataset artifacts directly into the
policyengine.py wheel
- forcing an immediate monorepo migration
- solving only the US case and retrofitting UK later
Why this matters
This would make policyengine.py the actual immutable boundary for replication.
That is a much stronger and simpler contract for research, debugging, support, and reproducibility than the current situation where users effectively rely on a mix of package versions, floating dataset URLs, and country-specific conventions.
Problem
policyengine.pyalready has the right concepts for versioned orchestration, but it is not yet the authoritative immutable boundary for country model and data releases.Today, the package still relies on mutable dataset locations and country-package-local defaults:
TaxBenefitModelVersionandDatasetVersionexist, but they do not currently pin or resolve concrete model/data artifact compatibility in a reusable way.hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5insrc/policyengine/tax_benefit_models/us/datasets.py.policyengine-uk-datainsrc/policyengine/tax_benefit_models/uk/datasets.py.src/policyengine/countries/us/regions.py.src/policyengine/countries/uk/regions.py.policyengine-usrelease metadata from PyPI at runtime insrc/policyengine/tax_benefit_models/us/model.py, which is both network-dependent and orthogonal to reproducible artifact resolution.This means that pinning
policyengine.py==Xis not currently enough to guarantee a fully reproducible default simulation environment.Desired contract
Pinning one top-level version should be sufficient.
If a user installs
policyengine.py==X, that release should deterministically define, for each supported country:policyengine-us,policyengine-uk, etc.)policyengine-us-data,policyengine-uk-data, etc.)In other words:
policyengine.py version -> country manifest -> model package version + data package version -> exact dataset artifactsThe same contract should hold for US, UK, and future countries.
What should change
1. Add a packaged release manifest layer in
policyengine.pyIntroduce a machine-readable manifest format, versioned with
policyengine.py, that maps each supported country to:model_package.namemodel_package.versiondata_package.namedata_package.versionExample shape:
{ "country": "us", "policyengine_py_version": "X.Y.Z", "model_package": { "name": "policyengine-us", "version": "A.B.C" }, "data_package": { "name": "policyengine-us-data", "version": "D.E.F" }, "datasets": { "enhanced_cps_2024": { "repo": "policyengine/policyengine-us-data", "path": "enhanced_cps_2024.h5", "revision": "D.E.F", "sha256": "..." } } }This manifest should be the canonical default lookup mechanism for
.py.2. Stop treating free-form URLs as the source of truth
Inside
policyengine.py, country code should resolve datasets from logical refs plus manifest metadata, not from handwritten floating URLs.Examples:
dataset="enhanced_cps_2024"+ manifest resolution over embeddinghf://.../enhanced_cps_2024.h5data_package.version3. Make each
-dataversion discoverable on Hugging FaceEach country
-datarelease should publish a release manifest or index that lives at the corresponding HF revision/tag.For example, given
policyengine-us-data==1.25.3, we should be able to resolve:policyengine/policyengine-us-data1.25.3This is close to current behavior in the US and UK data repos, which already upload root-level filenames and tag the HF commit with the package version. The missing contract is:
4. Support deterministic rebuilds from a pinned
-dataversionThe
-datapackage version should not only identify a downloadable artifact. It should also identify the build recipe.For each published dataset artifact, the release metadata should include enough information to rebuild it from scratch, including:
By default,
.pyshould download the prebuilt artifact for speed. But the pinned-dataversion should also support a deterministic rebuild path that can be verified against the published checksum.5. Record the resolved model/data bundle in simulation outputs
Every simulation/report/output that relies on this orchestration should expose the resolved bundle, including:
policyengine.pyversionThat metadata should be easy to serialize into result objects, exports, or manifests for academic replication.
6. Generalize this across countries
This should not be implemented as a US-only special case.
The orchestration mechanism should work for:
Repo scope
This issue belongs in
policyengine.py, but it requires coordinated changes across repos.Expected participating repos:
policyengine.pypolicyengine-uspolicyengine-ukpolicyengine-us-datapolicyengine-uk-dataLikely responsibilities:
policyengine.pycountry model repos (
policyengine-us,policyengine-uk)data_version/ manifest-driven dataset resolution where neededcountry data repos (
policyengine-us-data,policyengine-uk-data)data_version -> datasets available at that releasemachine-readableProposed rollout
Phase 1: contract and schema
policyengine.pycountry + logical dataset name + policyengine.py releasePhase 2: runtime migration
.pydataset helpersPhase 3: provenance and replay
-datareposAcceptance criteria
policyengine.py==Xis sufficient to determine the default country model and data versions for US and UK without relying on mutable latest dataset paths.policyengine.pycontains a machine-readable manifest for each supported country release bundle.-dataversion can be resolved to a machine-readable dataset index at the corresponding HF revision..pygoes through manifest-backed logical dataset names, not handwritten floating HF/GCS URLs.Non-goals
policyengine.pywheelWhy this matters
This would make
policyengine.pythe actual immutable boundary for replication.That is a much stronger and simpler contract for research, debugging, support, and reproducibility than the current situation where users effectively rely on a mix of package versions, floating dataset URLs, and country-specific conventions.