Make `policyengine.py` the immutable release boundary for country model and data versions

## Problem

`policyengine.py` already has the right concepts for versioned orchestration, but it is not yet the authoritative immutable boundary for country model and data releases.

Today, the package still relies on mutable dataset locations and country-package-local defaults:

- `TaxBenefitModelVersion` and `DatasetVersion` exist, but they do not currently pin or resolve concrete model/data artifact compatibility in a reusable way.
- US dataset helpers default to floating HF paths like `hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5` in `src/policyengine/tax_benefit_models/us/datasets.py`.
- UK dataset helpers do the same for `policyengine-uk-data` in `src/policyengine/tax_benefit_models/uk/datasets.py`.
- US region datasets are hard-coded to mutable GCS paths in `src/policyengine/countries/us/regions.py`.
- UK region datasets and weight artifacts are hard-coded to mutable/private GCS paths in `src/policyengine/countries/uk/regions.py`.
- The US model currently looks up `policyengine-us` release metadata from PyPI at runtime in `src/policyengine/tax_benefit_models/us/model.py`, which is both network-dependent and orthogonal to reproducible artifact resolution.

This means that pinning `policyengine.py==X` is **not** currently enough to guarantee a fully reproducible default simulation environment.

## Desired contract

Pinning one top-level version should be sufficient.

If a user installs `policyengine.py==X`, that release should deterministically define, for each supported country:

- the exact country model package version to use (`policyengine-us`, `policyengine-uk`, etc.)
- the exact country data package version to use (`policyengine-us-data`, `policyengine-uk-data`, etc.)
- the exact immutable dataset artifacts to fetch by default
- checksums for those artifacts
- enough provenance to rebuild the dataset deterministically from source inputs when needed

In other words:

`policyengine.py version -> country manifest -> model package version + data package version -> exact dataset artifacts`

The same contract should hold for US, UK, and future countries.

## What should change

### 1. Add a packaged release manifest layer in `policyengine.py`

Introduce a machine-readable manifest format, versioned with `policyengine.py`, that maps each supported country to:

- `model_package.name`
- `model_package.version`
- `data_package.name`
- `data_package.version`
- default datasets by logical name
- artifact locators for each dataset
- artifact checksums
- optional build provenance metadata

Example shape:

```json
{
  "country": "us",
  "policyengine_py_version": "X.Y.Z",
  "model_package": {
    "name": "policyengine-us",
    "version": "A.B.C"
  },
  "data_package": {
    "name": "policyengine-us-data",
    "version": "D.E.F"
  },
  "datasets": {
    "enhanced_cps_2024": {
      "repo": "policyengine/policyengine-us-data",
      "path": "enhanced_cps_2024.h5",
      "revision": "D.E.F",
      "sha256": "..."
    }
  }
}
```

This manifest should be the canonical default lookup mechanism for `.py`.

### 2. Stop treating free-form URLs as the source of truth

Inside `policyengine.py`, country code should resolve datasets from logical refs plus manifest metadata, not from handwritten floating URLs.

Examples:

- Prefer `dataset="enhanced_cps_2024"` + manifest resolution over embedding `hf://.../enhanced_cps_2024.h5`
- Prefer country-specific dataset registries resolved from a pinned `data_package.version`
- Prefer manifest-based resolution for region datasets and weight matrices too, not just national microdata

### 3. Make each `-data` version discoverable on Hugging Face

Each country `-data` release should publish a release manifest or index that lives at the corresponding HF revision/tag.

For example, given `policyengine-us-data==1.25.3`, we should be able to resolve:

- HF repo: `policyengine/policyengine-us-data`
- revision/tag: `1.25.3`
- machine-readable manifest at that revision listing the datasets available there

This is close to current behavior in the US and UK data repos, which already upload root-level filenames and tag the HF commit with the package version. The missing contract is:

- make the tag the official lookup boundary
- publish a manifest/index at that revision
- include checksums and provenance
- avoid depending on users knowing filenames by convention

### 4. Support deterministic rebuilds from a pinned `-data` version

The `-data` package version should not only identify a downloadable artifact. It should also identify the build recipe.

For each published dataset artifact, the release metadata should include enough information to rebuild it from scratch, including:

- source repo commit
- pipeline name / entrypoint
- upstream raw input identifiers and checksums
- calibration inputs and checksums
- any random seeds or deterministic parameters
- expected output checksum

By default, `.py` should download the prebuilt artifact for speed. But the pinned `-data` version should also support a deterministic rebuild path that can be verified against the published checksum.

### 5. Record the resolved model/data bundle in simulation outputs

Every simulation/report/output that relies on this orchestration should expose the resolved bundle, including:

- `policyengine.py` version
- country model package name/version
- country data package name/version
- resolved dataset artifact locator
- artifact checksum
- optional build provenance identifier

That metadata should be easy to serialize into result objects, exports, or manifests for academic replication.

### 6. Generalize this across countries

This should not be implemented as a US-only special case.

The orchestration mechanism should work for:

- US national datasets
- US region datasets (states, districts, place-derived workflows)
- UK national datasets
- UK regional artifacts and weight matrices
- future country packages that follow the same contract

## Repo scope

This issue belongs in `policyengine.py`, but it requires coordinated changes across repos.

Expected participating repos:

- `policyengine.py`
- `policyengine-us`
- `policyengine-uk`
- `policyengine-us-data`
- `policyengine-uk-data`

Likely responsibilities:

### `policyengine.py`

- define the manifest schema and resolution API
- package country manifests with each release
- resolve defaults from manifest rather than floating URLs
- expose resolved bundle metadata on simulations/results
- add replay tests for pinned historical manifests

### country model repos (`policyengine-us`, `policyengine-uk`)

- stop assuming mutable default dataset URLs are the default replication path
- support explicit `data_version` / manifest-driven dataset resolution where needed
- expose version metadata without requiring live PyPI lookups

### country data repos (`policyengine-us-data`, `policyengine-uk-data`)

- publish a per-release manifest/index at each release tag
- make `data_version -> datasets available at that release` machine-readable
- include checksums and build provenance
- preserve or publish immutable artifact paths for region/national assets

## Proposed rollout

### Phase 1: contract and schema

- define manifest schema in `policyengine.py`
- implement resolver for `country + logical dataset name + policyengine.py release`
- package initial manifests for US and UK

### Phase 2: runtime migration

- remove floating dataset defaults from `.py` dataset helpers
- migrate region registries to manifest-backed artifact resolution
- stop using runtime PyPI metadata lookups for orchestration decisions

### Phase 3: provenance and replay

- add per-release manifest publishing in `-data` repos
- add checksum verification
- add deterministic rebuild metadata
- add golden historical replay tests in CI for at least one US and one UK release

## Acceptance criteria

- Installing `policyengine.py==X` is sufficient to determine the default country model and data versions for US and UK without relying on mutable latest dataset paths.
- `policyengine.py` contains a machine-readable manifest for each supported country release bundle.
- Each pinned `-data` version can be resolved to a machine-readable dataset index at the corresponding HF revision.
- Default dataset resolution in `.py` goes through manifest-backed logical dataset names, not handwritten floating HF/GCS URLs.
- Region datasets and related weight artifacts are covered by the same versioning contract.
- Simulation outputs can report the exact resolved model/data artifact bundle.
- At least one historical US release and one historical UK release can be replayed in CI from pinned manifests.

## Non-goals

- bundling large dataset artifacts directly into the `policyengine.py` wheel
- forcing an immediate monorepo migration
- solving only the US case and retrofitting UK later

## Why this matters

This would make `policyengine.py` the actual immutable boundary for replication.

That is a much stronger and simpler contract for research, debugging, support, and reproducibility than the current situation where users effectively rely on a mix of package versions, floating dataset URLs, and country-specific conventions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `policyengine.py` the immutable release boundary for country model and data versions #270

Problem

Desired contract

What should change

1. Add a packaged release manifest layer in `policyengine.py`

2. Stop treating free-form URLs as the source of truth

3. Make each `-data` version discoverable on Hugging Face

4. Support deterministic rebuilds from a pinned `-data` version

5. Record the resolved model/data bundle in simulation outputs

6. Generalize this across countries

Repo scope

`policyengine.py`

country model repos (`policyengine-us`, `policyengine-uk`)

country data repos (`policyengine-us-data`, `policyengine-uk-data`)

Proposed rollout

Phase 1: contract and schema

Phase 2: runtime migration

Phase 3: provenance and replay

Acceptance criteria

Non-goals

Why this matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make policyengine.py the immutable release boundary for country model and data versions #270

Description

Problem

Desired contract

What should change

1. Add a packaged release manifest layer in policyengine.py

2. Stop treating free-form URLs as the source of truth

3. Make each -data version discoverable on Hugging Face

4. Support deterministic rebuilds from a pinned -data version

5. Record the resolved model/data bundle in simulation outputs

6. Generalize this across countries

Repo scope

policyengine.py

country model repos (policyengine-us, policyengine-uk)

country data repos (policyengine-us-data, policyengine-uk-data)

Proposed rollout

Phase 1: contract and schema

Phase 2: runtime migration

Phase 3: provenance and replay

Acceptance criteria

Non-goals

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Make `policyengine.py` the immutable release boundary for country model and data versions #270

1. Add a packaged release manifest layer in `policyengine.py`

3. Make each `-data` version discoverable on Hugging Face

4. Support deterministic rebuilds from a pinned `-data` version

`policyengine.py`

country model repos (`policyengine-us`, `policyengine-uk`)

country data repos (`policyengine-us-data`, `policyengine-uk-data`)