Track and migrate policyengine-us-data construction dependencies

## Context

`microplex-us` currently uses `policyengine-us-data` in two materially different ways:

1. **Construction dependencies**: PE-US-data code, files, or storage artifacts help build source frames, target constraints, calibration/selection weights, or other inputs that can affect the final Microplex H5.
2. **Validation/comparison dependencies**: PE-US-data is used after a candidate H5 exists, usually as the incumbent comparator or as the legacy PE-native scoring surface.

If the goal is to fully migrate/supersede PE-US-data with `microplex-us`, I recommend duplicating or re-homing the construction-related functionality inside `microplex-us` or a Microplex-owned data/artifact package. The validation/comparison uses should not necessarily be removed, because PE-US-data remains useful as an incumbent benchmark.

## Construction-related PE-US-data dependencies to migrate

These are the current places where PE-US-data can affect the final product, not just evaluate it afterward.

### Donor survey loading

`microplex-us` can load ACS and SCF donor frames through PE-US-data dataset classes, then converts those outputs into Microplex household/person source frames.

Relevant paths:

- `src/microplex_us/manifests/pe_source_impute_blocks.json`
  - `policyengine_us_data.datasets.acs.acs.ACS_2022`
  - `policyengine_us_data.datasets.scf.scf.SCF_2022`
- `src/microplex_us/data_sources/donor_surveys.py`
  - builds a subprocess loader for those PE-US-data dataset classes
  - parses the resulting arrays into `ObservationFrame` tables

Recommended migration:

- Move/port the ACS and SCF loading contracts into `microplex-us` source providers.
- Keep the current variable mapping manifests, but make their loaders Microplex-owned rather than PE-US-data imports.
- Add parity tests showing the Microplex-owned loaders reproduce the existing PE-US-data-backed source frames for the canonical years.

### SIPP donor source files

SIPP donor blocks currently download raw files from the Hugging Face repo `PolicyEngine/policyengine-us-data`, e.g. `pu2023_slim.csv` and `pu2023.csv`, then parse them into donor source frames.

Relevant paths:

- `src/microplex_us/manifests/pe_source_impute_blocks.json`
- `src/microplex_us/data_sources/donor_surveys.py`
  - `_download_policyengine_us_data_file(...)`

Recommended migration:

- Re-home the required SIPP files in a Microplex-owned dataset/artifact location, or define a source contract that fetches them from their original public source.
- Keep a manifest with checksums, year, schema, and provenance.
- Update donor survey loaders to reference the Microplex-owned source contract.

### PUF raw files and uprating inputs

When a PE-US-data checkout is supplied, the PUF provider can use repo-local PE-US-data files and tables during source construction:

- raw PUF CSVs and demographics files
- `soi.csv`
- `uprating_factors.csv`

These affect the PUF source before final export and can therefore affect the final H5.

Relevant paths:

- `src/microplex_us/data_sources/puf.py`
  - `_resolve_policyengine_repo_local_puf_paths(...)`
  - `_resolve_pe_soi_path(...)`
  - `_resolve_pe_uprating_factors_path(...)`
  - `uprate_raw_puf_pe_style(...)`
  - `uprate_mapped_puf_with_pe_factors(...)`
- `src/microplex_us/pipelines/pe_us_data_rebuild.py`
  - canonical rebuild provider bundle sets PE-style PUF uprating via `PUF_UPRATING_MODE_PE_SOI`

Recommended migration:

- Move the required PE-style SOI/uprating tables into a Microplex-owned artifact or source manifest.
- Make PUF raw artifact resolution independent of a local PE-US-data checkout.
- Preserve PE-style parity behavior as a named Microplex-owned mode, with tests against the current PE-US-data-backed results.

### PUF pre-tax contribution imputation support

The optional PE-style pre-tax contribution imputation path can use PE-US-data artifacts/code for training data:

- `extended_cps_2024.h5` or `extended_cps_<year>.h5` under PE-US-data storage
- fallback subprocess path using `policyengine_us_data.datasets.cps.CPS_2021`

Relevant path:

- `src/microplex_us/data_sources/puf.py`
  - `_load_pe_extended_cps_pre_tax_training_frame(...)`
  - `_default_pe_style_puf_pre_tax_contribution_model(...)`
  - `PEStyleSubprocessImputationPredictor`

Recommended migration:

- Decide whether this PE-style imputation path is part of the canonical build.
- If yes, port the training frame and model construction into `microplex-us` with Microplex-owned fixtures/artifacts.
- If no, mark it explicitly as a PE-compatibility optional path and exclude it from the canonical no-PE-US-data build contract.

### Calibration target DB ownership

The active PE-US target DB is often supplied from PE-US-data storage and is used to build calibration constraints, not only to evaluate final H5s. This means it affects final weights.

Relevant paths:

- `src/microplex_us/policyengine/us.py`
  - `PolicyEngineUSDBTargetProvider`
- `src/microplex_us/pipelines/us.py`
  - `calibrate_policyengine_tables(...)`
  - `_load_policyengine_target_set(...)`

Recommended migration:

- Treat the target DB as a first-class Microplex/PolicyEngine target artifact, not as an incidental file inside a PE-US-data checkout.
- Keep the DB schema reader if it is the canonical target schema, but document and package the target DB as an explicit input to Microplex builds.
- Add clear provenance/version metadata to saved artifacts.

### Optional solver and PE-native selection paths

Some optional construction paths directly reuse PE-US-data code:

- `calibration_backend="pe_l0"` calls `policyengine_us_data.calibration.unified_calibration.fit_l0_weights`.
- `policyengine_selection_backend="pe_native_loss"` and PE-native optimization extract `policyengine_us_data.utils.loss.build_loss_matrix(...)` and can affect household selection/weights.

Relevant paths:

- `src/microplex_us/pipelines/pe_l0.py`
- `src/microplex_us/pipelines/pe_native_optimization.py`
- `src/microplex_us/pipelines/us.py`

Recommended migration:

- Port any solver/loss-matrix functionality intended for canonical builds into `microplex-us` or `microplex` core.
- If these remain experimental/benchmark-only paths, gate and document them as optional PE-US-data compatibility features.

## Validation/comparison PE-US-data uses that should not necessarily be removed

These usages are different from construction dependencies. They are useful for benchmarking Microplex against the incumbent PE-US-data dataset and for preserving continuity with existing PE-US evaluation conventions.

Examples:

- `src/microplex_us/pipelines/pe_native_scores.py`
  - runs PE-US-data's native enhanced-CPS broad-loss scorer over candidate and baseline H5s
  - useful for incumbent comparison and historical continuity
- `src/microplex_us/pipelines/pe_us_data_rebuild_checkpoint.py`
  - can attach `policyengine_native_scores.json` and PE-native audit evidence after the candidate H5 is saved
- `src/microplex_us/pipelines/pe_us_data_rebuild_audit.py`
- `src/microplex_us/pipelines/backfill_pe_native_scores.py`
- `src/microplex_us/pipelines/backfill_pe_native_audit.py`
- `src/microplex_us/pipelines/pe_focus_targets.py`
- PE baseline H5s such as `enhanced_cps_2024.h5`
  - useful as incumbent comparator datasets

Recommendation: keep these validation/comparison paths available, but document them as benchmark/comparator dependencies rather than construction dependencies.

## Suggested acceptance criteria

- The canonical `microplex-us` build path can run without a local PE-US-data checkout, except when the user explicitly opts into PE-US-data validation/comparison.
- Construction-time PE-US-data imports are removed from the canonical build path or replaced by Microplex-owned loaders/artifacts.
- Source manifests identify Microplex-owned provenance, checksums, and schemas for ACS, SCF, SIPP, PUF, SOI, and uprating inputs.
- Optional PE-US-data compatibility paths are clearly gated and documented.
- Docs distinguish:
  - construction dependencies that affect final outputs
  - validation/comparison dependencies that benchmark final outputs
- Tests cover the migrated construction paths against current PE-US-data-backed behavior where parity is expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track and migrate policyengine-us-data construction dependencies #19

Context

Construction-related PE-US-data dependencies to migrate

Donor survey loading

SIPP donor source files

PUF raw files and uprating inputs

PUF pre-tax contribution imputation support

Calibration target DB ownership

Optional solver and PE-native selection paths

Validation/comparison PE-US-data uses that should not necessarily be removed

Suggested acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Track and migrate policyengine-us-data construction dependencies #19

Description

Context

Construction-related PE-US-data dependencies to migrate

Donor survey loading

SIPP donor source files

PUF raw files and uprating inputs

PUF pre-tax contribution imputation support

Calibration target DB ownership

Optional solver and PE-native selection paths

Validation/comparison PE-US-data uses that should not necessarily be removed

Suggested acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions