Skip to content

Track and migrate policyengine-us-data construction dependencies #19

@anth-volk

Description

@anth-volk

Context

microplex-us currently uses policyengine-us-data in two materially different ways:

  1. Construction dependencies: PE-US-data code, files, or storage artifacts help build source frames, target constraints, calibration/selection weights, or other inputs that can affect the final Microplex H5.
  2. Validation/comparison dependencies: PE-US-data is used after a candidate H5 exists, usually as the incumbent comparator or as the legacy PE-native scoring surface.

If the goal is to fully migrate/supersede PE-US-data with microplex-us, I recommend duplicating or re-homing the construction-related functionality inside microplex-us or a Microplex-owned data/artifact package. The validation/comparison uses should not necessarily be removed, because PE-US-data remains useful as an incumbent benchmark.

Construction-related PE-US-data dependencies to migrate

These are the current places where PE-US-data can affect the final product, not just evaluate it afterward.

Donor survey loading

microplex-us can load ACS and SCF donor frames through PE-US-data dataset classes, then converts those outputs into Microplex household/person source frames.

Relevant paths:

  • src/microplex_us/manifests/pe_source_impute_blocks.json
    • policyengine_us_data.datasets.acs.acs.ACS_2022
    • policyengine_us_data.datasets.scf.scf.SCF_2022
  • src/microplex_us/data_sources/donor_surveys.py
    • builds a subprocess loader for those PE-US-data dataset classes
    • parses the resulting arrays into ObservationFrame tables

Recommended migration:

  • Move/port the ACS and SCF loading contracts into microplex-us source providers.
  • Keep the current variable mapping manifests, but make their loaders Microplex-owned rather than PE-US-data imports.
  • Add parity tests showing the Microplex-owned loaders reproduce the existing PE-US-data-backed source frames for the canonical years.

SIPP donor source files

SIPP donor blocks currently download raw files from the Hugging Face repo PolicyEngine/policyengine-us-data, e.g. pu2023_slim.csv and pu2023.csv, then parse them into donor source frames.

Relevant paths:

  • src/microplex_us/manifests/pe_source_impute_blocks.json
  • src/microplex_us/data_sources/donor_surveys.py
    • _download_policyengine_us_data_file(...)

Recommended migration:

  • Re-home the required SIPP files in a Microplex-owned dataset/artifact location, or define a source contract that fetches them from their original public source.
  • Keep a manifest with checksums, year, schema, and provenance.
  • Update donor survey loaders to reference the Microplex-owned source contract.

PUF raw files and uprating inputs

When a PE-US-data checkout is supplied, the PUF provider can use repo-local PE-US-data files and tables during source construction:

  • raw PUF CSVs and demographics files
  • soi.csv
  • uprating_factors.csv

These affect the PUF source before final export and can therefore affect the final H5.

Relevant paths:

  • src/microplex_us/data_sources/puf.py
    • _resolve_policyengine_repo_local_puf_paths(...)
    • _resolve_pe_soi_path(...)
    • _resolve_pe_uprating_factors_path(...)
    • uprate_raw_puf_pe_style(...)
    • uprate_mapped_puf_with_pe_factors(...)
  • src/microplex_us/pipelines/pe_us_data_rebuild.py
    • canonical rebuild provider bundle sets PE-style PUF uprating via PUF_UPRATING_MODE_PE_SOI

Recommended migration:

  • Move the required PE-style SOI/uprating tables into a Microplex-owned artifact or source manifest.
  • Make PUF raw artifact resolution independent of a local PE-US-data checkout.
  • Preserve PE-style parity behavior as a named Microplex-owned mode, with tests against the current PE-US-data-backed results.

PUF pre-tax contribution imputation support

The optional PE-style pre-tax contribution imputation path can use PE-US-data artifacts/code for training data:

  • extended_cps_2024.h5 or extended_cps_<year>.h5 under PE-US-data storage
  • fallback subprocess path using policyengine_us_data.datasets.cps.CPS_2021

Relevant path:

  • src/microplex_us/data_sources/puf.py
    • _load_pe_extended_cps_pre_tax_training_frame(...)
    • _default_pe_style_puf_pre_tax_contribution_model(...)
    • PEStyleSubprocessImputationPredictor

Recommended migration:

  • Decide whether this PE-style imputation path is part of the canonical build.
  • If yes, port the training frame and model construction into microplex-us with Microplex-owned fixtures/artifacts.
  • If no, mark it explicitly as a PE-compatibility optional path and exclude it from the canonical no-PE-US-data build contract.

Calibration target DB ownership

The active PE-US target DB is often supplied from PE-US-data storage and is used to build calibration constraints, not only to evaluate final H5s. This means it affects final weights.

Relevant paths:

  • src/microplex_us/policyengine/us.py
    • PolicyEngineUSDBTargetProvider
  • src/microplex_us/pipelines/us.py
    • calibrate_policyengine_tables(...)
    • _load_policyengine_target_set(...)

Recommended migration:

  • Treat the target DB as a first-class Microplex/PolicyEngine target artifact, not as an incidental file inside a PE-US-data checkout.
  • Keep the DB schema reader if it is the canonical target schema, but document and package the target DB as an explicit input to Microplex builds.
  • Add clear provenance/version metadata to saved artifacts.

Optional solver and PE-native selection paths

Some optional construction paths directly reuse PE-US-data code:

  • calibration_backend="pe_l0" calls policyengine_us_data.calibration.unified_calibration.fit_l0_weights.
  • policyengine_selection_backend="pe_native_loss" and PE-native optimization extract policyengine_us_data.utils.loss.build_loss_matrix(...) and can affect household selection/weights.

Relevant paths:

  • src/microplex_us/pipelines/pe_l0.py
  • src/microplex_us/pipelines/pe_native_optimization.py
  • src/microplex_us/pipelines/us.py

Recommended migration:

  • Port any solver/loss-matrix functionality intended for canonical builds into microplex-us or microplex core.
  • If these remain experimental/benchmark-only paths, gate and document them as optional PE-US-data compatibility features.

Validation/comparison PE-US-data uses that should not necessarily be removed

These usages are different from construction dependencies. They are useful for benchmarking Microplex against the incumbent PE-US-data dataset and for preserving continuity with existing PE-US evaluation conventions.

Examples:

  • src/microplex_us/pipelines/pe_native_scores.py
    • runs PE-US-data's native enhanced-CPS broad-loss scorer over candidate and baseline H5s
    • useful for incumbent comparison and historical continuity
  • src/microplex_us/pipelines/pe_us_data_rebuild_checkpoint.py
    • can attach policyengine_native_scores.json and PE-native audit evidence after the candidate H5 is saved
  • src/microplex_us/pipelines/pe_us_data_rebuild_audit.py
  • src/microplex_us/pipelines/backfill_pe_native_scores.py
  • src/microplex_us/pipelines/backfill_pe_native_audit.py
  • src/microplex_us/pipelines/pe_focus_targets.py
  • PE baseline H5s such as enhanced_cps_2024.h5
    • useful as incumbent comparator datasets

Recommendation: keep these validation/comparison paths available, but document them as benchmark/comparator dependencies rather than construction dependencies.

Suggested acceptance criteria

  • The canonical microplex-us build path can run without a local PE-US-data checkout, except when the user explicitly opts into PE-US-data validation/comparison.
  • Construction-time PE-US-data imports are removed from the canonical build path or replaced by Microplex-owned loaders/artifacts.
  • Source manifests identify Microplex-owned provenance, checksums, and schemas for ACS, SCF, SIPP, PUF, SOI, and uprating inputs.
  • Optional PE-US-data compatibility paths are clearly gated and documented.
  • Docs distinguish:
    • construction dependencies that affect final outputs
    • validation/comparison dependencies that benchmark final outputs
  • Tests cover the migrated construction paths against current PE-US-data-backed behavior where parity is expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions