Context
microplex-us currently uses policyengine-us-data in two materially different ways:
- Construction dependencies: PE-US-data code, files, or storage artifacts help build source frames, target constraints, calibration/selection weights, or other inputs that can affect the final Microplex H5.
- Validation/comparison dependencies: PE-US-data is used after a candidate H5 exists, usually as the incumbent comparator or as the legacy PE-native scoring surface.
If the goal is to fully migrate/supersede PE-US-data with microplex-us, I recommend duplicating or re-homing the construction-related functionality inside microplex-us or a Microplex-owned data/artifact package. The validation/comparison uses should not necessarily be removed, because PE-US-data remains useful as an incumbent benchmark.
Construction-related PE-US-data dependencies to migrate
These are the current places where PE-US-data can affect the final product, not just evaluate it afterward.
Donor survey loading
microplex-us can load ACS and SCF donor frames through PE-US-data dataset classes, then converts those outputs into Microplex household/person source frames.
Relevant paths:
src/microplex_us/manifests/pe_source_impute_blocks.json
policyengine_us_data.datasets.acs.acs.ACS_2022
policyengine_us_data.datasets.scf.scf.SCF_2022
src/microplex_us/data_sources/donor_surveys.py
- builds a subprocess loader for those PE-US-data dataset classes
- parses the resulting arrays into
ObservationFrame tables
Recommended migration:
- Move/port the ACS and SCF loading contracts into
microplex-us source providers.
- Keep the current variable mapping manifests, but make their loaders Microplex-owned rather than PE-US-data imports.
- Add parity tests showing the Microplex-owned loaders reproduce the existing PE-US-data-backed source frames for the canonical years.
SIPP donor source files
SIPP donor blocks currently download raw files from the Hugging Face repo PolicyEngine/policyengine-us-data, e.g. pu2023_slim.csv and pu2023.csv, then parse them into donor source frames.
Relevant paths:
src/microplex_us/manifests/pe_source_impute_blocks.json
src/microplex_us/data_sources/donor_surveys.py
_download_policyengine_us_data_file(...)
Recommended migration:
- Re-home the required SIPP files in a Microplex-owned dataset/artifact location, or define a source contract that fetches them from their original public source.
- Keep a manifest with checksums, year, schema, and provenance.
- Update donor survey loaders to reference the Microplex-owned source contract.
PUF raw files and uprating inputs
When a PE-US-data checkout is supplied, the PUF provider can use repo-local PE-US-data files and tables during source construction:
- raw PUF CSVs and demographics files
soi.csv
uprating_factors.csv
These affect the PUF source before final export and can therefore affect the final H5.
Relevant paths:
src/microplex_us/data_sources/puf.py
_resolve_policyengine_repo_local_puf_paths(...)
_resolve_pe_soi_path(...)
_resolve_pe_uprating_factors_path(...)
uprate_raw_puf_pe_style(...)
uprate_mapped_puf_with_pe_factors(...)
src/microplex_us/pipelines/pe_us_data_rebuild.py
- canonical rebuild provider bundle sets PE-style PUF uprating via
PUF_UPRATING_MODE_PE_SOI
Recommended migration:
- Move the required PE-style SOI/uprating tables into a Microplex-owned artifact or source manifest.
- Make PUF raw artifact resolution independent of a local PE-US-data checkout.
- Preserve PE-style parity behavior as a named Microplex-owned mode, with tests against the current PE-US-data-backed results.
PUF pre-tax contribution imputation support
The optional PE-style pre-tax contribution imputation path can use PE-US-data artifacts/code for training data:
extended_cps_2024.h5 or extended_cps_<year>.h5 under PE-US-data storage
- fallback subprocess path using
policyengine_us_data.datasets.cps.CPS_2021
Relevant path:
src/microplex_us/data_sources/puf.py
_load_pe_extended_cps_pre_tax_training_frame(...)
_default_pe_style_puf_pre_tax_contribution_model(...)
PEStyleSubprocessImputationPredictor
Recommended migration:
- Decide whether this PE-style imputation path is part of the canonical build.
- If yes, port the training frame and model construction into
microplex-us with Microplex-owned fixtures/artifacts.
- If no, mark it explicitly as a PE-compatibility optional path and exclude it from the canonical no-PE-US-data build contract.
Calibration target DB ownership
The active PE-US target DB is often supplied from PE-US-data storage and is used to build calibration constraints, not only to evaluate final H5s. This means it affects final weights.
Relevant paths:
src/microplex_us/policyengine/us.py
PolicyEngineUSDBTargetProvider
src/microplex_us/pipelines/us.py
calibrate_policyengine_tables(...)
_load_policyengine_target_set(...)
Recommended migration:
- Treat the target DB as a first-class Microplex/PolicyEngine target artifact, not as an incidental file inside a PE-US-data checkout.
- Keep the DB schema reader if it is the canonical target schema, but document and package the target DB as an explicit input to Microplex builds.
- Add clear provenance/version metadata to saved artifacts.
Optional solver and PE-native selection paths
Some optional construction paths directly reuse PE-US-data code:
calibration_backend="pe_l0" calls policyengine_us_data.calibration.unified_calibration.fit_l0_weights.
policyengine_selection_backend="pe_native_loss" and PE-native optimization extract policyengine_us_data.utils.loss.build_loss_matrix(...) and can affect household selection/weights.
Relevant paths:
src/microplex_us/pipelines/pe_l0.py
src/microplex_us/pipelines/pe_native_optimization.py
src/microplex_us/pipelines/us.py
Recommended migration:
- Port any solver/loss-matrix functionality intended for canonical builds into
microplex-us or microplex core.
- If these remain experimental/benchmark-only paths, gate and document them as optional PE-US-data compatibility features.
Validation/comparison PE-US-data uses that should not necessarily be removed
These usages are different from construction dependencies. They are useful for benchmarking Microplex against the incumbent PE-US-data dataset and for preserving continuity with existing PE-US evaluation conventions.
Examples:
src/microplex_us/pipelines/pe_native_scores.py
- runs PE-US-data's native enhanced-CPS broad-loss scorer over candidate and baseline H5s
- useful for incumbent comparison and historical continuity
src/microplex_us/pipelines/pe_us_data_rebuild_checkpoint.py
- can attach
policyengine_native_scores.json and PE-native audit evidence after the candidate H5 is saved
src/microplex_us/pipelines/pe_us_data_rebuild_audit.py
src/microplex_us/pipelines/backfill_pe_native_scores.py
src/microplex_us/pipelines/backfill_pe_native_audit.py
src/microplex_us/pipelines/pe_focus_targets.py
- PE baseline H5s such as
enhanced_cps_2024.h5
- useful as incumbent comparator datasets
Recommendation: keep these validation/comparison paths available, but document them as benchmark/comparator dependencies rather than construction dependencies.
Suggested acceptance criteria
- The canonical
microplex-us build path can run without a local PE-US-data checkout, except when the user explicitly opts into PE-US-data validation/comparison.
- Construction-time PE-US-data imports are removed from the canonical build path or replaced by Microplex-owned loaders/artifacts.
- Source manifests identify Microplex-owned provenance, checksums, and schemas for ACS, SCF, SIPP, PUF, SOI, and uprating inputs.
- Optional PE-US-data compatibility paths are clearly gated and documented.
- Docs distinguish:
- construction dependencies that affect final outputs
- validation/comparison dependencies that benchmark final outputs
- Tests cover the migrated construction paths against current PE-US-data-backed behavior where parity is expected.
Context
microplex-uscurrently usespolicyengine-us-datain two materially different ways:If the goal is to fully migrate/supersede PE-US-data with
microplex-us, I recommend duplicating or re-homing the construction-related functionality insidemicroplex-usor a Microplex-owned data/artifact package. The validation/comparison uses should not necessarily be removed, because PE-US-data remains useful as an incumbent benchmark.Construction-related PE-US-data dependencies to migrate
These are the current places where PE-US-data can affect the final product, not just evaluate it afterward.
Donor survey loading
microplex-uscan load ACS and SCF donor frames through PE-US-data dataset classes, then converts those outputs into Microplex household/person source frames.Relevant paths:
src/microplex_us/manifests/pe_source_impute_blocks.jsonpolicyengine_us_data.datasets.acs.acs.ACS_2022policyengine_us_data.datasets.scf.scf.SCF_2022src/microplex_us/data_sources/donor_surveys.pyObservationFrametablesRecommended migration:
microplex-ussource providers.SIPP donor source files
SIPP donor blocks currently download raw files from the Hugging Face repo
PolicyEngine/policyengine-us-data, e.g.pu2023_slim.csvandpu2023.csv, then parse them into donor source frames.Relevant paths:
src/microplex_us/manifests/pe_source_impute_blocks.jsonsrc/microplex_us/data_sources/donor_surveys.py_download_policyengine_us_data_file(...)Recommended migration:
PUF raw files and uprating inputs
When a PE-US-data checkout is supplied, the PUF provider can use repo-local PE-US-data files and tables during source construction:
soi.csvuprating_factors.csvThese affect the PUF source before final export and can therefore affect the final H5.
Relevant paths:
src/microplex_us/data_sources/puf.py_resolve_policyengine_repo_local_puf_paths(...)_resolve_pe_soi_path(...)_resolve_pe_uprating_factors_path(...)uprate_raw_puf_pe_style(...)uprate_mapped_puf_with_pe_factors(...)src/microplex_us/pipelines/pe_us_data_rebuild.pyPUF_UPRATING_MODE_PE_SOIRecommended migration:
PUF pre-tax contribution imputation support
The optional PE-style pre-tax contribution imputation path can use PE-US-data artifacts/code for training data:
extended_cps_2024.h5orextended_cps_<year>.h5under PE-US-data storagepolicyengine_us_data.datasets.cps.CPS_2021Relevant path:
src/microplex_us/data_sources/puf.py_load_pe_extended_cps_pre_tax_training_frame(...)_default_pe_style_puf_pre_tax_contribution_model(...)PEStyleSubprocessImputationPredictorRecommended migration:
microplex-uswith Microplex-owned fixtures/artifacts.Calibration target DB ownership
The active PE-US target DB is often supplied from PE-US-data storage and is used to build calibration constraints, not only to evaluate final H5s. This means it affects final weights.
Relevant paths:
src/microplex_us/policyengine/us.pyPolicyEngineUSDBTargetProvidersrc/microplex_us/pipelines/us.pycalibrate_policyengine_tables(...)_load_policyengine_target_set(...)Recommended migration:
Optional solver and PE-native selection paths
Some optional construction paths directly reuse PE-US-data code:
calibration_backend="pe_l0"callspolicyengine_us_data.calibration.unified_calibration.fit_l0_weights.policyengine_selection_backend="pe_native_loss"and PE-native optimization extractpolicyengine_us_data.utils.loss.build_loss_matrix(...)and can affect household selection/weights.Relevant paths:
src/microplex_us/pipelines/pe_l0.pysrc/microplex_us/pipelines/pe_native_optimization.pysrc/microplex_us/pipelines/us.pyRecommended migration:
microplex-usormicroplexcore.Validation/comparison PE-US-data uses that should not necessarily be removed
These usages are different from construction dependencies. They are useful for benchmarking Microplex against the incumbent PE-US-data dataset and for preserving continuity with existing PE-US evaluation conventions.
Examples:
src/microplex_us/pipelines/pe_native_scores.pysrc/microplex_us/pipelines/pe_us_data_rebuild_checkpoint.pypolicyengine_native_scores.jsonand PE-native audit evidence after the candidate H5 is savedsrc/microplex_us/pipelines/pe_us_data_rebuild_audit.pysrc/microplex_us/pipelines/backfill_pe_native_scores.pysrc/microplex_us/pipelines/backfill_pe_native_audit.pysrc/microplex_us/pipelines/pe_focus_targets.pyenhanced_cps_2024.h5Recommendation: keep these validation/comparison paths available, but document them as benchmark/comparator dependencies rather than construction dependencies.
Suggested acceptance criteria
microplex-usbuild path can run without a local PE-US-data checkout, except when the user explicitly opts into PE-US-data validation/comparison.