Skip to content

Add fast eCPS column-parity gate (CLI + committed contract + CI)#120

Merged
MaxGhenis merged 4 commits into
mainfrom
add-ecps-column-parity-gate
Jun 1, 2026
Merged

Add fast eCPS column-parity gate (CLI + committed contract + CI)#120
MaxGhenis merged 4 commits into
mainfrom
add-ecps-column-parity-gate

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Motivation

The repo already knows how to detect eCPS export column drift:
_column_contract_gate and _compatibility_gate in
src/microplex_us/pipelines/mp300k_artifact_gates.py fail when an exported
H5 is missing required PolicyEngine columns or includes forbidden
diagnostic columns. But that logic only runs inside the slow MP-300k
artifact path
-- after a multi-hour build has produced an H5. Column
drift is therefore invisible until late and expensive to discover.

This PR makes the same column diff runnable in milliseconds, locally and
as the first CI job
, with no build, no GPU, and no heavy ML deps. It is
the cheap gate that should pass before the long build runs.

This PR adds the GATE only -- it does not change the exporter or fix
any missing/extra columns.

What's added

  • CLI -- src/microplex_us/pipelines/check_export_columns.py with
    main(argv) and __main__:
    • python -m microplex_us.pipelines.check_export_columns <h5path> reads
      top-level H5 column names (groups name/<period> or flat datasets
      name, collapsed to the base name) and diffs them against the
      contract.
    • --columns-json <file> diffs a JSON list of column names with no
      data file at all
      (the truly-fast CI path).
    • --contract <file> overrides the committed contract.
    • Exits 1 if any required column is missing or any forbidden column is
      present, else 0.
  • Reuse, not duplication -- the H5 column reader is a new shared helper
    _h5_top_level_columns added to mp300k_artifact_gates.py (handles both
    groups and flat datasets, base-name collapse); existing gate helpers are
    unchanged. The CLI loads it by file path so neither importing nor
    running the CLI pulls the microplex_us package __init__ (microplex
    / torch) -- keeping the gate and its CI torch-free.
  • Frozen contract -- src/microplex_us/pipelines/ecps_export_contract.json,
    derived from the real eCPS baseline H5 (244 columns), with three
    explicit, documented categories.
  • Fast CI -- .github/workflows/export-columns.yml: a standalone
    ubuntu-latest job (no H5/GPU) that installs only pytest/h5py/numpy,
    runs the new unit tests (loaded via importlib so no microplex), and
    runs a --columns-json self-check against a committed clean fixture by
    invoking the module as a file. Seconds, not minutes -- intentionally a
    separate workflow so it reads as the first/cheap contract gate.
  • Tests + fixture -- tests/pipelines/test_check_export_columns.py
    (15 tests) and a clean passing fixture
    tests/pipelines/fixtures/ecps_clean_columns.json (required +
    eCPS-internal-optional, no forbidden) so the green CI path proves the
    gate passes on a good set.
  • Packaged the contract via [tool.hatch.build.targets.wheel.force-include]
    and registered a microplex-us-check-export-columns console script.

The contract's three categories

category count meaning gate behavior
required 239 columns MP must export to be a drop-in eCPS replacement (the 244 minus the clone-bookkeeping columns) fail if any missing
ecps_internal_optional 5 eCPS clone-bookkeeping flags: household_is_puf_clone, person_is_puf_clone, spm_unit_is_puf_clone, tax_unit_is_puf_clone, family_is_puf_clone -- MP need not export these neither required nor forbidden
forbidden 15 transient takeup-input columns eCPS deliberately drops (the *_reported family, e.g. snap_reported, ssi_reported, tanf_reported, ..., plus unreported_payroll_tax) fail if any present

Current observed diff (the motivation)

Run against the present MP candidate export, the gate is red:

  • The MP candidate exports ~210 columns vs the 239 required, so
    roughly ~51 required-missing, including (all verified present in the
    frozen contract's required set):
    • SCF asset/debt columns -- scf_business_equity, scf_retirement_assets,
      scf_primary_residence_value, scf_mortgage_debt,
      scf_student_loan_debt, scf_credit_card_debt, ...
    • the American Opportunity Credit input family --
      american_opportunity_credit_claimed_prior_years,
      attends_eligible_educational_institution_for_american_opportunity_credit,
      is_enrolled_at_least_half_time_for_american_opportunity_credit, ...
    • retirement contributions -- roth_401k_contributions,
      roth_ira_contributions, traditional_ira_contributions
    • geoids -- block_geoid, tract_geoid, congressional_district_geoid
    • ESI / health premiums -- employer_sponsored_insurance_premiums,
      has_esi, health_insurance_premiums_without_medicare_part_b,
      other_health_insurance_premiums
  • And ~15 forbidden *_reported extras the candidate still exports and
    should drop.

(The exact required-missing count depends on the candidate column set; the
figures above are the observed motivating diff. The committed contract
category counts -- 239 / 5 / 15 -- are exact.)

This PR adds the GATE that catches all of the above in milliseconds. The
data fixes (adding the missing columns, dropping the forbidden ones) are
out of scope here.

Validation

  • tests/pipelines/test_check_export_columns.py -- 15 tests pass
    (missing-required -> exit 1, forbidden-present -> exit 1, clean -> exit 0,
    --columns-json path, H5 group + flat-dataset paths, mutual-exclusivity
    error, contract category-key/size validation, committed-fixture pass).
  • The shared _h5_top_level_columns helper is purely additive; existing
    gate helpers are untouched.
  • ruff format --check and ruff check clean on all changed files.
  • CLI self-check against the committed fixture exits 0, with no microplex
    / torch installed.

This PR adds the GATE only, not the data fixes.

[Generated with Claude Code]

Add a millisecond, local-runnable check comparing an export's column
set to a frozen eCPS contract, so column drift is caught before the
slow MP-300k build. The same required/forbidden diff already runs
inside _compatibility_gate / _column_contract_gate, but only deep in
the artifact path; this surfaces it as the first, cheap CI gate and a
one-line local command.

- check_export_columns.py: argparse CLI with main(argv); positional
  H5 path or --columns-json (no data file); --contract override.
  Reuses the gate's H5 column reader via a shared
  _h5_top_level_columns helper, loaded by file path so neither
  importing nor running the module pulls the microplex_us package
  __init__ (microplex / torch). Exits 1 on missing-required or
  forbidden-present, else 0.
- ecps_export_contract.json: 239 required, 5 eCPS-internal-optional
  (PUF-clone flags), 15 forbidden (*_reported takeup inputs).
- mp300k_artifact_gates.py: add shared _h5_top_level_columns (groups
  and flat datasets, base-name collapse); existing helpers unchanged.
- export-columns.yml: standalone fast CI job (no build/GPU, deps
  limited to pytest/h5py/numpy) running the tests and a fixture
  self-check.
- tests + clean fixture; packaged the contract via force-include.

Adds the gate only, not the underlying data fixes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Follow-up from the current artifact/code audit: the forbidden *_reported exports are not coming from SAFE_POLICYENGINE_US_EXPORT_VARIABLES. They are entering through POLICYENGINE_US_LEGACY_CONTRACT_VARIABLE_ENTITIES in src/microplex_us/policyengine/us.py.

Relevant path:

  • build_policyengine_us_export_variable_maps(...) groups allowed variables via _group_policyengine_us_export_variables_by_entity(...).
  • _group_policyengine_us_export_variables_by_entity(...) starts from SAFE_POLICYENGINE_US_EXPORT_VARIABLES | POLICYENGINE_US_EXPORT_DEFAULTS | direct_override_variables, but then unconditionally adds every variable in POLICYENGINE_US_LEGACY_CONTRACT_VARIABLE_ENTITIES by entity.
  • That legacy dict includes the reported/diagnostic family (reported_has_*, reported_is_*, snap_reported, spm_unit_*_reported, tanf_reported, etc.), which explains why they appear in candidate H5s even when grep shows they are absent from SAFE.

This matches the gate report's finding that the forbidden vars are outside SAFE but still exported. The exporter fix should target that legacy-contract bypass, not just SAFE allowlist cleanup.

Separate artifact check: the ACS-donor candidate recovered at /Users/maxghenis/CosilicoAI/microplex-us/artifacts/ecps_shaped_cps_puf_sipp_scf_acs_20260531/resume_correct_targets/policyengine_us.h5 still has 210 top-level vars and only differs from local eCPS by the same 9 eCPS-only variables; after #121 re-export it gains social_security_retirement as candidate-only.

@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Correction/refinement to my previous trace: the bypass is both POLICYENGINE_US_EXPORT_DEFAULTS and POLICYENGINE_US_LEGACY_CONTRACT_VARIABLE_ENTITIES, not only the legacy entity map.

Why: _group_policyengine_us_export_variables_by_entity(...) allows SAFE_POLICYENGINE_US_EXPORT_VARIABLES | set(POLICYENGINE_US_EXPORT_DEFAULTS) | direct_override_variables, so every forbidden variable that is in POLICYENGINE_US_EXPORT_DEFAULTS is already eligible. The legacy contract entity map then also force-adds many of the same reported variables by entity. _infer_policyengine_us_table_variable_map(...) finally exports missing allowed defaults via the for target_variable in sorted(set(POLICYENGINE_US_EXPORT_DEFAULTS) & allowed_variables) loop.

So the exporter fix should subtract the forbidden contract set from both routes:

  1. the allowed names built from defaults; and
  2. the legacy-contract force-add loop.

I also ran the PR #120 contract against the recovered ACS candidate and the current local eCPS file. Both currently fail the new contract's forbidden set because those 15 variables are present; the current local eCPS file also misses many of the same required columns, so this contract is stricter than today's local eCPS artifact rather than a literal mirror of that file.

MaxGhenis and others added 3 commits May 31, 2026 16:40
Regenerate the column-parity contract and clean fixture from the actual
clone-correct baseline H5 (enhanced_cps_2024_postfix_clonecorrect) instead of a
stale snapshot, reconciled with policyengine-us 1.715.2 variable roles:

- required: replace the 5 bare retirement-contribution columns (pe-us formulas,
  desired * scale) with the 5 *_desired INPUT columns the baseline exports.
- required: move in_nyc / has_tin / has_itin / weeks_worked (pe-us formula
  variables) into a new formula_owned_excluded category.
- forbidden: add the 7 PUF_REPORTED_CALCULATED_TAX_OUTPUT_VARIABLES tax-credit
  outputs the baseline excludes but the candidate currently leaks.
- fixture: regenerate ecps_clean_columns.json from the baseline H5 (252 cols).
- tests: update category counts (required 235, forbidden 22,
  formula_owned_excluded 4) and disjointness checks.

Contract now matches the baseline exactly; all 15 gate tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The first-pass contract under-specified `required` relative to the repo's
authoritative `_column_contract_gate`, so a deficient MP export could pass this
fast gate while failing the real one. Reviewer-verified against the 252-column
clone-correct baseline H5 and the in-tree computed-export allow-sets.

- contract: `required` is now the full baseline export minus the 5 clone-
  bookkeeping flags and `weeks_worked` (246 cols). This restores 8 columns the
  first pass silently dropped (difficulty_* x6, fsla_overtime_premium,
  meets_ssi_disability_criteria) and moves has_tin/has_itin/in_nyc back to
  `required` -- they are in POLICYENGINE_US_STRUCTURAL_COMPUTED_EXPORT_VARIABLES
  (structural fields us-data persists), and fsla_overtime_premium /
  meets_ssi_disability_criteria are in the OVERRIDABLE set. Only weeks_worked
  remains in `formula_owned_excluded`.
- check_export_columns.py: `compute_column_diff` now recognizes the
  `formula_owned_excluded` category (it was dead -- never read, so its members
  fell into extra_unknown); `load_contract` defaults it; fix the module
  docstring's stale `_h5_export_compatibility_gate` reference to
  `_column_contract_gate`.
- fixture: regenerated from the baseline H5 (252 cols).
- tests: counts updated to 246/5/22/1; assert structural fields are required and
  excluded == {weeks_worked}; new completeness test asserts the contract covers
  every baseline column (extra_unknown == []), which catches silent
  under-specification.

Verified: 16 gate tests pass; CLI self-check on the clean fixture reports
extra_unknown=0 / RESULT PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_description referenced the old _h5_export_compatibility_gate name (-> _column_contract_gate)
and _categories.required said '244 minus clone flags' (now 252 baseline - 5 clone flags - weeks_worked = 246).
Metadata only; 16 gate tests still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis marked this pull request as ready for review June 1, 2026 02:53
@MaxGhenis MaxGhenis merged commit 6bc5bd4 into main Jun 1, 2026
5 checks passed
MaxGhenis added a commit that referenced this pull request Jun 1, 2026
)

Re-measuring Gate-1 coverage after the seven gap-fill PRs surfaced six
contract-required columns no PR had built: difficulty_seeing, difficulty_hearing,
difficulty_walking_or_climbing_stairs, difficulty_dressing_or_bathing,
difficulty_doing_errands, difficulty_remembering_or_making_decisions. They were
missed because the original 47-column gap report ran against the first,
under-specified contract that had dropped them; PR #120 re-added them to the
contract but the imputation lane plan was never regenerated.

These are eCPS final-H5 contract columns present in the newest eCPS builds
(policyengine-us-data PR #1151 and the clone-correct baseline) and absent only
from the older published HF baseline. They recode from the ASEC PEDIS* fields
(PEDIS{X} == 1 -> True; verified difficulty_seeing is byte-identical to is_blind,
both PEDISEYE == 1, in the PR #1151 eCPS export). They are not PolicyEngine-US
variables, so they export as person-level dataset columns via the
legacy-contract entity map (the scf_* pattern).

Microplex already ingested the six PEDIS* fields into _disability_* staging
columns (used to compute is_disabled); this produces the difficulty_* leaves
from that staging before it is dropped, and wires the SAFE export set, the
export defaults (False), and the legacy-entity map. Static config coverage is
now 246/246 contract-required columns.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant