Skip to content

Make tax-unit role imputation household-coherent #76

@MaxGhenis

Description

@MaxGhenis

Context

The first mp-300k replacement candidate is still losing the sound eCPS comparison mainly because the constructed PolicyEngine tax-unit structure is too fragmented. This is now separate from the comparison-harness bug: the current sound comparison is matched-N, symmetric-refit, has score_candidate_only=false, checks objective/scoring identity, and verifies eCPS refit recovery.

Latest evidence from /Users/maxghenis/CosilicoAI/microplex-us/artifacts/small_asec_acs100k_role_flags_validation_20260529/:

  • Sound comparison: candidate refit loss 4.6236 vs eCPS 0.1727; candidate holdout 0.6574 vs eCPS 0.0275.
  • Role-flag materialized H5: 177,692 tax units on 100,000 households (1.777 tax units/HH).
  • Matched eCPS: 55,264 tax units on 41,314 households (1.338 tax units/HH).
  • PR Resolve conflicting tax unit role flags #75 sanitizes impossible overlapping role flags and reduces the same 100k-household structural probe to 171,058 tax units (1.711/HH), but that is still far above eCPS.

The root problem appears to be that is_tax_unit_head, is_tax_unit_spouse, and is_tax_unit_dependent are being treated as independent imputed person-level columns. The artifact has impossible overlaps and too many heads per household. Conflict sanitation is necessary, but the remaining excess comes from over-imputed head roles and household-incoherent role surfaces.

Desired direction

Do not just copy eCPS structure because it is incumbent. Use the best Microplex architecture:

  • Treat tax-unit construction as a household/tax-unit relational problem, not three independent binary person variables.
  • Preserve or synthesize coherent memberships where source data has them.
  • Use ACS/ASEC household relationships, spouse pointers, age, marital status, dependent hints, and donor tax-unit surfaces as inputs, but enforce hard per-household consistency before writing PE H5 tables.
  • Keep diagnostics out of the model H5; write sidecars/gates.

Acceptance criteria

  • Impossible role overlaps are zero after the role-surface construction stage, not only during H5 materialization.
  • Add a diagnostic sidecar with tax units/HH, heads/HH, spouses/HH, dependents/HH, singleton tax-unit share, and role-overlap counts by source/overall.
  • Add a release gate or warning for implausible tax-unit fragmentation, calibrated from source distributions rather than a hard-coded eCPS-only target.
  • Re-run the small ASEC+ACS100k sound comparison and report whether protected-family losses and filing-status-sensitive IRS cells improve.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions