Skip to content

State-specific .h5 datasets undercount population by 30-65% #135

@DTrim99

Description

@DTrim99

Summary

State-specific datasets at policyengine/policyengine-us-data (states/{STATE}.h5) appear to systematically undercount population across all states. Comparing weighted person counts between each state dataset and the national dataset filtered to that state (via state_code) shows the state datasets are ~30-65% of expected.

Reproduction

See us/poverty/michigan_under_age_1.ipynb for the minimal Michigan reproducer.

For Michigan (2026):

  • states/MI.h5 weighted total: 4,104,577
  • National dataset filtered to MI: 10,106,712
  • Census MI population: ~10.1M

The state dataset is at ~41% of true population. The same person_weight summation method is used in both cases.

Scope

Tested 11 states — all undercount:

State State ds total National (filter MI) Ratio
CA 16.1M 39.3M 41%
TX 11.0M 31.6M 35%
NY 7.6M 19.9M 38%
FL 8.5M 23.5M 36%
PA 5.2M 13.4M 39%
OH 4.6M 12.0M 38%
GA 4.3M 11.5M 38%
NC 4.6M 11.0M 42%
MI 4.1M 10.1M 41%
WY 385K 559K 69%
VT 408K 591K 69%

Note: large states cluster at ~35-42%; very small states (WY, VT) at ~69%. Pattern is consistent enough to suggest a calibration step issue, not random.

Downstream impact

Any analysis using the state datasets for absolute population counts (poverty headcounts, program enrollment, cost estimates) will be biased low. Rates (poverty rate, share affected) may still be approximately correct if the bias is uniform across the population, but this should be verified.

Notably, the existing us/poverty/state_poverty_rates.ipynb reported MI total population of 10.17M when run previously — so this appears to be a recent regression in the state datasets.

Cross-check note

The national dataset filtered to a single state has very high variance for granular age cells (e.g., the under-1 count for TX from the national dataset is 666K vs ~370K actual births). Neither approach is reliable for tight demographic slices at the state level, but the systematic undercount in the state datasets is a separate, fixable issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions