Summary
State-specific datasets at policyengine/policyengine-us-data (states/{STATE}.h5) appear to systematically undercount population across all states. Comparing weighted person counts between each state dataset and the national dataset filtered to that state (via state_code) shows the state datasets are ~30-65% of expected.
Reproduction
See us/poverty/michigan_under_age_1.ipynb for the minimal Michigan reproducer.
For Michigan (2026):
states/MI.h5 weighted total: 4,104,577
- National dataset filtered to MI: 10,106,712
- Census MI population: ~10.1M
The state dataset is at ~41% of true population. The same person_weight summation method is used in both cases.
Scope
Tested 11 states — all undercount:
| State |
State ds total |
National (filter MI) |
Ratio |
| CA |
16.1M |
39.3M |
41% |
| TX |
11.0M |
31.6M |
35% |
| NY |
7.6M |
19.9M |
38% |
| FL |
8.5M |
23.5M |
36% |
| PA |
5.2M |
13.4M |
39% |
| OH |
4.6M |
12.0M |
38% |
| GA |
4.3M |
11.5M |
38% |
| NC |
4.6M |
11.0M |
42% |
| MI |
4.1M |
10.1M |
41% |
| WY |
385K |
559K |
69% |
| VT |
408K |
591K |
69% |
Note: large states cluster at ~35-42%; very small states (WY, VT) at ~69%. Pattern is consistent enough to suggest a calibration step issue, not random.
Downstream impact
Any analysis using the state datasets for absolute population counts (poverty headcounts, program enrollment, cost estimates) will be biased low. Rates (poverty rate, share affected) may still be approximately correct if the bias is uniform across the population, but this should be verified.
Notably, the existing us/poverty/state_poverty_rates.ipynb reported MI total population of 10.17M when run previously — so this appears to be a recent regression in the state datasets.
Cross-check note
The national dataset filtered to a single state has very high variance for granular age cells (e.g., the under-1 count for TX from the national dataset is 666K vs ~370K actual births). Neither approach is reliable for tight demographic slices at the state level, but the systematic undercount in the state datasets is a separate, fixable issue.
Summary
State-specific datasets at
policyengine/policyengine-us-data(states/{STATE}.h5) appear to systematically undercount population across all states. Comparing weighted person counts between each state dataset and the national dataset filtered to that state (viastate_code) shows the state datasets are ~30-65% of expected.Reproduction
See
us/poverty/michigan_under_age_1.ipynbfor the minimal Michigan reproducer.For Michigan (2026):
states/MI.h5weighted total: 4,104,577The state dataset is at ~41% of true population. The same
person_weightsummation method is used in both cases.Scope
Tested 11 states — all undercount:
Note: large states cluster at ~35-42%; very small states (WY, VT) at ~69%. Pattern is consistent enough to suggest a calibration step issue, not random.
Downstream impact
Any analysis using the state datasets for absolute population counts (poverty headcounts, program enrollment, cost estimates) will be biased low. Rates (poverty rate, share affected) may still be approximately correct if the bias is uniform across the population, but this should be verified.
Notably, the existing
us/poverty/state_poverty_rates.ipynbreported MI total population of 10.17M when run previously — so this appears to be a recent regression in the state datasets.Cross-check note
The national dataset filtered to a single state has very high variance for granular age cells (e.g., the under-1 count for TX from the national dataset is 666K vs ~370K actual births). Neither approach is reliable for tight demographic slices at the state level, but the systematic undercount in the state datasets is a separate, fixable issue.