State-specific .h5 datasets undercount population by 30-65%

## Summary

State-specific datasets at `policyengine/policyengine-us-data` (`states/{STATE}.h5`) appear to systematically undercount population across all states. Comparing weighted person counts between each state dataset and the national dataset filtered to that state (via `state_code`) shows the state datasets are ~30-65% of expected.

## Reproduction

See `us/poverty/michigan_under_age_1.ipynb` for the minimal Michigan reproducer.

For Michigan (2026):
- `states/MI.h5` weighted total: **4,104,577**
- National dataset filtered to MI: **10,106,712**
- Census MI population: **~10.1M**

The state dataset is at ~41% of true population. The same `person_weight` summation method is used in both cases.

## Scope

Tested 11 states — all undercount:

| State | State ds total | National (filter MI) | Ratio |
|---|---|---|---|
| CA | 16.1M | 39.3M | 41% |
| TX | 11.0M | 31.6M | 35% |
| NY | 7.6M | 19.9M | 38% |
| FL | 8.5M | 23.5M | 36% |
| PA | 5.2M | 13.4M | 39% |
| OH | 4.6M | 12.0M | 38% |
| GA | 4.3M | 11.5M | 38% |
| NC | 4.6M | 11.0M | 42% |
| MI | 4.1M | 10.1M | 41% |
| WY | 385K | 559K | 69% |
| VT | 408K | 591K | 69% |

Note: large states cluster at ~35-42%; very small states (WY, VT) at ~69%. Pattern is consistent enough to suggest a calibration step issue, not random.

## Downstream impact

Any analysis using the state datasets for absolute population counts (poverty headcounts, program enrollment, cost estimates) will be biased low. Rates (poverty rate, share affected) may still be approximately correct if the bias is uniform across the population, but this should be verified.

Notably, the existing `us/poverty/state_poverty_rates.ipynb` reported MI total population of 10.17M when run previously — so this appears to be a recent regression in the state datasets.

## Cross-check note

The national dataset filtered to a single state has very high variance for granular age cells (e.g., the under-1 count for TX from the national dataset is 666K vs ~370K actual births). Neither approach is reliable for tight demographic slices at the state level, but the systematic undercount in the state datasets is a separate, fixable issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State-specific .h5 datasets undercount population by 30-65% #135

Summary

Reproduction

Scope

Downstream impact

Cross-check note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

State	State ds total	National (filter MI)	Ratio
CA	16.1M	39.3M	41%
TX	11.0M	31.6M	35%
NY	7.6M	19.9M	38%
FL	8.5M	23.5M	36%
PA	5.2M	13.4M	39%
OH	4.6M	12.0M	38%
GA	4.3M	11.5M	38%
NC	4.6M	11.0M	42%
MI	4.1M	10.1M	41%
WY	385K	559K	69%
VT	408K	591K	69%

State-specific .h5 datasets undercount population by 30-65% #135

Description

Summary

Reproduction

Scope

Downstream impact

Cross-check note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions