-
Notifications
You must be signed in to change notification settings - Fork 204
Open
Description
Motivation
The API v2 alpha's create_datasets() is extremely slow (~1hr+ per state) because it routes every variable through sim.calculate(), invoking the full simulation engine for each variable × each year. The UK avoids this entirely: policyengine-uk has extend_single_year_dataset() which uprates DataFrames via simple multiplication — no simulation engine needed.
Now that policyengine-us-data is being updated to publish entity-level HDFStore files alongside the existing h5py files (PolicyEngine/policyengine-us-data#567), policyengine-us needs the machinery to consume them.
Changes
1. Dataset schema classes (data/dataset_schema.py)
USSingleYearDataset— holds one DataFrame per entity (person, household, tax_unit, spm_unit, family, marital_unit) + time_period. Supports load/save/copy.USMultiYearDataset— holds a dict ofUSSingleYearDatasetindexed by year.load()returns{variable: {year: array}}compatible with policyengine-core'sTIME_PERIOD_ARRAYSformat.
2. extend_single_year_dataset() (data/economic_assumptions.py)
Copies base-year DataFrames for each year through end_year (default 2035), then applies multiplicative uprating:
- Reads uprating parameter path from
system.variables[var].upratingat runtime - Computes growth factor as
param(current_year) / param(prev_year)(absolute index ratio) - Variables without uprating are carried forward unchanged
- No separate uprating list to maintain — picks up changes from variable definitions and
default_uprating.pyautomatically
3. Dual-path loading in Microsimulation (system.py)
_is_hdfstore_format()— detects entity-level HDFStore vs variable-centric h5py by checking top-level HDF5 keys_resolve_dataset_path()— resolves HuggingFace URLs and local pathsMicrosimulation.__init__— when an HDFStore file is detected, loads it asUSSingleYearDataset, extends viaextend_single_year_dataset, and passes the resultingUSMultiYearDatasetto the parent class- Legacy h5py files continue to work via the existing code path
Depends on
- Add entity-level HDFStore output format alongside h5py policyengine-us-data#567 (HDFStore output format)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels