Skip to content

Add extend_single_year_dataset for fast dataset year projection #7699

@anth-volk

Description

@anth-volk

Motivation

The API v2 alpha's create_datasets() is extremely slow (~1hr+ per state) because it routes every variable through sim.calculate(), invoking the full simulation engine for each variable × each year. The UK avoids this entirely: policyengine-uk has extend_single_year_dataset() which uprates DataFrames via simple multiplication — no simulation engine needed.

Now that policyengine-us-data is being updated to publish entity-level HDFStore files alongside the existing h5py files (PolicyEngine/policyengine-us-data#567), policyengine-us needs the machinery to consume them.

Changes

1. Dataset schema classes (data/dataset_schema.py)

  • USSingleYearDataset — holds one DataFrame per entity (person, household, tax_unit, spm_unit, family, marital_unit) + time_period. Supports load/save/copy.
  • USMultiYearDataset — holds a dict of USSingleYearDataset indexed by year. load() returns {variable: {year: array}} compatible with policyengine-core's TIME_PERIOD_ARRAYS format.

2. extend_single_year_dataset() (data/economic_assumptions.py)

Copies base-year DataFrames for each year through end_year (default 2035), then applies multiplicative uprating:

  • Reads uprating parameter path from system.variables[var].uprating at runtime
  • Computes growth factor as param(current_year) / param(prev_year) (absolute index ratio)
  • Variables without uprating are carried forward unchanged
  • No separate uprating list to maintain — picks up changes from variable definitions and default_uprating.py automatically

3. Dual-path loading in Microsimulation (system.py)

  • _is_hdfstore_format() — detects entity-level HDFStore vs variable-centric h5py by checking top-level HDF5 keys
  • _resolve_dataset_path() — resolves HuggingFace URLs and local paths
  • Microsimulation.__init__ — when an HDFStore file is detected, loads it as USSingleYearDataset, extends via extend_single_year_dataset, and passes the resulting USMultiYearDataset to the parent class
  • Legacy h5py files continue to work via the existing code path

Depends on

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions