Skip to content

User-set ETERNITY inputs are dropped after Simulation._invalidate_all_caches (3.24.0+) #482

@juaristi22

Description

@juaristi22

Symptom

When a Microsimulation is constructed from a dataset that supplies an ETERNITY-defined variable as input, calculate/calculate_dataframe returns the variable's default value for every row instead of the dataset-provided values, as soon as anything triggers Simulation._invalidate_all_caches (e.g. apply_reform, subsample).

We hit this in policyengine-us-data after upgrading to policyengine-us 1.674.1 / policyengine-core 3.25.3. is_household_head (a bool ETERNITY variable on Person) came back all-False from acs.calculate_dataframe([\"is_household_head\", ...]), even though the ACS H5 dataset has it correctly populated. Filtering by is_household_head then yielded zero rows and downstream code crashed.

Suspected mechanism

In policyengine_core/simulations/simulation.py:

  • Holder.set_input records (variable_name, branch_name, period) into simulation._user_input_keys using the user-supplied period (e.g. Period(2024) from a dataset's time_period).
  • For an ETERNITY-defined variable, Holder._set canonicalizes storage to period=ETERNITY, so _memory_storage._arrays is keyed by \"default:eternity\".
  • _invalidate_all_caches then iterates _user_input_keys and looks up f\"{branch_name}:{period}\" (\"default:2024\"), which misses the canonicalized \"default:eternity\" entry.
  • The preserve-loop therefore doesn't re-add the array, the holder's _arrays is wiped at the iteration over population._holders, and subsequent calculate calls fall back to the variable's default_value.

Affected versions: introduced sometime in 3.24.x (the _invalidate_all_caches + _user_input_keys machinery is in 3.24.0/3.24.4/3.25.x; not present in 3.23.6).

Minimal repro sketch

```python
import numpy as np
from policyengine_core.data import Dataset
from policyengine_core.simulations import Simulation

... a tiny tax-benefit system with one ETERNITY bool variable `flag` ...

class Tiny(Dataset):
data_format = Dataset.ARRAYS
time_period = 2024 # any non-ETERNITY period
def generate(self):
# Write {"flag": [True, False], "person_id": [1, 2], ...}
...

sim = Simulation(dataset=Tiny, tax_benefit_system=tbs)
print(sim.calculate("flag")) # [True, False]
sim.apply_reform(some_reform) # triggers _invalidate_all_caches
print(sim.calculate("flag")) # [False, False] ← bug
```

Suggested fix directions

In `_user_input_keys` accounting (in `Holder.set_input` and `Simulation.set_input`), normalize the recorded `period` to `ETERNITY` whenever `variable.definition_period == ETERNITY`, so the preserve-loop's `f"{branch_name}:{period}"` lookup matches the canonicalized storage key. Same normalization in the preserve-loop's lookup would also work.

Impact

Any country package that loads ETERNITY variables from a `Dataset` whose `time_period != ETERNITY` and that subsequently calls `apply_reform` or `subsample` will silently see those inputs replaced with default values. In `policyengine-us-data`, this took down the CPS data build's rent imputation step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions