Forbid randomness inside variable formulas#500
Merged
Conversation
A rules-engine formula must be a pure, deterministic function of its inputs: identical inputs must always produce identical outputs. Calling a random number generator inside a formula breaks that contract and makes datasets non-reproducible (a property_purchased assignment built on unseeded np.random recently spiked a UK income decile's tax rate and blocked data releases for ~2 weeks). While a formula runs, forbid_randomness replaces the public callables of numpy.random and the stdlib random module with functions that raise NonDeterministicFormulaError. Seeding does not make randomness acceptable in a formula; stochastic inputs must be precomputed deterministically when building the dataset and stored as inputs. The guard is re-entrant and restores the namespaces once the outermost formula returns. Removes the per-variable np.random.seed previously applied at the top of calculate(), which existed only to make formula-level randomness reproducible and now conflicts with the guard. Core-internal seeding in __init__ (for non-formula randomness) is unchanged. No formula in policyengine-uk or policyengine-us uses randomness, so this is enforcement of an already-held invariant. Full core suite passes (580 tests).
Independent review of the determinism guard surfaced two interception gaps and missing exception-safety coverage: - A generator hoisted to module scope (rng = np.random.default_rng(0)) and used inside a formula is not caught: numpy's generator classes are immutable C extension types, so their bound methods cannot be patched. (Building a generator inside the formula IS caught, since the np.random.default_rng constructor is patched.) - A drawing function imported by name before the guard installs (from numpy.random import random) is not caught. Neither is closeable without heavy machinery, and no PolicyEngine formula uses randomness, so document both accurately in the module docstring and pin them with tests so the boundary cannot drift silently. Corrects an earlier draft docstring that wrongly claimed the generator classes were patched. Adds tests: - randomness namespaces are restored after a formula raises a non-RNG exception (pins _run_formula guarded-block exception safety), - constructing a generator inside a formula raises, - the two known gaps behave as documented.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
A rules-engine formula must be a pure, deterministic function of its inputs — identical inputs must always produce identical outputs. Calling an RNG inside a formula breaks that contract and makes whole datasets non-reproducible.
This is not hypothetical: a
property_purchasedassignment inpolicyengine-uk-databuilt on unseedednp.randomintermittently spiked the UK first income decile's effective tax rate to 251%, failing the data build's sanity test and blocking all UK data releases for ~2 weeks. The data-side fixes (seed the draw; flip the model default) address that instance; this PR makes the whole class impossible at the engine level.What
While a formula runs,
forbid_randomnessreplaces the public callables ofnumpy.randomand the stdlibrandommodule with functions that raiseNonDeterministicFormulaError.np.random.default_rng(0).random()inside a formula still raises. Stochastic inputs must be precomputed deterministically when building the dataset and stored as input variables.Simulation._run_formula.Also removes the per-variable
np.random.seed(...)previously applied at the top of everycalculate(). It existed only to make formula-level randomness reproducible — now moot — and it conflicted with the guard (it fires during nested formula execution). Core-internal seeding in__init__(governing legitimate non-formula randomness, e.g. axes) is unchanged.Impact
Enforcement of an already-held invariant: no formula in
policyengine-ukorpolicyengine-ususes randomness (grepacross bothvariables/trees returns zero).Test plan
tests/core/test_randomness_guard.py: numpy random, stdlib random, and seeded-generator calls inside a formula all raise; deterministic formulas unaffected; namespaces restored afterwards; re-entrancy correct.