New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random seed utilities #496
Random seed utilities #496
Conversation
alibi_detect/utils/random.py
Outdated
import torch | ||
|
||
|
||
def set_seed(seed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seed: int
Is None
valid/desirable here (e.g. you can do np.random.seed(None)
do seed with an unknown state).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I wanted that, but annoyingly torch.random.manual_seed doesn't support None
. I guess we could take None
, and then generate a random seed ourselves using system time etc, but I didn't want to complicate matters too much for now, especially since random
, np.random
and tf.random
all randomly generate seeds in different ways if None
is passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts, in the event of None
, we could call np.random.seed(None)
first, and then get that seed to use to set the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's worth complicating at this point, the utility of a "random random seed" is also not clear in a function specifically for setting a known seed.
alibi_detect/utils/random.py
Outdated
""" | ||
os.environ['PYTHONHASHSEED'] = str(seed) | ||
random.seed(seed) | ||
np.random.seed(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if relevant, but what was the deal with using new-style numpy
random number generators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, this shouldn't affect the numpy generators, in that they don't appear to be affected by the np.random.seed()
at all. A very crude example:
from alibi_detect.utils.random import set_seed, get_seed
rng = np.random.default_rng()
set_seed(0)
a = rng.random()
set_seed(0)
b = rng.random()
assert a == b # False
rng = np.random.default_rng(0)
a = rng.random()
rng = np.random.default_rng(0)
b = rng.random()
assert a == b # True
I've changed utils.random
to private for now, since going forward we might want to more deeply consider how we can move from the legacy numpy
RNG to generators (assume this would involve passing a seed
kwarg to detectors, a la scikit-learn
).
alibi_detect/utils/random.py
Outdated
try: | ||
yield | ||
finally: | ||
set_seed(orig_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure resetting seed (or rng state if that was the intention) is that easy (see also above). E.g. what happens in the following scenario where the different library rng's are out of sync?
set_seed(42) -> some numpy calls -> set_seed(other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new global variable approach fixes this? I've extended the tests in tests_random.py
to check that the seed returned by get_seed
is unchanged after RNG calls.
Since a number of existing tests still use |
else: | ||
raise RuntimeError('`set_seed` must be called before `get_seed` can be called.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for not accepting None
as valid? Think I'm a bit lost in the motivation of these utilities again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_seed
doesn't work with None
since torch.manual_seed()
doesn't accept None
(#496 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WRT to motivation, the primary motivation of this PR now is to provide the fixed_seed
context manager for use in tests. We want to be able to run detectors deterministically, i.e.
with fixed_seed(42):
detector1 = MMDDrift(X_ref, ...)
preds1 = detector1.predict(x_h0)
But want to avoid setting global seeds, since these transfer into the next pytest
test (I've checked this with a simple demo). This is quite an issue since in some tests we don't want the seeds fixed (see #250). If we move to use generators (e.g. np.random.Generator
) in the future, and have MMDDrift(x_ref, ..., random_seed=...)
, we can probably get rid of this entire _random
module.
The set_seed
and get_seed
methods might be helpful to a user in some way, but they're primarily for use with fixed_seed
so I've kept the submodule private for now.
Does the above help or am I barking up the wrong tree? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See individual comments.
Rebased onto |
requirements/dev.txt
Outdated
types-requests~=2.25 | ||
types-toml~=0.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conscious that we would not keep up with new minor releases if going with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, types-requests~=2.25
means >= 2.25, == 2.*
, so I think we are OK, but have changed to types-requests>=2.25, <3.0
for consistency with elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR implements a number of final fixes and refactorings for the config driven detectors: 1. The `preprocess_at_init` and `preprocess_at_pred` logic implemented in #381 and #458 has been reworked. This turned out to have a problem in how it dealt with `update_x_ref`, since regardless of `x_ref_preprocessed`, we still need to update reference data within `.predict()` when `update_x_ref` is set. All offline drift detectors have been reworked to use the old logic (but with `preprocess_x_ref` renamed `preprocess_at_init`), with the addition that `self.x_ref_preprocessed` is also checked internally. 2. The previous `get_config` methods involved a lot of boilerplate to try to recover the original args/kwargs from detector attributes. The new approach calls a generic `_set_config()` with `__init__`, and then `self.config` is returned by `get_config`. This should significantly reduce the workload to add save/load to new detectors. To avoid memory overheads, large artefacts such as `x_ref` are not set at `__init__`, and instead are added within `get_config`. 3. Owing to the ease of implementation with the new `get_config` approach, save/load has been added for the model uncertainty and online detectors! 4. Kernels and `preprocess_fn`'s were previously resolved in `_load_detector_config`, which wasn't consistent with how other artefacts were resolved (it also caused added extra challenges). These are now resolved in `resolve_config` instead. Following this the `KernelConfigResolved` and `PreprocessConfigResolved` pydantic models have been removed (they could be added back but it would complicate `resolve_config`). 5. Fixing determinism in #496 has allowed us to compare original and loaded detector predictions in `test_saving.py`. This uncovered bugs with how kernels were saved and loaded. These have been fixed. 6. The readthedocs.yml has been fully updated to the V2 schema so that we can use Python 3.9 for building the docs. This is required as the `class NDArray(Generic[T], np.ndarray[Any, T])` in `utils._typing` causes an error with `autodoc` on older Python versions.
This PR adds utility functions to get and set seeds for the Python, Numpy, TensorFlow and PyTorch random number generators.
Motivation
Some detectors such as
MMDDrift
andLSDDDrift
involve operations using NumPy, TensorFlow and/or PyTorch random number generators. The resulting non-deterministic behavior is a problem for testing, as it prevents us performing checks such as:Setting all random seeds (with
np.random.seed(), tf.random.set_seed()
etc) at the start of the test is insufficient to provide determinism here, since the operations on the first detector will still cycle through the RNG states such that the second detector would see a different RNG state. To ensure determinism we would instead have to set seeds again before the init and prediction of the second detector.Setting global random seeds in individual
pytest
functions is also undesriable in general, and leads to a difficult-to-maintain and unpredictable test suite.Solution
This PR addresses the above through two changes:
Introduces the use of pytest-randomly for testing. This package manages the random seeds used for each test. The seed is randomly changed for each test, which avoids us being tricked by tests passing due to a "friendly" hardcoded random state. Importantly, the random seed is reported for each test e.g.:
Using --randomly-seed=1553614239
and a test can then be repeated with a given random seed for debugging purposes:
pytest --randomly-seed=1234 alibi_detect -k <name_of_failed_test>
A new submodule
alibi_detect.utils.random
is introduced. This contains functionality to set and get RNG seeds (the relevent seeds for NumPy, TensorFlow, PyTorch etc are all set at once). A context managerfixed_seed
is also introduced to enable select lines of code to be run with a given seed in an isolated fashion, e.g.:The
fixed_seed
's above set all the random seeds to 42 within thewith
block, but they are reset to their original values upon exit of the block.The
alibi_detect.utils.random
has been made public as it might be quite useful for use in example notebooks, experiments etc.Note: If we decided we didn't want the new utilities public, we could hide it within
tests
instead, however we would not be able to setset_seeds
as anentry_point
forpytest
, meaningpytest
wouldn't use our custom seed setter internally (our custom one also sets the TF and torch random seeds, whereas thepytest
reseed
function only sets Python and Numpy seeds).What does this PR not address
Going forward we might want to avoid setting seeds altogether, by using generators such as tf.random.Generator and np.random.Generator. This would likely involve passing a
seed
kwarg to detectors (similar to thescikit-learn
random_state
). This PR doesn't help with this, and is instead a stop-gap to be used in the interim.Related issues
Related to #250.