Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random seed utilities #496

Conversation

ascillitoe
Copy link
Contributor

@ascillitoe ascillitoe commented May 3, 2022

This PR adds utility functions to get and set seeds for the Python, Numpy, TensorFlow and PyTorch random number generators.

Motivation

Some detectors such as MMDDrift and LSDDDrift involve operations using NumPy, TensorFlow and/or PyTorch random number generators. The resulting non-deterministic behavior is a problem for testing, as it prevents us performing checks such as:

random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

detector1 = MMDDrift(X_ref, ...)
preds1 = detector1.predict(x_h0)

detector2 = MMDDrift(X_ref, ...)
preds2 = detector2.predict(x_h0)

assert preds1['data']['p_val'] == preds2['data']['p_val']

Setting all random seeds (with np.random.seed(), tf.random.set_seed() etc) at the start of the test is insufficient to provide determinism here, since the operations on the first detector will still cycle through the RNG states such that the second detector would see a different RNG state. To ensure determinism we would instead have to set seeds again before the init and prediction of the second detector.

Setting global random seeds in individual pytest functions is also undesriable in general, and leads to a difficult-to-maintain and unpredictable test suite.

Solution

This PR addresses the above through two changes:

  1. Introduces the use of pytest-randomly for testing. This package manages the random seeds used for each test. The seed is randomly changed for each test, which avoids us being tricked by tests passing due to a "friendly" hardcoded random state. Importantly, the random seed is reported for each test e.g.:

    Using --randomly-seed=1553614239

    and a test can then be repeated with a given random seed for debugging purposes:

    pytest --randomly-seed=1234 alibi_detect -k <name_of_failed_test>

  2. A new submodule alibi_detect.utils.random is introduced. This contains functionality to set and get RNG seeds (the relevent seeds for NumPy, TensorFlow, PyTorch etc are all set at once). A context manager fixed_seed is also introduced to enable select lines of code to be run with a given seed in an isolated fashion, e.g.:

    with fixed_seed(42):
       detector1 = MMDDrift(X_ref, ...)
       preds1 = detector1.predict(x_h0)
    
    with fixed_seed(42):
       detector2 = MMDDrift(X_ref, ...)
       preds2 = detector2.predict(x_h0)
    
    assert preds1['data']['p_val'] == preds2['data']['p_val']

    The fixed_seed's above set all the random seeds to 42 within the with block, but they are reset to their original values upon exit of the block.

The alibi_detect.utils.random has been made public as it might be quite useful for use in example notebooks, experiments etc.

Note: If we decided we didn't want the new utilities public, we could hide it within tests instead, however we would not be able to set set_seeds as an entry_point for pytest, meaning pytest wouldn't use our custom seed setter internally (our custom one also sets the TF and torch random seeds, whereas the pytest reseed function only sets Python and Numpy seeds).

What does this PR not address

Going forward we might want to avoid setting seeds altogether, by using generators such as tf.random.Generator and np.random.Generator. This would likely involve passing a seed kwarg to detectors (similar to the scikit-learn random_state). This PR doesn't help with this, and is instead a stop-gap to be used in the interim.

Related issues

Related to #250.

@jklaise jklaise self-requested a review May 3, 2022 12:28
@ascillitoe ascillitoe requested a review from mauicv May 3, 2022 12:36
setup.cfg Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
import torch


def set_seed(seed):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seed: int
Is None valid/desirable here (e.g. you can do np.random.seed(None) do seed with an unknown state).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I wanted that, but annoyingly torch.random.manual_seed doesn't support None. I guess we could take None, and then generate a random seed ourselves using system time etc, but I didn't want to complicate matters too much for now, especially since random, np.random and tf.random all randomly generate seeds in different ways if None is passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts, in the event of None, we could call np.random.seed(None) first, and then get that seed to use to set the others?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's worth complicating at this point, the utility of a "random random seed" is also not clear in a function specifically for setting a known seed.

"""
os.environ['PYTHONHASHSEED'] = str(seed)
random.seed(seed)
np.random.seed(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if relevant, but what was the deal with using new-style numpy random number generators?

Copy link
Contributor Author

@ascillitoe ascillitoe May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, this shouldn't affect the numpy generators, in that they don't appear to be affected by the np.random.seed() at all. A very crude example:

from alibi_detect.utils.random import set_seed, get_seed

rng = np.random.default_rng()
set_seed(0)
a = rng.random()
set_seed(0)
b = rng.random()
assert a == b  # False

rng = np.random.default_rng(0)
a = rng.random()
rng = np.random.default_rng(0)
b = rng.random()
assert a == b  # True

I've changed utils.random to private for now, since going forward we might want to more deeply consider how we can move from the legacy numpy RNG to generators (assume this would involve passing a seed kwarg to detectors, a la scikit-learn).

try:
yield
finally:
set_seed(orig_seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure resetting seed (or rng state if that was the intention) is that easy (see also above). E.g. what happens in the following scenario where the different library rng's are out of sync?

set_seed(42) -> some numpy calls -> set_seed(other)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new global variable approach fixes this? I've extended the tests in tests_random.py to check that the seed returned by get_seed is unchanged after RNG calls.

@ascillitoe
Copy link
Contributor Author

ascillitoe commented May 3, 2022

Since a number of existing tests still use np/tf.random.seed(0) etc in numerous places I've hardcoded in --randomly-seed=0 in setup.cfg for now, so that existing tests don't randomly fail (reworking these tests to avoid setting global seed values will need to be part of a wider project). This does mean that pytest-randomly is not actually doing anything! However, I've left it in the PR since the related code to setup set_seed as an entry point is already there, and it should pave the way for the future work. I can remove pytest-randomly for now though if people think that would be better...

Comment on lines +51 to +52
else:
raise RuntimeError('`set_seed` must be called before `get_seed` can be called.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for not accepting None as valid? Think I'm a bit lost in the motivation of these utilities again...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_seed doesn't work with None since torch.manual_seed() doesn't accept None (#496 (comment)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT to motivation, the primary motivation of this PR now is to provide the fixed_seed context manager for use in tests. We want to be able to run detectors deterministically, i.e.

with fixed_seed(42):
   detector1 = MMDDrift(X_ref, ...)
   preds1 = detector1.predict(x_h0)

But want to avoid setting global seeds, since these transfer into the next pytest test (I've checked this with a simple demo). This is quite an issue since in some tests we don't want the seeds fixed (see #250). If we move to use generators (e.g. np.random.Generator) in the future, and have MMDDrift(x_ref, ..., random_seed=...), we can probably get rid of this entire _random module.

The set_seed and get_seed methods might be helpful to a user in some way, but they're primarily for use with fixed_seed so I've kept the submodule private for now.

Does the above help or am I barking up the wrong tree? 🙂

Copy link
Member

@jklaise jklaise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See individual comments.

@ascillitoe ascillitoe changed the base branch from master to feature/config_driven_detectors May 10, 2022 10:21
@ascillitoe
Copy link
Contributor Author

ascillitoe commented May 10, 2022

Rebased onto feature/config_driven_detectors since this functionality is initially intended for use in tests within #497.

Comment on lines 5 to 6
types-requests~=2.25
types-toml~=0.10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conscious that we would not keep up with new minor releases if going with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, types-requests~=2.25 means >= 2.25, == 2.*, so I think we are OK, but have changed to types-requests>=2.25, <3.0 for consistency with elsewhere.

Copy link
Member

@jklaise jklaise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ascillitoe ascillitoe merged commit a87153b into SeldonIO:feature/config_driven_detectors May 16, 2022
ascillitoe added a commit that referenced this pull request May 25, 2022
This PR implements a number of final fixes and refactorings for the config driven detectors:

1. The `preprocess_at_init` and `preprocess_at_pred` logic implemented in #381 and #458 has been reworked. This turned out to have a problem in how it dealt with `update_x_ref`, since regardless of `x_ref_preprocessed`, we still need to update reference data within `.predict()` when `update_x_ref` is set. All offline drift detectors have been reworked to use the old logic (but with `preprocess_x_ref` renamed `preprocess_at_init`), with the addition that `self.x_ref_preprocessed` is also checked internally. 
2. The previous `get_config` methods involved a lot of boilerplate to try to recover the original args/kwargs from detector attributes. The new approach calls a generic `_set_config()` with `__init__`, and then `self.config` is returned by `get_config`. This should significantly reduce the workload to add save/load to new detectors. To avoid memory overheads, large artefacts such as `x_ref` are not set at `__init__`, and instead are added within `get_config`. 
3. Owing to the ease of implementation with the new `get_config` approach, save/load has been added for the model uncertainty and online detectors!
4. Kernels and `preprocess_fn`'s were previously resolved in `_load_detector_config`, which wasn't consistent with how other artefacts were resolved (it also caused added extra challenges). These are now resolved in `resolve_config` instead. Following this the `KernelConfigResolved` and `PreprocessConfigResolved` pydantic models have been removed (they could be added back but it would complicate `resolve_config`).
5. Fixing determinism in #496 has allowed us to compare original and loaded detector predictions in `test_saving.py`. This uncovered bugs with how kernels were saved and loaded. These have been fixed.
6. The readthedocs.yml has been fully updated to the V2 schema so that we can use Python 3.9 for building the docs. This is required as the `class NDArray(Generic[T], np.ndarray[Any, T])` in `utils._typing` causes an error with `autodoc` on older Python versions.
@ascillitoe ascillitoe mentioned this pull request Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants