Support numpy.random.RandomState objects (take 2) #556

dsherry · 2020-04-01T06:46:46Z

Closes #347 .

A version of #530 which won't break the windows tests.

Will retarget on master once #554 and #555 are merged.

Changes

Have all components and pipelines take random_state as a keyword argument
Have the entire codebase accept random_state as either an int seed or a numpy.random.RandomState object
Add get_random_state helper to standardize to np.random.RandomState objects
Provide a way for components which don't support np.random.RandomState objects to get random seeds, via a get_random_seed method
Ensure getting random seed will be safe on 32-bit systems using a SEED_BOUNDS range constant
Add test coverage to increase coverage

Building off of #441 , opening to test my own changes on top of Angela's work (thank you @angela97lin!)

Note @kmax12 we had discussed sticking with seeds internally, but using np.random.RandomState turned out to be the simpler option.

codecov · 2020-04-01T06:50:46Z

Codecov Report

Merging #556 into master will increase coverage by 13.53%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           master     #556       +/-   ##
===========================================
+ Coverage   85.30%   98.83%   +13.53%     
===========================================
  Files         115      115               
  Lines        4205     4297       +92     
===========================================
+ Hits         3587     4247      +660     
+ Misses        618       50      -568

Impacted Files	Coverage Δ
evalml/automl/auto_classification_search.py	`100.00% <ø> (ø)`
evalml/automl/auto_regression_search.py	`100.00% <ø> (ø)`
evalml/pipelines/components/utils.py	`100.00% <ø> (ø)`
evalml/preprocessing/utils.py	`100.00% <ø> (ø)`
evalml/tuners/skopt_tuner.py	`100.00% <ø> (ø)`
evalml/tuners/tuner.py	`100.00% <ø> (ø)`
evalml/automl/auto_base.py	`96.19% <100.00%> (+3.67%)`	⬆️
evalml/pipelines/classification/catboost.py	`100.00% <100.00%> (+14.28%)`	⬆️
evalml/pipelines/classification/xgboost.py	`100.00% <100.00%> (+12.50%)`	⬆️
evalml/pipelines/components/component_base.py	`88.88% <100.00%> (ø)`
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e0288e...85fae99. Read the comment docs.

dsherry · 2020-04-01T06:53:07Z

This was the error coming out of all the windows unit tests: ValueError: high is out of bounds for int32, from the call to randint. It was because I was passing in 2**32 - 1, but max int on 32-bit systems is 2**31 - 1 🤦‍♂️

dsherry · 2020-04-01T06:55:19Z

evalml/pipelines/components/estimators/classifiers/catboost_classifier.py

@@ -26,6 +26,7 @@ class CatBoostClassifier(Estimator):
    supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]

    def __init__(self, n_estimators=1000, eta=0.03, max_depth=6, bootstrap_type=None, random_state=0):
+        random_seed = get_random_seed(random_state, 0, SEED_BOUNDS.max_bound)


The catboost estimators need random_seed to be between 0 and max int, otherwise they throw an error.

dsherry · 2020-04-01T06:57:20Z

evalml/pipelines/components/estimators/classifiers/xgboost_classifier.py

@@ -19,13 +19,14 @@ class XGBoostClassifier(Estimator):
    supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]

    def __init__(self, eta=0.1, max_depth=3, min_child_weight=1, n_estimators=100, random_state=0):
+        random_seed = get_random_seed(random_state, SEED_BOUNDS.min_bound, SEED_BOUNDS.max_bound)


Like catboost, the xgboost classifier needs random_seed to be an int, not np.random.RandomState. The weird thing is, passing random_state worked fine on linux... so it appears its only on windows that xgboost requires this!

dsherry · 2020-04-01T07:03:42Z

evalml/utils/gen_utils.py

+        return random_state.randint(min_bound, max_bound)
+    if random_state < min_bound or random_state >= max_bound:
+        return random_state % min(abs(min_bound), abs(max_bound))
+    return random_state


I was hoping to avoid defining a function like this because it introduces complexity. But having this provides a pattern which we can follow for when we add new pipelines which require integer seeds instead of np.random.RandomState.

I think this is a good solution!

Thanks. I still wish it were simpler and therefore easier to understand, but I'm not sure there's a better alternative right now. Let's continue to think about it :)

* Squash random_state work from 347_random_state * Lint * Lint * Changelog * Lint * Test update * Always pass random_state to components * Lint * Fix bug: set random state first. Remove usages of random_state as dict param item in test_pipelines.py * update test for clarity * Fix catboost * Update logreg test * Lint catboost * Update tuner impl to handle random_state * Test changes * Lint * Docs changes * Add unit test for get_random_state * Update test * Remove uncalled code after my changes * Fix tests after rebase * Add unit test coverage for RandomSearchTuner.is_search_space_exhausted * Add unit test coverage for max_time * Add test coverage of get_pipeline when invalid * Lint * Add unit test coverage of when fit/score throws in autobase * Remove duplicate * Lets try that again... got mysterious docs failure

…32bit systems

… for different classes

jeremyliweishih

Looks good to me!

dsherry requested review from kmax12, angela97lin and jeremyliweishih April 1, 2020 06:46

dsherry self-assigned this Apr 1, 2020

dsherry commented Apr 1, 2020

View reviewed changes

dsherry force-pushed the ds_revert_347_random_state branch from d60a2e7 to fbaa89f Compare April 1, 2020 14:09

dsherry force-pushed the ds_347_random_state_windows branch 2 times, most recently from a295283 to 33ef6ba Compare April 1, 2020 15:15

dsherry mentioned this pull request Apr 1, 2020

Release 0.8.0 #532

Closed

dsherry added 7 commits April 1, 2020 11:49

Get min/max int instead of using fixed number which is incorrect for …

b02c3e8

…32bit systems

Add limits to seed range for xgboost too

fe747fb

Introduce a sustainable pattern for generating random seeds from RNGs…

bcdd2d9

… for different classes

Update changelog

fec067d

Use SEED_BOUNDS in unit tests

377ed8f

Update comment

85fae99

dsherry force-pushed the ds_347_random_state_windows branch from 33ef6ba to 85fae99 Compare April 1, 2020 15:49

dsherry changed the base branch from ds_revert_347_random_state to master April 1, 2020 15:50

dsherry requested review from christopherbunn and rwedge April 1, 2020 18:04

jeremyliweishih approved these changes Apr 1, 2020

View reviewed changes

dsherry merged commit 9bafdd2 into master Apr 1, 2020

dsherry deleted the ds_347_random_state_windows branch April 1, 2020 18:57

angela97lin mentioned this pull request Apr 1, 2020

Make release for v0.8.0 #565

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support numpy.random.RandomState objects (take 2) #556

Support numpy.random.RandomState objects (take 2) #556

dsherry commented Apr 1, 2020 •

edited

codecov bot commented Apr 1, 2020 •

edited

dsherry commented Apr 1, 2020 •

edited

dsherry Apr 1, 2020

dsherry Apr 1, 2020

dsherry Apr 1, 2020

jeremyliweishih Apr 1, 2020

dsherry Apr 1, 2020

jeremyliweishih left a comment

Support numpy.random.RandomState objects (take 2) #556

Support numpy.random.RandomState objects (take 2) #556

Conversation

dsherry commented Apr 1, 2020 • edited

codecov bot commented Apr 1, 2020 • edited

Codecov Report

dsherry commented Apr 1, 2020 • edited

dsherry Apr 1, 2020

Choose a reason for hiding this comment

dsherry Apr 1, 2020

Choose a reason for hiding this comment

dsherry Apr 1, 2020

Choose a reason for hiding this comment

jeremyliweishih Apr 1, 2020

Choose a reason for hiding this comment

dsherry Apr 1, 2020

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

dsherry commented Apr 1, 2020 •

edited

codecov bot commented Apr 1, 2020 •

edited

dsherry commented Apr 1, 2020 •

edited