Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support numpy.random.RandomState objects (take 2) #556

Merged
merged 7 commits into from Apr 1, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Apr 1, 2020

Closes #347 .

A version of #530 which won't break the windows tests.

Will retarget on master once #554 and #555 are merged.

Changes

  • Have all components and pipelines take random_state as a keyword argument
  • Have the entire codebase accept random_state as either an int seed or a numpy.random.RandomState object
  • Add get_random_state helper to standardize to np.random.RandomState objects
  • Provide a way for components which don't support np.random.RandomState objects to get random seeds, via a get_random_seed method
  • Ensure getting random seed will be safe on 32-bit systems using a SEED_BOUNDS range constant
  • Add test coverage to increase coverage

Building off of #441 , opening to test my own changes on top of Angela's work (thank you @angela97lin!)

Note @kmax12 we had discussed sticking with seeds internally, but using np.random.RandomState turned out to be the simpler option.

@dsherry dsherry self-assigned this Apr 1, 2020
@codecov
Copy link

codecov bot commented Apr 1, 2020

Codecov Report

Merging #556 into master will increase coverage by 13.53%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #556       +/-   ##
===========================================
+ Coverage   85.30%   98.83%   +13.53%     
===========================================
  Files         115      115               
  Lines        4205     4297       +92     
===========================================
+ Hits         3587     4247      +660     
+ Misses        618       50      -568     
Impacted Files Coverage Δ
evalml/automl/auto_classification_search.py 100.00% <ø> (ø)
evalml/automl/auto_regression_search.py 100.00% <ø> (ø)
evalml/pipelines/components/utils.py 100.00% <ø> (ø)
evalml/preprocessing/utils.py 100.00% <ø> (ø)
evalml/tuners/skopt_tuner.py 100.00% <ø> (ø)
evalml/tuners/tuner.py 100.00% <ø> (ø)
evalml/automl/auto_base.py 96.19% <100.00%> (+3.67%) ⬆️
evalml/pipelines/classification/catboost.py 100.00% <100.00%> (+14.28%) ⬆️
evalml/pipelines/classification/xgboost.py 100.00% <100.00%> (+12.50%) ⬆️
evalml/pipelines/components/component_base.py 88.88% <100.00%> (ø)
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e0288e...85fae99. Read the comment docs.

@dsherry
Copy link
Contributor Author

dsherry commented Apr 1, 2020

This was the error coming out of all the windows unit tests: ValueError: high is out of bounds for int32, from the call to randint. It was because I was passing in 2**32 - 1, but max int on 32-bit systems is 2**31 - 1 🤦‍♂️

@@ -26,6 +26,7 @@ class CatBoostClassifier(Estimator):
supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]

def __init__(self, n_estimators=1000, eta=0.03, max_depth=6, bootstrap_type=None, random_state=0):
random_seed = get_random_seed(random_state, 0, SEED_BOUNDS.max_bound)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catboost estimators need random_seed to be between 0 and max int, otherwise they throw an error.

@@ -19,13 +19,14 @@ class XGBoostClassifier(Estimator):
supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]

def __init__(self, eta=0.1, max_depth=3, min_child_weight=1, n_estimators=100, random_state=0):
random_seed = get_random_seed(random_state, SEED_BOUNDS.min_bound, SEED_BOUNDS.max_bound)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like catboost, the xgboost classifier needs random_seed to be an int, not np.random.RandomState. The weird thing is, passing random_state worked fine on linux... so it appears its only on windows that xgboost requires this!

return random_state.randint(min_bound, max_bound)
if random_state < min_bound or random_state >= max_bound:
return random_state % min(abs(min_bound), abs(max_bound))
return random_state
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to avoid defining a function like this because it introduces complexity. But having this provides a pattern which we can follow for when we add new pipelines which require integer seeds instead of np.random.RandomState.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good solution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I still wish it were simpler and therefore easier to understand, but I'm not sure there's a better alternative right now. Let's continue to think about it :)

@dsherry dsherry force-pushed the ds_revert_347_random_state branch from d60a2e7 to fbaa89f Compare April 1, 2020 14:09
@dsherry dsherry force-pushed the ds_347_random_state_windows branch 2 times, most recently from a295283 to 33ef6ba Compare April 1, 2020 15:15
@dsherry dsherry mentioned this pull request Apr 1, 2020
* Squash random_state work from 347_random_state

* Lint

* Lint

* Changelog

* Lint

* Test update

* Always pass random_state to components

* Lint

* Fix bug: set random state first. Remove usages of random_state as dict param item in test_pipelines.py

* update test for clarity

* Fix catboost

* Update logreg test

* Lint catboost

* Update tuner impl to handle random_state

* Test changes

* Lint

* Docs changes

* Add unit test for get_random_state

* Update test

* Remove uncalled code after my changes

* Fix tests after rebase

* Add unit test coverage for RandomSearchTuner.is_search_space_exhausted

* Add unit test coverage for max_time

* Add test coverage of get_pipeline when invalid

* Lint

* Add unit test coverage of when fit/score throws in autobase

* Remove duplicate

* Lets try that again... got mysterious docs failure
@dsherry dsherry force-pushed the ds_347_random_state_windows branch from 33ef6ba to 85fae99 Compare April 1, 2020 15:49
@dsherry dsherry changed the base branch from ds_revert_347_random_state to master April 1, 2020 15:50
Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@dsherry dsherry merged commit 9bafdd2 into master Apr 1, 2020
@dsherry dsherry deleted the ds_347_random_state_windows branch April 1, 2020 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

random_state usage is inconsistent with sklearn
2 participants