Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Sklearn meta-estimators into namespace #3493

Conversation

viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Feb 12, 2021

Closes #3484

Imports sklearn's Pipeline and GridSearch meta-estimators into cuML namespace for ease-of-use.

@viclafargue viclafargue requested a review from a team as a code owner February 12, 2021 14:10
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Feb 12, 2021
@viclafargue viclafargue added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 12, 2021
@viclafargue viclafargue changed the title [REVIEW] Sklearn meta-estimators into namespace Sklearn meta-estimators into namespace Feb 12, 2021
@viclafargue viclafargue added the 3 - Ready for Review Ready for review by team label Feb 12, 2021
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! It's a straightforward change that will help make the transition easier for people currently using an sklearn-only workflow.

A few thoughts on this:

  1. We should probably delay merging this until [REVIEW] Ensure global_output_type is thread-safe #3497 is in, since our estimators currently do not play nice with sklearn's GridSearchCV in any sort of multithreaded environment.
  2. We should probably do something to explicitly and unambiguously give the fine folks of scikit-learn credit for these features in our docs. I could imagine doing that a couple different ways. One would be to edit the docstring at import time, adding a line that clarifies where these classes come from. The other would be to wrap them in a class that provides a docstring which links directly to scikit-learn documentation.
  3. Since we're bringing these features into our namespace, do we want to do something to test them as well? I don't think there's any need to duplicate sklearn tests, but perhaps we could explicitly use these features with one of our estimators, which would open up the possibility of detecting bugs like the one addressed by [REVIEW] Ensure global_output_type is thread-safe #3497.

python/cuml/model_selection/__init__.py Outdated Show resolved Hide resolved
python/cuml/pipeline/__init__.py Outdated Show resolved Hide resolved
@JohnZed
Copy link
Contributor

JohnZed commented Feb 22, 2021

Updating the docstrings seems appealing... if we import 4 classes, we can iterate over them and append something vaguely like this:

"""This code is developed and maintained by scikit-learn (insert proper citation here) and imported by cuML to maintain the familiar sklearn.* namespace structure. cuML includes tests to ensure full compatibility of these wrappers with CUDA-based data and cuML estimators, but all of the underlying code is due to the scikit-learn developers."""

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great... I just have one question/topic regarding tests.


GridSearchCV.__doc__ = """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, bummer that we have to copy past since these are in separate files (pipeline and model_selection) - if it goes to a third file ever I think we should break it out in a shared way

from cuml.svm import SVC


def test_pipeline():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test that every model is compatible? seems like we could do this with about 10 rows of data and make it fast. Could use the same approach as test_pickle to grab at least all regression and classification models pretty easily. This is an open question, so LMK if you think it's a bad idea...
It might be nice also to confirm that output type doesn't get messed up. Basically, predict should preserve the on-gpu-ness of the data even through the pipeline (predict + check output class type). Not sure the scoring is even necessary since this is more a question of whether the pipeline is flowing...

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I know you're still tweaking some things, but wanted to go ahead and approve what I've seen so far since I'll be out next week.

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@JohnZed
Copy link
Contributor

JohnZed commented Mar 2, 2021

rerun tests

@viclafargue viclafargue force-pushed the sklearn-meta-estimator-into-namespace branch from 8699bad to d53f13a Compare March 3, 2021 08:53
@viclafargue
Copy link
Contributor Author

rerun tests

@viclafargue viclafargue changed the title Sklearn meta-estimators into namespace [REVIEW] Sklearn meta-estimators into namespace Mar 5, 2021
@codecov-io
Copy link

Codecov Report

Merging #3493 (79b4a95) into branch-0.19 (6dfff66) will increase coverage by 1.62%.
The diff coverage is 94.95%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #3493      +/-   ##
===============================================
+ Coverage        79.21%   80.84%   +1.62%     
===============================================
  Files              226      228       +2     
  Lines            17900    17742     -158     
===============================================
+ Hits             14180    14343     +163     
+ Misses            3720     3399     -321     
Flag Coverage Δ
dask 45.30% <70.42%> (+1.55%) ⬆️
non-dask 73.10% <94.95%> (+1.62%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/dask/cluster/dbscan.py 97.29% <ø> (+0.07%) ⬆️
python/cuml/dask/linear_model/elastic_net.py 100.00% <ø> (ø)
python/cuml/datasets/arima.pyx 97.56% <ø> (ø)
python/cuml/datasets/blobs.py 77.27% <ø> (ø)
python/cuml/datasets/regression.pyx 98.11% <ø> (ø)
python/cuml/decomposition/incremental_pca.py 94.70% <ø> (ø)
python/cuml/feature_extraction/_tfidf.py 94.73% <ø> (+0.11%) ⬆️
python/cuml/metrics/pairwise_distances.pyx 98.83% <ø> (ø)
python/cuml/neighbors/__init__.py 100.00% <ø> (ø)
python/cuml/preprocessing/LabelEncoder.py 94.73% <ø> (ø)
... and 93 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bacb05e...79b4a95. Read the comment docs.

@JohnZed
Copy link
Contributor

JohnZed commented Mar 16, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 28c3e39 into rapidsai:branch-0.19 Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Import compatible meta-estimators (e.g. pipelines) from sklearn into namespace
4 participants