Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] cuML's estimator Base class for preprocessing models #3270

Merged
merged 34 commits into from
Mar 29, 2021

Conversation

viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Dec 7, 2020

Answers #3201 .
This PR makes preprocessing models fully compliant with cuML's estimator Base class and the tagging system.

Preprocessing models were decorated with cuml_estimator and preprocessing functions with cuml_function to make use of features offered by cuML's estimator Base class. The return type of fit and transform method were specified. CumlArrayDescriptor attributes were created and the get_param_names and _more_tags methods were added when necessary.

As the SparseCumlArray class can only handle CSR matrices for now, preprocessing models will only return this type as sparse outputs.

@viclafargue viclafargue requested a review from a team as a code owner December 7, 2020 15:56
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@viclafargue viclafargue added breaking Breaking change improvement Improvement / enhancement to an existing function labels Dec 7, 2020
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one question about the relationship with PR #3257

return X

def _more_tags(self):
return {'allow_nan': True}
return {'X_types_gpu': ['2darray', 'sparse'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viclafargue we're in the process of making the tags system static in #3257, so depending on timing that PR will affect this one or the other way around. Do you foresee many issues arising from that change for these classes in _thirdparty/sklearn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I'll wait for your PR. Should be fairly simple, the models inherit from the Base class. I'll just have to make _more_tags methods static everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viclafargue I think were interested in knowing if you have any tags that are dynamic and will change from one instance to the other depending on the properties of the class. Or can all of the tags be determined in a static method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't knew these tags could be instance specific. From what I could see, all of them seems to be class specific (static) for preprocessing. After closer look, there seems to be at least one occurrence of instance-specific tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which tag would be instance specific?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see it, let me think on on it for a second, have a couple of ideas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is using the AllowNaNTagMixin already, instead of defining _more_tags, you can just add

class SparseInputTagMixin:

Copy link
Contributor

@mdemoret-nv mdemoret-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this PR I think the use of CumlArrayDescriptor looks pretty good but I have some concerns about the use of decorators and the class inheritance. In order to work seamlessly with the descriptors/decorators added in 0.17, this will need some significant changes to the architecture (the ESTIMATOR_GUIDE.md might be helpful).

Before approving this PR or make any suggestions, I would prefer to discuss the design decisions with Victor to understand the motivation first and then do another review.

python/cuml/_thirdparty/sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
python/cuml/thirdparty_adapters/adapters.py Outdated Show resolved Hide resolved
python/cuml/thirdparty_adapters/adapters.py Outdated Show resolved Hide resolved
python/cuml/_thirdparty/sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
python/cuml/_thirdparty/sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
return X

def _more_tags(self):
return {'allow_nan': True}
return {'X_types_gpu': ['2darray', 'sparse'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python/cuml/_thirdparty/sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
python/cuml/_thirdparty/sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
@viclafargue viclafargue requested a review from a team as a code owner December 29, 2020 10:40
@viclafargue viclafargue removed the 0 - Blocked Cannot progress due to external reasons label Mar 5, 2021
v0.19 Release automation moved this from PR-WIP to PR-Needs review Mar 16, 2021
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on my review @viclafargue

return X

def _more_tags(self):
return {'allow_nan': True}
return {'X_types_gpu': ['2darray', 'sparse'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is using the AllowNaNTagMixin already, instead of defining _more_tags, you can just add

class SparseInputTagMixin:

python/cuml/_thirdparty/sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
python/cuml/_thirdparty/sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
python/cuml/_thirdparty/sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved
python/cuml/common/base.pyx Outdated Show resolved Hide resolved
python/cuml/test/test_preproc_utils.py Show resolved Hide resolved
@JohnZed JohnZed self-assigned this Mar 18, 2021
Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Will give feedback on the rest tomorrow - just need to think on it a bit but wanted to add this comment first)

I see the TODOs about preserving order. Is this something that should really be a generic feature of the base class output conversion? Maybe some other transform-type models need this?

Copy link
Contributor

@JohnZed JohnZed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for being so slow! I REALLY debated whether there was another approach that would reduce the delta in the _thirdparty_dependencies section, but in the end I believe you found the best solution so we should move forward with this PR.

My one caveat overall is that we should make sure we're testing the various different sparse matrix formats that are supported... in a couple of cases, I think we may not be testing them all (noted in comments)

python/cuml/test/test_preprocessing.py Show resolved Hide resolved
python/cuml/test/test_preprocessing.py Show resolved Hide resolved
python/cuml/test/test_preproc_utils.py Show resolved Hide resolved
python/cuml/test/test_base.py Outdated Show resolved Hide resolved
python/cuml/test/test_base.py Outdated Show resolved Hide resolved
python/cuml/common/array_sparse.py Show resolved Hide resolved
python/cuml/test/test_preprocessing.py Show resolved Hide resolved
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good from my review pov

@JohnZed
Copy link
Contributor

JohnZed commented Mar 25, 2021

Looks great!

@JohnZed JohnZed dismissed mdemoret-nv’s stale review March 25, 2021 22:24

Outdated review from several months ago.

v0.19 Release automation moved this from PR-Needs review to PR-Reviewer approved Mar 25, 2021
@viclafargue viclafargue added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Waiting on Reviewer Waiting for reviewer to review or respond labels Mar 26, 2021
@codecov-io
Copy link

Codecov Report

Merging #3270 (d199019) into branch-0.19 (c2f246a) will increase coverage by 1.36%.
The diff coverage is 87.22%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #3270      +/-   ##
===============================================
+ Coverage        80.87%   82.23%   +1.36%     
===============================================
  Files              228      226       -2     
  Lines            17630    17480     -150     
===============================================
+ Hits             14258    14375     +117     
+ Misses            3372     3105     -267     
Flag Coverage Δ
dask 46.37% <1.66%> (+1.41%) ⬆️
non-dask 74.17% <87.22%> (+1.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...cuml/_thirdparty/sklearn/preprocessing/__init__.py 100.00% <ø> (ø)
python/cuml/thirdparty_adapters/adapters.py 92.08% <ø> (+3.08%) ⬆️
...on/cuml/_thirdparty/sklearn/preprocessing/_data.py 64.65% <81.45%> (+1.54%) ⬆️
...hirdparty/sklearn/preprocessing/_discretization.py 83.59% <100.00%> (-0.62%) ⬇️
...l/_thirdparty/sklearn/preprocessing/_imputation.py 64.54% <100.00%> (+1.74%) ⬆️
...cuml/_thirdparty/sklearn/utils/skl_dependencies.py 80.00% <100.00%> (+25.09%) ⬆️
python/cuml/common/array_sparse.py 96.29% <100.00%> (+1.95%) ⬆️
python/cuml/internals/api_context_managers.py 93.61% <100.00%> (+0.13%) ⬆️
python/cuml/thirdparty_adapters/__init__.py 100.00% <100.00%> (ø)
python/cuml/_thirdparty/sklearn/utils/_pprint.py 0.00% <0.00%> (-27.54%) ⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2f246a...d199019. Read the comment docs.

@JohnZed
Copy link
Contributor

JohnZed commented Mar 29, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d4d1bcf into rapidsai:branch-0.19 Mar 29, 2021
v0.19 Release automation moved this from PR-Reviewer approved to Done Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function
Projects
No open projects
v0.19 Release
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants