-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch estimator operator logical checks to be interface based rather than inheritance based #1108
Switch estimator operator logical checks to be interface based rather than inheritance based #1108
Conversation
…sor, trasnformer, or selector
@weixuanfu , I'd be happy to add "a demo for using TPOT with a cuML configuration similar to TPOT's default configurations." I do have a couple of questions to help make sure what I provide is useful.
|
Thank you for the PR A Jupyter notebook for demo (like this one) stored in Also, you could add the cuML configuration under |
Happy to add this. As a note, as of now I would plan to wrap this logic inside a check to make sure def _has_cuml():
try:
import cuml # NOQA
return True
except ImportError:
return False to create a binary flag and raise an informative warning if |
One issue with cuml is that the |
Thanks for highlighting that discrepancy and the specific location in the code. We have an open issue. I agree with your assessment. Rather than special case cuML in the TPOT codebase, I'd prefer we resolve it upstream (at which point, the existing code you linked should "just work"). I'll push the next commit, and then will shift to address the |
@weixuanfu I've included two example notebooks for review. Since this PR adds now a configuration, it also should correspondingly update the documentation. I'll do that shortly, but first will address the discussed upstream cuML compatibility. |
@beckernick thank you for submiting those examples. I tried to test one of them on Colab but somehow it failed. Here is the link. Any idea? Key runtime info: |
Ah, that's my mistake. While debugging I discovered one more necessary cuML change (PR) and ran on that build without thinking about it. Apologies for not being more clear on the cuML status and timeline earlier. This will not work with cuML 0.14. RAPIDS cuML is scheduled to release 0.15 next week, which has a significant number of enhancements that enable this functionality (less the bug in the PR above). Conda packages for the nightly release cuML 0.16 are available currently. The examples will work with what we call the "0.16 nightly" packages once the linked PR lands. |
Great, I will test with nightly build. Thank you for the information. |
Hmm I cannot install the nightly build in Colab (stderr below) and I think the reason is that Colab only support python3.6 for now but it seems not working with RAPIDS 0.15. Will RAPIDS 0.15 support python 3.6 later? Installing RAPIDS 0.15 packages from the nightly release channel
Please standby, this will take a few minutes...
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed
SpecsConfigurationConflictError: Requested specs conflict with configured specs.
requested specs:
- cudatoolkit=10.1
- cudf=0.15
- cugraph
- cuml
- cusignal
- cuspatial
- dask-cudf
- gcsfs
- pynvml
pinned specs:
- python=3.6
- xgboost |
RAPIDS 0.15 and onward will not support Python 3.6. The rationale comes from the broader community in NEP-29 (NEP 29 — Recommend Python and Numpy version support as a community policy standard)) In particular, the NumPy recommended support table and drop schedule:
Given the cuML bugfix PR above has not yet merged, I'm happy to ping you when this current PR is ready for testing. I want to make sure to be respectful of your time. I appreciate your guidance and support on this PR 😄 |
Thank you for the information and heads-up. I will be on vacation next week. So I will wait for another test after the stable release of cuML 0.15. (Then I will set up cuML 0.15 with 2080 Ti GPU in my local compute node) |
Sounds great. Just to note, you'd need to use the 0.16 nightly conda package (either locally or in the cloud). Enjoy your vacation! 🌴 |
…ub.com/beckernick/tpot into feature/duck-typed-estimator-op-checks
… to highlight the impact
<td>TPOT will search over a restricted configuration using the GPU-accelerated estimators in <a href="https://github.com/rapidsai/cuml">RAPIDS cuML</a> and <a href="https://github.com/dmlc/xgboost">DMLC XGBoost</a>. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed. With this configuration, all model training and predicting will be GPU-accelerated. | ||
<br /><br /> | ||
This configuration is particularly useful for medium-sized and larger datasets on which CPU-based estimators are a common bottleneck, and works for both the TPOTClassifier and TPOTRegressor.</td> | ||
<td align="center"><a href="https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier_cuml.py">Classification</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weixuanfu for these documentation hyperlinks, should I be hard-coding the master
branch like the existing documentation? Or, since this PR is targeting the development branch, should these refer to development
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need that, eventually it will be merged to master branch.
@weixuanfu this PR should be ready to test with the 0.16
The above would need a different In terms of loose memory requirements for the biggest example notebook, when I ran Higgs_Boson.ipynb as is (400k rows in the training set) peak GPU memory appeared to be about 2800 MB. |
Thank you! I will check it next week. |
To test if this PR can make a material impact, I ran some small classification timed experiments ( For both datasets, the GPU-accelerated PR configuration achieved higher accuracy in one hour than the default in eight hours. Because this PR provides a restricted configuration set (somewhere in between TPOT-Light and TPOT-Default) with GPU-accelerated estimators, it’s able to evaluate quite a few more individuals per given time with medium-to-large datasets. In these experiments, this allowed TPOT to find a higher accuracy pipeline much faster. The graphs below provide a brief summary. As a concrete example, in the eight hour airlines dataset experiment, the final default pipeline achieved 87.2% cross-validation accuracy. The final PR pipeline achieved 88.5% accuracy, a 1.3% increase. In line with expectations, the TPOT Default Final Pipeline achieving 87.2% accuracy (8 Hour Experiment):
TPOT PR Final Pipeline achieving 88.5% accuracy (8 Hour Experiment):
I’ve included the experiment code I used in these two gists below: |
Thanks @beckernick I just tested it in a smaller experiment and it worked well in my environment. The only minor issue is that I got a lot of warning message like |
Ah, thanks for catching this. Just needs a one-line change. cuML currently has verbose logging on by default. I'll update the PR.
Today, no. For interface consistency to use the necessary scikit-learn APIs, these conversions need to happen automatically by the cuML methods, rather than beforehand by the user. |
After thinking it through, we've decided to make a change upstream to turn off these messages by default (rapidsai/cuml#2824). I've tested this PR with the upstream PR and can confirm that those "Expected column..." messages no longer appear. The PR is approved and "Ready for Merge", but slated for merging after rapidsai/cuml#2747 which should happen imminently (also approved). Would you consider this minor issue non-blocking for this PR, given it will be resolved shortly upstream? EDIT: As of 6 PM EDT, both of the linked PRs have merged 👍 |
@beckernick this minor won't block us to merge this PR. I will merge this one to dev branch soon. We will tested it more before merging dev branch to master branch. Also, do you have any info when the stable version of cuML 0.16 will be released, I hope we have a stable version of cuML before next release of TPOT with this cool feature! |
Sounds good! Please do let me know if there's anything I can do to help out. The 0.16 release is currently scheduled for Wednesday, October 14th (release timeline). If there are any changes to that plan, I'll make sure to ping you as well. |
Nice! Thank you for the info. I will test the new feature and prepare a new TPOT release in Oct. |
As a quick update, cuML 0.16 is on target for the planned October 14th release 😄 |
Thank you for the update. We will release a new version of TPOT with this PR right after cuML 0.16 release. |
cuML 0.16 is now planned for an October 21st release, a delay of 7 days. |
OK, thank you for update! |
cuML 0.16 is on track to release on October 21 (tomorrow)! |
Thank you! We will test TPOT with cuML 0.16 stable version and then make a release later this month if there is no further major issue. |
Sounds good. cuML 0.16 has been released, so the following environment should work (ran the example notebooks in the
|
What does this PR do?
This PR:
Switches several inheritance based operator/estimator checks to be duck typing based (verifying based on the estimator interface). The primary use of this is in evaluating whether an operator can be the root of a pipeline and setting the
optype
correctly. Specifically, logical checks for whether an operator is an estimator of a certain category are done by checking if it inherits from one of several scikit-learn Mixin classes. This PR switches these checks to evaluate whether the interface of the operator is consistent with the scikit-learn estimators, rather than an explicit subclassing.Adds a new configuration,
"TPOT cuML"
. With this configuration, TPOT will search over a restricted configuration using the GPU-accelerated estimators in RAPIDS cuML and DMLC XGBoost. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed. With this configuration, all model training and predicting will be GPU-accelerated. This configuration is particularly useful for medium-sized and larger datasets on which CPU-based estimators are a common bottleneck, and works for both the TPOTClassifier and TPOTRegressor.Where should the reviewer start?
The reviewer should start in
stacking_estimator.py
, and then move tooperator_utils.py
. Next, they should look atbase.py
and then the new configuration options.How should this PR be tested?
Currently, this PR should probably be tested with existing tests, as it does not introduce any new public-interface behavior or dependencies. If it's desirable, I'm happy to add tests for the private helper methods inoperator_utils.py
(_is_selector
and_is_transformer
).This PR adds new tests to the general testing suite, as it now adds new public-interface options. It can be tested with the standard tests.
nosetests -s -v
(on my local machine)Any background context you want to provide?
Currently, TPOT requires that estimators explicitly inherit from Scikit-learn Mixin classes in order to determine the nature of an estimator operator within the TPOTOperatorClassFactory and StackingEstimator. This Scikit-learn inheritance based programming model provides consistency, but also limits flexibility. Switching these checks to be duck-typing based rather than inheritance based would still preserve consistency but also allow users to use other libraries with TPOT, such as cuML, a GPU-based machine learning library. Using cuML with TPOT can provide significant speedups, as shown in the issue linked below.
What are the relevant issues?
This closes #1106
Screenshots (if appropriate)
From the linked issue, with the specified configuration and key parameters:
Questions: