Switch estimator operator logical checks to be interface based rather than inheritance based #1108

beckernick · 2020-08-19T16:03:51Z

What does this PR do?

This PR:

Switches several inheritance based operator/estimator checks to be duck typing based (verifying based on the estimator interface). The primary use of this is in evaluating whether an operator can be the root of a pipeline and setting the optype correctly. Specifically, logical checks for whether an operator is an estimator of a certain category are done by checking if it inherits from one of several scikit-learn Mixin classes. This PR switches these checks to evaluate whether the interface of the operator is consistent with the scikit-learn estimators, rather than an explicit subclassing.
Adds a new configuration, "TPOT cuML". With this configuration, TPOT will search over a restricted configuration using the GPU-accelerated estimators in RAPIDS cuML and DMLC XGBoost. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed. With this configuration, all model training and predicting will be GPU-accelerated. This configuration is particularly useful for medium-sized and larger datasets on which CPU-based estimators are a common bottleneck, and works for both the TPOTClassifier and TPOTRegressor.

Where should the reviewer start?

The reviewer should start in stacking_estimator.py, and then move to operator_utils.py. Next, they should look at base.py and then the new configuration options.

How should this PR be tested?

Currently, this PR should probably be tested with existing tests, as it does not introduce any new public-interface behavior or dependencies. If it's desirable, I'm happy to add tests for the private helper methods in operator_utils.py (_is_selector and _is_transformer).

This PR adds new tests to the general testing suite, as it now adds new public-interface options. It can be tested with the standard tests.

Passes existing + new tests with nosetests -s -v (on my local machine)

Ran 256 tests in 81.943s
OK (SKIP=1)

Any background context you want to provide?

Currently, TPOT requires that estimators explicitly inherit from Scikit-learn Mixin classes in order to determine the nature of an estimator operator within the TPOTOperatorClassFactory and StackingEstimator. This Scikit-learn inheritance based programming model provides consistency, but also limits flexibility. Switching these checks to be duck-typing based rather than inheritance based would still preserve consistency but also allow users to use other libraries with TPOT, such as cuML, a GPU-based machine learning library. Using cuML with TPOT can provide significant speedups, as shown in the issue linked below.

What are the relevant issues?

This closes #1106

Screenshots (if appropriate)

From the linked issue, with the specified configuration and key parameters:

Questions:

Do the docs need to be updated? Yes.
Does this PR add new (Python) dependencies? No. Only optional dependencies that a user can control independently.

…sor, trasnformer, or selector

beckernick · 2020-08-19T16:08:34Z

@weixuanfu , I'd be happy to add "a demo for using TPOT with a cuML configuration similar to TPOT's default configurations." I do have a couple of questions to help make sure what I provide is useful.

Do you have a specific kind of demo format in mind (e.g., a Jupyter Notebook example)?
Is this something you would like to see/live in this PR, as a gist, or somewhere else? If in this PR, would you prefer it to live in the tutorials directory?

coveralls · 2020-08-19T16:09:19Z

Coverage decreased (-0.07%) to 96.533% when pulling 3a74907 on beckernick:feature/duck-typed-estimator-op-checks into d887251 on EpistasisLab:development.

weixuanfu · 2020-08-19T17:28:41Z

Thank you for the PR

A Jupyter notebook for demo (like this one) stored in tutorials should be fine.

Also, you could add the cuML configuration under tpot/config for simply using it via config_dict (something like config_dict="TPOT cuML". It is also easier for comparing the performance between default TPOT configuration and cuML configuration.

beckernick · 2020-08-19T19:42:30Z

Also, you could add the cuML configuration under tpot/config for simply using it via config_dict (something like config_dict="TPOT cuML". It is also easier for comparing the performance between default TPOT configuration and cuML configuration.

Happy to add this. As a note, as of now I would plan to wrap this logic inside a check to make sure cuml is available. Otherwise, a user could pass config_dict="TPOT cuML" and get a failure where they didn't expect one. If you have any thoughts on your preferred style for this, I'm open to suggestions. Currently, I plan to follow this pattern:

def _has_cuml():
    try:
        import cuml   # NOQA
        return True
    except ImportError:
        return False

to create a binary flag and raise an informative warning if False. If you'd prefer to not add something like this, that's fine with me too 😄

…process

tpot/config/classifier_cuml.py

weixuanfu · 2020-08-20T18:15:09Z

One issue with cuml is that the RandomForestClassifier uses seed instead of random_state for setting random seed. But in TPOT, we use random_state (see those lines) to set random seeds of all operators in a pipeline. I think we need a workaround (like adding set_param_recursive(sklearn_pipeline.steps, 'seed', self.random_state) there) for this. But I hope cuml could update it based on scikit-learn's API.

beckernick · 2020-08-20T18:21:02Z

Thanks for highlighting that discrepancy and the specific location in the code. We have an open issue.

I agree with your assessment. Rather than special case cuML in the TPOT codebase, I'd prefer we resolve it upstream (at which point, the existing code you linked should "just work").

I'll push the next commit, and then will shift to address the random_state vs seed in cuML

beckernick · 2020-08-20T19:00:54Z

@weixuanfu I've included two example notebooks for review. Since this PR adds now a configuration, it also should correspondingly update the documentation.

I'll do that shortly, but first will address the discussed upstream cuML compatibility.

weixuanfu · 2020-08-20T19:33:03Z

@beckernick thank you for submiting those examples.

I tried to test one of them on Colab but somehow it failed. Here is the link. Any idea?

Key runtime info:
cuml version 0.14
cudo version 10.1

beckernick · 2020-08-20T20:13:05Z

Ah, that's my mistake. While debugging I discovered one more necessary cuML change (PR) and ran on that build without thinking about it.

Apologies for not being more clear on the cuML status and timeline earlier. This will not work with cuML 0.14. RAPIDS cuML is scheduled to release 0.15 next week, which has a significant number of enhancements that enable this functionality (less the bug in the PR above). Conda packages for the nightly release cuML 0.16 are available currently. The examples will work with what we call the "0.16 nightly" packages once the linked PR lands.

weixuanfu · 2020-08-20T20:16:45Z

Great, I will test with nightly build. Thank you for the information.

weixuanfu · 2020-08-20T20:44:50Z

Hmm I cannot install the nightly build in Colab (stderr below) and I think the reason is that Colab only support python3.6 for now but it seems not working with RAPIDS 0.15. Will RAPIDS 0.15 support python 3.6 later?

Installing RAPIDS 0.15 packages from the nightly release channel
Please standby, this will take a few minutes...
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed

SpecsConfigurationConflictError: Requested specs conflict with configured specs.
  requested specs: 
    - cudatoolkit=10.1
    - cudf=0.15
    - cugraph
    - cuml
    - cusignal
    - cuspatial
    - dask-cudf
    - gcsfs
    - pynvml
  pinned specs: 
    - python=3.6
    - xgboost

beckernick · 2020-08-20T20:57:16Z

RAPIDS 0.15 and onward will not support Python 3.6. The rationale comes from the broader community in NEP-29 (NEP 29 — Recommend Python and Numpy version support as a community policy standard))

In particular, the NumPy recommended support table and drop schedule:

Support Table

Date | Python | NumPy
Jan 07, 2020 | 3.6+ | 1.15+
Jun 23, 2020 | 3.7+ | 1.15+
Jul 23, 2020 | 3.7+ | 1.16+
Jan 13, 2021 | 3.7+ | 1.17+
Jul 26, 2021 | 3.7+ | 1.18+
Dec 26, 2021 | 3.8+ | 1.18+
Apr 14, 2023 | 3.9+ | 1.18+

Drop Table
On next release, drop support for Python 3.5 (initially released on Sep 13, 2015)
On Jan 07, 2020 drop support for Numpy 1.14 (initially released on Jan 06, 2018)
On Jun 23, 2020 drop support for Python 3.6 (initially released on Dec 23, 2016)
On Jul 23, 2020 drop support for Numpy 1.15 (initially released on Jul 23, 2018)
On Jan 13, 2021 drop support for Numpy 1.16 (initially released on Jan 13, 2019)
On Jul 26, 2021 drop support for Numpy 1.17 (initially released on Jul 26, 2019)
On Dec 26, 2021 drop support for Python 3.7 (initially released on Jun 27, 2018)
On Apr 14, 2023 drop support for Python 3.8 (initially released on Oct 14, 2019)

Given the cuML bugfix PR above has not yet merged, I'm happy to ping you when this current PR is ready for testing. I want to make sure to be respectful of your time. I appreciate your guidance and support on this PR 😄

weixuanfu · 2020-08-20T21:01:03Z

Thank you for the information and heads-up. I will be on vacation next week. So I will wait for another test after the stable release of cuML 0.15. (Then I will set up cuML 0.15 with 2080 Ti GPU in my local compute node)

beckernick · 2020-08-20T21:10:41Z

Sounds great. Just to note, you'd need to use the 0.16 nightly conda package (either locally or in the cloud).

Enjoy your vacation! 🌴

…ub.com/beckernick/tpot into feature/duck-typed-estimator-op-checks

… to highlight the impact

beckernick · 2020-08-31T19:07:49Z

docs_sources/using.md

+<td>TPOT will search over a restricted configuration using the GPU-accelerated estimators in <a href="https://github.com/rapidsai/cuml">RAPIDS cuML</a> and <a href="https://github.com/dmlc/xgboost">DMLC XGBoost</a>. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed. With this configuration, all model training and predicting will be GPU-accelerated.
+<br /><br />
+This configuration is particularly useful for medium-sized and larger datasets on which CPU-based estimators are a common bottleneck, and works for both the TPOTClassifier and TPOTRegressor.</td>
+<td align="center"><a href="https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier_cuml.py">Classification</a>


@weixuanfu for these documentation hyperlinks, should I be hard-coding the master branch like the existing documentation? Or, since this PR is targeting the development branch, should these refer to development

No need that, eventually it will be merged to master branch.

beckernick · 2020-09-03T22:40:27Z

@weixuanfu this PR should be ready to test with the 0.16 rapidsai-nightly release of cuML and standard XGBoost from conda-forge. If you want to build from a small conda environment, the following should work:

tpot-pr.yml

channels:
  - rapidsai-nightly
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.7
  - cudatoolkit=10.2
  - cuml
  - scikit-learn
  - ipython
  - ipywidgets
  - jupyterlab
  - xgboost
  - pip
  - pip:
    - jupyter-server-proxy
    - git+https://github.com/beckernick/tpot.git@feature/duck-typed-estimator-op-checks

conda env create -f tpot-pr.yml -n tpot-pr --force`
conda activate tpot-pr

The above would need a different cudatoolkit version depending on the system.

In terms of loose memory requirements for the biggest example notebook, when I ran Higgs_Boson.ipynb as is (400k rows in the training set) peak GPU memory appeared to be about 2800 MB.

weixuanfu · 2020-09-03T23:41:52Z

Thank you! I will check it next week.

beckernick · 2020-09-14T16:20:21Z

To test if this PR can make a material impact, I ran some small classification timed experiments (max_time_mins=60, 120, 240, 480 minutes) using a population size of 30 on a 500,000 row sample from two reasonably standard ML datasets (Higgs Boson and Airline flights). Actual experiment times vary due to how long it took the run to finish after reaching the max_time_mins threshold (quite long, in one case). I used the conda environment in the tpot-pr.yml example above. The goal was to compare this PR with the TPOT default. Please do note that this is only a loose test using dual Intel Xeon Platinum 8168 CPUs and one V100 GPU.

For both datasets, the GPU-accelerated PR configuration achieved higher accuracy in one hour than the default in eight hours. Because this PR provides a restricted configuration set (somewhere in between TPOT-Light and TPOT-Default) with GPU-accelerated estimators, it’s able to evaluate quite a few more individuals per given time with medium-to-large datasets. In these experiments, this allowed TPOT to find a higher accuracy pipeline much faster. The graphs below provide a brief summary.

As a concrete example, in the eight hour airlines dataset experiment, the final default pipeline achieved 87.2% cross-validation accuracy. The final PR pipeline achieved 88.5% accuracy, a 1.3% increase. In line with expectations, the fitted_pipeline is quite a bit more complex in the PR result, highlighting the impact of evaluating more pipelines.

TPOT Default Final Pipeline achieving 87.2% accuracy (8 Hour Experiment):

Pipeline(steps=[('extratreesclassifier',
                 ExtraTreesClassifier(bootstrap=True,
                                      max_features=0.9500000000000001,
                                      min_samples_split=11, random_state=12))])

TPOT PR Final Pipeline achieving 88.5% accuracy (8 Hour Experiment):

Pipeline(steps=[('zerocount-1', ZeroCount()),
                ('variancethreshold', VarianceThreshold(threshold=0.01)),
                ('selectpercentile', SelectPercentile(percentile=43)),
                ('pca',
                 PCA(iterated_power=5, random_state=12,
                     svd_solver='randomized')),
                ('zerocount-2', ZeroCount()),
                ('xgbclassifier',
                 XGBClassifier(alpha=1, base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsampl...
                               importance_type='gain',
                               interaction_constraints='', learning_rate=0.5,
                               max_delta_step=0, max_depth=9,
                               min_child_weight=3, missing=nan,
                               monotone_constraints='(0,0,0,0,0,0,0,0)',
                               n_estimators=100, n_jobs=1, nthread=1,
                               num_parallel_tree=1, random_state=12,
                               reg_alpha=1, reg_lambda=1, scale_pos_weight=1,
                               subsample=1.0, tree_method='gpu_hist',
                               validate_parameters=1, verbosity=None))])

I’ve included the experiment code I used in these two gists below:

tpot-benchmark.py
tpot-benchmark-config.yml

weixuanfu · 2020-09-14T17:01:14Z

Thanks @beckernick I just tested it in a smaller experiment and it worked well in my environment. The only minor issue is that I got a lot of warning message like [11:10:10.910600] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization. Do you have any idea to turn it off or should the input X, y be converted to cudf object?

beckernick · 2020-09-14T17:50:51Z

The only minor issue is that I got a lot of warning message like `[11:10:10.910600] Expected column ('F') major order, but got the opposite.

Ah, thanks for catching this. Just needs a one-line change. cuML currently has verbose logging on by default. I'll update the PR.

should the input X, y be converted to cudf object?

Today, no. For interface consistency to use the necessary scikit-learn APIs, these conversions need to happen automatically by the cuML methods, rather than beforehand by the user.

beckernick · 2020-09-15T14:03:28Z

The only minor issue is that I got a lot of warning message like `[11:10:10.910600] Expected column ('F') major order, but got the opposite.

After thinking it through, we've decided to make a change upstream to turn off these messages by default (rapidsai/cuml#2824). I've tested this PR with the upstream PR and can confirm that those "Expected column..." messages no longer appear. The PR is approved and "Ready for Merge", but slated for merging after rapidsai/cuml#2747 which should happen imminently (also approved).

Would you consider this minor issue non-blocking for this PR, given it will be resolved shortly upstream?

EDIT: As of 6 PM EDT, both of the linked PRs have merged 👍

weixuanfu · 2020-09-15T14:34:29Z

@beckernick this minor won't block us to merge this PR. I will merge this one to dev branch soon. We will tested it more before merging dev branch to master branch. Also, do you have any info when the stable version of cuML 0.16 will be released, I hope we have a stable version of cuML before next release of TPOT with this cool feature!

beckernick · 2020-09-15T14:37:22Z

Sounds good! Please do let me know if there's anything I can do to help out.

The 0.16 release is currently scheduled for Wednesday, October 14th (release timeline). If there are any changes to that plan, I'll make sure to ping you as well.

weixuanfu · 2020-09-15T14:38:55Z

Nice! Thank you for the info. I will test the new feature and prepare a new TPOT release in Oct.

beckernick · 2020-10-06T17:01:20Z

As a quick update, cuML 0.16 is on target for the planned October 14th release 😄

weixuanfu · 2020-10-06T17:11:08Z

Thank you for the update. We will release a new version of TPOT with this PR right after cuML 0.16 release.

beckernick · 2020-10-08T19:03:32Z

cuML 0.16 is now planned for an October 21st release, a delay of 7 days.

weixuanfu · 2020-10-08T19:12:00Z

OK, thank you for update!

beckernick · 2020-10-20T17:38:05Z

cuML 0.16 is on track to release on October 21 (tomorrow)!

weixuanfu · 2020-10-20T17:45:04Z

Thank you! We will test TPOT with cuML 0.16 stable version and then make a release later this month if there is no further major issue.

beckernick · 2020-10-22T14:36:07Z

Sounds good. cuML 0.16 has been released, so the following environment should work (ran the example notebooks in the development branch with it this morning):

tpot-minimal.yml

channels:
  - rapidsai
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.7
  - cudatoolkit=10.2
  - cuml=0.16
  - scikit-learn
  - ipython
  - ipywidgets
  - jupyterlab
  - pip
  - pip:
    - xgboost
    - git+https://github.com/epistasislab/tpot.git@development

conda env create -f tpot-minimal.yml -n tpot-minimal --force

beckernick added 5 commits July 29, 2020 09:54

use duck typing to check whether an operator is a classifier, regeres…

2a90723

…sor, trasnformer, or selector

cleanup now unused imports

a5a99ed

combine sklearn.base imports into a single line

ab47c7a

duck typig in stacking estimator for classifier checking

f08acbf

remove unused classifiermixin import

931bff7

beckernick added 7 commits August 19, 2020 20:04

add cuml classifier config to configuration options and config setup …

d441de5

…process

cuml classifier and regressor configs

899fb01

clean up cuML check and valueerror

dbb94a6

cuML fit test for classifier

078e775

test cuml config dict correctly loads

3c15656

wrap config test in cuml available block

cabdaf3

try hit the regressor config codepath in the tests for coverage

10217b8

weixuanfu approved these changes Aug 20, 2020

View reviewed changes

tpot/config/classifier_cuml.py Outdated Show resolved Hide resolved

weixuanfu changed the base branch from master to development August 20, 2020 17:26

beckernick added 2 commits August 20, 2020 18:55

cleanup configs

5cea7b2

cuML + TPOT example notebooks (regression and classificaton)

df00d7d

beckernick changed the title ~~Switch estimator operator logical checks to be interfaced based rather than inheritance based~~ Switch estimator operator logical checks to be interface based rather than inheritance based Aug 20, 2020

beckernick added 7 commits August 21, 2020 17:24

cut off svr tolerance minimum at 1e-4

7633284

update using.md for the cuML config

fd71b8b

Merge branch 'feature/duck-typed-estimator-op-checks' of https://gith…

17acc3a

…ub.com/beckernick/tpot into feature/duck-typed-estimator-op-checks

update cuml config and add larger-scale example notebook on real data…

eb3909b

… to highlight the impact

clean up configs; update notebooks

29ee28f

update using docs

4b6f2a1

cleanup Higgs notebook

ca22319

beckernick commented Aug 31, 2020

View reviewed changes

beckernick added 2 commits September 3, 2020 14:45

update regressor/clasifier configs

a45046d

update example higgs notebook for 0.16

3a74907

weixuanfu merged commit 593763d into EpistasisLab:development Sep 15, 2020

beckernick mentioned this pull request Jan 13, 2021

Check estimator interface with duck typing rather than scikit-learn Mixin class inheritance #1106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch estimator operator logical checks to be interface based rather than inheritance based #1108

Switch estimator operator logical checks to be interface based rather than inheritance based #1108

beckernick commented Aug 19, 2020 •

edited

beckernick commented Aug 19, 2020 •

edited

coveralls commented Aug 19, 2020 •

edited

weixuanfu commented Aug 19, 2020

beckernick commented Aug 19, 2020 •

edited

weixuanfu commented Aug 20, 2020

beckernick commented Aug 20, 2020 •

edited

beckernick commented Aug 20, 2020

weixuanfu commented Aug 20, 2020 •

edited

beckernick commented Aug 20, 2020 •

edited

weixuanfu commented Aug 20, 2020

weixuanfu commented Aug 20, 2020 •

edited

beckernick commented Aug 20, 2020 •

edited

weixuanfu commented Aug 20, 2020 •

edited

beckernick commented Aug 20, 2020 •

edited

beckernick Aug 31, 2020 •

edited

weixuanfu Sep 4, 2020

beckernick commented Sep 3, 2020 •

edited

weixuanfu commented Sep 3, 2020

beckernick commented Sep 14, 2020 •

edited

weixuanfu commented Sep 14, 2020 •

edited

beckernick commented Sep 14, 2020

beckernick commented Sep 15, 2020 •

edited

weixuanfu commented Sep 15, 2020

beckernick commented Sep 15, 2020

weixuanfu commented Sep 15, 2020

beckernick commented Oct 6, 2020

weixuanfu commented Oct 6, 2020

beckernick commented Oct 8, 2020

weixuanfu commented Oct 8, 2020

beckernick commented Oct 20, 2020

weixuanfu commented Oct 20, 2020 •

edited

beckernick commented Oct 22, 2020 •

edited

Switch estimator operator logical checks to be interface based rather than inheritance based #1108

Switch estimator operator logical checks to be interface based rather than inheritance based #1108

Conversation

beckernick commented Aug 19, 2020 • edited

What does this PR do?

Where should the reviewer start?

How should this PR be tested?

Any background context you want to provide?

What are the relevant issues?

Screenshots (if appropriate)

Questions:

beckernick commented Aug 19, 2020 • edited

coveralls commented Aug 19, 2020 • edited

weixuanfu commented Aug 19, 2020

beckernick commented Aug 19, 2020 • edited

weixuanfu commented Aug 20, 2020

beckernick commented Aug 20, 2020 • edited

beckernick commented Aug 20, 2020

weixuanfu commented Aug 20, 2020 • edited

beckernick commented Aug 20, 2020 • edited

weixuanfu commented Aug 20, 2020

weixuanfu commented Aug 20, 2020 • edited

beckernick commented Aug 20, 2020 • edited

weixuanfu commented Aug 20, 2020 • edited

beckernick commented Aug 20, 2020 • edited

beckernick Aug 31, 2020 • edited

Choose a reason for hiding this comment

weixuanfu Sep 4, 2020

Choose a reason for hiding this comment

beckernick commented Sep 3, 2020 • edited

weixuanfu commented Sep 3, 2020

beckernick commented Sep 14, 2020 • edited

weixuanfu commented Sep 14, 2020 • edited

beckernick commented Sep 14, 2020

beckernick commented Sep 15, 2020 • edited

weixuanfu commented Sep 15, 2020

beckernick commented Sep 15, 2020

weixuanfu commented Sep 15, 2020

beckernick commented Oct 6, 2020

weixuanfu commented Oct 6, 2020

beckernick commented Oct 8, 2020

weixuanfu commented Oct 8, 2020

beckernick commented Oct 20, 2020

weixuanfu commented Oct 20, 2020 • edited

beckernick commented Oct 22, 2020 • edited

beckernick commented Aug 19, 2020 •

edited

beckernick commented Aug 19, 2020 •

edited

coveralls commented Aug 19, 2020 •

edited

beckernick commented Aug 19, 2020 •

edited

beckernick commented Aug 20, 2020 •

edited

weixuanfu commented Aug 20, 2020 •

edited

beckernick commented Aug 20, 2020 •

edited

weixuanfu commented Aug 20, 2020 •

edited

beckernick commented Aug 20, 2020 •

edited

weixuanfu commented Aug 20, 2020 •

edited

beckernick commented Aug 20, 2020 •

edited

beckernick Aug 31, 2020 •

edited

beckernick commented Sep 3, 2020 •

edited

beckernick commented Sep 14, 2020 •

edited

weixuanfu commented Sep 14, 2020 •

edited

beckernick commented Sep 15, 2020 •

edited

weixuanfu commented Oct 20, 2020 •

edited

beckernick commented Oct 22, 2020 •

edited