Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch estimator operator logical checks to be interface based rather than inheritance based #1108

Conversation

beckernick
Copy link
Contributor

@beckernick beckernick commented Aug 19, 2020

What does this PR do?

This PR:

  • Switches several inheritance based operator/estimator checks to be duck typing based (verifying based on the estimator interface). The primary use of this is in evaluating whether an operator can be the root of a pipeline and setting the optype correctly. Specifically, logical checks for whether an operator is an estimator of a certain category are done by checking if it inherits from one of several scikit-learn Mixin classes. This PR switches these checks to evaluate whether the interface of the operator is consistent with the scikit-learn estimators, rather than an explicit subclassing.

  • Adds a new configuration, "TPOT cuML". With this configuration, TPOT will search over a restricted configuration using the GPU-accelerated estimators in RAPIDS cuML and DMLC XGBoost. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed. With this configuration, all model training and predicting will be GPU-accelerated. This configuration is particularly useful for medium-sized and larger datasets on which CPU-based estimators are a common bottleneck, and works for both the TPOTClassifier and TPOTRegressor.

Where should the reviewer start?

The reviewer should start in stacking_estimator.py, and then move to operator_utils.py. Next, they should look at base.py and then the new configuration options.

How should this PR be tested?

Currently, this PR should probably be tested with existing tests, as it does not introduce any new public-interface behavior or dependencies. If it's desirable, I'm happy to add tests for the private helper methods in operator_utils.py (_is_selector and _is_transformer).

This PR adds new tests to the general testing suite, as it now adds new public-interface options. It can be tested with the standard tests.

  • Passes existing + new tests with nosetests -s -v (on my local machine)
Ran 256 tests in 81.943s
OK (SKIP=1)

Any background context you want to provide?

Currently, TPOT requires that estimators explicitly inherit from Scikit-learn Mixin classes in order to determine the nature of an estimator operator within the TPOTOperatorClassFactory and StackingEstimator. This Scikit-learn inheritance based programming model provides consistency, but also limits flexibility. Switching these checks to be duck-typing based rather than inheritance based would still preserve consistency but also allow users to use other libraries with TPOT, such as cuML, a GPU-based machine learning library. Using cuML with TPOT can provide significant speedups, as shown in the issue linked below.

What are the relevant issues?

This closes #1106

Screenshots (if appropriate)

From the linked issue, with the specified configuration and key parameters:

tpot-cuml-speedup (2)

Questions:

  • Do the docs need to be updated? Yes.
  • Does this PR add new (Python) dependencies? No. Only optional dependencies that a user can control independently.

@beckernick
Copy link
Contributor Author

beckernick commented Aug 19, 2020

@weixuanfu , I'd be happy to add "a demo for using TPOT with a cuML configuration similar to TPOT's default configurations." I do have a couple of questions to help make sure what I provide is useful.

  • Do you have a specific kind of demo format in mind (e.g., a Jupyter Notebook example)?
  • Is this something you would like to see/live in this PR, as a gist, or somewhere else? If in this PR, would you prefer it to live in the tutorials directory?

@coveralls
Copy link

coveralls commented Aug 19, 2020

Coverage Status

Coverage decreased (-0.07%) to 96.533% when pulling 3a74907 on beckernick:feature/duck-typed-estimator-op-checks into d887251 on EpistasisLab:development.

@weixuanfu
Copy link
Contributor

Thank you for the PR

A Jupyter notebook for demo (like this one) stored in tutorials should be fine.

Also, you could add the cuML configuration under tpot/config for simply using it via config_dict (something like config_dict="TPOT cuML". It is also easier for comparing the performance between default TPOT configuration and cuML configuration.

@beckernick
Copy link
Contributor Author

beckernick commented Aug 19, 2020

Also, you could add the cuML configuration under tpot/config for simply using it via config_dict (something like config_dict="TPOT cuML". It is also easier for comparing the performance between default TPOT configuration and cuML configuration.

Happy to add this. As a note, as of now I would plan to wrap this logic inside a check to make sure cuml is available. Otherwise, a user could pass config_dict="TPOT cuML" and get a failure where they didn't expect one. If you have any thoughts on your preferred style for this, I'm open to suggestions. Currently, I plan to follow this pattern:

def _has_cuml():
    try:
        import cuml   # NOQA
        return True
    except ImportError:
        return False

to create a binary flag and raise an informative warning if False. If you'd prefer to not add something like this, that's fine with me too 😄

tpot/config/classifier_cuml.py Outdated Show resolved Hide resolved
@weixuanfu weixuanfu changed the base branch from master to development August 20, 2020 17:26
@weixuanfu
Copy link
Contributor

One issue with cuml is that the RandomForestClassifier uses seed instead of random_state for setting random seed. But in TPOT, we use random_state (see those lines) to set random seeds of all operators in a pipeline. I think we need a workaround (like adding set_param_recursive(sklearn_pipeline.steps, 'seed', self.random_state) there) for this. But I hope cuml could update it based on scikit-learn's API.

@beckernick
Copy link
Contributor Author

beckernick commented Aug 20, 2020

Thanks for highlighting that discrepancy and the specific location in the code. We have an open issue.

I agree with your assessment. Rather than special case cuML in the TPOT codebase, I'd prefer we resolve it upstream (at which point, the existing code you linked should "just work").

I'll push the next commit, and then will shift to address the random_state vs seed in cuML

@beckernick
Copy link
Contributor Author

@weixuanfu I've included two example notebooks for review. Since this PR adds now a configuration, it also should correspondingly update the documentation.

I'll do that shortly, but first will address the discussed upstream cuML compatibility.

@beckernick beckernick changed the title Switch estimator operator logical checks to be interfaced based rather than inheritance based Switch estimator operator logical checks to be interface based rather than inheritance based Aug 20, 2020
@weixuanfu
Copy link
Contributor

weixuanfu commented Aug 20, 2020

@beckernick thank you for submiting those examples.

I tried to test one of them on Colab but somehow it failed. Here is the link. Any idea?

Key runtime info:
cuml version 0.14
cudo version 10.1

@beckernick
Copy link
Contributor Author

beckernick commented Aug 20, 2020

Ah, that's my mistake. While debugging I discovered one more necessary cuML change (PR) and ran on that build without thinking about it.

Apologies for not being more clear on the cuML status and timeline earlier. This will not work with cuML 0.14. RAPIDS cuML is scheduled to release 0.15 next week, which has a significant number of enhancements that enable this functionality (less the bug in the PR above). Conda packages for the nightly release cuML 0.16 are available currently. The examples will work with what we call the "0.16 nightly" packages once the linked PR lands.

@weixuanfu
Copy link
Contributor

Great, I will test with nightly build. Thank you for the information.

@weixuanfu
Copy link
Contributor

weixuanfu commented Aug 20, 2020

Hmm I cannot install the nightly build in Colab (stderr below) and I think the reason is that Colab only support python3.6 for now but it seems not working with RAPIDS 0.15. Will RAPIDS 0.15 support python 3.6 later?

Installing RAPIDS 0.15 packages from the nightly release channel
Please standby, this will take a few minutes...
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed

SpecsConfigurationConflictError: Requested specs conflict with configured specs.
  requested specs: 
    - cudatoolkit=10.1
    - cudf=0.15
    - cugraph
    - cuml
    - cusignal
    - cuspatial
    - dask-cudf
    - gcsfs
    - pynvml
  pinned specs: 
    - python=3.6
    - xgboost

@beckernick
Copy link
Contributor Author

beckernick commented Aug 20, 2020

RAPIDS 0.15 and onward will not support Python 3.6. The rationale comes from the broader community in NEP-29 (NEP 29 — Recommend Python and Numpy version support as a community policy standard))

In particular, the NumPy recommended support table and drop schedule:

Support Table

Date | Python | NumPy
Jan 07, 2020 | 3.6+ | 1.15+
Jun 23, 2020 | 3.7+ | 1.15+
Jul 23, 2020 | 3.7+ | 1.16+
Jan 13, 2021 | 3.7+ | 1.17+
Jul 26, 2021 | 3.7+ | 1.18+
Dec 26, 2021 | 3.8+ | 1.18+
Apr 14, 2023 | 3.9+ | 1.18+

Drop Table
On next release, drop support for Python 3.5 (initially released on Sep 13, 2015)
On Jan 07, 2020 drop support for Numpy 1.14 (initially released on Jan 06, 2018)
On Jun 23, 2020 drop support for Python 3.6 (initially released on Dec 23, 2016)
On Jul 23, 2020 drop support for Numpy 1.15 (initially released on Jul 23, 2018)
On Jan 13, 2021 drop support for Numpy 1.16 (initially released on Jan 13, 2019)
On Jul 26, 2021 drop support for Numpy 1.17 (initially released on Jul 26, 2019)
On Dec 26, 2021 drop support for Python 3.7 (initially released on Jun 27, 2018)
On Apr 14, 2023 drop support for Python 3.8 (initially released on Oct 14, 2019)

Given the cuML bugfix PR above has not yet merged, I'm happy to ping you when this current PR is ready for testing. I want to make sure to be respectful of your time. I appreciate your guidance and support on this PR 😄

@weixuanfu
Copy link
Contributor

weixuanfu commented Aug 20, 2020

Thank you for the information and heads-up. I will be on vacation next week. So I will wait for another test after the stable release of cuML 0.15. (Then I will set up cuML 0.15 with 2080 Ti GPU in my local compute node)

@beckernick
Copy link
Contributor Author

beckernick commented Aug 20, 2020

Sounds great. Just to note, you'd need to use the 0.16 nightly conda package (either locally or in the cloud).

Enjoy your vacation! 🌴

<td>TPOT will search over a restricted configuration using the GPU-accelerated estimators in <a href="https://github.com/rapidsai/cuml">RAPIDS cuML</a> and <a href="https://github.com/dmlc/xgboost">DMLC XGBoost</a>. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed. With this configuration, all model training and predicting will be GPU-accelerated.
<br /><br />
This configuration is particularly useful for medium-sized and larger datasets on which CPU-based estimators are a common bottleneck, and works for both the TPOTClassifier and TPOTRegressor.</td>
<td align="center"><a href="https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier_cuml.py">Classification</a>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weixuanfu for these documentation hyperlinks, should I be hard-coding the master branch like the existing documentation? Or, since this PR is targeting the development branch, should these refer to development

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need that, eventually it will be merged to master branch.

@beckernick
Copy link
Contributor Author

beckernick commented Sep 3, 2020

@weixuanfu this PR should be ready to test with the 0.16 rapidsai-nightly release of cuML and standard XGBoost from conda-forge. If you want to build from a small conda environment, the following should work:

tpot-pr.yml

channels:
  - rapidsai-nightly
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.7
  - cudatoolkit=10.2
  - cuml
  - scikit-learn
  - ipython
  - ipywidgets
  - jupyterlab
  - xgboost
  - pip
  - pip:
    - jupyter-server-proxy
    - git+https://github.com/beckernick/tpot.git@feature/duck-typed-estimator-op-checks

conda env create -f tpot-pr.yml -n tpot-pr --force`
conda activate tpot-pr

The above would need a different cudatoolkit version depending on the system.

In terms of loose memory requirements for the biggest example notebook, when I ran Higgs_Boson.ipynb as is (400k rows in the training set) peak GPU memory appeared to be about 2800 MB.

@weixuanfu
Copy link
Contributor

Thank you! I will check it next week.

@beckernick
Copy link
Contributor Author

beckernick commented Sep 14, 2020

To test if this PR can make a material impact, I ran some small classification timed experiments (max_time_mins=60, 120, 240, 480 minutes) using a population size of 30 on a 500,000 row sample from two reasonably standard ML datasets (Higgs Boson and Airline flights). Actual experiment times vary due to how long it took the run to finish after reaching the max_time_mins threshold (quite long, in one case). I used the conda environment in the tpot-pr.yml example above. The goal was to compare this PR with the TPOT default. Please do note that this is only a loose test using dual Intel Xeon Platinum 8168 CPUs and one V100 GPU.

For both datasets, the GPU-accelerated PR configuration achieved higher accuracy in one hour than the default in eight hours. Because this PR provides a restricted configuration set (somewhere in between TPOT-Light and TPOT-Default) with GPU-accelerated estimators, it’s able to evaluate quite a few more individuals per given time with medium-to-large datasets. In these experiments, this allowed TPOT to find a higher accuracy pipeline much faster. The graphs below provide a brief summary.

tpot-higgs-accuracy-actual-time-hours
tpot-higgs-num-pipelines-actual-time-hours

tpot-airline-accuracy-actual-time-hours
tpot-airline-num-pipelines-actual-time-hours

As a concrete example, in the eight hour airlines dataset experiment, the final default pipeline achieved 87.2% cross-validation accuracy. The final PR pipeline achieved 88.5% accuracy, a 1.3% increase. In line with expectations, the fitted_pipeline is quite a bit more complex in the PR result, highlighting the impact of evaluating more pipelines.

TPOT Default Final Pipeline achieving 87.2% accuracy (8 Hour Experiment):

Pipeline(steps=[('extratreesclassifier',
                 ExtraTreesClassifier(bootstrap=True,
                                      max_features=0.9500000000000001,
                                      min_samples_split=11, random_state=12))])

TPOT PR Final Pipeline achieving 88.5% accuracy (8 Hour Experiment):

Pipeline(steps=[('zerocount-1', ZeroCount()),
                ('variancethreshold', VarianceThreshold(threshold=0.01)),
                ('selectpercentile', SelectPercentile(percentile=43)),
                ('pca',
                 PCA(iterated_power=5, random_state=12,
                     svd_solver='randomized')),
                ('zerocount-2', ZeroCount()),
                ('xgbclassifier',
                 XGBClassifier(alpha=1, base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsampl...
                               importance_type='gain',
                               interaction_constraints='', learning_rate=0.5,
                               max_delta_step=0, max_depth=9,
                               min_child_weight=3, missing=nan,
                               monotone_constraints='(0,0,0,0,0,0,0,0)',
                               n_estimators=100, n_jobs=1, nthread=1,
                               num_parallel_tree=1, random_state=12,
                               reg_alpha=1, reg_lambda=1, scale_pos_weight=1,
                               subsample=1.0, tree_method='gpu_hist',
                               validate_parameters=1, verbosity=None))])

I’ve included the experiment code I used in these two gists below:

tpot-benchmark.py
tpot-benchmark-config.yml

@weixuanfu
Copy link
Contributor

weixuanfu commented Sep 14, 2020

Thanks @beckernick I just tested it in a smaller experiment and it worked well in my environment. The only minor issue is that I got a lot of warning message like [11:10:10.910600] Expected column ('F') major order, but got the opposite. Converting data, this will result in additional memory utilization. Do you have any idea to turn it off or should the input X, y be converted to cudf object?

@beckernick
Copy link
Contributor Author

The only minor issue is that I got a lot of warning message like `[11:10:10.910600] Expected column ('F') major order, but got the opposite.

Ah, thanks for catching this. Just needs a one-line change. cuML currently has verbose logging on by default. I'll update the PR.

should the input X, y be converted to cudf object?

Today, no. For interface consistency to use the necessary scikit-learn APIs, these conversions need to happen automatically by the cuML methods, rather than beforehand by the user.

@beckernick
Copy link
Contributor Author

beckernick commented Sep 15, 2020

The only minor issue is that I got a lot of warning message like `[11:10:10.910600] Expected column ('F') major order, but got the opposite.

After thinking it through, we've decided to make a change upstream to turn off these messages by default (rapidsai/cuml#2824). I've tested this PR with the upstream PR and can confirm that those "Expected column..." messages no longer appear. The PR is approved and "Ready for Merge", but slated for merging after rapidsai/cuml#2747 which should happen imminently (also approved).

Would you consider this minor issue non-blocking for this PR, given it will be resolved shortly upstream?

EDIT: As of 6 PM EDT, both of the linked PRs have merged 👍

@weixuanfu
Copy link
Contributor

@beckernick this minor won't block us to merge this PR. I will merge this one to dev branch soon. We will tested it more before merging dev branch to master branch. Also, do you have any info when the stable version of cuML 0.16 will be released, I hope we have a stable version of cuML before next release of TPOT with this cool feature!

@weixuanfu weixuanfu merged commit 593763d into EpistasisLab:development Sep 15, 2020
@beckernick
Copy link
Contributor Author

Sounds good! Please do let me know if there's anything I can do to help out.

The 0.16 release is currently scheduled for Wednesday, October 14th (release timeline). If there are any changes to that plan, I'll make sure to ping you as well.

@weixuanfu
Copy link
Contributor

Nice! Thank you for the info. I will test the new feature and prepare a new TPOT release in Oct.

@beckernick
Copy link
Contributor Author

As a quick update, cuML 0.16 is on target for the planned October 14th release 😄

@weixuanfu
Copy link
Contributor

Thank you for the update. We will release a new version of TPOT with this PR right after cuML 0.16 release.

@beckernick
Copy link
Contributor Author

cuML 0.16 is now planned for an October 21st release, a delay of 7 days.

@weixuanfu
Copy link
Contributor

OK, thank you for update!

@beckernick
Copy link
Contributor Author

cuML 0.16 is on track to release on October 21 (tomorrow)!

@weixuanfu
Copy link
Contributor

weixuanfu commented Oct 20, 2020

Thank you! We will test TPOT with cuML 0.16 stable version and then make a release later this month if there is no further major issue.

@beckernick
Copy link
Contributor Author

beckernick commented Oct 22, 2020

Sounds good. cuML 0.16 has been released, so the following environment should work (ran the example notebooks in the development branch with it this morning):

tpot-minimal.yml

channels:
  - rapidsai
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.7
  - cudatoolkit=10.2
  - cuml=0.16
  - scikit-learn
  - ipython
  - ipywidgets
  - jupyterlab
  - pip
  - pip:
    - xgboost
    - git+https://github.com/epistasislab/tpot.git@development

conda env create -f tpot-minimal.yml -n tpot-minimal --force

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check estimator interface with duck typing rather than scikit-learn Mixin class inheritance
3 participants