Gradient Boosting with XGBoost #81

tcfuji · 2016-02-09T01:55:44Z

Hi Randy,

Thanks to XGBoost's scikit-learn API, it was not difficult to replace the scikit-learn GB with xgboost. I created a separate branch provided here: https://github.com/tcfuji/tpot/tree/xgboost

I tested it a little and it seems to be working. Here's an example of an exported pipeline:

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from xgboost import XGBClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


result1 = tpot_data.copy()

# Perform classification with an eXtreme gradient boosting classifier
xgbc1 = XGBClassifier(learning_rate=0.01, n_estimators=42, max_depth=94)
xgbc1.fit(result1.loc[training_indices].drop('class', axis=1).values, result1.loc[training_indices, 'class'].values)
result1['xgbc1-classification'] = xgbc1.predict(result1.drop('class', axis=1).values)

Would this be a desirable addition to the master branch? Of course, this would require another dependency (XGBoost itself!).

rhiever · 2016-02-09T02:11:47Z

I've been looking into XGBoost and I'm trying to understand what it adds over sklearn's implementation of GradientBoostingClassifier. Do you know?

tcfuji · 2016-02-09T03:00:22Z

It's mostly just a faster version of GradientBoostingClassifier:
http://auduno.com/post/96084011658/some-nice-ml-libraries

However, it's also mentioned in a number of Kaggle winning solutions
because Gradient Boosting apparently seems to do quite well in those
competitions:
1.
https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov
2. https://github.com/dmlc/xgboost/tree/master/demo/kaggle-higgs
3. https://github.com/daxiongshu/kaggle-tradeshift-winning-solution
4.
http://blog.kaggle.com/2015/12/21/rossmann-store-sales-winners-interview-1st-place-gert/
5.
http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/

(Just found out it has its own tag on the Kaggle blog):
http://blog.kaggle.com/tag/xgboost/

On Mon, Feb 8, 2016 at 9:11 PM Randy Olson notifications@github.com wrote:

I've been looking into XGBoost and I'm trying to understand what it adds
over sklearn's implementation of GradientBoostingClassifier. Do you know?

—
Reply to this email directly or view it on GitHub
#81 (comment).

bartleyn · 2016-02-09T03:28:32Z

Besides being highly optimized like tcfuji mentioned, I understand it also has the ability to be trained in a distributed fashion. Would it easily interface with pandas though?

rhiever · 2016-02-09T14:56:16Z

@tcfuji: If it works better than sklearn's GradientBoostingClassifier, isn't incredibly slow (in comparison), and the XGBoost library isn't a pain to integrate with, then I'm not opposed to integrating XGBoost into TPOT. Are you free to do a small benchmark on, say, MNIST or CIFAR-10? I'd be interested to see performance in terms of accuracy and training time.

@bartleyn: From my readings, the Python implementation of XGBoost has the exact same interface as all other sklearn classifiers. I don't think that would be a difficulty.

tcfuji · 2016-02-09T17:47:32Z

@bartleyn As Randy mentioned, the xgboost python API makes it easy since it can construct its main data structure (DMatrix) from numpy arrays. Also, the tpot method _train_model_and_predict converts the pandas dataframe inputs into numpy arrays (using the .values method).

@rhiever Sure, I'll work on it over the weekend. Want something similar to tutorials/IRIS.ipynb and tutorials/MNIST.ipynb while keeping track of the training time?

rhiever · 2016-02-09T18:13:55Z

That sounds good to me, @tcfuji. Thank you!

tcfuji · 2016-02-16T23:15:15Z

@rhiever As we discussed yesterday, you wanted me to evaluate the performance of xgboost itself, not my fork.

The results were better than I expected:

from sklearn.datasets import load_digits, make_classification
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from time import perf_counter
import numpy as np

gb = GradientBoostingClassifier()
xgb = XGBClassifier()

MNIST:

digits = load_digits()
X_train_digit, X_test_digit, y_train_digit, y_test_digit = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

start = perf_counter()
gb.fit(X_train_digit, y_train_digit)
print(gb.score(X_test_digit, y_test_digit))
print("%f seconds" % (perf_counter() - start))

0.957777777778
6.918697 seconds

start = perf_counter()
xgb.fit(X_train_digit, y_train_digit)
print(np.mean(xgb.predict(X_test_digit) == y_test_digit))
print("%f seconds" % (perf_counter() - start))

0.955555555556
1.720479 seconds

Using the make_classification method:

X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_classes=10)
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y,
                                                    train_size=0.7, test_size=0.3)
start = perf_counter()
gb.fit(X_train_mc, y_train_mc)
print(gb.score(X_test_mc, y_test_mc))
print("%f seconds" % (perf_counter() - start))

0.494
763.447380 seconds

start = perf_counter()
xgb.fit(X_train_mc, y_train_mc)
print(np.mean(xgb.predict(X_test_mc) == y_test_mc))
print("%f seconds" % (perf_counter() - start))

0.513
52.425525 seconds

With a few other variations of make_classification (changing the parameters), xgboost consistently performed about 14x faster than scikit GB.

One caveat is that this speed increase is likely due to OpenMP. I think people with Linux and OS X (by running brew install gcc --without-multilib) this shouldn't be a problem, but it's still another dependency.

rhiever · 2016-02-16T23:27:38Z

Thank you for running these benchmarks, @tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly.

tcfuji · 2016-02-16T23:40:13Z

According to the xgboost docs (
https://xgboost.readthedocs.org/en/latest/build.html), it does not appear
to be a hard dependency.

On Tue, Feb 16, 2016 at 6:27 PM Randy Olson notifications@github.com
wrote:

Thank you for running these benchmarks, @tcfuji
https://github.com/tcfuji! Is that a hard dependency on OpenMP, or is
it optional? I'm concerned that making OpenMP a requirement for TPOT would
cut down on its potential user base pretty significantly.

—
Reply to this email directly or view it on GitHub
#81 (comment).

rhiever · 2016-02-17T00:40:01Z

How do the benchmarks look without OpenMP?

On Tuesday, February 16, 2016, Ted notifications@github.com wrote:

According to the xgboost docs (
https://xgboost.readthedocs.org/en/latest/build.html), it does not appear
to be a hard dependency.

On Tue, Feb 16, 2016 at 6:27 PM Randy Olson <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');>
wrote:

Thank you for running these benchmarks, @tcfuji
https://github.com/tcfuji! Is that a hard dependency on OpenMP, or is
it optional? I'm concerned that making OpenMP a requirement for TPOT
would
cut down on its potential user base pretty significantly.

—
Reply to this email directly or view it on GitHub
#81 (comment).

—
Reply to this email directly or view it on GitHub
#81 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

tcfuji · 2016-02-17T19:41:41Z

Just ran the same code without OpenMP. As expected, not as fast but still consistently faster (about 2x to 4x) than scikit's GradientBoostingClassifier.

If we can make OpenMP an optional dependency, I think xgboost would be a great addition.

rhiever · 2016-02-17T20:06:49Z

Looks good to me. Just tried running the benchmarks myself and XGBoost looks like a solid improvement over the GradientBoostingClassifier. Easy to pip install too. Go ahead and put together a PR to replace the GradientBoostingClassifier with XGBoost.

Thanks for looking into this, @tcfuji.

tcfuji · 2016-02-17T23:23:07Z

#83

rhiever · 2016-02-19T14:56:38Z

#83 merged.

tcfuji · 2016-02-19T17:29:13Z

👍

rhiever added the question label Feb 9, 2016

rhiever closed this as completed Feb 19, 2016

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient Boosting with XGBoost #81

Gradient Boosting with XGBoost #81

tcfuji commented Feb 9, 2016

rhiever commented Feb 9, 2016

tcfuji commented Feb 9, 2016

bartleyn commented Feb 9, 2016

rhiever commented Feb 9, 2016

tcfuji commented Feb 9, 2016

rhiever commented Feb 9, 2016

tcfuji commented Feb 16, 2016

rhiever commented Feb 16, 2016

tcfuji commented Feb 16, 2016

rhiever commented Feb 17, 2016

tcfuji commented Feb 17, 2016

rhiever commented Feb 17, 2016

tcfuji commented Feb 17, 2016

rhiever commented Feb 19, 2016

tcfuji commented Feb 19, 2016

Gradient Boosting with XGBoost #81

Gradient Boosting with XGBoost #81

Comments

tcfuji commented Feb 9, 2016

rhiever commented Feb 9, 2016

tcfuji commented Feb 9, 2016

bartleyn commented Feb 9, 2016

rhiever commented Feb 9, 2016

tcfuji commented Feb 9, 2016

rhiever commented Feb 9, 2016

tcfuji commented Feb 16, 2016

rhiever commented Feb 16, 2016

tcfuji commented Feb 16, 2016

rhiever commented Feb 17, 2016

tcfuji commented Feb 17, 2016

rhiever commented Feb 17, 2016

tcfuji commented Feb 17, 2016

rhiever commented Feb 19, 2016

tcfuji commented Feb 19, 2016