Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient Boosting with XGBoost #81

Closed
tcfuji opened this issue Feb 9, 2016 · 15 comments
Closed

Gradient Boosting with XGBoost #81

tcfuji opened this issue Feb 9, 2016 · 15 comments
Labels

Comments

@tcfuji
Copy link
Contributor

tcfuji commented Feb 9, 2016

Hi Randy,

Thanks to XGBoost's scikit-learn API, it was not difficult to replace the scikit-learn GB with xgboost. I created a separate branch provided here: https://github.com/tcfuji/tpot/tree/xgboost

I tested it a little and it seems to be working. Here's an example of an exported pipeline:

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from xgboost import XGBClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))


result1 = tpot_data.copy()

# Perform classification with an eXtreme gradient boosting classifier
xgbc1 = XGBClassifier(learning_rate=0.01, n_estimators=42, max_depth=94)
xgbc1.fit(result1.loc[training_indices].drop('class', axis=1).values, result1.loc[training_indices, 'class'].values)
result1['xgbc1-classification'] = xgbc1.predict(result1.drop('class', axis=1).values)

Would this be a desirable addition to the master branch? Of course, this would require another dependency (XGBoost itself!).

@rhiever
Copy link
Contributor

rhiever commented Feb 9, 2016

I've been looking into XGBoost and I'm trying to understand what it adds over sklearn's implementation of GradientBoostingClassifier. Do you know?

@tcfuji
Copy link
Contributor Author

tcfuji commented Feb 9, 2016

It's mostly just a faster version of GradientBoostingClassifier:
http://auduno.com/post/96084011658/some-nice-ml-libraries

However, it's also mentioned in a number of Kaggle winning solutions
because Gradient Boosting apparently seems to do quite well in those
competitions:
1.
https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov
2. https://github.com/dmlc/xgboost/tree/master/demo/kaggle-higgs
3. https://github.com/daxiongshu/kaggle-tradeshift-winning-solution
4.
http://blog.kaggle.com/2015/12/21/rossmann-store-sales-winners-interview-1st-place-gert/
5.
http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/

(Just found out it has its own tag on the Kaggle blog):
http://blog.kaggle.com/tag/xgboost/

On Mon, Feb 8, 2016 at 9:11 PM Randy Olson notifications@github.com wrote:

I've been looking into XGBoost and I'm trying to understand what it adds
over sklearn's implementation of GradientBoostingClassifier. Do you know?


Reply to this email directly or view it on GitHub
#81 (comment).

@bartleyn
Copy link
Contributor

bartleyn commented Feb 9, 2016

Besides being highly optimized like tcfuji mentioned, I understand it also has the ability to be trained in a distributed fashion. Would it easily interface with pandas though?

@rhiever
Copy link
Contributor

rhiever commented Feb 9, 2016

@tcfuji: If it works better than sklearn's GradientBoostingClassifier, isn't incredibly slow (in comparison), and the XGBoost library isn't a pain to integrate with, then I'm not opposed to integrating XGBoost into TPOT. Are you free to do a small benchmark on, say, MNIST or CIFAR-10? I'd be interested to see performance in terms of accuracy and training time.

@bartleyn: From my readings, the Python implementation of XGBoost has the exact same interface as all other sklearn classifiers. I don't think that would be a difficulty.

@tcfuji
Copy link
Contributor Author

tcfuji commented Feb 9, 2016

@bartleyn As Randy mentioned, the xgboost python API makes it easy since it can construct its main data structure (DMatrix) from numpy arrays. Also, the tpot method _train_model_and_predict converts the pandas dataframe inputs into numpy arrays (using the .values method).

@rhiever Sure, I'll work on it over the weekend. Want something similar to tutorials/IRIS.ipynb and tutorials/MNIST.ipynb while keeping track of the training time?

@rhiever
Copy link
Contributor

rhiever commented Feb 9, 2016

That sounds good to me, @tcfuji. Thank you!

@tcfuji
Copy link
Contributor Author

tcfuji commented Feb 16, 2016

@rhiever As we discussed yesterday, you wanted me to evaluate the performance of xgboost itself, not my fork.

The results were better than I expected:

from sklearn.datasets import load_digits, make_classification
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from time import perf_counter
import numpy as np

gb = GradientBoostingClassifier()
xgb = XGBClassifier()

MNIST:

digits = load_digits()
X_train_digit, X_test_digit, y_train_digit, y_test_digit = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

start = perf_counter()
gb.fit(X_train_digit, y_train_digit)
print(gb.score(X_test_digit, y_test_digit))
print("%f seconds" % (perf_counter() - start))

0.957777777778
6.918697 seconds

start = perf_counter()
xgb.fit(X_train_digit, y_train_digit)
print(np.mean(xgb.predict(X_test_digit) == y_test_digit))
print("%f seconds" % (perf_counter() - start))

0.955555555556
1.720479 seconds


Using the make_classification method:

X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_classes=10)
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y,
                                                    train_size=0.7, test_size=0.3)
start = perf_counter()
gb.fit(X_train_mc, y_train_mc)
print(gb.score(X_test_mc, y_test_mc))
print("%f seconds" % (perf_counter() - start))

0.494
763.447380 seconds

start = perf_counter()
xgb.fit(X_train_mc, y_train_mc)
print(np.mean(xgb.predict(X_test_mc) == y_test_mc))
print("%f seconds" % (perf_counter() - start))

0.513
52.425525 seconds

With a few other variations of make_classification (changing the parameters), xgboost consistently performed about 14x faster than scikit GB.

One caveat is that this speed increase is likely due to OpenMP. I think people with Linux and OS X (by running brew install gcc --without-multilib) this shouldn't be a problem, but it's still another dependency.

@rhiever
Copy link
Contributor

rhiever commented Feb 16, 2016

Thank you for running these benchmarks, @tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly.

@tcfuji
Copy link
Contributor Author

tcfuji commented Feb 16, 2016

According to the xgboost docs (
https://xgboost.readthedocs.org/en/latest/build.html), it does not appear
to be a hard dependency.

On Tue, Feb 16, 2016 at 6:27 PM Randy Olson notifications@github.com
wrote:

Thank you for running these benchmarks, @tcfuji
https://github.com/tcfuji! Is that a hard dependency on OpenMP, or is
it optional? I'm concerned that making OpenMP a requirement for TPOT would
cut down on its potential user base pretty significantly.


Reply to this email directly or view it on GitHub
#81 (comment).

@rhiever
Copy link
Contributor

rhiever commented Feb 17, 2016

How do the benchmarks look without OpenMP?

On Tuesday, February 16, 2016, Ted notifications@github.com wrote:

According to the xgboost docs (
https://xgboost.readthedocs.org/en/latest/build.html), it does not appear
to be a hard dependency.

On Tue, Feb 16, 2016 at 6:27 PM Randy Olson <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');>
wrote:

Thank you for running these benchmarks, @tcfuji
https://github.com/tcfuji! Is that a hard dependency on OpenMP, or is
it optional? I'm concerned that making OpenMP a requirement for TPOT
would
cut down on its potential user base pretty significantly.


Reply to this email directly or view it on GitHub
#81 (comment).


Reply to this email directly or view it on GitHub
#81 (comment).

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: rso@randalolson.com | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

@tcfuji
Copy link
Contributor Author

tcfuji commented Feb 17, 2016

Just ran the same code without OpenMP. As expected, not as fast but still consistently faster (about 2x to 4x) than scikit's GradientBoostingClassifier.

If we can make OpenMP an optional dependency, I think xgboost would be a great addition.

@rhiever
Copy link
Contributor

rhiever commented Feb 17, 2016

Looks good to me. Just tried running the benchmarks myself and XGBoost looks like a solid improvement over the GradientBoostingClassifier. Easy to pip install too. Go ahead and put together a PR to replace the GradientBoostingClassifier with XGBoost.

Thanks for looking into this, @tcfuji.

@tcfuji
Copy link
Contributor Author

tcfuji commented Feb 17, 2016

#83

@rhiever
Copy link
Contributor

rhiever commented Feb 19, 2016

#83 merged.

@rhiever rhiever closed this as completed Feb 19, 2016
@tcfuji
Copy link
Contributor Author

tcfuji commented Feb 19, 2016

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants