-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient Boosting with XGBoost #81
Comments
I've been looking into XGBoost and I'm trying to understand what it adds over sklearn's implementation of GradientBoostingClassifier. Do you know? |
Besides being highly optimized like tcfuji mentioned, I understand it also has the ability to be trained in a distributed fashion. Would it easily interface with pandas though? |
@tcfuji: If it works better than sklearn's GradientBoostingClassifier, isn't incredibly slow (in comparison), and the XGBoost library isn't a pain to integrate with, then I'm not opposed to integrating XGBoost into TPOT. Are you free to do a small benchmark on, say, MNIST or CIFAR-10? I'd be interested to see performance in terms of accuracy and training time. @bartleyn: From my readings, the Python implementation of XGBoost has the exact same interface as all other sklearn classifiers. I don't think that would be a difficulty. |
@bartleyn As Randy mentioned, the xgboost python API makes it easy since it can construct its main data structure (DMatrix) from numpy arrays. Also, the @rhiever Sure, I'll work on it over the weekend. Want something similar to |
That sounds good to me, @tcfuji. Thank you! |
@rhiever As we discussed yesterday, you wanted me to evaluate the performance of xgboost itself, not my fork. The results were better than I expected: from sklearn.datasets import load_digits, make_classification
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from time import perf_counter
import numpy as np
gb = GradientBoostingClassifier()
xgb = XGBClassifier() MNIST: digits = load_digits()
X_train_digit, X_test_digit, y_train_digit, y_test_digit = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
start = perf_counter()
gb.fit(X_train_digit, y_train_digit)
print(gb.score(X_test_digit, y_test_digit))
print("%f seconds" % (perf_counter() - start)) 0.957777777778 start = perf_counter()
xgb.fit(X_train_digit, y_train_digit)
print(np.mean(xgb.predict(X_test_digit) == y_test_digit))
print("%f seconds" % (perf_counter() - start)) 0.955555555556 Using the make_classification method: X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_classes=10)
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y,
train_size=0.7, test_size=0.3)
start = perf_counter()
gb.fit(X_train_mc, y_train_mc)
print(gb.score(X_test_mc, y_test_mc))
print("%f seconds" % (perf_counter() - start)) 0.494 start = perf_counter()
xgb.fit(X_train_mc, y_train_mc)
print(np.mean(xgb.predict(X_test_mc) == y_test_mc))
print("%f seconds" % (perf_counter() - start)) 0.513 With a few other variations of make_classification (changing the parameters), xgboost consistently performed about 14x faster than scikit GB. One caveat is that this speed increase is likely due to OpenMP. I think people with Linux and OS X (by running |
Thank you for running these benchmarks, @tcfuji! Is that a hard dependency on OpenMP, or is it optional? I'm concerned that making OpenMP a requirement for TPOT would cut down on its potential user base pretty significantly. |
According to the xgboost docs ( On Tue, Feb 16, 2016 at 6:27 PM Randy Olson notifications@github.com
|
How do the benchmarks look without OpenMP? On Tuesday, February 16, 2016, Ted notifications@github.com wrote:
Randal S. Olson, Ph.D. E-mail: rso@randalolson.com | Twitter: @randal_olson |
Just ran the same code without OpenMP. As expected, not as fast but still consistently faster (about 2x to 4x) than scikit's GradientBoostingClassifier. If we can make OpenMP an optional dependency, I think xgboost would be a great addition. |
Looks good to me. Just tried running the benchmarks myself and XGBoost looks like a solid improvement over the GradientBoostingClassifier. Easy to pip install too. Go ahead and put together a PR to replace the GradientBoostingClassifier with XGBoost. Thanks for looking into this, @tcfuji. |
#83 merged. |
👍 |
Hi Randy,
Thanks to XGBoost's scikit-learn API, it was not difficult to replace the scikit-learn GB with xgboost. I created a separate branch provided here: https://github.com/tcfuji/tpot/tree/xgboost
I tested it a little and it seems to be working. Here's an example of an exported pipeline:
Would this be a desirable addition to the master branch? Of course, this would require another dependency (XGBoost itself!).
The text was updated successfully, but these errors were encountered: