<a href="https://colab.research.google.com/github/gisandnes/Extreme-Gradient-Boosting-with-XGBoost_DataCamp/blob/master/custom_objective.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
!rm -rf ../data
!mkdir -p ../data #Make folders for downloads

!wget --no-verbose https://raw.githubusercontent.com/gisandnes/xgboost/master/demo/data/agaricus.txt.train -O ../data/agaricus.txt.train
!wget --no-verbose https://raw.githubusercontent.com/gisandnes/xgboost/master/demo/data/agaricus.txt.test  -O ../data/agaricus.txt.test

2019-03-09 18:09:45 URL:https://raw.githubusercontent.com/gisandnes/xgboost/master/demo/data/agaricus.txt.train [742257/742257] -> "../data/agaricus.txt.train" [1]
2019-03-09 18:09:47 URL:https://raw.githubusercontent.com/gisandnes/xgboost/master/demo/data/agaricus.txt.test [183611/183611] -> "../data/agaricus.txt.test" [1]


In [18]:
#!/usr/bin/python
import numpy as np
import xgboost as xgb
###
# advanced: customized loss function
#
print('start running example to used customized objective function')

dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')

# note: for customized objective function, we leave objective as default
# note: what we are getting is margin value in prediction
# you must know what you are doing
param = {'max_depth': 2, 'eta': 1, 'silent': 1}
#watchlist = [(dtest, 'eval'), (dtrain, 'train')]
watchlist = [(dtrain, 'train'), (dtest, 'eval')]
num_round = 100

# user define objective function, given prediction, return gradient and second order gradient
# this is log likelihood loss
def logregobj(preds, dtrain):
    labels = dtrain.get_label()
    preds = 1.0 / (1.0 + np.exp(-preds))
    grad = preds - labels
    hess = preds * (1.0 - preds)
    return grad, hess

# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    # return a pair metric_name, result. The metric name must not contain a colon (:) or a space
    # since preds are margin(before logistic transformation, cutoff at 0)
    return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels)

# training with customized objective, we can also do step by step training
# simply look at xgboost.py's implementation of train

#Case 1: obj and feval fit together
bst = xgb.train(params=param, dtrain=dtrain, num_boost_round=num_round, evals=watchlist, obj=logregobj, feval=evalerror, early_stopping_rounds=5)
print("best_score:       ", bst.best_score)
print("best_iteration:   ", bst.best_iteration)
print("best_ntree_limit: ", bst.best_ntree_limit)

start running example to used customized objective function
[0]	train-rmse:1.59597	eval-rmse:1.59229	train-my-error:0.046522	eval-my-error:0.042831
Multiple eval metrics have been passed: 'eval-my-error' will be used for early stopping.

Will train until eval-my-error hasn't improved in 5 rounds.
[1]	train-rmse:2.40977	eval-rmse:2.40519	train-my-error:0.022263	eval-my-error:0.021726
[2]	train-rmse:2.87459	eval-rmse:2.88253	train-my-error:0.007063	eval-my-error:0.006207
[3]	train-rmse:3.63621	eval-rmse:3.62808	train-my-error:0.0152	eval-my-error:0.018001
[4]	train-rmse:3.83893	eval-rmse:3.80794	train-my-error:0.007063	eval-my-error:0.006207
[5]	train-rmse:3.96515	eval-rmse:3.9293	train-my-error:0.001228	eval-my-error:0
[6]	train-rmse:4.70775	eval-rmse:4.68611	train-my-error:0.001228	eval-my-error:0
[7]	train-rmse:5.6368	eval-rmse:5.6103	train-my-error:0.001228	eval-my-error:0
[8]	train-rmse:5.37778	eval-rmse:5.33298	train-my-error:0.001228	eval-my-error:0
[9]	train-rmse:5.76417	eval-rms

In [19]:
# Case 2: Default objective function, but user feval
bst2 = xgb.train(params=param, dtrain=dtrain, num_boost_round=num_round, evals=watchlist, feval=evalerror, early_stopping_rounds=5)
print("best_score:       ", bst2.best_score)
print("best_iteration:   ", bst2.best_iteration)
print("best_ntree_limit: ", bst2.best_ntree_limit)

[0]	train-rmse:0.208674	eval-rmse:0.200713	train-my-error:0.517887	eval-my-error:0.518312
Multiple eval metrics have been passed: 'eval-my-error' will be used for early stopping.

Will train until eval-my-error hasn't improved in 5 rounds.
[1]	train-rmse:0.12273	eval-rmse:0.127662	train-my-error:0.517887	eval-my-error:0.518312
[2]	train-rmse:0.094829	eval-rmse:0.091932	train-my-error:0.209888	eval-my-error:0.199255
[3]	train-rmse:0.07799	eval-rmse:0.073037	train-my-error:0.453094	eval-my-error:0.45748
[4]	train-rmse:0.070741	eval-rmse:0.069815	train-my-error:0.419622	eval-my-error:0.431409
[5]	train-rmse:0.06459	eval-rmse:0.060638	train-my-error:0.3883	eval-my-error:0.399131
[6]	train-rmse:0.059991	eval-rmse:0.055395	train-my-error:0.400123	eval-my-error:0.410925
[7]	train-rmse:0.057305	eval-rmse:0.052123	train-my-error:0.400123	eval-my-error:0.410925
Stopping. Best iteration:
[2]	train-rmse:0.094829	eval-rmse:0.091932	train-my-error:0.209888	eval-my-error:0.199255

best_score:        

In [20]:
# Case 3: User objective function, but default feval
bst3 = xgb.train(params=param, dtrain=dtrain, num_boost_round=num_round, evals=watchlist, obj=logregobj, early_stopping_rounds=5)
print("best_score:       ", bst3.best_score)
print("best_iteration:   ", bst3.best_iteration)
print("best_ntree_limit: ", bst3.best_ntree_limit)

[0]	train-rmse:1.59597	eval-rmse:1.59229
Multiple eval metrics have been passed: 'eval-rmse' will be used for early stopping.

Will train until eval-rmse hasn't improved in 5 rounds.
[1]	train-rmse:2.40977	eval-rmse:2.40519
[2]	train-rmse:2.87459	eval-rmse:2.88253
[3]	train-rmse:3.63621	eval-rmse:3.62808
[4]	train-rmse:3.83893	eval-rmse:3.80794
[5]	train-rmse:3.96515	eval-rmse:3.9293
Stopping. Best iteration:
[0]	train-rmse:1.59597	eval-rmse:1.59229

best_score:        1.592294
best_iteration:    0
best_ntree_limit:  1


In [22]:
# Case 4: Default objective function, default feval
bst3 = xgb.train(params=param, dtrain=dtrain, num_boost_round=num_round, evals=watchlist, early_stopping_rounds=5, verbose_eval=10)
print("best_score:       ", bst3.best_score)
print("best_iteration:   ", bst3.best_iteration)
print("best_ntree_limit: ", bst3.best_ntree_limit)

[0]	train-rmse:0.208674	eval-rmse:0.200713
Multiple eval metrics have been passed: 'eval-rmse' will be used for early stopping.

Will train until eval-rmse hasn't improved in 5 rounds.
[10]	train-rmse:0.04218	eval-rmse:0.037257
[20]	train-rmse:0.027572	eval-rmse:0.018735
[30]	train-rmse:0.016569	eval-rmse:0.012981
[40]	train-rmse:0.014828	eval-rmse:0.010438
[50]	train-rmse:0.0133	eval-rmse:0.008507
Stopping. Best iteration:
[53]	train-rmse:0.012907	eval-rmse:0.008088

best_score:        0.008088
best_iteration:    53
best_ntree_limit:  54
