# Tuning XGBoost

### [Advice from a Kaggle forum](https://www.kaggle.com/general/4092)
1. tune the number of candidate feature for splitting (starting with min_samples_leaf=1)
2. tune the depth of the trees (min_samples_leaf)
3. tune the number of trees

### Original XGBoost article : [Greedy Function Approximation: A Gradient Boosting Machine](http://www.jstor.org/stable/2699986?seq=1#page_scan_tab_contents)

### sklearn's [tips for parameter tuning](http://scikit-learn.org/stable/modules/grid_search.html#tips-for-parameter-search)

### Analytics Vidhya [tuning random forest](https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/)

In [31]:
import pandas as pd
import numpy as np
import time
import importlib.machinery
es = importlib.machinery.SourceFileLoader('extrasense','/home/sac086/extrasensory/extrasense/extrasense.py').load_module()
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GroupShuffleSplit, GroupKFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline

In [63]:
import xgboost as xgb

### Load Data

In [2]:
features_df = es.get_impersonal_data(leave_users_out=[], data_type="activity", labeled_only=False)

# remove nan rows
no_label_indeces = features_df.label.isnull()
features_df = features_df[~no_label_indeces]

timestamps = features_df.pop('timestamp')
label_source = features_df.pop("label_source")
labels = features_df.pop("label")
user_ids = features_df.pop("user_id")

In [3]:
resulting_scores = {}

## Score to beat
* In previous trials, the best parameters were n_estimators=100, max_features="log2", min_samples_leaf=1

In [6]:
# set up the accuracy scorer for cross validation
scorer = make_scorer(accuracy_score)
resulting_scores = {}

### ZeroR

In [7]:
steps = []
steps.append(('standardize', StandardScaler()))
steps.append(('ZeroR', es.ZeroR()))
clf = Pipeline(steps)

In [9]:
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=20, scoring=scorer)
resulting_scores["ZeroR"]=score

# Try [XGBoost](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/)

# Boosting algos
* [Machine Learning Mastery's page on gradient boosting, lots of other resources too!](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
* [Kaggle master's blog on gradient descent](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)
* [Notes on tuning from the XGBoost docs](https://xgboost.readthedocs.io/en/latest/how_to/param_tuning.html)


In [10]:
from xgboost import XGBClassifier

In [15]:
xgboost.cv?

Object `xgboost.cv` not found.


In [17]:
steps = [('standardize', StandardScaler()),
         ('clf', XGBClassifier())]
clf = Pipeline(steps)

In [18]:
cv = cv=GroupKFold(n_splits=2).split(features_df, labels, user_ids)
iteration = next(cv)
X_train = features_df.iloc[iteration[0]]
y_train = labels.iloc[iteration[0]]

X_test = features_df.iloc[iteration[1]]
y_test = labels.iloc[iteration[1]]

In [19]:
clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1))])

In [20]:
predictions = clf.predict(X_test)

In [21]:
accuracy_score(y_test, predictions)

0.65276582427040142

In [25]:
training_predictions = clf.predict(X_train)
accuracy_score(y_train, training_predictions)

0.55462198046807643

## ^^^ Best score yet and it's not even optimized!

### Optimizing 1 : Control Overfitting

from docs:
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.

There are in general two ways that you can control overfitting in xgboost

* The first way is to directly control model complexity
    * This include max_depth, min_child_weight and gamma
* The second way is to add randomness to make training robust to noise
    * This include subsample, colsample_bytree

You can also reduce stepsize eta, but needs to remember to increase num_round when you do so.

### How much are we overfitting?

In [28]:
training_sizes = [1,5,10,15,20,25,30,35] # this is the number of participants whose data we are including

test_size = 5

rows = []
for ts in training_sizes:
    # setup classification pipeline
    steps = [('standardize', StandardScaler()),
         ('clf', XGBClassifier())]
    clf = Pipeline(steps)
    
    # set up splitter
    splitter = GroupShuffleSplit(n_splits=3, test_size = test_size, train_size=ts)
    
    for train_ind, test_ind in splitter.split(features_df, labels, groups=user_ids):
        X_train = features_df.iloc[train_ind]
        y_train = labels.iloc[train_ind]

        X_test = features_df.iloc[test_ind]
        y_test = labels.iloc[test_ind]
        
        clf.fit(X_train, y_train)
        
        training_predictions = clf.predict(X_train)
        training_score = accuracy_score(y_train, training_predictions)
        
        test_predictions = clf.predict(X_test)
        test_score = accuracy_score(y_test, test_predictions)
        
        print("With %s users for training : Training Score=%.3f, Test Score=%.3f" % (ts, training_score, test_score))
        
        row = {"test score" : test_score,
               "training score" : training_score,
               "test instances" : len(y_test),
               "training instances" : len(y_train),
               "training users" : ts}
        rows.append(row)

With 1 users for training : Training Score=0.942, Test Score=0.517
With 1 users for training : Training Score=0.859, Test Score=0.437
With 1 users for training : Training Score=0.865, Test Score=0.280
With 5 users for training : Training Score=0.803, Test Score=0.582
With 5 users for training : Training Score=0.826, Test Score=0.553
With 5 users for training : Training Score=0.816, Test Score=0.518
With 10 users for training : Training Score=0.783, Test Score=0.605
With 10 users for training : Training Score=0.818, Test Score=0.480
With 10 users for training : Training Score=0.771, Test Score=0.607
With 15 users for training : Training Score=0.746, Test Score=0.558
With 15 users for training : Training Score=0.742, Test Score=0.553
With 15 users for training : Training Score=0.728, Test Score=0.611
With 20 users for training : Training Score=0.749, Test Score=0.590
With 20 users for training : Training Score=0.748, Test Score=0.576
With 20 users for training : Training Score=0.719, Tes

In [29]:
results_df = pd.DataFrame(rows)

In [30]:
for ts in training_sizes:
    ts_df = results_df[results_df['training users'] == ts]
    training_score = ts_df['training score']
    test_score = ts_df['test score']
    print("TS = %s" % ts)
    print("\ttest : M=%.3f, SD=%.3f" % (test_score.mean(), np.std(test_score)))
    print("\ttrain : M=%.3f, SD=%.3f" % (training_score.mean(), np.std(training_score)))

TS = 1
	test : M=0.411, SD=0.098
	train : M=0.889, SD=0.038
TS = 5
	test : M=0.551, SD=0.026
	train : M=0.815, SD=0.009
TS = 10
	test : M=0.564, SD=0.059
	train : M=0.791, SD=0.020
TS = 15
	test : M=0.574, SD=0.026
	train : M=0.739, SD=0.008
TS = 20
	test : M=0.599, SD=0.024
	train : M=0.739, SD=0.014
TS = 25
	test : M=0.565, SD=0.023
	train : M=0.721, SD=0.008
TS = 30
	test : M=0.598, SD=0.020
	train : M=0.703, SD=0.002
TS = 35
	test : M=0.646, SD=0.021
	train : M=0.709, SD=0.007


### Random Grid Tuning

In [46]:
param_grid = {"clf__max_depth" : [3,4,5,6,7,8,9,10],
              "clf__min_child_weight" : [1,2,3]}
steps = [('standardize', StandardScaler()),
         ('clf', XGBClassifier())]
clf = Pipeline(steps)
cv = list(GroupKFold(n_splits=3).split(features_df, labels, user_ids))
gscv = GridSearchCV(clf, param_grid=param_grid, cv=cv, n_jobs=12, verbose=2)

In [47]:
gscv.fit(features_df, labels)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] clf__max_depth=3, clf__min_child_weight=1 .......................
[CV] clf__max_depth=3, clf__min_child_weight=1 .......................
[CV] clf__max_depth=3, clf__min_child_weight=1 .......................
[CV] clf__max_depth=3, clf__min_child_weight=2 .......................
[CV] clf__max_depth=3, clf__min_child_weight=2 .......................
[CV] clf__max_depth=3, clf__min_child_weight=2 .......................
[CV] clf__max_depth=3, clf__min_child_weight=3 .......................
[CV] clf__max_depth=3, clf__min_child_weight=3 .......................
[CV] clf__max_depth=3, clf__min_child_weight=3 .......................
[CV] clf__max_depth=4, clf__min_child_weight=1 .......................
[CV] clf__max_depth=4, clf__min_child_weight=1 .......................
[CV] clf__max_depth=4, clf__min_child_weight=1 .......................
[CV] ........ clf__max_depth=3, clf__min_child_weight=1, total= 3.9min
[CV] clf__max_de

[Parallel(n_jobs=12)]: Done  17 tasks      | elapsed:  9.8min


[CV] ........ clf__max_depth=4, clf__min_child_weight=3, total= 5.4min
[CV] clf__max_depth=6, clf__min_child_weight=1 .......................
[CV] ........ clf__max_depth=5, clf__min_child_weight=1, total= 6.2min
[CV] clf__max_depth=6, clf__min_child_weight=2 .......................
[CV] ........ clf__max_depth=5, clf__min_child_weight=1, total= 6.7min
[CV] clf__max_depth=6, clf__min_child_weight=2 .......................
[CV] ........ clf__max_depth=5, clf__min_child_weight=1, total= 6.6min
[CV] clf__max_depth=6, clf__min_child_weight=2 .......................
[CV] ........ clf__max_depth=5, clf__min_child_weight=2, total= 6.2min
[CV] clf__max_depth=6, clf__min_child_weight=3 .......................
[CV] ........ clf__max_depth=5, clf__min_child_weight=2, total= 6.6min
[CV] clf__max_depth=6, clf__min_child_weight=3 .......................
[CV] ........ clf__max_depth=5, clf__min_child_weight=2, total= 6.7min
[CV] clf__max_depth=6, clf__min_child_weight=3 .......................
[CV] .

[Parallel(n_jobs=12)]: Done  72 out of  72 | elapsed: 55.8min finished


GridSearchCV(cv=[(array([  9347,   9348, ..., 307241, 307242]), array([     0,      1, ..., 301622, 301623])), (array([     0,      1, ..., 307241, 307242]), array([  9347,   9348, ..., 305083, 305084])), (array([     0,      1, ..., 305083, 305084]), array([ 19009,  19010, ..., 307241, 307242]))],
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardize', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_w...       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1))]),
       fit_params=None, iid=True, n_jobs=12,
       param_grid={'clf__max_depth': [3, 4, 5, 6, 7, 8, 9, 10], 'clf__min_child_weight': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=2)

In [51]:
gscv.best_params_

{'clf__max_depth': 3, 'clf__min_child_weight': 1}

In [52]:
gscv.best_score_

0.63768743307414655

In [54]:
gscv.cv_results_.keys()

dict_keys(['rank_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'split0_test_score', 'std_train_score', 'std_test_score', 'split0_train_score', 'param_clf__min_child_weight', 'split1_test_score', 'mean_fit_time', 'mean_train_score', 'mean_test_score', 'param_clf__max_depth', 'split2_train_score', 'params', 'split1_train_score', 'mean_score_time'])

In [58]:
gscv.cv_results_['mean_train_score']

0.79484629838441145

In [61]:
gscv.cv_results_['params']

[{'clf__max_depth': 3, 'clf__min_child_weight': 1},
 {'clf__max_depth': 3, 'clf__min_child_weight': 2},
 {'clf__max_depth': 3, 'clf__min_child_weight': 3},
 {'clf__max_depth': 4, 'clf__min_child_weight': 1},
 {'clf__max_depth': 4, 'clf__min_child_weight': 2},
 {'clf__max_depth': 4, 'clf__min_child_weight': 3},
 {'clf__max_depth': 5, 'clf__min_child_weight': 1},
 {'clf__max_depth': 5, 'clf__min_child_weight': 2},
 {'clf__max_depth': 5, 'clf__min_child_weight': 3},
 {'clf__max_depth': 6, 'clf__min_child_weight': 1},
 {'clf__max_depth': 6, 'clf__min_child_weight': 2},
 {'clf__max_depth': 6, 'clf__min_child_weight': 3},
 {'clf__max_depth': 7, 'clf__min_child_weight': 1},
 {'clf__max_depth': 7, 'clf__min_child_weight': 2},
 {'clf__max_depth': 7, 'clf__min_child_weight': 3},
 {'clf__max_depth': 8, 'clf__min_child_weight': 1},
 {'clf__max_depth': 8, 'clf__min_child_weight': 2},
 {'clf__max_depth': 8, 'clf__min_child_weight': 3},
 {'clf__max_depth': 9, 'clf__min_child_weight': 1},
 {'clf__max_

In [62]:
for i in range(len(gscv.cv_results_['mean_train_score'])):
    train_score = gscv.cv_results_['mean_train_score'][i]
    test_score = gscv.cv_results_['mean_test_score'][i]
    params = gscv.cv_results_['params'][i]
    print('params : %s' % params)
    print('\ttrain score : %.3f' % train_score)
    print('\ttest score : %.3f' % test_score)

params : {'clf__max_depth': 3, 'clf__min_child_weight': 1}
	train score : 0.702
	test score : 0.638
params : {'clf__max_depth': 3, 'clf__min_child_weight': 2}
	train score : 0.702
	test score : 0.636
params : {'clf__max_depth': 3, 'clf__min_child_weight': 3}
	train score : 0.702
	test score : 0.636
params : {'clf__max_depth': 4, 'clf__min_child_weight': 1}
	train score : 0.730
	test score : 0.635
params : {'clf__max_depth': 4, 'clf__min_child_weight': 2}
	train score : 0.730
	test score : 0.634
params : {'clf__max_depth': 4, 'clf__min_child_weight': 3}
	train score : 0.729
	test score : 0.634
params : {'clf__max_depth': 5, 'clf__min_child_weight': 1}
	train score : 0.756
	test score : 0.625
params : {'clf__max_depth': 5, 'clf__min_child_weight': 2}
	train score : 0.755
	test score : 0.625
params : {'clf__max_depth': 5, 'clf__min_child_weight': 3}
	train score : 0.756
	test score : 0.625
params : {'clf__max_depth': 6, 'clf__min_child_weight': 1}
	train score : 0.782
	test score : 0.620


In [None]:
param_grid = {"clf__gamma" : [0,1,2,3,4,5]}
steps = [('standardize', StandardScaler()),
         ('clf', XGBClassifier(max_depth=3, min_child_weight=1))]
clf = Pipeline(steps)
cv = list(GroupKFold(n_splits=3).split(features_df, labels, user_ids))
gscv = GridSearchCV(clf, param_grid=param_grid, cv=cv, n_jobs=12, verbose=2)

In [64]:
xgb.cv?

### ^^^^ Optimize the shit out of this!!!
* [Analytics Vidhya's blog about tuning xgboost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)

In [74]:
from sklearn.preprocessing import LabelEncoder

In [99]:
def modelfit(alg, features, labels,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(features, label=labels)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                          early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(features_df, numeric_labels,eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(features_df)
    dtrain_predprob = alg.predict_proba(features_df)[:,1]
        
    #Print model report:
    print("\nModel Report")
    print("Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions))
    print("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob))
                    
    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

In [102]:
xgtrain = xgb.DMatrix(features_df, label=numeric_labels)

In [103]:
params = xgb1.get_params()
params

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bytree': 0.8,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 1000,
 'n_jobs': 1,
 'nthread': 4,
 'num_class': 7,
 'objective': 'multi:softprob',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': 27,
 'silent': True,
 'subsample': 0.8}

In [None]:
cvresult = xgb.cv(params,
                  xgtrain, 
                  num_boost_round=params['n_estimators'], 
                  nfold=5, 
                  early_stopping_rounds=50)

In [75]:
le = LabelEncoder()
le.fit(labels)
numeric_labels = le.transform(labels)

In [84]:
numeric_labels

array([4, 4, 4, ..., 2, 2, 2])

In [85]:
len(set(numeric_labels))

7

In [87]:
len(le.classes_)

7

In [89]:
len(numeric_labels)

307243

In [90]:
len(labels)

307243

In [106]:
xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=3,
 min_child_weight=1,
 gamma=0,
 num_class = len(le.classes_),
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'multi:softmax',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb1, features_df, numeric_labels)

XGBoostError: b'[10:39:59] src/metric/rank_metric.cc:88: Check failed: preds.size() == info.labels.size() (1720558 vs. 245794) label size predict size not match\n\nStack trace returned 10 entries:\n[bt] (0) /usr/lib/python3.4/site-packages/xgboost/./lib/libxgboost.so(_ZN4dmlc15LogMessageFatalD1Ev+0x29) [0x7fe7946fc849]\n[bt] (1) /usr/lib/python3.4/site-packages/xgboost/./lib/libxgboost.so(_ZNK7xgboost6metric7EvalAuc4EvalERKSt6vectorIfSaIfEERKNS_8MetaInfoEb+0x172) [0x7fe794755b62]\n[bt] (2) /usr/lib/python3.4/site-packages/xgboost/./lib/libxgboost.so(_ZN7xgboost11LearnerImpl11EvalOneIterEiRKSt6vectorIPNS_7DMatrixESaIS3_EERKS1_ISsSaISsEE+0x26b) [0x7fe79470684b]\n[bt] (3) /usr/lib/python3.4/site-packages/xgboost/./lib/libxgboost.so(XGBoosterEvalOneIter+0x36d) [0x7fe79485797d]\n[bt] (4) /lib64/libffi.so.6(ffi_call_unix64+0x4c) [0x7fe7b3a59dcc]\n[bt] (5) /lib64/libffi.so.6(ffi_call+0x1f5) [0x7fe7b3a596f5]\n[bt] (6) /usr/lib64/python3.4/lib-dynload/_ctypes.cpython-34m.so(_ctypes_callproc+0x2fb) [0x7fe7b3c6c4fb]\n[bt] (7) /usr/lib64/python3.4/lib-dynload/_ctypes.cpython-34m.so(+0xa5ef) [0x7fe7b3c665ef]\n[bt] (8) /lib64/libpython3.4m.so.1.0(PyObject_Call+0x8c) [0x7fe7c0b09dcc]\n[bt] (9) /lib64/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x3ce2) [0x7fe7c0bbcf52]\n'

In [98]:
len(features_df) == len(numeric_labels)

True

# Attempting one vs. rest with regression models
* [Drawing from Sklearn's page](http://scikit-learn.org/stable/modules/multiclass.html#one-vs-the-rest)

# Maybe also consider [Deep Forest](https://github.com/kingfengji/gcForest)

In [80]:
xgb.DMatrix?

In [77]:
xgb.cv?

In [81]:
xgb.XGBClassifier?