# Tuning XGBoost

### [Advice from a Kaggle forum](https://www.kaggle.com/general/4092)
1. tune the number of candidate feature for splitting (starting with min_samples_leaf=1)
2. tune the depth of the trees (min_samples_leaf)
3. tune the number of trees

### Original XGBoost article : [Greedy Function Approximation: A Gradient Boosting Machine](http://www.jstor.org/stable/2699986?seq=1#page_scan_tab_contents)

### sklearn's [tips for parameter tuning](http://scikit-learn.org/stable/modules/grid_search.html#tips-for-parameter-search)

### Analytics Vidhya [tuning random forest](https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/)

In [30]:
import pandas as pd
import numpy as np
import time
import importlib.machinery
import sys
sys.path.append('/home/sac086/extrasensory/')
import extrasense as es
from sklearn.metrics import accuracy_score, make_scorer, roc_auc_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GroupShuffleSplit, GroupKFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline

In [31]:
import xgboost as xgb

### Load Data

In [36]:
features_df = es.get_impersonal_data(leave_users_out=[], data_type="activity", labeled_only=False)

# remove nan rows
no_label_indeces = features_df.label.isnull()
features_df = features_df[~no_label_indeces]

timestamps = features_df.pop('timestamp')
label_source = features_df.pop("label_source")
labels = features_df.pop("label")
user_ids = features_df.pop("user_id")

KeyboardInterrupt: 

In [4]:
resulting_scores = {}

## Score to beat
* In previous trials, the best parameters were n_estimators=100, max_features="log2", min_samples_leaf=1

In [5]:
# set up the accuracy scorer for cross validation
scorer = make_scorer(accuracy_score)
resulting_scores = {}

### ZeroR

In [7]:
steps = []
steps.append(('standardize', StandardScaler()))
steps.append(('ZeroR', es.ZeroR()))
clf = Pipeline(steps)

In [8]:
score = cross_val_score(clf, features_df, labels, groups=user_ids, cv=GroupKFold(n_splits=3), n_jobs=20, scoring=scorer)
resulting_scores["ZeroR"]=score

# Tuning XGBoost
* [Machine Learning Mastery's page on gradient boosting, lots of other resources too!](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
* [Kaggle master's blog on gradient descent](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)
* [Notes on tuning from the XGBoost docs](https://xgboost.readthedocs.io/en/latest/how_to/param_tuning.html)
* [Analytics Vidhya Blog on tuning XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
* [Tuning Gamma Parameter of XGBoost](https://medium.com/data-design/xgboost-hi-im-gamma-what-can-i-do-for-you-and-the-tuning-of-regularization-a42ea17e6ab6)
* [Data Iku blog on tuning XGBoost](https://www.dataiku.com/learn/guide/code/python/advanced-xgboost-tuning.html)


### Optimizing 1 : Control Overfitting

from docs:
When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.

There are in general two ways that you can control overfitting in xgboost

* The first way is to directly control model complexity
    * This include max_depth, min_child_weight and gamma
* The second way is to add randomness to make training robust to noise
    * This include subsample, colsample_bytree

You can also reduce stepsize eta, but needs to remember to increase num_round when you do so.

### Activity Classes

In [4]:
for l in labels.unique():
    print(l)

SITTING
FIX_walking
LYING_DOWN
FIX_running
BICYCLING
STAIRS_-_GOING_DOWN
STAIRS_-_GOING_UP


In [5]:
def get_base_xgb():
    return xgb.XGBClassifier(
                            learning_rate =0.1,
                            n_estimators=3000,
                            max_depth=5,
                            min_child_weight=1,
                            gamma=0,
                            subsample=0.8,
                            colsample_bytree=0.8,
                            objective= 'binary:logistic',
                            nthread=4,
                            scale_pos_weight=1,
                            seed=27)

In [6]:
from sklearn.metrics import classification_report, precision_score, f1_score, average_precision_score, recall_score

In [7]:
def modelfit(alg, features, labels, activity, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    values = [1 if l == activity else 0 for l in labels.unique()]
    activity_labels  = labels.replace(to_replace=labels.unique(), value=values)
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(features, label=activity_labels)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                          early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(features, activity_labels,eval_metric='auc')
        
    #Predict training set:
    train_predictions = alg.predict(features)
    train_predprob = alg.predict_proba(features)[:,1]
        
    #Print model report:
    print("Training Accuracy : %.4g" % accuracy_score(activity_labels, train_predictions))
    print("Training AUC Score (Train): %f" % roc_auc_score(activity_labels, train_predprob))
    
    return alg

In [15]:
alg_params = {}

for l in labels.unique():
    print("Training %s classifier" % l)
    alg = modelfit(get_base_xgb(), X_train, y_train, l)
    
    print("Getting Test Predictions")
    y_pred = alg.predict(X_test)
    y_pred_proba = alg.predict_proba(X_test)[:,1]
    
    values = [1 if i == l else 0 for i in labels.unique()]
    y_test_activity = y_test.replace(to_replace=labels.unique(), value=values)
    print("Test Accuracy : %.4g" % accuracy_score(y_test_activity, y_pred))
    print("Test AUC Score (Train): %f\n" % roc_auc_score(y_test_activity, y_pred_proba))
    alg_params[l] = alg.get_params()

Training SITTING classifier

Model Report : SITTING 
Training Accuracy : 0.9246
Training AUC Score (Train): 0.980337
Getting Test Predictions
Test Accuracy : 0.6244
Test AUC Score (Train): 0.672881

Training FIX_walking classifier

Model Report : FIX_walking 
Training Accuracy : 0.9623
Training AUC Score (Train): 0.976131
Getting Test Predictions
Test Accuracy : 0.9368
Test AUC Score (Train): 0.827958

Training LYING_DOWN classifier

Model Report : LYING_DOWN 
Training Accuracy : 0.9331
Training AUC Score (Train): 0.985091
Getting Test Predictions
Test Accuracy : 0.6315
Test AUC Score (Train): 0.687873

Training FIX_running classifier

Model Report : FIX_running 
Training Accuracy : 0.9998
Training AUC Score (Train): 0.999998
Getting Test Predictions
Test Accuracy : 0.997
Test AUC Score (Train): 0.590481

Training BICYCLING classifier

Model Report : BICYCLING 
Training Accuracy : 0.9998
Training AUC Score (Train): 0.999998
Getting Test Predictions
Test Accuracy : 0.9851
Test AUC Score

In [8]:
import pickle

In [9]:
params_step1_filename = "xgboost_step1_tuning_params.pickle"

In [77]:
with open(params_step1_filename, "wb") as fOut:
    pickle.dump(params_to_save1, fOut)

In [10]:
params_step1 = pickle.load(open(params_step1_filename, "rb"))

In [11]:
for activity, alg in params_step1.items():
    print("%s : %s" % (activity, alg['n_estimators']))

LYING_DOWN : 2648
FIX_walking : 812
FIX_running : 393
BICYCLING : 943
STAIRS_-_GOING_UP : 330
SITTING : 2546
STAIRS_-_GOING_DOWN : 164


## Step 2 : Tune max_depth and min_child_weight

In [12]:
param_test1 = {'max_depth':range(3,10,2),
              'min_child_weight': range(1,6,2)}

### Begin testing section

In [11]:
activity = "SITTING"
old_alg = params_step1[activity]

In [18]:
values = [1 if l == activity else 0 for l in labels.unique()]
activity_labels  = labels.replace(to_replace=labels.unique(), value=values)

In [19]:
new_alg = get_base_xgb()
new_alg.n_estimators = old_alg['n_estimators']
print("Tuning Alg for %s" % activity)
cv = list(GroupKFold(n_splits=5).split(features_df, activity_labels, user_ids))
gsv = GridSearchCV(estimator = new_alg, param_grid = param_test1,
                   scoring='roc_auc', n_jobs=8, iid=False, cv=cv, verbose=2)

Tuning Alg for SITTING


In [20]:
import time

In [24]:
start = time.time()
gsv.fit(features_df, activity_labels, user_ids)
print("best params : %s" % gsv.best_params_)
print("best score : %s" % gsv.best_score_)
finish = time.time()
print("took %s seconds" % (finish-start))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=5 .................................
[CV] max_depth=3, min_child_weight=5 .................................


KeyboardInterrupt: 

In [None]:
# set new alg dictionary
algs_2 = {}

for activity, old_alg in params_step1.items():
    new_alg = get_base_xgb()
    new_alg.n_estimators = old_alg['n_estimators']
    
    values = [1 if l == activity else 0 for l in labels.unique()]
    activity_labels  = labels.replace(to_replace=labels.unique(), value=values)

    print("Tuning Alg for %s" % activity)
    cv = list(GroupKFold(n_splits=3).split(features_df, activity_labels, user_ids))
    gsv = GridSearchCV(estimator = new_alg, param_grid = param_test1,
                       scoring='roc_auc', n_jobs=8, iid=False, cv=cv, verbose=2)
    gsv.fit(features_df, activity_labels, user_ids)
    print("best params : %s" % gsv.best_params_)
    print("best score : %s" % gsv.best_score_)
    algs_2[activity] = (gsv.grid_scores_, gsv.best_params_, gsv.best_score_)

Tuning Alg for LYING_DOWN
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=5 .................................
[CV] max_depth=3, min_child_weight=5 .................................
[CV] .................. max_depth=3, min_child_weight=3, total= 5.3min
[CV] max_depth=3, min_child_weight=5 .................................
[CV] .................. max_depth=3, min_child_weight=3, total= 5.3min
[CV] max_depth=5, min_child_weight=1 .................................
[CV] .................. max_depth=3, min_child_weight=1, tota

[Parallel(n_jobs=8)]: Done  36 out of  36 | elapsed: 56.9min finished


best params : {'max_depth': 3, 'min_child_weight': 5}
best score : 0.694242659224




Tuning Alg for FIX_walking
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=1 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=3 .................................
[CV] max_depth=3, min_child_weight=5 .................................
[CV] max_depth=3, min_child_weight=5 .................................


KeyboardInterrupt: 

In [22]:
[i for i in range(1,8,2)]

[1, 3, 5, 7]

In [24]:
[i for i in range(3,10,2)]

[3, 5, 7, 9]

In [11]:
algs[1]

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=812,
       n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27, silent=True,
       subsample=0.8)

In [25]:
alg_test = get_base_xgb()

In [26]:
alg_test.set_params(n_estimators=140)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=140,
       n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27, silent=True,
       subsample=0.8)

In [27]:
alg_test.n_estimators

140

In [12]:
algs[2]

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27, silent=True,
       subsample=0.8)

In [13]:
algs[3]

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=393,
       n_jobs=1, nthread=4, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27, silent=True,
       subsample=0.8)

# Next step : Start making a pipeline that follows the optimization steps in the blog

In [None]:
param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

# Running the tuning through a script

In [5]:
import pickle

In [7]:
step1_params = pickle.load(open("../processes/step1_params.pickle", "rb"))

In [8]:
step1_params

{'BICYCLING': {'base_score': 0.5,
  'booster': 'gbtree',
  'colsample_bylevel': 1,
  'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.1,
  'max_delta_step': 0,
  'max_depth': 5,
  'min_child_weight': 1,
  'missing': None,
  'n_estimators': 943,
  'n_jobs': 1,
  'nthread': 4,
  'objective': 'binary:logistic',
  'random_state': 0,
  'reg_alpha': 0,
  'reg_lambda': 1,
  'scale_pos_weight': 1,
  'seed': 27,
  'silent': True,
  'subsample': 0.8},
 'FIX_running': {'base_score': 0.5,
  'booster': 'gbtree',
  'colsample_bylevel': 1,
  'colsample_bytree': 0.8,
  'gamma': 0,
  'learning_rate': 0.1,
  'max_delta_step': 0,
  'max_depth': 5,
  'min_child_weight': 1,
  'missing': None,
  'n_estimators': 393,
  'n_jobs': 1,
  'nthread': 4,
  'objective': 'binary:logistic',
  'random_state': 0,
  'reg_alpha': 0,
  'reg_lambda': 1,
  'scale_pos_weight': 1,
  'seed': 27,
  'silent': True,
  'subsample': 0.8},
 'FIX_walking': {'base_score': 0.5,
  'booster': 'gbtree',
  'colsample_bylevel': 1,

## After step 2

In [32]:
step2_params = pickle.load(open("../processes/step2_params.pickle", "rb"))

In [34]:
type(step2_params)

NoneType

In [35]:
ls -l ../processes/

total 36
-rw-r--r--. 1 sac086 sac086 5773 Dec 29 18:46 experiment1.py
-rw-r--r--. 1 sac086 sac086 5287 Dec 13 15:12 experimental_setups.py
drwxrwxr-x. 2 sac086 sac086   55 Dec 13 15:40 [0m[01;34m__pycache__[0m/
-rw-rw-r--. 1 sac086 sac086 1144 Jan 14 16:34 step1_params.pickle
-rw-rw-r--. 1 sac086 sac086 1302 Jan 13 18:27 step1_params.pickle.old
-rw-rw-r--. 1 sac086 sac086    4 Jan 14 21:09 step2_params.pickle
-rw-r--r--. 1 sac086 sac086 6439 Jan 14 15:17 tuning_xgboost.py


# Results from output

In [None]:
Step 1 Tuning...
Training SITTING classifier
Training Accuracy : 0.9219
Training AUC Score (Train): 0.979328
Getting Test Predictions
Test Accuracy : 0.6321
Test AUC Score (Train): 0.672686

Training FIX_walking classifier
Training Accuracy : 0.9694
Training AUC Score (Train): 0.986583
Getting Test Predictions
Test Accuracy : 0.9311
Test AUC Score (Train): 0.864814

Training LYING_DOWN classifier
Training Accuracy : 0.9347
Training AUC Score (Train): 0.985816
Getting Test Predictions
Test Accuracy : 0.6606
Test AUC Score (Train): 0.700288

Training STAIRS classifier
Training Accuracy : 0.9993
Training AUC Score (Train): 0.999895
Getting Test Predictions
Test Accuracy : 0.9993
Test AUC Score (Train): 0.715796

Training FIX_running classifier
Training Accuracy : 0.9994
Training AUC Score (Train): 0.999973
Getting Test Predictions
Test Accuracy : 0.9985
Test AUC Score (Train): 0.948706

Training BICYCLING classifier
Training Accuracy : 0.9941
Training AUC Score (Train): 0.997416
Getting Test Predictions
Test Accuracy : 0.9939
Test AUC Score (Train): 0.913646

BICYCLING : 393
SITTING : 2754
STAIRS : 263
LYING_DOWN : 3000
FIX_walking : 1436
FIX_running : 336

In [None]:
Step2
BICYCLING
Training results:
best params : {'max_depth': 3, 'min_child_weight': 1}
best score : 0.87567072957
Test Accuracy : 0.995
Test AUC Score (Train): 0.822023
    
SITTING
best params : {'max_depth': 3, 'min_child_weight': 1}
best score : 0.660944318251
Test Accuracy : 0.6317
Test AUC Score (Train): 0.668495
    
STAIRS
Training results:
best params : {'max_depth': 3, 'min_child_weight': 5}
best score : 0.634735966304
Test Accuracy : 0.998
Test AUC Score (Train): 0.618597
    
LYING_DOWN
best params : {'max_depth': 3, 'min_child_weight': 1}
best score : 0.659388735858
Test Accuracy : 0.6685
Test AUC Score (Train): 0.712522