In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
sys.path.append(module_path)

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
color = sns.color_palette()
%matplotlib inline
matplotlib.style.use('ggplot')

import time
import numpy as np
import pandas as pd
from IPython.display import display

# remove warnings
import warnings
warnings.filterwarnings('ignore')

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, roc_curve
from itertools import product

# my module
from conf.configure import Configure
from utils import data_utils, dataframe_util
from utils.common_utils import common_num_range

import model.get_datasets as gd

# Load Datasets

In [2]:
train = pd.read_csv(Configure.base_path + 'huang_lin/train_dataHL.csv')
test = pd.read_csv(Configure.base_path + 'huang_lin/test_dataHL.csv')

y_train = train['orderType']
train.drop(['orderType'], axis=1, inplace=True)

df_columns = train.columns.values
print('train: {}, test: {}, feature count: {}, orderType 1:0 = {:.5f}'.format(
    train.shape[0], test.shape[0], len(df_columns), 1.0*sum(y_train) / len(y_train)))

train: 40307, test: 10076, feature count: 368, orderType 1:0 = 0.16436


In [3]:
np.mean(y_train)

0.1643635100602873

In [4]:
dtrain = xgb.DMatrix(train.values, y_train, feature_names=df_columns)
dtest = xgb.DMatrix(test, feature_names=df_columns)

# Parameter Fine Tuning

## Parameters
The overall parameters can be divided into 3 categories:

### General Parameters: Guide the overall functioning
1. booster [default=gbtree], Select the type of model to run at each iteration. It has 2 options:
- gbtree: tree-based models
- gblinear: linear models

2. silent [default=0]:
- Silent mode is activated is set to 1, i.e. no running messages will be printed.
- It’s generally good to keep it 0 as the messages might help in understanding the model.

3. nthread [default to maximum number of threads available if not set]

### Booster Parameters : Guide the individual booster (tree/regression) at each step
Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.

1.eta [default=0.3]
    - Analogous to **learning rate** in GBM
    - Makes the model more robust by shrinking the weights on each step
    - Typical final values to be used: 0.01-0.2
    
2. min_child_weight [default=1]
    - Defines **the minimum sum of weights of all observations required in a child**.
    - This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
    - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
    - Too high values can lead to under-fitting hence, it should be tuned using CV.

3. max_depth [default=6]
    - **The maximum depth of a tree**, same as GBM.
    - Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.Should be tuned using CV.Typical values: 3-10

4. max_leaf_nodes
    - **The maximum number of terminal nodes or leaves in a tree**.
    - Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
    - ** If this is defined, GBM will ignore max_depth**.

5. gamma [default=0]
    - A node is split only when the resulting split gives a positive reduction in the loss function. **Gamma specifies the minimum loss reduction required to make a split**.
    - Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
    - The higher Gamma is, the higher the regularization. Default value is 0 (no regularization).

6. max_delta_step [default=0]
    - In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
    - Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
    - This is generally not used but you can explore further if you wish.

7. subsample [default=1]
    - Same as the subsample of GBM. Denotes **the fraction of observations to be randomly samples for each tree**.
    - Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
    - Typical values: 0.5-1

8. colsample_bytree [default=1]
    - Similar to max_features in GBM. Denotes **the fraction of columns to be randomly samples for each tree**.
    - Typical values: 0.5-1

9. colsample_bylevel [default=1]
    - Denotes **the subsample ratio of columns for each split, in each level**.
    - I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.

10. lambda [default=1]
    - **L2 regularization term on weights** (analogous to Ridge regression)
    - This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.

11. alpha [default=0]
    - **L1 regularization term on weights** (analogous to Lasso regression)
    - Can be used in case of very high dimensionality so that the algorithm runs faster when implemented

12. scale_pos_weight [default=1]
    - A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.

### Learning Task Parameters: Guide the optimization performed
These parameters are used to define the optimization objective the metric to be calculated at each step.

1. objective [default=reg:linear]
    - This defines the loss function to be minimized. Mostly used values are:
        - binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
        - multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
            - you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
        - multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.

2. eval_metric [ default according to objective ]
    - The metric to be used for validation data.
    - The default values are rmse for regression and error for classification.
    - Typical values are:
        - rmse – root mean square error
        - mae – mean absolute error
        - logloss – negative log-likelihood
        - error – Binary classification error rate (0.5 threshold)
        - merror – Multiclass classification error rate
        - mlogloss – Multiclass logloss
        - auc: Area under the curve

3. seed [default=0]
    - The random number seed.
    - Can be used for generating reproducible results and also for parameter tuning.
    

## General Approach for Parameter Tuning

The various steps to be performed are:

1. Choose a **relatively high learning rate**. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. **Determine the optimum number of trees for this learning rate**. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
2. **Tune tree-specific parameters** (*max_depth, min_child_weight, gamma, subsample, colsample_bytree*) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
3. **Tune regularization parameters** (*lambda, alpha*) for xgboost which can help reduce model complexity and enhance performance.
4. Lower the learning rate and decide the optimal parameters .

In [5]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, roc_curve
from itertools import product

def model_cross_validate(xgb_params, cv_param_dict, dtrain, cv_num_boost_round=4000, early_stopping_rounds=100, cv_nfold=5, stratified=True):
    params_value = []
    params_name = cv_param_dict.keys()
    max_auc = 0
    for param in params_name:
        params_value.append(cv_param_dict[param])

    for param_pair in product(*params_value):
        param_str = ''
        for i in xrange(len(param_pair)):
            param_str += params_name[i] + '=' + str(param_pair[i]) + ' '
            xgb_params[params_name[i]] = param_pair[i]
        
        start = time.time()
        cv_result = xgb.cv(xgb_params, dtrain, num_boost_round=cv_num_boost_round, stratified=stratified,
                           nfold=cv_nfold, early_stopping_rounds=early_stopping_rounds)
        
        best_num_boost_rounds = len(cv_result)
        mean_test_auc = cv_result.loc[best_num_boost_rounds - 6: best_num_boost_rounds - 1, 'test-auc-mean'].mean()
        if mean_test_auc > max_auc:
            best_param = param_pair
            max_auc = mean_test_auc
        
        end = time.time()
        print('Tuning paramter: {}, best_ntree_limit:{}, auc = {:.7f}, cost: {}s'.format(param_str, best_num_boost_rounds,
                                                                              mean_test_auc, end-start))
    param_str = ''
    for i in xrange(len(best_param)):
        param_str += params_name[i] + '=' + str(best_param[i]) + ' '
        xgb_params[params_name[i]] = best_param[i]
    print('===========best paramter: {} auc={:.7f}==========='.format(param_str, max_auc))

### Step 1: Fix learning rate and number of estimators for tuning tree-based parameters

### Baseline model

In [6]:
xgb_params = {
    'eta': 0.1,
    'max_depth': 5,
    'min_child_weight': 1,
    'scale_pos_weight': 1,
    'gamma': 0,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'auc',
    'objective': 'binary:logistic',
    'updater': 'grow_gpu',
    'gpu_id':0,
    'nthread': -1,
    'silent': 1,
    'booster': 'gbtree',
}

In [7]:
print('---> calc baseline model')

cv_num_boost_round=4000
early_stopping_rounds=100
cv_nfold=5
stratified=True

cv_result = xgb.cv(xgb_params,
                   dtrain,
                   nfold=cv_nfold,
                   stratified=stratified,
                   num_boost_round=cv_num_boost_round,
                   early_stopping_rounds=early_stopping_rounds,
                   )
best_num_boost_rounds = len(cv_result)
mean_train_auc = cv_result.loc[best_num_boost_rounds-6 : best_num_boost_rounds-1, 'train-auc-mean'].mean()
mean_test_auc = cv_result.loc[best_num_boost_rounds-6 : best_num_boost_rounds-1, 'test-auc-mean'].mean()

print('mean_train_auc = {:.7f} , mean_test_auc = {:.7f}\n'.format(mean_train_auc, mean_test_auc))

---> calc baseline model
mean_train_auc = 0.9974385 , mean_test_auc = 0.9691350



### Fine tune *max_depth* and *min_child_weight*

In [8]:
cv_paramters = {'max_depth':range(5,15,2),'min_child_weight':range(1,10,2)}
model_cross_validate(xgb_params, cv_paramters, dtrain)

Tuning paramter: max_depth=5 min_child_weight=1 , best_ntree_limit:368, auc = 0.9691350, cost: 90.7996020317s
Tuning paramter: max_depth=5 min_child_weight=3 , best_ntree_limit:388, auc = 0.9698913, cost: 94.5384869576s


KeyboardInterrupt: 

In [8]:
cv_paramters = {'max_depth':range(10,13,1),'min_child_weight':range(2,5,1)}
model_cross_validate(xgb_params, cv_paramters, dtrain)

Tuning paramter: max_depth=10 min_child_weight=2 , best_ntree_limit:607, auc = 0.9692929, cost: 507.173731089s
Tuning paramter: max_depth=10 min_child_weight=3 , best_ntree_limit:434, auc = 0.9687700, cost: 375.949675083s
Tuning paramter: max_depth=10 min_child_weight=4 , best_ntree_limit:348, auc = 0.9691484, cost: 313.063239098s
Tuning paramter: max_depth=11 min_child_weight=2 , best_ntree_limit:601, auc = 0.9696587, cost: 547.44369483s
Tuning paramter: max_depth=11 min_child_weight=3 , best_ntree_limit:596, auc = 0.9697522, cost: 541.135313034s
Tuning paramter: max_depth=11 min_child_weight=4 , best_ntree_limit:367, auc = 0.9696156, cost: 366.118253946s
Tuning paramter: max_depth=12 min_child_weight=2 , best_ntree_limit:974, auc = 0.9694576, cost: 949.071660042s
Tuning paramter: max_depth=12 min_child_weight=3 , best_ntree_limit:460, auc = 0.9696810, cost: 502.434431076s
Tuning paramter: max_depth=12 min_child_weight=4 , best_ntree_limit:497, auc = 0.9694188, cost: 533.029690981s


### Tune gamma

In [11]:
cv_paramters={'gamma':common_num_range(0,10,1)}
# model_cross_validate(xgb_params, cv_paramters, dtrain)

### Tune subsample and colsample_bytree

In [13]:
cv_paramters = {'subsample':common_num_range(0.5, 1, 0.2), 'colsample_bytree':common_num_range(0.5,1,0.2)}
model_cross_validate(xgb_params,cv_paramters,dtrain)

Tuning paramter: subsample=0.5 colsample_bytree=0.5 , best_ntree_limit:223, auc = 0.9679042, cost: 222.292212009s
Tuning paramter: subsample=0.5 colsample_bytree=0.7 , best_ntree_limit:577, auc = 0.9670692, cost: 502.259080172s
Tuning paramter: subsample=0.5 colsample_bytree=0.9 , best_ntree_limit:326, auc = 0.9678104, cost: 377.781816959s
Tuning paramter: subsample=0.7 colsample_bytree=0.5 , best_ntree_limit:368, auc = 0.9690405, cost: 462.038894176s
Tuning paramter: subsample=0.7 colsample_bytree=0.7 , best_ntree_limit:327, auc = 0.9691685, cost: 569.801518917s
Tuning paramter: subsample=0.7 colsample_bytree=0.9 , best_ntree_limit:632, auc = 0.9687758, cost: 1015.72838306s
Tuning paramter: subsample=0.9 colsample_bytree=0.5 , best_ntree_limit:595, auc = 0.9700122, cost: 740.865426064s
Tuning paramter: subsample=0.9 colsample_bytree=0.7 , best_ntree_limit:758, auc = 0.9697036, cost: 719.50315094s
Tuning paramter: subsample=0.9 colsample_bytree=0.9 , best_ntree_limit:750, auc = 0.96939

In [14]:
cv_paramters = {'subsample':common_num_range(0.8, 1.1, 0.1), 'colsample_bytree':common_num_range(0.4,0.7,0.1)}
model_cross_validate(xgb_params,cv_paramters,dtrain)

Tuning paramter: subsample=0.8 colsample_bytree=0.4 , best_ntree_limit:648, auc = 0.9694050, cost: 550.568568945s
Tuning paramter: subsample=0.8 colsample_bytree=0.5 , best_ntree_limit:652, auc = 0.9697718, cost: 582.018703938s
Tuning paramter: subsample=0.8 colsample_bytree=0.6 , best_ntree_limit:809, auc = 0.9695466, cost: 731.63205409s
Tuning paramter: subsample=0.9 colsample_bytree=0.4 , best_ntree_limit:515, auc = 0.9698373, cost: 417.920390844s
Tuning paramter: subsample=0.9 colsample_bytree=0.5 , best_ntree_limit:595, auc = 0.9700122, cost: 500.100728989s
Tuning paramter: subsample=0.9 colsample_bytree=0.6 , best_ntree_limit:639, auc = 0.9696043, cost: 554.651578903s
Tuning paramter: subsample=1.0 colsample_bytree=0.4 , best_ntree_limit:543, auc = 0.9696669, cost: 429.173371792s
Tuning paramter: subsample=1.0 colsample_bytree=0.5 , best_ntree_limit:1046, auc = 0.9697188, cost: 807.565859079s
Tuning paramter: subsample=1.0 colsample_bytree=0.6 , best_ntree_limit:742, auc = 0.9696

### Tuning Regularization Parameters: alpha, lambda

In [16]:
cv_paramters = {'alpha':[1e-5, 1e-3, 1e-2, 0.1, 1, 10, 100],
                'lambda':[1e-5, 1e-3, 1e-2, 0.1, 1, 10, 100]}
model_cross_validate(xgb_params,cv_paramters,dtrain)

Tuning paramter: alpha=1e-05 lambda=1e-05 , best_ntree_limit:503, auc = 0.9692792, cost: 425.94354701s
Tuning paramter: alpha=1e-05 lambda=0.001 , best_ntree_limit:467, auc = 0.9695275, cost: 416.013977051s
Tuning paramter: alpha=1e-05 lambda=0.01 , best_ntree_limit:689, auc = 0.9692839, cost: 563.177620888s
Tuning paramter: alpha=1e-05 lambda=0.1 , best_ntree_limit:670, auc = 0.9695547, cost: 528.730576992s
Tuning paramter: alpha=1e-05 lambda=1 , best_ntree_limit:923, auc = 0.9701873, cost: 698.041824818s
Tuning paramter: alpha=1e-05 lambda=10 , best_ntree_limit:453, auc = 0.9702873, cost: 391.562325001s
Tuning paramter: alpha=1e-05 lambda=100 , best_ntree_limit:561, auc = 0.9696862, cost: 474.354606152s
Tuning paramter: alpha=0.001 lambda=1e-05 , best_ntree_limit:692, auc = 0.9692623, cost: 544.138518095s
Tuning paramter: alpha=0.001 lambda=0.001 , best_ntree_limit:628, auc = 0.9693156, cost: 509.361008883s
Tuning paramter: alpha=0.001 lambda=0.01 , best_ntree_limit:376, auc = 0.9690

### Reducing Learning Rate and Done!

In [17]:
xgb_params

{'alpha': 1e-05,
 'booster': 'gbtree',
 'colsample_bytree': 0.5,
 'eta': 0.1,
 'eval_metric': 'auc',
 'gamma': 4,
 'gpu_id': 0,
 'lambda': 10,
 'max_depth': 11,
 'min_child_weight': 3,
 'nthread': -1,
 'objective': 'binary:logistic',
 'scale_pos_weight': 1,
 'silent': 1,
 'subsample': 0.9,
 'updater': 'grow_gpu'}

In [None]:
xgb_params['eta'] = 0.01