The purpose of this experiment is to show how heuristics such as Simulated Annealing can be used to find efficiently good combinations of hyper-parameters in machine learning algorithms. This approach is better than blind random generation of parameters. It is also preferable to fine-tuning each hyper-parameter separately because typically there are interactions between them.
The XGBoost algorithm is a good show case because it has many hyper-parameters. Exhaustive grid search can be computationally prohibitive.
For a very good discussion of the theoretical details of XGBoost, see this Slideshare presentation of the algorithm with title "Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author" by Tianqi Chen.
This is a Kaggle dataset taken from here which contains credit card transactions data and a fraud flag. It appeared originally in Dal Pozzolo, Andrea, et al. "Calibrating Probability with Undersampling for Unbalanced Classification." Computational Intelligence, 2015 IEEE Symposium Series on. IEEE, 2015. There is a Time variable (seconds from the first transaction in the dataset), an Amount variable, the Class variable (1=fraud, 0= no fraud) and the rest (V1-V28) are factor variables obtained through Principal Components Analysis from the original variables.
This is not a very difficult case for XGBoost as it will be seen. The main objective in this experiment is to show that the heuristic search finds a suitable set of hyper-parameters out of a quite large set of potential combinations.
We can verify below that this is a highly imbalanced dataset, typical of fraud detection data. We will take this into account when setting the weights of observations in XGBoost parameters.
Since we have plenty of data we are going to calibrate the hyper-parameters on a validation dataset and evaluate performance on an unseen testing dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import random
random.seed(1234)
plt.style.use('ggplot')
dat = pd.read_csv('creditcard.csv')
print(dat.head())
print('\nThe distribution of the target variable:\n')
dat['Class'].value_counts()
Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 ... V21 V22 V23 V24 \
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846
2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281
3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575
4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267
V25 V26 V27 V28 Amount Class
0 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 -0.206010 0.502292 0.219422 0.215153 69.99 0
[5 rows x 31 columns]
The distribution of the target variable:
0 284315
1 492
Name: Class, dtype: int64
Although the context of most of the variables is not known (recall that V1-V28 are factors summarizing the transactional data), we know that V1-V28 are by construction standardized, with a mean of 0 and and a standard deviation of 1. We standardize the Time and Amount variables too.
The data exploration and in particular Welch’s t-tests reveal that almost all the factors are significantly associated with the Class variable. The mean of these variables is almost zero in Class 0 and clearly non-zero in Class 1. The Time and Amount variables are also significant. There does not seem to by any reason for variable selection yet. We could drop the Time variable which is probably useless.
Some of the factors (and the Amount variable) are quite skewed and have very thin distributions. If we were to apply some other method (say, logistic regression) we could apply some transformations (and probably binning) but XGBoost is insensitive to such deviations from normality.
from scipy import stats
from sklearn import preprocessing
dat['Time'] = preprocessing.scale(dat['Time'])
dat['Amount'] = preprocessing.scale(dat['Amount'])
print('\nMeans of variables in the two Class categories:\n')
pt = pd.pivot_table(dat, values=dat.columns, columns = 'Class', aggfunc='mean')
print(pt.loc[dat.columns])
print('\nP-values of Welch’s t-tests and shape statistics:\n')
for i in range(30):
col_name = dat.columns[i]
t, p_val = stats.ttest_ind(dat.loc[ dat['Class']==0, col_name], dat.loc[ dat['Class']==1, col_name],equal_var=False)
skewness = dat.loc[:,col_name].skew()
kurtosis = stats.kurtosis(dat.loc[:,col_name])
print('Variable: {:7s}'.format(col_name),end='')
print('p-value: {:6.3f} skewness: {:6.3f} kurtosis: {:6.3f}'.format(p_val, skewness, kurtosis))
fig, axes = plt.subplots(nrows=6, ncols=5,figsize=(10,10))
axes = axes.flatten()
columns = dat.columns
for i in range(30):
axes[i].hist(dat[columns[i]], bins=50,facecolor='b',alpha=0.5)
axes[i].set_title(columns[i])
axes[i].set_xlim([-4., +4.])
plt.setp(axes[i].get_xticklabels(), visible=False)
plt.setp(axes[i].get_yticklabels(), visible=False)
Means of variables in the two Class categories:
Class 0 1
Time 0.000513 -0.296223
V1 0.008258 -4.771948
V2 -0.006271 3.623778
V3 0.012171 -7.033281
V4 -0.007860 4.542029
V5 0.005453 -3.151225
V6 0.002419 -1.397737
V7 0.009637 -5.568731
V8 -0.000987 0.570636
V9 0.004467 -2.581123
V10 0.009824 -5.676883
V11 -0.006576 3.800173
V12 0.010832 -6.259393
V13 0.000189 -0.109334
V14 0.012064 -6.971723
V15 0.000161 -0.092929
V16 0.007164 -4.139946
V17 0.011535 -6.665836
V18 0.003887 -2.246308
V19 -0.001178 0.680659
V20 -0.000644 0.372319
V21 -0.001235 0.713588
V22 -0.000024 0.014049
V23 0.000070 -0.040308
V24 0.000182 -0.105130
V25 -0.000072 0.041449
V26 -0.000089 0.051648
V27 -0.000295 0.170575
V28 -0.000131 0.075667
Amount -0.000234 0.135382
Class NaN NaN
P-values of Welch’s t-tests and shape statistics:
Variable: Time p-value: 0.000 skewness: -0.036 kurtosis: -1.294
Variable: V1 p-value: 0.000 skewness: -3.281 kurtosis: 32.486
Variable: V2 p-value: 0.000 skewness: -4.625 kurtosis: 95.771
Variable: V3 p-value: 0.000 skewness: -2.240 kurtosis: 26.619
Variable: V4 p-value: 0.000 skewness: 0.676 kurtosis: 2.635
Variable: V5 p-value: 0.000 skewness: -2.426 kurtosis: 206.901
Variable: V6 p-value: 0.000 skewness: 1.827 kurtosis: 42.642
Variable: V7 p-value: 0.000 skewness: 2.554 kurtosis: 405.600
Variable: V8 p-value: 0.063 skewness: -8.522 kurtosis: 220.583
Variable: V9 p-value: 0.000 skewness: 0.555 kurtosis: 3.731
Variable: V10 p-value: 0.000 skewness: 1.187 kurtosis: 31.988
Variable: V11 p-value: 0.000 skewness: 0.357 kurtosis: 1.634
Variable: V12 p-value: 0.000 skewness: -2.278 kurtosis: 20.241
Variable: V13 p-value: 0.028 skewness: 0.065 kurtosis: 0.195
Variable: V14 p-value: 0.000 skewness: -1.995 kurtosis: 23.879
Variable: V15 p-value: 0.050 skewness: -0.308 kurtosis: 0.285
Variable: V16 p-value: 0.000 skewness: -1.101 kurtosis: 10.419
Variable: V17 p-value: 0.000 skewness: -3.845 kurtosis: 94.798
Variable: V18 p-value: 0.000 skewness: -0.260 kurtosis: 2.578
Variable: V19 p-value: 0.000 skewness: 0.109 kurtosis: 1.725
Variable: V20 p-value: 0.000 skewness: -2.037 kurtosis: 271.011
Variable: V21 p-value: 0.000 skewness: 3.593 kurtosis: 207.283
Variable: V22 p-value: 0.835 skewness: -0.213 kurtosis: 2.833
Variable: V23 p-value: 0.571 skewness: -5.875 kurtosis: 440.081
Variable: V24 p-value: 0.000 skewness: -0.552 kurtosis: 0.619
Variable: V25 p-value: 0.249 skewness: -0.416 kurtosis: 4.290
Variable: V26 p-value: 0.015 skewness: 0.577 kurtosis: 0.919
Variable: V27 p-value: 0.006 skewness: -1.170 kurtosis: 244.985
Variable: V28 p-value: 0.002 skewness: 11.192 kurtosis: 933.381
Variable: Amount p-value: 0.004 skewness: 16.978 kurtosis: 845.078
In this step, we partition the dataset into 40% training, 30% validation and 30% testing. Note the use of the random.shuffle() function from numpy. We also make the corresponding matrices train, valid and test containing predictors only with labels trainY, validY and testY, respectively.
Class = dat['Class'].values
allIndices = np.arange(len(Class))
np.random.shuffle(allIndices) ## shuffle the indices of the observations
numTrain = int(round(0.40*len(Class)))
numValid = int(round(0.30*len(Class)))
numTest = len(Class)-numTrain-numValid
inTrain = allIndices[:numTrain]
inValid = allIndices[numTrain:(numTrain+numValid)]
inTest = allIndices[(numTrain+numValid):]
train = dat.iloc[inTrain,:30]
valid= dat.iloc[inValid,:30]
test = dat.iloc[inTest,:30]
trainY = Class[inTrain]
validY = Class[inValid]
testY = Class[inTest]
First we create the matrices in the format required by XGBoost with the xgb.DMatrix() function, passing for each dataset the predictors data and the labels. Then we set some fixed parameters. The number of boosting iterations (num_rounds) is set to 20. Normally we would use a larger number, but we want to keep the processing time low for the purposes of this experiment.
We initialize the param dictionary with silent=1 (no messages). Parameter min_child_weight is set at the default value of 1 because the data is highly unbalanced. This is the minimum weighted number of observations in a child node for further partitioning. The objective is binary classification and the evaluation metric is the Area Under Curve (AUC), the default for binary classification. In a more advanced implementation we could make a customized evaluation function, as described in XGBoost API. The internal random numbers seed is set to a constant for reproducible results (this is not guaranteed though, among other reasons because XGBoost runs in threads).
We are going to expand the param dictionary with the parameters in the heuristic search.
import xgboost as xgb
dtrain = xgb.DMatrix(train, label=trainY)
dvalid = xgb.DMatrix(valid, label=validY)
dtest = xgb.DMatrix(test, label=testY)
## fixed parameters
num_rounds=20 # number of boosting iterations
param = {'silent':1,
'min_child_weight':1,
'objective':'binary:logistic',
'eval_metric':'auc',
'seed' : 1234}
In what follows we combine the suggestions from several sources, notably:
-
The official XGBoost documentation and in particular the Notes on Parameter Tuning
-
The article "Complete Guide to Parameter Tuning in XGBoost" from Analytics Vidhya
-
Another Slideshare presentation with title "Open Source Tools & Data Science Competitions"
We select several important parameters for the heuristic search:
- max_depth: the maximum depth of a tree, in [1,∞], with default value 6. This is highly data-dependent. [2] quotes as typical values: 3-10 and [3] advises to start from 6. We choose to explore also larger values and select 5-25 levels in steps of 5.
- subsample, in (0,1] with default value 1. This is the proportion of the training instances used in trees and smaller values can prevent over-fitting. In [2] values in 0.5-1 are suggested. [3] suggests to leave this at 1. We decide to test values in 0.5-1.0 in steps of 0.1.
- colsample_bytree, in (0,1] with default value 1. This is the subsample ratio of columns (features) used to construct a tree. In [2] values in 0.5-1 are suggested. The advice in [3] is 0.3-0.5. We will try similar values as with subsample.
- eta (or learning_rate), in [0,1], with default value 0.3. This is the shrinking rate of the feature weights and larger values (but not too high!) can be used to prevent overfitting. A suggestion in [2] is to use values in 0.01-0.2. We can select some values in [0.01,0.4].
- gamma, in [0, ∞], with default value 0. This is the minimum loss function reduction required for a split. [3] suggests to leave this at 0. We can experiment with values in 0-2 in steps of 0.05.
- scale_pos_weight which controls the balance of positive and negative weights with default value 1. The advice in [1] is to use the ratio of negative to positive cases which is 595 here, i.e. to put a weight that large to the positive cases. [2] similarly suggests a large value in case of high class imbalance as is the case here. We can try some small values and some larger ones.
The total number of possible combinations is 43200 and we are only going to test a small fraction of 100 of them, i.e. as many as the number of the heuristic search iterations.
We also initialize a dataframe which will hold the results, for later examination.
from collections import OrderedDict
ratio_neg_to_pos = sum(trainY==0)/sum(trainY==1) ## = 608
print('Ratio of negative to positive instances: {:6.1f}'.format(ratio_neg_to_pos))
## parameters to be tuned
tune_dic = OrderedDict()
tune_dic['max_depth']= [5,10,15,20,25] ## maximum tree depth
tune_dic['subsample']=[0.5,0.6,0.7,0.8,0.9,1.0] ## proportion of training instances used in trees
tune_dic['colsample_bytree']= [0.5,0.6,0.7,0.8,0.9,1.0] ## subsample ratio of columns
tune_dic['eta']= [0.01,0.05,0.10,0.20,0.30,0.40] ## learning rate
tune_dic['gamma']= [0.00,0.05,0.10,0.15,0.20] ## minimum loss function reduction required for a split
tune_dic['scale_pos_weight']=[30,40,50,300,400,500,600,700] ## relative weight of positive/negative instances
lengths = [len(lst) for lst in tune_dic.values()]
combs=1
for i in range(len(lengths)):
combs *= lengths[i]
print('Total number of combinations: {:16d}'.format(combs))
maxiter=100
columns=[*tune_dic.keys()]+['F-Score','Best F-Score']
results = pd.DataFrame(index=range(maxiter), columns=columns) ## dataframe to hold training results
Ratio of negative to positive instances: 595.5
Total number of combinations: 43200
Next we define two functions:
Function perf_measures() accepts some predictions and labels, optionally prints the confusion matrix, and returns the F-Score This is a measure of performance combining precision and recall and will guide the heuristic search.
Function do_train() accepts as parameters:
- the current choice of variable parameters in a dictionary (cur_choice),
- the full dictionary of parameters to be passed to the main XGBoost training routine (param),
- a train dataset in XGBoost format (train),
- a string identifier (train_s),
- its labels (trainY),
- the corresponding arguments for a validation dataset (valid, valid_s, validY),
- and the option to print the confusion matrix (print_conf_matrix).
It then trains the model and returns the F-score of the predictions on the validation dataset together with the model. The call to the main train routine xgb.train() has the following arguments:
- the full dictionary of the parameters (param),
- the train dataset in XGBoost format (train),
- the number of boosting iterations (num_boost),
- [ a watchlist with datasets information to show progress (evals) ] --> this is commented-out below.
def perf_measures(preds, labels, print_conf_matrix=False):
act_pos=sum(labels==1) ## actual positive
act_neg=len(labels) - act_pos ## actual negative
pred_pos=sum(1 for i in range(len(preds)) if (preds[i]>=0.5)) ## predicted positive
true_pos=sum(1 for i in range(len(preds)) if (preds[i]>=0.5) & (labels[i]==1)) ## predicted negative
false_pos=pred_pos - true_pos ## false positive
false_neg=act_pos-true_pos ## false negative
true_neg=act_neg-false_pos ## true negative
precision = true_pos/pred_pos ## tp/(tp+fp) percentage of correctly classified predicted positives
recall = true_pos/act_pos ## tp/(tp+fn) percentage of positives correctly classified
f_score = 2*precision*recall/(precision+recall)
if print_conf_matrix:
print('\nconfusion matrix')
print('----------------')
print( 'tn:{:6d} fp:{:6d}'.format(true_neg,false_pos))
print( 'fn:{:6d} tp:{:6d}'.format(false_neg,true_pos))
return(f_score)
def do_train(cur_choice, param, train,train_s,trainY,valid,valid_s,validY,print_conf_matrix=False):
## train with given fixed and variable parameters
## and report the F-score on the validation dataset
print('Parameters:')
for (key,value) in cur_choice.items():
print(key,': ',value,' ',end='')
param[key]=value
print('\n')
## the commented-out segment below uses a watchlist to monitor the progress of the boosting iterations
## evallist = [(train,train_s), (valid,valid_s)]
## model = xgb.train( param, train, num_boost_round=num_rounds,
## evals=evallist,verbose_eval=False)
model = xgb.train( param, train, num_boost_round=num_rounds)
preds = model.predict(valid)
labels = valid.get_label()
f_score = perf_measures(preds, labels,print_conf_matrix)
return(f_score, model)
Next we define a function next_choice() which either produces a random combination of the variable parameters (if no current parameters are passed with cur_params) or generates a neighboring combination of the parameters passed in cur_params.
In the second case we first select at random a parameter to be changed. Then:
- If this parameter currently has the smallest value, we select the next (larger) one.
- If this parameter currently has the largest value, we select the previous (smaller) one.
- Otherwise, we select the left (smaller) or right (larger) value randomly.
Repetitions are avoided in the function which carries out the heuristic search.
def next_choice(cur_params=None):
## returns a random combination of the variable parameters (if cur_params=None)
## or a random neighboring combination from cur_params
if cur_params:
## chose parameter to change
## parameter name and current value
choose_param_name, cur_value = random.choice(list(cur_choice.items())) ## parameter name
all_values = list(tune_dic[choose_param_name]) ## all values of selected parameter
cur_index = all_values.index(cur_value) ## current index of selected parameter
if cur_index==0: ## if it is the first in the range select the second one
next_index=1
elif cur_index==len(all_values)-1: ## if it is the last in the range select the previous one
next_index=len(all_values)-2
else: ## otherwise select the left or right value randomly
direction=np.random.choice([-1,1])
next_index=cur_index + direction
next_params = dict((k,v) for k,v in cur_params.items())
next_params[choose_param_name] = all_values[next_index] ## change the value of the selected parameter
print('selected move: {:10s}: from {:6.2f} to {:6.2f}'.
format(choose_param_name, cur_value, all_values[next_index] ))
else: ## generate a random combination of parameters
next_params=dict()
for i in range(len(tune_dic)):
key = [*tune_dic.keys()][i]
values = [*tune_dic.values()][i]
next_params[key] = np.random.choice(values)
return(next_params)
At each iteration of the Simulated Annealing algorith, one combination of hyper-parameters is selected. The XGBoost algorithm is trained with these parameters and the F-score on the validation set is produced.
- If this F-score is better (larger) than the one at the previous iteration, i.e. there is a "local" improvement, the combination is accepted as the current combination and a neighbouring combination is selected for the next iteration through a call to the next_choice() function.
- Otherwise, i.e. if this F-score is worse (smaller) than the one at the previous iteration and the decline is Δf < 0, the combination is accepted as the current one with probability exp(-beta Δf/T) where beta is a constant and T is the current "temperature". The idea is that we start with a high temperature and "bad" solutions are easily accepted at first, in the hope of exploring wide areas of the search space. But as the temperature drops, bad solutions are less likely to be accepted and the search becomes more focused.
The temperature starts at a fixed value T0 and is reduced by a factor of alpha < 1 every n number of iterations. Here T0 = 0.40, n=5 and alpha = 0.85. The beta constant is 1.3.
The selection of the parameters of this "cooling schedule" can be done easily in MS Excel. In this example we select the average acceptance probabilities for F-Score deterioration of 0.150, 0.075, 0.038, 0.019, 0.009, i.e. starting from 0.150 and dividing by 2. We set these average probabilities to be around 50% during the first, second,...,fifth 20 iterations respectively and we use Excel Solver to find suitable parameters. The Excel file can be found here.
Repetitions are avoided with a simple hashing scheme.
A warning: if the number of iterations is not suficiently smaller than the total number of combinations, there may be too many repetitions and delays. The simple approach for producing combinations implemented here does not address such cases.
import time
t0 = time.clock()
T=0.40
best_params = dict() ## initialize dictionary to hold the best parameters
best_f_score = -1. ## initialize best f-score
prev_f_score = -1. ## initialize previous f-score
prev_choice = None ## initialize previous selection of parameters
weights = list(map(lambda x: 10**x, [0,1,2,3,4])) ## weights for the hash function
hash_values=set()
for iter in range(maxiter):
print('\nIteration = {:5d} T = {:12.6f}'.format(iter,T))
## find next selection of parameters not visited before
while True:
cur_choice=next_choice(prev_choice) ## first selection or selection-neighbor of prev_choice
## indices of the selections in alphabetical order of the parameters
indices=[tune_dic[name].index(cur_choice[name]) for name in sorted([*tune_dic.keys()])]
## check if selection has already been visited
hash_val = sum([i*j for (i, j) in zip(weights, indices)])
if hash_val in hash_values:
print('\nCombination revisited - searching again')
# tmp=abs(results.loc[:,[*cur_choice.keys()]] - list(cur_choice.values()))
# tmp=tmp.sum(axis=1)
# if any(tmp==0): ## selection has already been visited
# print('\nCombination revisited - searching again')
else:
hash_values.add(hash_val)
break ## break out of the while-loop
## train the model and obtain f-score on the validation dataset
f_score,model=do_train(cur_choice, param, dtrain,'train',trainY,dvalid,'valid',validY)
## store the parameters
results.loc[iter,[*cur_choice.keys()]]=list(cur_choice.values())
print(' F-Score: {:6.2f} previous: {:6.2f} best so far: {:6.2f}'.format(f_score, prev_f_score, best_f_score))
if f_score > prev_f_score:
print(' Local improvement')
## accept this combination as the new starting point
prev_f_score = f_score
prev_choice = cur_choice
## update best parameters if the f-score is globally better
if f_score > best_f_score:
best_f_score = f_score
print(' Global improvement - best f-score updated')
for (key,value) in prev_choice.items():
best_params[key]=value
else: ## f-score is smaller than the previous one
## accept this combination as the new starting point with probability exp(-(1.6 x f-score decline)/temperature)
rnd = random.random()
diff = f_score-prev_f_score
thres=np.exp(1.3*diff/T)
if rnd <= thres:
print(' Worse result. F-Score change: {:8.4f} threshold: {:6.4f} random number: {:6.4f} -> accepted'.
format(diff, thres, rnd))
prev_f_score = f_score
prev_choice = cur_choice
else:
## do not update previous f-score and previous choice
print(' Worse result. F-Score change: {:8.4f} threshold: {:6.4f} random number: {:6.4f} -> rejected'.
format(diff, thres, rnd))
## store results
results.loc[iter,'F-Score']=f_score
results.loc[iter,'Best F-Score']=best_f_score
if iter % 5 == 0: T=0.85*T ## reduce temperature every 5 iterations and continue
print('\n{:6.1f} minutes process time\n'.format((time.clock() - t0)/60))
print('Best variable parameters found:\n')
print(best_params)
Iteration = 0 T = 0.400000
Parameters:
eta : 0.05 colsample_bytree : 0.9 scale_pos_weight : 300 gamma : 0.2 max_depth : 20 subsample : 0.5
F-Score: 0.80 previous: -1.00 best so far: -1.00
Local improvement
Global improvement - best f-score updated
Iteration = 1 T = 0.340000
selected move: gamma : from 0.20 to 0.15
Parameters:
eta : 0.05 colsample_bytree : 0.9 scale_pos_weight : 300 gamma : 0.15 max_depth : 20 subsample : 0.5
F-Score: 0.80 previous: 0.80 best so far: 0.80
Worse result. F-Score change: 0.0000 threshold: 1.0000 random number: 0.1169 -> accepted
Iteration = 2 T = 0.340000
selected move: eta : from 0.05 to 0.10
Parameters:
eta : 0.1 colsample_bytree : 0.9 scale_pos_weight : 300 gamma : 0.15 max_depth : 20 subsample : 0.5
F-Score: 0.80 previous: 0.80 best so far: 0.80
Worse result. F-Score change: -0.0087 threshold: 0.9672 random number: 0.9110 -> accepted
Iteration = 3 T = 0.340000
selected move: max_depth : from 20.00 to 15.00
Parameters:
eta : 0.1 colsample_bytree : 0.9 scale_pos_weight : 300 gamma : 0.15 max_depth : 15 subsample : 0.5
F-Score: 0.80 previous: 0.80 best so far: 0.80
Local improvement
Iteration = 4 T = 0.340000
selected move: eta : from 0.10 to 0.05
Parameters:
eta : 0.05 colsample_bytree : 0.9 scale_pos_weight : 300 gamma : 0.15 max_depth : 15 subsample : 0.5
F-Score: 0.80 previous: 0.80 best so far: 0.80
Worse result. F-Score change: -0.0001 threshold: 0.9998 random number: 0.6716 -> accepted
Iteration = 5 T = 0.340000
selected move: eta : from 0.05 to 0.01
Parameters:
eta : 0.01 colsample_bytree : 0.9 scale_pos_weight : 300 gamma : 0.15 max_depth : 15 subsample : 0.5
F-Score: 0.81 previous: 0.80 best so far: 0.80
Local improvement
Global improvement - best f-score updated
Iteration = 6 T = 0.289000
selected move: eta : from 0.01 to 0.05
Combination revisited - searching again
selected move: scale_pos_weight: from 300.00 to 400.00
Parameters:
eta : 0.01 colsample_bytree : 0.9 scale_pos_weight : 400 gamma : 0.15 max_depth : 15 subsample : 0.5
F-Score: 0.81 previous: 0.81 best so far: 0.81
Local improvement
Global improvement - best f-score updated
Iteration = 7 T = 0.289000
selected move: colsample_bytree: from 0.90 to 1.00
Parameters:
eta : 0.01 colsample_bytree : 1.0 scale_pos_weight : 400 gamma : 0.15 max_depth : 15 subsample : 0.5
F-Score: 0.81 previous: 0.81 best so far: 0.81
Worse result. F-Score change: -0.0043 threshold: 0.9809 random number: 0.0174 -> accepted
Iteration = 8 T = 0.289000
selected move: eta : from 0.01 to 0.05
Parameters:
eta : 0.05 colsample_bytree : 1.0 scale_pos_weight : 400 gamma : 0.15 max_depth : 15 subsample : 0.5
F-Score: 0.81 previous: 0.81 best so far: 0.81
Local improvement
Global improvement - best f-score updated
Iteration = 9 T = 0.289000
selected move: scale_pos_weight: from 400.00 to 500.00
Parameters:
eta : 0.05 colsample_bytree : 1.0 scale_pos_weight : 500 gamma : 0.15 max_depth : 15 subsample : 0.5
F-Score: 0.82 previous: 0.81 best so far: 0.81
Local improvement
Global improvement - best f-score updated
...
Iteration = 96 T = 0.015504
selected move: scale_pos_weight: from 500.00 to 600.00
Parameters:
eta : 0.1 colsample_bytree : 1.0 scale_pos_weight : 600 gamma : 0.1 max_depth : 15 subsample : 0.5
F-Score: 0.81 previous: 0.82 best so far: 0.86
Worse result. F-Score change: -0.0029 threshold: 0.7819 random number: 0.4103 -> accepted
Iteration = 97 T = 0.015504
selected move: max_depth : from 15.00 to 20.00
Parameters:
eta : 0.1 colsample_bytree : 1.0 scale_pos_weight : 600 gamma : 0.1 max_depth : 20 subsample : 0.5
F-Score: 0.82 previous: 0.81 best so far: 0.86
Local improvement
Iteration = 98 T = 0.015504
selected move: scale_pos_weight: from 600.00 to 500.00
Parameters:
eta : 0.1 colsample_bytree : 1.0 scale_pos_weight : 500 gamma : 0.1 max_depth : 20 subsample : 0.5
F-Score: 0.82 previous: 0.82 best so far: 0.86
Worse result. F-Score change: -0.0055 threshold: 0.6282 random number: 0.5258 -> accepted
Iteration = 99 T = 0.015504
selected move: subsample : from 0.50 to 0.60
Combination revisited - searching again
selected move: gamma : from 0.10 to 0.15
Parameters:
eta : 0.1 colsample_bytree : 1.0 scale_pos_weight : 500 gamma : 0.15 max_depth : 20 subsample : 0.5
F-Score: 0.82 previous: 0.82 best so far: 0.86
Worse result. F-Score change: 0.0000 threshold: 1.0000 random number: 0.2742 -> accepted
19.2 minutes process time
Best variable parameters found:
{'eta': 0.4, 'colsample_bytree': 1.0, 'scale_pos_weight': 50, 'gamma': 0.1, 'max_depth': 15, 'subsample': 0.5}
The evaluation on the test dataset results to an F-Score of 0.86 which is considered good given the high imbalance in the classes. The run time was 19 minutes on a 64-bit Windows system with 6GB RAM and an Intel Core i3 processor 2.30GHz.
The best hyper-parameters found are in the ranges expected to be good according to all sources. One can then proceed this way:
- Narrowing the ranges of these hyper-parameters,
- Possibly adding others which are not used here (for example, regularization parameters),
- Possibly doing some variable selection on the basis of the variable importance information,
- Importantly, combining different models, ensemble-like.
print('\nBest parameters found:\n')
print(best_params)
print('\nEvaluation on the test dataset\n')
best_f_score,best_model=do_train(best_params, param, dtrain,'train',trainY,dtest,'test',testY,print_conf_matrix=True)
print('\nF-score on the test dataset: {:6.2f}'.format(best_f_score))
f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, sharey=False, figsize=(8,5))
ax1.plot(results['F-Score'])
ax2.plot(results['Best F-Score'])
ax1.set_xlabel('Iterations',fontsize=11)
ax2.set_xlabel('Iterations',fontsize=11)
ax1.set_ylabel('F-Score',fontsize=11)
ax2.set_ylabel('Best F-Score',fontsize=11)
ax1.set_ylim([0.7,0.9])
ax2.set_ylim([0.7,0.9])
plt.tight_layout()
plt.show()
print('\nVariables importance:\n')
p = xgb.plot_importance(best_model)
plt.show()
Best parameters found:
{'eta': 0.4, 'colsample_bytree': 1.0, 'scale_pos_weight': 50, 'gamma': 0.1, 'max_depth': 15, 'subsample': 0.5}
Evaluation on the test dataset
Parameters:
eta : 0.4 colsample_bytree : 1.0 scale_pos_weight : 50 gamma : 0.1 max_depth : 15 subsample : 0.5
confusion matrix
----------------
tn: 85265 fp: 18
fn: 25 tp: 134
F-score on the test dataset: 0.86
Variables importance: