# In-class Kaggle Competition - Example - v3

This notebook gives an example on the implementation of machine learning pipeline for the In-class Kaggle Competition.

v1 - Updated on 05/03/2020:
- Implementing a basic machine learning pipeline
- Data processing: grouping (binning), dummitization, variable transformation, variable selection
- Methodology: LR, RF, 10-fold CV, AUC

v2 - Updated on 13/03/2020:
- Noticing about information leakage in train data + data processing + k-fold CV process
- Changing the previous experimental setup to: train, validation, test setup

v3 - Updated on 27/03/2020:
- Hyper-parameter tuning and reference papers

## 1. Initiation

In [1]:
Sys.setenv(LANG = "en")

# Data processing library
library(data.table)       # Data manipulation
library(plyr)             # Data manipulation
library(stringr)          # String, text processing
library(vita)             # Quickly check variable importance
library(dataPreparation)  # Data preparation library
library(woeBinning)       # Decision tree–based binning for numerical and categorical variables
library(Boruta)           # Variable selection

# Machine learning library
library(mlr)          # Machine learning framework
library(caret)         # Data processing and machine learning framework
library(MASS)          # LDA
library(randomForest)  # RF
library(gbm)           # Boosting Tree
library(xgboost)       # XGboost
library(LLM)           # Logit Leaf Model
library(ada)           # Ada Model
library(gbm)           # gradiant boosting Model
library(kknn)          # knn Model

"package 'dataPreparation' was built under R version 3.6.3"Loading required package: lubridate
Loading required package: lubridate
"package 'lubridate' was built under R version 3.6.2"
Attaching package: 'lubridate'


Attaching package: 'lubridate'

The following object is masked from 'package:plyr':

    here

The following object is masked from 'package:plyr':

    here

The following objects are masked from 'package:data.table':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:data.table':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following object is masked from 'package:base':

    date

The following object is masked from 'package:base':

    date

Loading required package: Matrix
Loading required package: Matrix
Loading required package: progress
Loading required package: progress
"package 'progress' was built under R version 3.6.1"dataPreparation 0.

In [2]:
#install.packages('kknn')

## 2. Data summary and processing

### 2.1. Data summary

#### Read and print out some data

In [3]:
# Read train (full), test (holdout)
train_full <- read.csv('./data/Kaggle/input/bank_mkt_train.csv')  # Training dataset
test_holdout <- read.csv('./data/Kaggle/input/bank_mkt_test.csv')  # Holdout data set without response

In [4]:
colnames(test_holdout)

In [5]:
train_full

client_id,age,job,marital,education,default,housing,loan,contact,month,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribe
2,29,housemaid,single,high.school,no,no,no,telephone,may,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.858,5191.0,0
3,39,unemployed,married,basic.9y,unknown,yes,no,telephone,jun,...,6,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,0
4,49,blue-collar,married,basic.6y,unknown,no,no,cellular,nov,...,2,999,0,nonexistent,-0.1,93.200,-42.0,4.153,5195.8,0
5,32,self-employed,single,university.degree,no,yes,no,cellular,may,...,3,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,0
6,29,admin.,single,high.school,unknown,yes,no,cellular,jul,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,0
7,51,self-employed,married,university.degree,unknown,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.864,5228.1,0
8,34,blue-collar,married,basic.4y,no,yes,no,cellular,nov,...,1,999,0,nonexistent,-0.1,93.200,-42.0,4.153,5195.8,0
9,52,services,married,high.school,no,yes,no,cellular,nov,...,1,999,0,nonexistent,-0.1,93.200,-42.0,4.153,5195.8,0
14,52,admin.,married,university.degree,no,yes,no,cellular,nov,...,3,999,0,nonexistent,-0.1,93.200,-42.0,4.076,5195.8,0
15,29,admin.,single,university.degree,no,yes,no,cellular,jun,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.266,5076.2,0


client_id,age,job,marital,education,default,housing,loan,contact,month,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribe
2,29,housemaid,single,high.school,no,no,no,telephone,may,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.858,5191.0,0
3,39,unemployed,married,basic.9y,unknown,yes,no,telephone,jun,...,6,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,0
4,49,blue-collar,married,basic.6y,unknown,no,no,cellular,nov,...,2,999,0,nonexistent,-0.1,93.200,-42.0,4.153,5195.8,0
5,32,self-employed,single,university.degree,no,yes,no,cellular,may,...,3,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,0
6,29,admin.,single,high.school,unknown,yes,no,cellular,jul,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,0
7,51,self-employed,married,university.degree,unknown,yes,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.864,5228.1,0
8,34,blue-collar,married,basic.4y,no,yes,no,cellular,nov,...,1,999,0,nonexistent,-0.1,93.200,-42.0,4.153,5195.8,0
9,52,services,married,high.school,no,yes,no,cellular,nov,...,1,999,0,nonexistent,-0.1,93.200,-42.0,4.153,5195.8,0
14,52,admin.,married,university.degree,no,yes,no,cellular,nov,...,3,999,0,nonexistent,-0.1,93.200,-42.0,4.076,5195.8,0
15,29,admin.,single,university.degree,no,yes,no,cellular,jun,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.266,5076.2,0


In [6]:
colnames(train_full)

In [7]:
# Print out to check the data type
str(train_full)

'data.frame':	7000 obs. of  21 variables:
 $ client_id     : int  2 3 4 5 6 7 8 9 14 15 ...
 $ age           : int  29 39 49 32 29 51 34 52 52 29 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 11 2 7 1 7 2 8 1 1 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 3 2 2 3 3 2 2 2 2 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 4 3 2 7 4 7 1 4 7 7 ...
 $ default       : Factor w/ 2 levels "no","unknown": 1 2 2 1 2 2 1 1 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 3 1 3 3 3 3 3 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 1 1 1 2 1 1 1 1 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 5 8 7 4 5 8 8 8 5 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 1 4 2 1 4 4 4 3 2 ...
 $ campaign      : int  3 6 2 3 2 1 1 1 3 1 ...
 $ pdays         : int  999 999 999 999 9

#### Correct the variable: campaign

Since campaign includes also the last contact, its value should be reduce by 1.

In [8]:
# Fix the value
train_full[, 'campaign'] <- train_full[, 'campaign'] - 1
test_holdout[, 'campaign'] <- test_holdout[, 'campaign'] - 1

# Quick check
min(train_full[, 'campaign'])  # Previously = 1
min(test_holdout[, 'campaign'])  # Previously = 1

#### Check and fix data error (if any)

In [9]:
# Check missing value
apply(is.na(train_full), 2, sum)

#### Split train (full) data into train, valid, test (60:20:20)

In [10]:
set.seed(1)

train_idx <- caret::createDataPartition(y=train_full[, 'subscribe'], p=.6, list=F)
train <- train_full[train_idx, ]  # Train 60%
valid_test <- train_full[-train_idx, ]  # Valid + Test 40%

valid_idx <- caret::createDataPartition(y=valid_test[, 'subscribe'], p=.5, list=F)
valid <- valid_test[valid_idx, ]  # Valid 20%
test <- valid_test[-valid_idx, ]  # Test 20%

#### Check the target variable class distribution

In [11]:
# By number
table(train$subscribe)
table(valid$subscribe)
table(test$subscribe)


   0    1 
3721  479 


   0    1 
1247  153 


   0    1 
1210  190 


   0    1 
3721  479 


   0    1 
1247  153 


   0    1 
1210  190 

In [12]:
# By percentage
table(train$subscribe) / nrow(train)
table(valid$subscribe) / nrow(valid)
table(test$subscribe) / nrow(test)


        0         1 
0.8859524 0.1140476 


        0         1 
0.8907143 0.1092857 


        0         1 
0.8642857 0.1357143 


        0         1 
0.8859524 0.1140476 


        0         1 
0.8907143 0.1092857 


        0         1 
0.8642857 0.1357143 

#### Simply check which variables are potentially important

Note: Running this permutation algorithm may take some time.

In [13]:
# PIMP-Algorithm For The Permutation Variable Importance Measure
# https://cran.r-project.org/web/packages/vita/vita.pdf
X <- train[, 2:(ncol(train)-1)]
y <- as.factor(train[, 'subscribe'])
rf_model <- randomForest(X, y, mtry=3, ntree=100, importance=T, seed=1)
pimp_varImp <- PIMP(X, y, rf_model, S=10, parallel=F, seed=123)

In [14]:
# Print out top most important variables
pimp_varImp$VarImp[order(pimp_varImp$VarImp[, 1], decreasing=T), ]

### 2.2. Feature engineering

Hints:
- Focus on the most important variables.
- Create some framework for searching the new variables.

#### Add variable: month_spring

In [15]:
# Add new variable to train and test (holdout)
# Train, valid, test
train[, 'month_spring'] <- as.logical(train$month %in% c('mar', 'apr', 'may'))
valid[, 'month_spring'] <- as.logical(valid$month %in% c('mar', 'apr', 'may'))
test[, 'month_spring'] <- as.logical(test$month %in% c('mar', 'apr', 'may'))
# Test (holdout)
test_holdout[, 'month_spring'] <- as.logical(test_holdout$month %in% c('mar', 'apr', 'may'))

#### Add variable: month_summer

In [16]:
# Add new variable to train and test (holdout)
# Train, valid, test
train[, 'month_summer'] <- as.logical(train$month %in% c('jun', 'jul', 'aug'))
valid[, 'month_summer'] <- as.logical(valid$month %in% c('jun', 'jul', 'aug'))
test[, 'month_summer'] <- as.logical(test$month %in% c('jun', 'jul', 'aug'))
# Test (holdout)
test_holdout[, 'month_summer'] <- as.logical(test_holdout$month %in% c('jun', 'jul', 'aug'))

#### Add variable: month_autumn

In [17]:
# Add new variable to train and test (holdout)
# Train, valid, test
train[, 'month_autumn'] <- as.logical(train$month %in% c('sep', 'oct', 'nov'))
valid[, 'month_autumn'] <- as.logical(valid$month %in% c('sep', 'oct', 'nov'))
test[, 'month_autumn'] <- as.logical(test$month %in% c('sep', 'oct', 'nov'))
# Test (holdout)
test_holdout[, 'month_autumn'] <- as.logical(test_holdout$month %in% c('sep', 'oct', 'nov'))

#### Add variable: month_winter

In [18]:
# Add new variable to train and test (holdout)
# Train, valid, test
train[, 'month_winter'] <- as.logical(train$month %in% c('dec', 'jan', 'feb'))
valid[, 'month_winter'] <- as.logical(valid$month %in% c('dec', 'jan', 'feb'))
test[, 'month_winter'] <- as.logical(test$month %in% c('dec', 'jan', 'feb'))
# Test (holdout)
test_holdout[, 'month_winter'] <- as.logical(test_holdout$month %in% c('dec', 'jan', 'feb'))

#### Add variable: age > mean(age)

In [19]:
# Add new variable to train and test (holdout)
# Train, valid, test
train[, 'age_ge_mean'] <- as.logical(train$age > mean(train$age))
valid[, 'age_ge_mean'] <- as.logical(valid$age > mean(valid$age))
test[, 'age_ge_mean'] <- as.logical(test$age > mean(test$age))
# Test (holdout)
test_holdout[, 'age_ge_mean'] <- as.logical(test_holdout$age > mean(train$age))

#### Add variable: pdays_999

In [20]:
# Add new variable to train and test (holdout)
# pdays == 999 is a special value
# Train, valid, test
train[, 'pdays_999'] <- as.logical(train$pdays == 999)
valid[, 'pdays_999'] <- as.logical(valid$pdays == 999)
test[, 'pdays_999'] <- as.logical(test$pdays == 999)
# Test (holdout)
test_holdout[, 'pdays_999'] <- as.logical(test_holdout$pdays == 999)

### 2.3. Processing data

#### 2.3.1. Value transformation

- Categorical variables: remapping
- Continuous variables: discretization

Reference:  

Coussement, K., Lessmann, S., & Verstraeten, G. (2017). A comparative analysis of data preparation algorithms for customer churn prediction: A case study in the telecommunication industry. Decision Support Systems, 95, 27-36.

#### Get the list of categorical, boolean and numerical variables

In [21]:
# Get the IV and DV list name
# Dependent variable (DV)
dv_list <- c('subscribe')
# Independent variable (IV)
iv_list <- setdiff(colnames(train), dv_list)  # Exclude the target variable
iv_list <- setdiff(iv_list, 'client_id')  # Exclude the client_id

In [22]:
# Pick out categorical, boolean and numerical variable
iv_cat_list <- c()  # List to store categorical variable
iv_bool_list <- c()  # List to store boolean variable
iv_num_list <- c()  # List to store numerical variable
for (v in iv_list) {
    if (class(train[, v]) == 'factor') {  # Factor == categorical variable
        iv_cat_list <- c(iv_cat_list, v)
    } else if (class(train[, v]) == 'logical') {  # Logical == boolean variable
        iv_bool_list <- c(iv_bool_list, v)
    } else {  # Non-factor + Non-logical == numerical variable
        iv_num_list <- c(iv_num_list, v)
    }
}

#### Grouping (or remapping) categorical variables - Decision tree–based remapping

Reference:  

Package ‘woeBinning’: https://cran.r-project.org/web/packages/woeBinning/woeBinning.pdf

Test the variable remmaping on a categorical variable.

In [23]:
# Grouping 12 categories in the variable job onto 3 groups using WOE
binning_cat <- woe.binning(train, 'subscribe', 'job')
binning_cat

Unnamed: 0,Group.2,Group.1,woe,iv.total.final,1,0,col.perc.a,col.perc.b,iv.bins
2,management + technician + misc. level neg. + blue-collar + services,blue-collar,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
3,management + technician + misc. level neg. + blue-collar + services,entrepreneur,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
4,management + technician + misc. level neg. + blue-collar + services,housemaid,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
5,management + technician + misc. level neg. + blue-collar + services,management,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
6,management + technician + misc. level neg. + blue-collar + services,services,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
7,management + technician + misc. level neg. + blue-collar + services,technician,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
8,management + technician + misc. level neg. + blue-collar + services,unknown,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
1,admin.,admin.,15.07074,0.1237265,135,902,0.2818372,0.242408,0.005942273
9,misc. level pos.,retired,68.22718,0.1237265,109,428,0.2275574,0.1150228,0.076779164
10,misc. level pos.,self-employed,68.22718,0.1237265,109,428,0.2275574,0.1150228,0.076779164


Unnamed: 0,Group.2,Group.1,woe,iv.total.final,1,0,col.perc.a,col.perc.b,iv.bins
2,management + technician + misc. level neg. + blue-collar + services,blue-collar,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
3,management + technician + misc. level neg. + blue-collar + services,entrepreneur,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
4,management + technician + misc. level neg. + blue-collar + services,housemaid,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
5,management + technician + misc. level neg. + blue-collar + services,management,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
6,management + technician + misc. level neg. + blue-collar + services,services,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
7,management + technician + misc. level neg. + blue-collar + services,technician,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
8,management + technician + misc. level neg. + blue-collar + services,unknown,-26.98343,0.1237265,235,2391,0.4906054,0.6425692,0.041005042
1,admin.,admin.,15.07074,0.1237265,135,902,0.2818372,0.242408,0.005942273
9,misc. level pos.,retired,68.22718,0.1237265,109,428,0.2275574,0.1150228,0.076779164
10,misc. level pos.,self-employed,68.22718,0.1237265,109,428,0.2275574,0.1150228,0.076779164


In [24]:
# Apply the binning to data
tmp <- woe.binning.deploy(train, binning_cat, add.woe.or.dum.var='woe')
head(tmp[, c('job', 'job.binned', 'woe.job.binned')])

Unnamed: 0,job,job.binned,woe.job.binned
1,housemaid,management + technician + misc. level neg. + blue-collar + services,-26.98343
4,self-employed,misc. level pos.,68.22718
7,blue-collar,management + technician + misc. level neg. + blue-collar + services,-26.98343
14,technician,management + technician + misc. level neg. + blue-collar + services,-26.98343
15,management,management + technician + misc. level neg. + blue-collar + services,-26.98343
16,admin.,admin.,15.07074


Unnamed: 0,job,job.binned,woe.job.binned
1,housemaid,management + technician + misc. level neg. + blue-collar + services,-26.98343
4,self-employed,misc. level pos.,68.22718
7,blue-collar,management + technician + misc. level neg. + blue-collar + services,-26.98343
14,technician,management + technician + misc. level neg. + blue-collar + services,-26.98343
15,management,management + technician + misc. level neg. + blue-collar + services,-26.98343
16,admin.,admin.,15.07074


Apply the variable remmaping for all categorical variables.

In [25]:
# Loop through all categorical variables
for (v in iv_cat_list) {
    
    # Remapping categorical variable on train data
    binning_cat <- woe.binning(train, 'subscribe', v)
    
    # Apply the binning to the train, valid and test data
    train <- woe.binning.deploy(train, binning_cat, add.woe.or.dum.var='woe')
    valid <- woe.binning.deploy(valid, binning_cat, add.woe.or.dum.var='woe')
    test <- woe.binning.deploy(test, binning_cat, add.woe.or.dum.var='woe')
    
    # Apply the binning to the test (holdout) data
    test_holdout <- woe.binning.deploy(test_holdout, binning_cat, add.woe.or.dum.var='woe')
}

#### Grouping (or discretizing) numerical variables - Decision tree–based discretization

Test the variable discretizing on a numerical variable.

In [26]:
# Grouping the variable age onto 4 groups using WOE
binning_num <- woe.binning(train, 'subscribe', 'age')
binning_num

Unnamed: 0,woe,cutpoints.final,cutpoints.final[-1],iv.total.final,1,0,col.perc.a,col.perc.b,iv.bins
"(-Inf,26]",62.64151,-inf,26,0.1257925,46,191,0.0960334,0.05133029,0.028002706
"(26,38]",1.589851,26.0,38,0.1257925,223,1705,0.4655532,0.45821016,0.000116744
"(38,50]",-53.033993,38.0,50,0.1257925,89,1175,0.1858038,0.31577533,0.068929114
"(50, Inf]",36.886531,50.0,Inf,0.1257925,121,650,0.2526096,0.17468422,0.028743969
Missing,,inf,Missing,0.1257925,0,0,0.0,0.0,


Unnamed: 0,woe,cutpoints.final,cutpoints.final[-1],iv.total.final,1,0,col.perc.a,col.perc.b,iv.bins
"(-Inf,26]",62.64151,-inf,26,0.1257925,46,191,0.0960334,0.05133029,0.028002706
"(26,38]",1.589851,26.0,38,0.1257925,223,1705,0.4655532,0.45821016,0.000116744
"(38,50]",-53.033993,38.0,50,0.1257925,89,1175,0.1858038,0.31577533,0.068929114
"(50, Inf]",36.886531,50.0,Inf,0.1257925,121,650,0.2526096,0.17468422,0.028743969
Missing,,inf,Missing,0.1257925,0,0,0.0,0.0,


In [27]:
# Apply the binning to data
tmp <- woe.binning.deploy(train, binning_num, add.woe.or.dum.var='woe')
head(tmp[, c('age', 'age.binned', 'woe.age.binned')])

Unnamed: 0,age,age.binned,woe.age.binned
1,29,"(26,38]",1.589851
4,32,"(26,38]",1.589851
7,34,"(26,38]",1.589851
14,39,"(38,50]",-53.033993
15,39,"(38,50]",-53.033993
16,33,"(26,38]",1.589851


Unnamed: 0,age,age.binned,woe.age.binned
1,29,"(26,38]",1.589851
4,32,"(26,38]",1.589851
7,34,"(26,38]",1.589851
14,39,"(38,50]",-53.033993
15,39,"(38,50]",-53.033993
16,33,"(26,38]",1.589851


Apply the variable discretizing for all numerical variables.

In [28]:
# Loop through all numerical variables
for (v in iv_num_list) {
    
    # Discretizing numerical variable on train data
    binning_num <- woe.binning(train, 'subscribe', v)
    
    # Apply the binning to the train, valid and test data
    train <- woe.binning.deploy(train, binning_num, add.woe.or.dum.var='woe')
    valid <- woe.binning.deploy(valid, binning_num, add.woe.or.dum.var='woe')
    test <- woe.binning.deploy(test, binning_num, add.woe.or.dum.var='woe')
    
    # Apply the binning to the test (holdout) data
    test_holdout <- woe.binning.deploy(test_holdout, binning_num, add.woe.or.dum.var='woe')
}

#### Grouping (or discretizing) numerical variables - Equal frequency discretization

Reference:  

Tutorial to prepare train and test set using dataPreparation: https://cran.r-project.org/web/packages/dataPreparation/vignettes/train_test_prep.html

Test the variable discretizing on a numerical variable.

In [29]:
# Build the discretization
bins <- build_bins(dataSet=train, cols="age", n_bins=5, type="equal_freq", verbose=F)

# Print out to check
bins

In [30]:
# Apply to the data
tmp <- fastDiscretization(dataSet=train, bins=bins, verbose=F)
setDF(tmp); setDF(train)  # Convert data.table to data.frame
head(tmp[, 'age'])

Apply the variable discretizing for all numerical variables.

In [31]:
# Loop through all numerical variables
for (v in iv_num_list) {
    
    # Discretizing numerical variable on train data, n_bins=5
    bins <- build_bins(dataSet=train, cols=v, n_bins=5, type="equal_freq", verbose=F)
    
    # Apply the binning to the train, valid and test data
    tmp <- fastDiscretization(dataSet=train, bins=bins, verbose=F)
    setDF(tmp); setDF(train)  # Convert data.table to data.frame
    train[, paste0(v, '_freq_bin')] <- tmp[, v]  # Add new variable
    
    tmp <- fastDiscretization(dataSet=valid, bins=bins, verbose=F)
    setDF(tmp); setDF(valid)  # Convert data.table to data.frame
    valid[, paste0(v, '_freq_bin')] <- tmp[, v]  # Add new variable
    
    tmp <- fastDiscretization(dataSet=test, bins=bins, verbose=F)
    setDF(tmp); setDF(test)  # Convert data.table to data.frame
    test[, paste0(v, '_freq_bin')] <- tmp[, v]  # Add new variable
    
    # Apply the binning to the test (holdout) data
    tmp <- fastDiscretization(dataSet=test_holdout, bins=bins, verbose=F)
    setDF(tmp); setDF(test_holdout)  # Convert data.table to data.frame
    test_holdout[, paste0(v, '_freq_bin')] <- tmp[, v]  # Add new variable
}

#### Grouping (or discretizing) numerical variables - Equal width discretization

Test the variable discretizing on a numerical variable.

In [32]:
# Build the discretization
bins <- build_bins(dataSet=train, cols="age", n_bins=5, type="equal_width", verbose=F)

# Print out to check
bins

In [33]:
# Apply to the data
tmp <- fastDiscretization(dataSet=train, bins=bins, verbose=F)
setDF(tmp); setDF(train)  # Convert data.table to data.frame
head(tmp[, 'age'])

Apply the variable discretizing for all numerical variables.

In [34]:
# Loop through all numerical variables
for (v in iv_num_list) {
    
    # Discretizing numerical variable on train data, n_bins=5
    bins <- build_bins(dataSet=train, cols=v, n_bins=5, type="equal_width", verbose=F)
    
    # Apply the binning to the train, valid and test data
    tmp <- fastDiscretization(dataSet=train, bins=bins, verbose=F)
    setDF(tmp); setDF(train)  # Convert data.table to data.frame
    train[, paste0(v, '_width_bin')] <- tmp[, v]  # Add new variable
    
    tmp <- fastDiscretization(dataSet=valid, bins=bins, verbose=F)
    setDF(tmp); setDF(valid)  # Convert data.table to data.frame
    valid[, paste0(v, '_width_bin')] <- tmp[, v]  # Add new variable
    
    tmp <- fastDiscretization(dataSet=test, bins=bins, verbose=F)
    setDF(tmp); setDF(test)  # Convert data.table to data.frame
    test[, paste0(v, '_width_bin')] <- tmp[, v]  # Add new variable
    
    # Apply the binning to the test (holdout) data
    tmp <- fastDiscretization(dataSet=test_holdout, bins=bins, verbose=F)
    setDF(tmp); setDF(test_holdout)  # Convert data.table to data.frame
    test_holdout[, paste0(v, '_width_bin')] <- tmp[, v]  # Add new variable
}

#### 2.3.2. Value representation

- Dummy coding
- Incidence replacement
- Weight of evidence (WoE conversion)

Reference:  

Coussement, K., Lessmann, S., & Verstraeten, G. (2017). A comparative analysis of data preparation algorithms for customer churn prediction: A case study in the telecommunication industry. Decision Support Systems, 95, 27-36.

All about Categorical Variable Encoding: https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

#### Get the updated list of categorical, boolean and numerical variables

In [35]:
# Get the IV and DV list name
# Dependent variable (DV)
dv_list <- c('subscribe')
# Independent variable (IV)
iv_list <- setdiff(colnames(train), dv_list)  # Exclude the target variable
iv_list <- setdiff(iv_list, 'client_id')  # Exclude the client_id

In [36]:
# Pick out categorical, boolean and numerical variable
iv_cat_list <- c()  # List to store categorical variable
iv_bool_list <- c()  # List to store boolean variable
iv_num_list <- c()  # List to store numerical variable
for (v in iv_list) {
    if (class(train[, v]) == 'factor') {  # Factor == categorical variable
        iv_cat_list <- c(iv_cat_list, v)
    } else if (class(train[, v]) == 'logical') {  # Logical == boolean variable
        iv_bool_list <- c(iv_bool_list, v)
    } else {  # Non-factor + Non-logical == numerical variable
        iv_num_list <- c(iv_num_list, v)
    }
}

#### Convert categorical variables to dummy

Test the variable representation on a categorical variable.

In [37]:
# Build the dummy encoding
encoding <- build_encoding(dataSet=train, cols="job", verbose=F)

In [38]:
# Transform the categorical variable
tmp <- one_hot_encoder(dataSet=train, encoding=encoding, type='logical', drop=F, verbose=F)
setDF(tmp)
tmp <- tmp[, -ncol(tmp)]
head(tmp[, 84:ncol(tmp)])

job.admin.,job.blue.collar,job.entrepreneur,job.housemaid,job.management,job.retired,job.self.employed,job.services,job.student,job.technician,job.unemployed
False,False,False,True,False,False,False,False,False,False,False
False,False,False,False,False,False,True,False,False,False,False
False,True,False,False,False,False,False,False,False,False,False
False,False,False,False,False,False,False,False,False,True,False
False,False,False,False,True,False,False,False,False,False,False
True,False,False,False,False,False,False,False,False,False,False


job.admin.,job.blue.collar,job.entrepreneur,job.housemaid,job.management,job.retired,job.self.employed,job.services,job.student,job.technician,job.unemployed
False,False,False,True,False,False,False,False,False,False,False
False,False,False,False,False,False,True,False,False,False,False
False,True,False,False,False,False,False,False,False,False,False
False,False,False,False,False,False,False,False,False,True,False
False,False,False,False,True,False,False,False,False,False,False
True,False,False,False,False,False,False,False,False,False,False


Apply the variable representation for all categorical variables.

In [39]:
# Loop through all categorical variables
for (v in iv_cat_list) {
    
    # Representing categorical variable on train data
    encoding <- build_encoding(dataSet=train, cols=v, verbose=F)
    
    # Apply the binning to the train, valid and test data
    train <- one_hot_encoder(dataSet=train, encoding=encoding, type='logical', drop=F, verbose=F)
    setDF(train)
    train <- train[, -ncol(train)]  # Drop the last dummy column
    
    valid <- one_hot_encoder(dataSet=valid, encoding=encoding, type='logical', drop=F, verbose=F)
    setDF(valid)
    valid <- valid[, -ncol(valid)]  # Drop the last dummy column
    
    test <- one_hot_encoder(dataSet=test, encoding=encoding, type='logical', drop=F, verbose=F)
    setDF(test)
    test <- test[, -ncol(test)]  # Drop the last dummy column
    
    # Apply the binning to the test (holdout) data
    test_holdout <- one_hot_encoder(dataSet=test_holdout, encoding=encoding, type='logical', drop=F, verbose=F)
    setDF(test_holdout)
    test_holdout <- test_holdout[, -ncol(test_holdout)]  # Drop the last dummy column
}

#### Represent categorical variables using incidence of target variable

Test the variable representation on a categorical variable.

In [40]:
# Find the incidence rates per category of a variable
tb <- table(train$job, train$subscribe)
incidence_map <- data.frame('v1'=rownames(tb), 'v2'=tb[, '1'] / (tb[, '0'] + tb[, '1']))
colnames(incidence_map) <- c('job', 'job_incidence')
incidence_map

Unnamed: 0,job,job_incidence
admin.,admin.,0.13018322
blue-collar,blue-collar,0.08278146
entrepreneur,entrepreneur,0.10493827
housemaid,housemaid,0.08256881
management,management,0.11006289
retired,retired,0.24102564
self-employed,self-employed,0.13548387
services,services,0.07616708
student,student,0.29347826
technician,technician,0.09571429


Unnamed: 0,job,job_incidence
admin.,admin.,0.13018322
blue-collar,blue-collar,0.08278146
entrepreneur,entrepreneur,0.10493827
housemaid,housemaid,0.08256881
management,management,0.11006289
retired,retired,0.24102564
self-employed,self-employed,0.13548387
services,services,0.07616708
student,student,0.29347826
technician,technician,0.09571429


In [41]:
# Convert the categories with incidences
tmp <- plyr::join(x=train, y=incidence_map, by='job', type="left", match="all")  # Left join
head(tmp[, c('job', 'job_incidence')])

job,job_incidence
housemaid,0.08256881
self-employed,0.13548387
blue-collar,0.08278146
technician,0.09571429
management,0.11006289
admin.,0.13018322


job,job_incidence
housemaid,0.08256881
self-employed,0.13548387
blue-collar,0.08278146
technician,0.09571429
management,0.11006289
admin.,0.13018322


Apply the variable representation for all categorical variables.

In [42]:
# Loop through all categorical variables
for (v in iv_cat_list) {
    
    # Find the incidence rates per category of a variable
    tb <- table(train[, v], train[, 'subscribe'])
    incidence_map <- data.frame('v1'=rownames(tb), 'v2'=tb[, '1'] / (tb[, '0'] + tb[, '1']))
    colnames(incidence_map) <- c(v, paste0(v, '_incidence'))  # Rename the columns to join
    
    # Apply the variable representation to the train, valid and test data
    train <- plyr::join(x=train, y=incidence_map, by=v, type="left", match="all")
    valid <- plyr::join(x=valid, y=incidence_map, by=v, type="left", match="all")
    test <- plyr::join(x=test, y=incidence_map, by=v, type="left", match="all")
    
    # Apply the binning to the test (holdout) data
    test_holdout <- plyr::join(x=test_holdout, y=incidence_map, by=v, type="left", match="all")
}

#### Represent categorical variables using weight-of-evidence conversion

Test the variable representation on a categorical variable.

In [43]:
# Find the WOE per category of a variable
tb <- table(train$job, train$subscribe)
woe_map <- data.frame('v1'=rownames(tb), 'v2'=log(tb[, '1'] / tb[, '0']))
colnames(woe_map) <- c('job', 'job_woe')
woe_map

Unnamed: 0,job,job_woe
admin.,admin.,-1.8993397
blue-collar,blue-collar,-2.4051417
entrepreneur,entrepreneur,-2.1435204
housemaid,housemaid,-2.4079456
management,management,-2.0900988
retired,retired,-1.1470647
self-employed,self-employed,-1.8533174
services,services,-2.4956019
student,student,-0.8785504
technician,technician,-2.2457778


Unnamed: 0,job,job_woe
admin.,admin.,-1.8993397
blue-collar,blue-collar,-2.4051417
entrepreneur,entrepreneur,-2.1435204
housemaid,housemaid,-2.4079456
management,management,-2.0900988
retired,retired,-1.1470647
self-employed,self-employed,-1.8533174
services,services,-2.4956019
student,student,-0.8785504
technician,technician,-2.2457778


In [44]:
# Convert the categories with WOE
tmp <- plyr::join(x=train, y=woe_map, by='job', type="left", match="all")  # Left join
head(tmp[, c('job', 'job_woe')])

job,job_woe
housemaid,-2.407946
self-employed,-1.853317
blue-collar,-2.405142
technician,-2.245778
management,-2.090099
admin.,-1.89934


job,job_woe
housemaid,-2.407946
self-employed,-1.853317
blue-collar,-2.405142
technician,-2.245778
management,-2.090099
admin.,-1.89934


Apply the variable representation for all categorical variables.

In [45]:
# Loop through all categorical variables
for (v in iv_cat_list) {
    
    # Find the incidence rates per category of a variable
    tb <- table(train[, v], train[, 'subscribe'])
    woe_map <- data.frame('v1'=rownames(tb), 'v2'=log(tb[, '1'] / tb[, '0']))
    colnames(woe_map) <- c(v, paste0(v, '_woe'))  # Rename the columns to join
    
    # Apply the variable representation to the train, valid and test data
    train <- plyr::join(x=train, y=woe_map, by=v, type="left", match="all")
    valid <- plyr::join(x=valid, y=woe_map, by=v, type="left", match="all")
    test <- plyr::join(x=test, y=woe_map, by=v, type="left", match="all")
    
    # Apply the binning to the test (holdout) data
    test_holdout <- plyr::join(x=test_holdout, y=woe_map, by=v, type="left", match="all")
}

#### 2.3.3. Others variable transformations

#### Log transformation numerical variable

In [46]:
# Transform the variable age on train and test (holdout)
# Train, valid, test
train[, 'age_log'] <- log(train[, 'age'])
valid[, 'age_log'] <- log(valid[, 'age'])
test[, 'age_log'] <- log(test[, 'age'])
# Test (holdout)
test_holdout[, 'age_log'] <- log(test_holdout[, 'age'])

#### Standardize numerical variable

In [47]:
# Standardize the variable age on train and test (holdout)
# Train, valid, test
train[, 'age_scaled'] <- scale(train[, 'age'], center=T, scale=T)  # sd = 1, mean = 0
valid[, 'age_scaled'] <- scale(valid[, 'age'], center=T, scale=T)  # sd = 1, mean = 0
test[, 'age_scaled'] <- scale(test[, 'age'], center=T, scale=T)  # sd = 1, mean = 0
# Test (holdout)
test_holdout[, 'age_scaled'] <- scale(test_holdout[, 'age'], center=T, scale=T)  # sd = 1, mean = 0

### 2.4. Variable selection

Reference:  

Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New insights into churn prediction in the telecommunication sector: A profit driven data mining approach. European Journal of Operational Research, 218(1), 211-229.

Boruta: https://www.datacamp.com/community/tutorials/feature-selection-R-boruta

#### Get the updated list of categorical, boolean and numerical variables

In [48]:
# Get the IV and DV list name
# Dependent variable (DV)
dv_list <- c('subscribe')
# Independent variable (IV)
iv_list <- setdiff(colnames(train), dv_list)  # Exclude the target variable
iv_list <- setdiff(iv_list, 'client_id')  # Exclude the client_id

In [49]:
# Pick out categorical, boolean and numerical variable
iv_cat_list <- c()  # List to store categorical variable
iv_bool_list <- c()  # List to store boolean variable
iv_num_list <- c()  # List to store numerical variable
for (v in iv_list) {
    if (class(train[, v]) == 'factor') {  # Factor == categorical variable
        iv_cat_list <- c(iv_cat_list, v)
    } else if (class(train[, v]) == 'logical') {  # Logical == boolean variable
        iv_bool_list <- c(iv_bool_list, v)
    } else {  # Non-factor + Non-logical == numerical variable
        iv_num_list <- c(iv_num_list, v)
    }
}

#### 2.4.1. Variable correcting and filtering

#### Check and correct +/-Inf values (if any)

In [50]:
# Check missing value
# Train, valid, test
sum(apply(sapply(train, is.infinite), 2, sum))
sum(apply(sapply(valid, is.infinite), 2, sum))
sum(apply(sapply(test, is.infinite), 2, sum))
# Test (holdout)
sum(apply(sapply(test_holdout, is.infinite), 2, sum))

In [51]:
# Impute +/-Inf value by NA
# Train, valid, test
train[sapply(train, is.infinite)] <- NA
valid[sapply(valid, is.infinite)] <- NA
test[sapply(test, is.infinite)] <- NA
# Test (holdout)
test_holdout[sapply(test_holdout, is.infinite)] <- NA

#### Check and correct missing values (if any)

In [52]:
# Check missing value
# Train, valid, test
sum(apply(is.na(train), 2, sum))
sum(apply(is.na(valid), 2, sum))
sum(apply(is.na(test), 2, sum))
# Test (holdout)
sum(apply(is.na(test_holdout), 2, sum))

In [53]:
# Impute missing value in numerical variable by mean
for (v in iv_num_list) {
    # Train, valid, test
    train[is.na(train[, v]), v] <- mean(train[, v], na.rm=T)
    valid[is.na(valid[, v]), v] <- mean(valid[, v], na.rm=T)
    test[is.na(test[, v]), v] <- mean(test[, v], na.rm=T)
    
    # Test (holdout)
    test_holdout[is.na(test_holdout[, v]), v] <- mean(test_holdout[, v], na.rm=T)
}

#### Drop categorical variables (all were processed)

In [54]:
for (v in iv_cat_list) {
    # Train, valid, test
    train[, v] <- NULL
    valid[, v] <- NULL
    test[, v] <- NULL
    
    # Test (holdout)
    test_holdout[, v] <- NULL
}

#### Convert boolean variable to numerical

In [55]:
# Convert boolean to int
for (v in iv_bool_list) {
    # Train, valid, test
    train[, v] <- as.integer(train[, v])
    valid[, v] <- as.integer(valid[, v])
    test[, v] <- as.integer(test[, v])
    
    # Test (holdout)
    test_holdout[, v] <- as.integer(test_holdout[, v])
}

#### Drop constant variable (i.e. variance=0)

In [56]:
# Find the constant variable
var_list <- c()
for (v in c(iv_num_list, iv_bool_list)) {
    var_list <- c(var_list, var(train[, v], na.rm=T))
}
constant_var <- c(iv_num_list, iv_bool_list)[var_list == 0]
constant_var

In [57]:
# Drop the constant variable
for (v in constant_var) {
    # Train, valid, test
    train[, v] <- NULL
    valid[, v] <- NULL
    test[, v] <- NULL
    
    # Test (holdout)
    test_holdout[, v] <- NULL
}

#### 2.4.2. Variable selection: Fisher Score

In [58]:
FisherScore <- function(basetable, depvar, IV_list) {
  "
  This function calculate the Fisher score of a variable.
  
  Ref:
  ---
  Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New insights into churn prediction in the telecommunication sector: A profit driven data mining approach. European Journal of Operational Research, 218(1), 211-229.
  "
  
  # Get the unique values of dependent variable
  DV <- unique(basetable[, depvar])
  
  IV_FisherScore <- c()
  
  for (v in IV_list) {
    fs <- abs((mean(basetable[which(basetable[, depvar]==DV[1]), v]) - mean(basetable[which(basetable[, depvar]==DV[2]), v]))) /
      sqrt((var(basetable[which(basetable[, depvar]==DV[1]), v]) + var(basetable[which(basetable[, depvar]==DV[2]), v])))
    IV_FisherScore <- c(IV_FisherScore, fs)
  }
  
  return(data.frame(IV=IV_list, fisher_score=IV_FisherScore))
}

varSelectionFisher <- function(basetable, depvar, IV_list, num_select=20) {
  "
  This function will calculate the Fisher score for all IVs and select the best
  top IVs.

  Assumption: all variables of input dataset are converted into numeric type.
  "
  
  fs <- FisherScore(basetable, depvar, IV_list)  # Calculate Fisher Score for all IVs
  num_select <- min(num_select, ncol(basetable))  # Top N IVs to be selected
  return(as.vector(fs[order(fs$fisher_score, decreasing=T), ][1:num_select, 'IV']))
}

In [59]:
# Calculate Fisher Score for all variable
# Get the IV and DV list
dv_list <- c('subscribe')  # DV list
iv_list <- setdiff(names(train), dv_list)  # IV list excluded DV
iv_list <- setdiff(iv_list, 'client_id')  # Excluded the client_id
fs <- FisherScore(train, dv_list, iv_list)
head(fs)

IV,fisher_score
age,0.0364485
campaign,0.1443533
pdays,0.4343708
previous,0.3454339
emp.var.rate,0.653128
cons.price.idx,0.3030515


IV,fisher_score
age,0.0364485
campaign,0.1443533
pdays,0.4343708
previous,0.3454339
emp.var.rate,0.653128
cons.price.idx,0.3030515


In [60]:
# Select top 20 variables according to the Fisher Score
best_fs_var <- varSelectionFisher(train, dv_list, iv_list, num_select=50)
head(best_fs_var, 10)

In [61]:
# Apply variable selection to the data
# Train
var_select <- names(train)[names(train) %in% best_fs_var]
train_processed <- train[, c('client_id', var_select, 'subscribe')]
# Valid
var_select <- names(valid)[names(valid) %in% best_fs_var]
valid_processed <- valid[, c('client_id', var_select, 'subscribe')]
# Test
var_select <- names(test)[names(test) %in% best_fs_var]
test_processed <- test[, c('client_id', var_select, 'subscribe')]
# Test (holdout)
var_select <- names(test_holdout)[names(test_holdout) %in% best_fs_var]
test_holdout_processed <- test_holdout[, c('client_id', var_select)]

### 2.5. Finalize data processing

In [62]:
# Check if train and test (holdout) have same variables
# Train, valid, test
dim(train_processed)
dim(valid_processed)
dim(test_processed)
# Test (holdout)
dim(test_holdout_processed)

In [63]:
# Rename the data columns
for (v in colnames(train_processed)) {
    
    # Fix the column name
    fix_name <- str_replace_all(v, "[^[:alnum:] ]", "_")
    fix_name <- gsub(' +', '', fix_name) 
    
    # Train, valid, test
    colnames(train_processed)[colnames(train_processed) == v] <- fix_name
    colnames(valid_processed)[colnames(valid_processed) == v] <- fix_name
    colnames(test_processed)[colnames(test_processed) == v] <- fix_name
    
    # Test (holdout)
    colnames(test_holdout_processed)[colnames(test_holdout_processed) == v] <- fix_name
}

In [64]:
# Print out to check
head(train_processed)

client_id,pdays,emp_var_rate,euribor3m,nr_employed,pdays_999,woe_month_binned,woe_emp_var_rate_binned,woe_cons_price_idx_binned,woe_cons_conf_idx_binned,...,euribor3m_binned_woe,nr_employed_binned_woe,emp_var_rate_freq_bin_woe,euribor3m_freq_bin_woe,nr_employed_freq_bin_woe,emp_var_rate_width_bin_woe,cons_conf_idx_width_bin_woe,euribor3m_width_bin_woe,nr_employed_width_bin_woe,subscribe
2,999,1.1,4.858,5191.0,1,-33.39693,-94.56851,-123.93836,9.414245,...,-2.616464,-2.896025,-3.289431,-3.192275,-3.289431,-2.995732,-2.671493,-2.929108,-2.896025,0
5,999,-1.8,1.299,5099.1,1,-33.39693,44.1457,30.17728,-59.006784,...,-2.616464,-1.971674,-1.429056,-2.098587,-1.974348,-1.540039,-1.876394,-1.171029,-1.691018,0
8,999,-0.1,4.153,5195.8,1,-33.39693,44.1457,-33.24684,-59.006784,...,-2.616464,-2.896025,-2.363994,-2.098587,-2.363994,-2.370244,-2.756119,-2.639057,-2.896025,0
21,999,1.4,4.967,5228.1,1,-33.39693,-94.56851,-33.24684,-106.548634,...,-2.616464,-2.896025,-2.876386,-2.833213,-2.876386,-2.995732,-2.671493,-2.929108,-2.896025,0
22,999,1.4,4.964,5228.1,1,-33.39693,-94.56851,-33.24684,-106.548634,...,-2.616464,-2.896025,-2.876386,-2.833213,-2.876386,-2.995732,-2.671493,-2.929108,-2.896025,1
24,999,1.4,4.964,5228.1,1,-33.39693,-94.56851,-33.24684,-106.548634,...,-2.616464,-2.896025,-2.876386,-2.833213,-2.876386,-2.995732,-2.671493,-2.929108,-2.896025,0


client_id,pdays,emp_var_rate,euribor3m,nr_employed,pdays_999,woe_month_binned,woe_emp_var_rate_binned,woe_cons_price_idx_binned,woe_cons_conf_idx_binned,...,euribor3m_binned_woe,nr_employed_binned_woe,emp_var_rate_freq_bin_woe,euribor3m_freq_bin_woe,nr_employed_freq_bin_woe,emp_var_rate_width_bin_woe,cons_conf_idx_width_bin_woe,euribor3m_width_bin_woe,nr_employed_width_bin_woe,subscribe
2,999,1.1,4.858,5191.0,1,-33.39693,-94.56851,-123.93836,9.414245,...,-2.616464,-2.896025,-3.289431,-3.192275,-3.289431,-2.995732,-2.671493,-2.929108,-2.896025,0
5,999,-1.8,1.299,5099.1,1,-33.39693,44.1457,30.17728,-59.006784,...,-2.616464,-1.971674,-1.429056,-2.098587,-1.974348,-1.540039,-1.876394,-1.171029,-1.691018,0
8,999,-0.1,4.153,5195.8,1,-33.39693,44.1457,-33.24684,-59.006784,...,-2.616464,-2.896025,-2.363994,-2.098587,-2.363994,-2.370244,-2.756119,-2.639057,-2.896025,0
21,999,1.4,4.967,5228.1,1,-33.39693,-94.56851,-33.24684,-106.548634,...,-2.616464,-2.896025,-2.876386,-2.833213,-2.876386,-2.995732,-2.671493,-2.929108,-2.896025,0
22,999,1.4,4.964,5228.1,1,-33.39693,-94.56851,-33.24684,-106.548634,...,-2.616464,-2.896025,-2.876386,-2.833213,-2.876386,-2.995732,-2.671493,-2.929108,-2.896025,1
24,999,1.4,4.964,5228.1,1,-33.39693,-94.56851,-33.24684,-106.548634,...,-2.616464,-2.896025,-2.876386,-2.833213,-2.876386,-2.995732,-2.671493,-2.929108,-2.896025,0


## 3. Methodology

Reference:  

Lessmann, S., Baesens, B., Seow, H. V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1), 124-136.File 

### 3.1. Logistic Regression model

In [65]:
# Set up cross-validation
rdesc = makeResampleDesc("CV", iters=5, predict="both")

# Define the model
learner <- makeLearner("classif.logreg", predict.type="prob", fix.factors.prediction=T)

# Define the task
train_task <- makeClassifTask(id="bank_train", data=train_processed[, -1], target="subscribe")

# Set hyper parameter tuning
tune_params <- makeParamSet(
)
ctrl = makeTuneControlGrid()

# Run the hyper parameter tuning with k-fold CV
if (length(tune_params$pars) > 0) {
    # Run parameter tuning
    res <- tuneParams(learner, task=train_task, resampling=rdesc,
      par.set=tune_params, control=ctrl, measures=list(mlr::auc))
    
    # Extract best model
    best_learner <- res$learner
    
} else {
    # Simple cross-validation
    res <- resample(learner, train_task, rdesc, measures=list(mlr::auc, setAggregation(mlr::auc, train.mean)))
    
    # No parameter for tuning, only 1 best learner
    best_learner <- learner
}

Resampling: cross-validation
Resampling: cross-validation
Measures:             auc.train   auc.test    
Measures:             auc.train   auc.test    
"prediction from a rank-deficient fit may be misleading"[Resample] iter 1:    0.7942316   0.7487939   
[Resample] iter 1:    0.7942316   0.7487939   
"prediction from a rank-deficient fit may be misleading"[Resample] iter 2:    0.7873690   0.7697736   
[Resample] iter 2:    0.7873690   0.7697736   
"prediction from a rank-deficient fit may be misleading"[Resample] iter 3:    0.7910198   0.7663166   
[Resample] iter 3:    0.7910198   0.7663166   
"prediction from a rank-deficient fit may be misleading"[Resample] iter 4:    0.7795883   0.8042540   
[Resample] iter 4:    0.7795883   0.8042540   
"prediction from a rank-deficient fit may be misleading"[Resample] iter 5:    0.7892984   0.7713832   
[Resample] iter 5:    0.7892984   0.7713832   




Aggregated Result: auc.test.mean=0.7721043,auc.train.mean=0.7883014
Aggregated Result: auc.tes

In [66]:
# Retrain the model with tbe best hyper-parameters
best_md <- mlr::train(best_learner, train_task)

In [67]:
best_md

Model for learner.id=classif.logreg; learner.class=classif.logreg
Trained on: task.id = bank_train; obs = 4200; features = 50
Hyperparameters: model=FALSE

Model for learner.id=classif.logreg; learner.class=classif.logreg
Trained on: task.id = bank_train; obs = 4200; features = 50
Hyperparameters: model=FALSE

In [68]:
# Make prediction on valid data
pred <- predict(best_md, newdata=valid_processed[, -1])
performance(pred, measures=mlr::auc)

"prediction from a rank-deficient fit may be misleading"

In [69]:
# Make prediction on test data
pred <- predict(best_md, newdata=test_processed[, -1])
performance(pred, measures=mlr::auc)

"prediction from a rank-deficient fit may be misleading"

"prediction from a rank-deficient fit may be misleading"

In [70]:
# Make prediction on test (holdout) data
pred <- predict(best_md, newdata=test_holdout_processed[, -1])
pred

"prediction from a rank-deficient fit may be misleading"

Prediction: 3000 observations
predict.type: prob
threshold: 0=0.50,1=0.50
time: 0.02
     prob.0     prob.1 response
1 0.9481784 0.05182160        0
2 0.9578501 0.04214988        0
3 0.5400638 0.45993620        0
4 0.9289286 0.07107139        0
5 0.9643685 0.03563154        0
6 0.9281973 0.07180266        0
... (#rows: 3000, #cols: 3)

Prediction: 3000 observations
predict.type: prob
threshold: 0=0.50,1=0.50
time: 0.02
     prob.0     prob.1 response
1 0.9481784 0.05182160        0
2 0.9578501 0.04214988        0
3 0.5400638 0.45993620        0
4 0.9289286 0.07107139        0
5 0.9643685 0.03563154        0
6 0.9281973 0.07180266        0
... (#rows: 3000, #cols: 3)

In [71]:
# Output predicted file
output <- data.frame(client_id=test_holdout$client_id, subscribe=pred$data$prob.1)
write.csv(output, './data/Kaggle/output/lr_submission_5.csv', row.names=FALSE)

### 3.2. RandomForest model

In [72]:
# Set up cross-validation
rdesc = makeResampleDesc("CV", iters=5)

# Define the model
learner <- makeLearner("classif.randomForest", predict.type="prob", fix.factors.prediction=T)

# Define the task
train_task <- makeClassifTask(id="bank_train", data=train_processed[, -1], target="subscribe")

# Set hyper parameter tuning
tune_params <- makeParamSet(
  makeDiscreteParam('ntree', value=c(100, 250, 500, 750, 1000)),
  makeDiscreteParam('mtry', value=round(sqrt((ncol(train_processed)-1) * c(0.1, 0.25, 0.5, 1, 2, 4))))
)
ctrl = makeTuneControlGrid()

# Run the hyper parameter tuning with k-fold CV
if (length(tune_params$pars) > 0) {
    # Run parameter tuning
    res <- tuneParams(learner, task=train_task, resampling=rdesc,
      par.set=tune_params, control=ctrl, measures=list(mlr::auc))
    
    # Extract best model
    best_learner <- res$learner
    
} else {
    # Simple cross-validation
    res <- resample(learner, train_task, rdesc, measures=list(mlr::auc))
    
    # No parameter for tuning, only 1 best learner
    best_learner <- learner
}

[Tune] Started tuning learner classif.randomForest for parameter set:
[Tune] Started tuning learner classif.randomForest for parameter set:
          Type len Def               Constr Req Tunable Trafo
ntree discrete   -   - 100,250,500,750,1000   -    TRUE     -
mtry  discrete   -   -        2,4,5,7,10,14   -    TRUE     -
          Type len Def               Constr Req Tunable Trafo
ntree discrete   -   - 100,250,500,750,1000   -    TRUE     -
mtry  discrete   -   -        2,4,5,7,10,14   -    TRUE     -
With control class: TuneControlGrid
With control class: TuneControlGrid
Imputation value: -0
Imputation value: -0
[Tune-x] 1: ntree=100; mtry=2
[Tune-x] 1: ntree=100; mtry=2
[Tune-y] 1: auc.test.mean=0.7118991; time: 0.1 min
[Tune-y] 1: auc.test.mean=0.7118991; time: 0.1 min
[Tune-x] 2: ntree=250; mtry=2
[Tune-x] 2: ntree=250; mtry=2
[Tune-y] 2: auc.test.mean=0.7287134; time: 0.1 min
[Tune-y] 2: auc.test.mean=0.7287134; time: 0.1 min
[Tune-x] 3: ntree=500; mtry=2
[Tune-x] 3: ntree=50

In [73]:
# Retrain the model with tbe best hyper-parameters
best_md <- mlr::train(best_learner, train_task)

In [74]:
# Make prediction on valid data
pred <- predict(best_md, newdata=valid_processed[, -1])
performance(pred, measures=mlr::auc)

In [75]:
# Make prediction on test data
pred <- predict(best_md, newdata=test_processed[, -1])
performance(pred, measures=mlr::auc)

In [76]:
# Make prediction on test holdout data
pred <- predict(best_md, newdata=test_holdout_processed[, -1])
pred

Prediction: 3000 observations
predict.type: prob
threshold: 0=0.50,1=0.50
time: 0.09
  prob.0 prob.1 response
1  1.000  0.000        0
2  1.000  0.000        0
3  0.734  0.266        0
4  1.000  0.000        0
5  1.000  0.000        0
6  1.000  0.000        0
... (#rows: 3000, #cols: 3)

Prediction: 3000 observations
predict.type: prob
threshold: 0=0.50,1=0.50
time: 0.09
  prob.0 prob.1 response
1  1.000  0.000        0
2  1.000  0.000        0
3  0.734  0.266        0
4  1.000  0.000        0
5  1.000  0.000        0
6  1.000  0.000        0
... (#rows: 3000, #cols: 3)

In [77]:
# Output predicted file
output <- data.frame(client_id=test_holdout$client_id, subscribe=pred$data$prob.1)
write.csv(output, './data/Kaggle/output/rf_submission_4.csv', row.names=FALSE)

### 3.3 Gradient boosting tree

In [78]:
# Set up cross-validation
rdesc = makeResampleDesc("CV", iters=5, predict="both")

# Define the model
learner <- makeLearner("classif.gbm", predict.type="prob", fix.factors.prediction=T)

# Define the task
train_task <- makeClassifTask(id="bank_train", data=train_processed[, -1], target="subscribe")

# Set hyper parameter tuning
tune_params <- makeParamSet(makeDiscreteParam("distribution", values = "bernoulli"),
makeIntegerParam("n.trees", lower = 100, upper = 1000), #number of trees
makeIntegerParam("interaction.depth", lower = 2, upper = 10), #depth of tree
makeNumericParam("shrinkage",lower = 0.01, upper = 1))

ctrl = makeTuneControlGrid()

# Run the hyper parameter tuning with k-fold CV
if (length(tune_params$pars) > 0) {
    # Run parameter tuning
    res <- tuneParams(learner, task=train_task, resampling=rdesc,
      par.set=tune_params, control=ctrl, measures=list(mlr::auc))
    
    # Extract best model
    best_learner <- res$learner
    
} else {
    # Simple cross-validation
    res <- resample(learner, train_task, rdesc, measures=list(mlr::auc, setAggregation(mlr::auc, train.mean)))
    
    # No parameter for tuning, only 1 best learner
    best_learner <- learner
}

[Tune] Started tuning learner classif.gbm for parameter set:
[Tune] Started tuning learner classif.gbm for parameter set:
                      Type len Def       Constr Req Tunable Trafo
distribution      discrete   -   -    bernoulli   -    TRUE     -
n.trees            integer   -   - 100 to 1e+03   -    TRUE     -
interaction.depth  integer   -   -      2 to 10   -    TRUE     -
shrinkage          numeric   -   -    0.01 to 1   -    TRUE     -
                      Type len Def       Constr Req Tunable Trafo
distribution      discrete   -   -    bernoulli   -    TRUE     -
n.trees            integer   -   - 100 to 1e+03   -    TRUE     -
interaction.depth  integer   -   -      2 to 10   -    TRUE     -
shrinkage          numeric   -   -    0.01 to 1   -    TRUE     -
With control class: TuneControlGrid
With control class: TuneControlGrid
Imputation value: -0
Imputation value: -0
[Tune-x] 1: distribution=bernoulli; n.trees=100; interaction.depth=2; shrinkage=0.01
[Tune-x] 1: distrib

[Tune-y] 27: auc.test.mean=0.7750140; time: 0.5 min
[Tune-y] 27: auc.test.mean=0.7750140; time: 0.5 min
[Tune-x] 28: distribution=bernoulli; n.trees=800; interaction.depth=4; shrinkage=0.01
[Tune-x] 28: distribution=bernoulli; n.trees=800; interaction.depth=4; shrinkage=0.01
[Tune-y] 28: auc.test.mean=0.7747758; time: 0.6 min
[Tune-y] 28: auc.test.mean=0.7747758; time: 0.6 min
[Tune-x] 29: distribution=bernoulli; n.trees=900; interaction.depth=4; shrinkage=0.01
[Tune-x] 29: distribution=bernoulli; n.trees=900; interaction.depth=4; shrinkage=0.01
[Tune-y] 29: auc.test.mean=0.7742846; time: 0.6 min
[Tune-y] 29: auc.test.mean=0.7742846; time: 0.6 min
[Tune-x] 30: distribution=bernoulli; n.trees=1000; interaction.depth=4; shrinkage=0.01
[Tune-x] 30: distribution=bernoulli; n.trees=1000; interaction.depth=4; shrinkage=0.01
[Tune-y] 30: auc.test.mean=0.7742334; time: 0.7 min
[Tune-y] 30: auc.test.mean=0.7742334; time: 0.7 min
[Tune-x] 31: distribution=bernoulli; n.trees=100; interaction.dept

[Tune-x] 57: distribution=bernoulli; n.trees=700; interaction.depth=7; shrinkage=0.01
[Tune-y] 57: auc.test.mean=0.7719285; time: 0.8 min
[Tune-y] 57: auc.test.mean=0.7719285; time: 0.8 min
[Tune-x] 58: distribution=bernoulli; n.trees=800; interaction.depth=7; shrinkage=0.01
[Tune-x] 58: distribution=bernoulli; n.trees=800; interaction.depth=7; shrinkage=0.01
[Tune-y] 58: auc.test.mean=0.7713096; time: 0.9 min
[Tune-y] 58: auc.test.mean=0.7713096; time: 0.9 min
[Tune-x] 59: distribution=bernoulli; n.trees=900; interaction.depth=7; shrinkage=0.01
[Tune-x] 59: distribution=bernoulli; n.trees=900; interaction.depth=7; shrinkage=0.01
[Tune-y] 59: auc.test.mean=0.7724468; time: 1.0 min
[Tune-y] 59: auc.test.mean=0.7724468; time: 1.0 min
[Tune-x] 60: distribution=bernoulli; n.trees=1000; interaction.depth=7; shrinkage=0.01
[Tune-x] 60: distribution=bernoulli; n.trees=1000; interaction.depth=7; shrinkage=0.01
[Tune-y] 60: auc.test.mean=0.7707862; time: 1.2 min
[Tune-y] 60: auc.test.mean=0.770

[Tune-x] 87: distribution=bernoulli; n.trees=700; interaction.depth=10; shrinkage=0.01
[Tune-x] 87: distribution=bernoulli; n.trees=700; interaction.depth=10; shrinkage=0.01
[Tune-y] 87: auc.test.mean=0.7718383; time: 1.1 min
[Tune-y] 87: auc.test.mean=0.7718383; time: 1.1 min
[Tune-x] 88: distribution=bernoulli; n.trees=800; interaction.depth=10; shrinkage=0.01
[Tune-x] 88: distribution=bernoulli; n.trees=800; interaction.depth=10; shrinkage=0.01
[Tune-y] 88: auc.test.mean=0.7702443; time: 1.4 min
[Tune-y] 88: auc.test.mean=0.7702443; time: 1.4 min
[Tune-x] 89: distribution=bernoulli; n.trees=900; interaction.depth=10; shrinkage=0.01
[Tune-x] 89: distribution=bernoulli; n.trees=900; interaction.depth=10; shrinkage=0.01
[Tune-y] 89: auc.test.mean=0.7696372; time: 1.7 min
[Tune-y] 89: auc.test.mean=0.7696372; time: 1.7 min
[Tune-x] 90: distribution=bernoulli; n.trees=1000; interaction.depth=10; shrinkage=0.01
[Tune-x] 90: distribution=bernoulli; n.trees=1000; interaction.depth=10; shrin

[Tune-y] 116: auc.test.mean=0.7554442; time: 0.5 min
[Tune-y] 116: auc.test.mean=0.7554442; time: 0.5 min
[Tune-x] 117: distribution=bernoulli; n.trees=700; interaction.depth=4; shrinkage=0.12
[Tune-x] 117: distribution=bernoulli; n.trees=700; interaction.depth=4; shrinkage=0.12
[Tune-y] 117: auc.test.mean=0.7494593; time: 0.6 min
[Tune-y] 117: auc.test.mean=0.7494593; time: 0.6 min
[Tune-x] 118: distribution=bernoulli; n.trees=800; interaction.depth=4; shrinkage=0.12
[Tune-x] 118: distribution=bernoulli; n.trees=800; interaction.depth=4; shrinkage=0.12
[Tune-y] 118: auc.test.mean=0.7398561; time: 0.6 min
[Tune-y] 118: auc.test.mean=0.7398561; time: 0.6 min
[Tune-x] 119: distribution=bernoulli; n.trees=900; interaction.depth=4; shrinkage=0.12
[Tune-x] 119: distribution=bernoulli; n.trees=900; interaction.depth=4; shrinkage=0.12
[Tune-y] 119: auc.test.mean=0.7385488; time: 0.7 min
[Tune-y] 119: auc.test.mean=0.7385488; time: 0.7 min
[Tune-x] 120: distribution=bernoulli; n.trees=1000; in

[Tune-x] 146: distribution=bernoulli; n.trees=600; interaction.depth=7; shrinkage=0.12
[Tune-x] 146: distribution=bernoulli; n.trees=600; interaction.depth=7; shrinkage=0.12
[Tune-y] 146: auc.test.mean=0.7258019; time: 0.7 min
[Tune-y] 146: auc.test.mean=0.7258019; time: 0.7 min
[Tune-x] 147: distribution=bernoulli; n.trees=700; interaction.depth=7; shrinkage=0.12
[Tune-x] 147: distribution=bernoulli; n.trees=700; interaction.depth=7; shrinkage=0.12
[Tune-y] 147: auc.test.mean=0.7265460; time: 0.9 min
[Tune-y] 147: auc.test.mean=0.7265460; time: 0.9 min
[Tune-x] 148: distribution=bernoulli; n.trees=800; interaction.depth=7; shrinkage=0.12
[Tune-x] 148: distribution=bernoulli; n.trees=800; interaction.depth=7; shrinkage=0.12
[Tune-y] 148: auc.test.mean=0.7219422; time: 1.0 min
[Tune-y] 148: auc.test.mean=0.7219422; time: 1.0 min
[Tune-x] 149: distribution=bernoulli; n.trees=900; interaction.depth=7; shrinkage=0.12
[Tune-x] 149: distribution=bernoulli; n.trees=900; interaction.depth=7; s

[Tune-x] 175: distribution=bernoulli; n.trees=500; interaction.depth=10; shrinkage=0.12
[Tune-y] 175: auc.test.mean=0.7161604; time: 0.9 min
[Tune-y] 175: auc.test.mean=0.7161604; time: 0.9 min
[Tune-x] 176: distribution=bernoulli; n.trees=600; interaction.depth=10; shrinkage=0.12
[Tune-x] 176: distribution=bernoulli; n.trees=600; interaction.depth=10; shrinkage=0.12
[Tune-y] 176: auc.test.mean=0.7184303; time: 1.1 min
[Tune-y] 176: auc.test.mean=0.7184303; time: 1.1 min
[Tune-x] 177: distribution=bernoulli; n.trees=700; interaction.depth=10; shrinkage=0.12
[Tune-x] 177: distribution=bernoulli; n.trees=700; interaction.depth=10; shrinkage=0.12
[Tune-y] 177: auc.test.mean=0.7073330; time: 1.3 min
[Tune-y] 177: auc.test.mean=0.7073330; time: 1.3 min
[Tune-x] 178: distribution=bernoulli; n.trees=800; interaction.depth=10; shrinkage=0.12
[Tune-x] 178: distribution=bernoulli; n.trees=800; interaction.depth=10; shrinkage=0.12
[Tune-y] 178: auc.test.mean=0.7003425; time: 1.5 min
[Tune-y] 178:

[Tune-y] 204: auc.test.mean=0.7460740; time: 0.3 min
[Tune-y] 204: auc.test.mean=0.7460740; time: 0.3 min
[Tune-x] 205: distribution=bernoulli; n.trees=500; interaction.depth=4; shrinkage=0.23
[Tune-x] 205: distribution=bernoulli; n.trees=500; interaction.depth=4; shrinkage=0.23
[Tune-y] 205: auc.test.mean=0.7329525; time: 0.4 min
[Tune-y] 205: auc.test.mean=0.7329525; time: 0.4 min
[Tune-x] 206: distribution=bernoulli; n.trees=600; interaction.depth=4; shrinkage=0.23
[Tune-x] 206: distribution=bernoulli; n.trees=600; interaction.depth=4; shrinkage=0.23
[Tune-y] 206: auc.test.mean=0.7317291; time: 0.5 min
[Tune-y] 206: auc.test.mean=0.7317291; time: 0.5 min
[Tune-x] 207: distribution=bernoulli; n.trees=700; interaction.depth=4; shrinkage=0.23
[Tune-x] 207: distribution=bernoulli; n.trees=700; interaction.depth=4; shrinkage=0.23
[Tune-y] 207: auc.test.mean=0.7208347; time: 0.6 min
[Tune-y] 207: auc.test.mean=0.7208347; time: 0.6 min
[Tune-x] 208: distribution=bernoulli; n.trees=800; int

[Tune-x] 234: distribution=bernoulli; n.trees=400; interaction.depth=7; shrinkage=0.23
[Tune-x] 234: distribution=bernoulli; n.trees=400; interaction.depth=7; shrinkage=0.23
[Tune-y] 234: auc.test.mean=0.7159008; time: 0.5 min
[Tune-y] 234: auc.test.mean=0.7159008; time: 0.5 min
[Tune-x] 235: distribution=bernoulli; n.trees=500; interaction.depth=7; shrinkage=0.23
[Tune-x] 235: distribution=bernoulli; n.trees=500; interaction.depth=7; shrinkage=0.23
[Tune-y] 235: auc.test.mean=0.7113298; time: 0.6 min
[Tune-y] 235: auc.test.mean=0.7113298; time: 0.6 min
[Tune-x] 236: distribution=bernoulli; n.trees=600; interaction.depth=7; shrinkage=0.23
[Tune-x] 236: distribution=bernoulli; n.trees=600; interaction.depth=7; shrinkage=0.23
[Tune-y] 236: auc.test.mean=0.7134736; time: 0.7 min
[Tune-y] 236: auc.test.mean=0.7134736; time: 0.7 min
[Tune-x] 237: distribution=bernoulli; n.trees=700; interaction.depth=7; shrinkage=0.23
[Tune-x] 237: distribution=bernoulli; n.trees=700; interaction.depth=7; s

[Tune-x] 263: distribution=bernoulli; n.trees=300; interaction.depth=10; shrinkage=0.23
[Tune-y] 263: auc.test.mean=0.7123729; time: 0.5 min
[Tune-y] 263: auc.test.mean=0.7123729; time: 0.5 min
[Tune-x] 264: distribution=bernoulli; n.trees=400; interaction.depth=10; shrinkage=0.23
[Tune-x] 264: distribution=bernoulli; n.trees=400; interaction.depth=10; shrinkage=0.23
[Tune-y] 264: auc.test.mean=0.7093715; time: 0.6 min
[Tune-y] 264: auc.test.mean=0.7093715; time: 0.6 min
[Tune-x] 265: distribution=bernoulli; n.trees=500; interaction.depth=10; shrinkage=0.23
[Tune-x] 265: distribution=bernoulli; n.trees=500; interaction.depth=10; shrinkage=0.23
[Tune-y] 265: auc.test.mean=0.6954504; time: 0.8 min
[Tune-y] 265: auc.test.mean=0.6954504; time: 0.8 min
[Tune-x] 266: distribution=bernoulli; n.trees=600; interaction.depth=10; shrinkage=0.23
[Tune-x] 266: distribution=bernoulli; n.trees=600; interaction.depth=10; shrinkage=0.23
[Tune-y] 266: auc.test.mean=0.7037894; time: 0.9 min
[Tune-y] 266:

[Tune-y] 292: auc.test.mean=0.7387600; time: 0.2 min
[Tune-y] 292: auc.test.mean=0.7387600; time: 0.2 min
[Tune-x] 293: distribution=bernoulli; n.trees=300; interaction.depth=4; shrinkage=0.34
[Tune-x] 293: distribution=bernoulli; n.trees=300; interaction.depth=4; shrinkage=0.34
[Tune-y] 293: auc.test.mean=0.7333594; time: 0.2 min
[Tune-y] 293: auc.test.mean=0.7333594; time: 0.2 min
[Tune-x] 294: distribution=bernoulli; n.trees=400; interaction.depth=4; shrinkage=0.34
[Tune-x] 294: distribution=bernoulli; n.trees=400; interaction.depth=4; shrinkage=0.34
[Tune-y] 294: auc.test.mean=0.7281838; time: 0.3 min
[Tune-y] 294: auc.test.mean=0.7281838; time: 0.3 min
[Tune-x] 295: distribution=bernoulli; n.trees=500; interaction.depth=4; shrinkage=0.34
[Tune-x] 295: distribution=bernoulli; n.trees=500; interaction.depth=4; shrinkage=0.34
[Tune-y] 295: auc.test.mean=0.7211623; time: 0.3 min
[Tune-y] 295: auc.test.mean=0.7211623; time: 0.3 min
[Tune-x] 296: distribution=bernoulli; n.trees=600; int

[Tune-x] 322: distribution=bernoulli; n.trees=200; interaction.depth=7; shrinkage=0.34
[Tune-x] 322: distribution=bernoulli; n.trees=200; interaction.depth=7; shrinkage=0.34
[Tune-y] 322: auc.test.mean=0.7204046; time: 0.2 min
[Tune-y] 322: auc.test.mean=0.7204046; time: 0.2 min
[Tune-x] 323: distribution=bernoulli; n.trees=300; interaction.depth=7; shrinkage=0.34
[Tune-x] 323: distribution=bernoulli; n.trees=300; interaction.depth=7; shrinkage=0.34
[Tune-y] 323: auc.test.mean=0.7137001; time: 0.4 min
[Tune-y] 323: auc.test.mean=0.7137001; time: 0.4 min
[Tune-x] 324: distribution=bernoulli; n.trees=400; interaction.depth=7; shrinkage=0.34
[Tune-x] 324: distribution=bernoulli; n.trees=400; interaction.depth=7; shrinkage=0.34
[Tune-y] 324: auc.test.mean=0.7104820; time: 0.5 min
[Tune-y] 324: auc.test.mean=0.7104820; time: 0.5 min
[Tune-x] 325: distribution=bernoulli; n.trees=500; interaction.depth=7; shrinkage=0.34
[Tune-x] 325: distribution=bernoulli; n.trees=500; interaction.depth=7; s

[Tune-x] 351: distribution=bernoulli; n.trees=100; interaction.depth=10; shrinkage=0.34
[Tune-y] 351: auc.test.mean=0.7302733; time: 0.2 min
[Tune-y] 351: auc.test.mean=0.7302733; time: 0.2 min
[Tune-x] 352: distribution=bernoulli; n.trees=200; interaction.depth=10; shrinkage=0.34
[Tune-x] 352: distribution=bernoulli; n.trees=200; interaction.depth=10; shrinkage=0.34
[Tune-y] 352: auc.test.mean=0.7041974; time: 0.3 min
[Tune-y] 352: auc.test.mean=0.7041974; time: 0.3 min
[Tune-x] 353: distribution=bernoulli; n.trees=300; interaction.depth=10; shrinkage=0.34
[Tune-x] 353: distribution=bernoulli; n.trees=300; interaction.depth=10; shrinkage=0.34
[Tune-y] 353: auc.test.mean=0.6960941; time: 0.5 min
[Tune-y] 353: auc.test.mean=0.6960941; time: 0.5 min
[Tune-x] 354: distribution=bernoulli; n.trees=400; interaction.depth=10; shrinkage=0.34
[Tune-x] 354: distribution=bernoulli; n.trees=400; interaction.depth=10; shrinkage=0.34
[Tune-y] 354: auc.test.mean=0.6893250; time: 0.7 min
[Tune-y] 354:

[Tune-y] 380: auc.test.mean=0.7013636; time: 0.6 min
[Tune-y] 380: auc.test.mean=0.7013636; time: 0.6 min
[Tune-x] 381: distribution=bernoulli; n.trees=100; interaction.depth=4; shrinkage=0.45
[Tune-x] 381: distribution=bernoulli; n.trees=100; interaction.depth=4; shrinkage=0.45
[Tune-y] 381: auc.test.mean=0.7474360; time: 0.1 min
[Tune-y] 381: auc.test.mean=0.7474360; time: 0.1 min
[Tune-x] 382: distribution=bernoulli; n.trees=200; interaction.depth=4; shrinkage=0.45
[Tune-x] 382: distribution=bernoulli; n.trees=200; interaction.depth=4; shrinkage=0.45
[Tune-y] 382: auc.test.mean=0.7389050; time: 0.1 min
[Tune-y] 382: auc.test.mean=0.7389050; time: 0.1 min
[Tune-x] 383: distribution=bernoulli; n.trees=300; interaction.depth=4; shrinkage=0.45
[Tune-x] 383: distribution=bernoulli; n.trees=300; interaction.depth=4; shrinkage=0.45
[Tune-y] 383: auc.test.mean=0.7161191; time: 0.2 min
[Tune-y] 383: auc.test.mean=0.7161191; time: 0.2 min
[Tune-x] 384: distribution=bernoulli; n.trees=400; int

[Tune-x] 410: distribution=bernoulli; n.trees=1000; interaction.depth=6; shrinkage=0.45
[Tune-x] 410: distribution=bernoulli; n.trees=1000; interaction.depth=6; shrinkage=0.45
[Tune-y] 410: auc.test.mean=0.6800032; time: 1.1 min
[Tune-y] 410: auc.test.mean=0.6800032; time: 1.1 min
[Tune-x] 411: distribution=bernoulli; n.trees=100; interaction.depth=7; shrinkage=0.45
[Tune-x] 411: distribution=bernoulli; n.trees=100; interaction.depth=7; shrinkage=0.45
[Tune-y] 411: auc.test.mean=0.7313797; time: 0.1 min
[Tune-y] 411: auc.test.mean=0.7313797; time: 0.1 min
[Tune-x] 412: distribution=bernoulli; n.trees=200; interaction.depth=7; shrinkage=0.45
[Tune-x] 412: distribution=bernoulli; n.trees=200; interaction.depth=7; shrinkage=0.45
[Tune-y] 412: auc.test.mean=0.7147232; time: 0.3 min
[Tune-y] 412: auc.test.mean=0.7147232; time: 0.3 min
[Tune-x] 413: distribution=bernoulli; n.trees=300; interaction.depth=7; shrinkage=0.45
[Tune-x] 413: distribution=bernoulli; n.trees=300; interaction.depth=7;

[Tune-x] 439: distribution=bernoulli; n.trees=900; interaction.depth=9; shrinkage=0.45
[Tune-y] 439: auc.test.mean=0.6575049; time: 1.3 min
[Tune-y] 439: auc.test.mean=0.6575049; time: 1.3 min
[Tune-x] 440: distribution=bernoulli; n.trees=1000; interaction.depth=9; shrinkage=0.45
[Tune-x] 440: distribution=bernoulli; n.trees=1000; interaction.depth=9; shrinkage=0.45
[Tune-y] 440: auc.test.mean=0.6547351; time: 1.5 min
[Tune-y] 440: auc.test.mean=0.6547351; time: 1.5 min
[Tune-x] 441: distribution=bernoulli; n.trees=100; interaction.depth=10; shrinkage=0.45
[Tune-x] 441: distribution=bernoulli; n.trees=100; interaction.depth=10; shrinkage=0.45
[Tune-y] 441: auc.test.mean=0.7208503; time: 0.2 min
[Tune-y] 441: auc.test.mean=0.7208503; time: 0.2 min
[Tune-x] 442: distribution=bernoulli; n.trees=200; interaction.depth=10; shrinkage=0.45
[Tune-x] 442: distribution=bernoulli; n.trees=200; interaction.depth=10; shrinkage=0.45
[Tune-y] 442: auc.test.mean=0.6954037; time: 0.3 min
[Tune-y] 442: 

[Tune-y] 468: auc.test.mean=0.6863725; time: 0.4 min
[Tune-y] 468: auc.test.mean=0.6863725; time: 0.4 min
[Tune-x] 469: distribution=bernoulli; n.trees=900; interaction.depth=3; shrinkage=0.56
[Tune-x] 469: distribution=bernoulli; n.trees=900; interaction.depth=3; shrinkage=0.56
[Tune-y] 469: auc.test.mean=0.7007418; time: 0.5 min
[Tune-y] 469: auc.test.mean=0.7007418; time: 0.5 min
[Tune-x] 470: distribution=bernoulli; n.trees=1000; interaction.depth=3; shrinkage=0.56
[Tune-x] 470: distribution=bernoulli; n.trees=1000; interaction.depth=3; shrinkage=0.56
[Tune-y] 470: auc.test.mean=0.6943738; time: 0.5 min
[Tune-y] 470: auc.test.mean=0.6943738; time: 0.5 min
[Tune-x] 471: distribution=bernoulli; n.trees=100; interaction.depth=4; shrinkage=0.56
[Tune-x] 471: distribution=bernoulli; n.trees=100; interaction.depth=4; shrinkage=0.56
[Tune-y] 471: auc.test.mean=0.7475442; time: 0.1 min
[Tune-y] 471: auc.test.mean=0.7475442; time: 0.1 min
[Tune-x] 472: distribution=bernoulli; n.trees=200; i

[Tune-x] 498: distribution=bernoulli; n.trees=800; interaction.depth=6; shrinkage=0.56
[Tune-x] 498: distribution=bernoulli; n.trees=800; interaction.depth=6; shrinkage=0.56
[Tune-y] 498: auc.test.mean=0.6738954; time: 0.8 min
[Tune-y] 498: auc.test.mean=0.6738954; time: 0.8 min
[Tune-x] 499: distribution=bernoulli; n.trees=900; interaction.depth=6; shrinkage=0.56
[Tune-x] 499: distribution=bernoulli; n.trees=900; interaction.depth=6; shrinkage=0.56
[Tune-y] 499: auc.test.mean=0.6626373; time: 0.9 min
[Tune-y] 499: auc.test.mean=0.6626373; time: 0.9 min
[Tune-x] 500: distribution=bernoulli; n.trees=1000; interaction.depth=6; shrinkage=0.56
[Tune-x] 500: distribution=bernoulli; n.trees=1000; interaction.depth=6; shrinkage=0.56
[Tune-y] 500: auc.test.mean=0.6652873; time: 1.0 min
[Tune-y] 500: auc.test.mean=0.6652873; time: 1.0 min
[Tune-x] 501: distribution=bernoulli; n.trees=100; interaction.depth=7; shrinkage=0.56
[Tune-x] 501: distribution=bernoulli; n.trees=100; interaction.depth=7;

[Tune-x] 527: distribution=bernoulli; n.trees=700; interaction.depth=9; shrinkage=0.56
[Tune-y] 527: auc.test.mean=0.6489267; time: 1.0 min
[Tune-y] 527: auc.test.mean=0.6489267; time: 1.0 min
[Tune-x] 528: distribution=bernoulli; n.trees=800; interaction.depth=9; shrinkage=0.56
[Tune-x] 528: distribution=bernoulli; n.trees=800; interaction.depth=9; shrinkage=0.56
[Tune-y] 528: auc.test.mean=0.6514660; time: 1.2 min
[Tune-y] 528: auc.test.mean=0.6514660; time: 1.2 min
[Tune-x] 529: distribution=bernoulli; n.trees=900; interaction.depth=9; shrinkage=0.56
[Tune-x] 529: distribution=bernoulli; n.trees=900; interaction.depth=9; shrinkage=0.56
[Tune-y] 529: auc.test.mean=0.6540544; time: 1.5 min
[Tune-y] 529: auc.test.mean=0.6540544; time: 1.5 min
[Tune-x] 530: distribution=bernoulli; n.trees=1000; interaction.depth=9; shrinkage=0.56
[Tune-x] 530: distribution=bernoulli; n.trees=1000; interaction.depth=9; shrinkage=0.56
[Tune-y] 530: auc.test.mean=0.6533416; time: 1.7 min
[Tune-y] 530: auc.

[Tune-y] 556: auc.test.mean=0.6898629; time: 0.4 min
[Tune-y] 556: auc.test.mean=0.6898629; time: 0.4 min
[Tune-x] 557: distribution=bernoulli; n.trees=700; interaction.depth=3; shrinkage=0.67
[Tune-x] 557: distribution=bernoulli; n.trees=700; interaction.depth=3; shrinkage=0.67
[Tune-y] 557: auc.test.mean=0.6759930; time: 0.5 min
[Tune-y] 557: auc.test.mean=0.6759930; time: 0.5 min
[Tune-x] 558: distribution=bernoulli; n.trees=800; interaction.depth=3; shrinkage=0.67
[Tune-x] 558: distribution=bernoulli; n.trees=800; interaction.depth=3; shrinkage=0.67
[Tune-y] 558: auc.test.mean=0.6883732; time: 0.6 min
[Tune-y] 558: auc.test.mean=0.6883732; time: 0.6 min
[Tune-x] 559: distribution=bernoulli; n.trees=900; interaction.depth=3; shrinkage=0.67
[Tune-x] 559: distribution=bernoulli; n.trees=900; interaction.depth=3; shrinkage=0.67
[Tune-y] 559: auc.test.mean=0.6897644; time: 0.6 min
[Tune-y] 559: auc.test.mean=0.6897644; time: 0.6 min
[Tune-x] 560: distribution=bernoulli; n.trees=1000; in

[Tune-x] 586: distribution=bernoulli; n.trees=600; interaction.depth=6; shrinkage=0.67
[Tune-x] 586: distribution=bernoulli; n.trees=600; interaction.depth=6; shrinkage=0.67
[Tune-y] 586: auc.test.mean=0.6719797; time: 0.6 min
[Tune-y] 586: auc.test.mean=0.6719797; time: 0.6 min
[Tune-x] 587: distribution=bernoulli; n.trees=700; interaction.depth=6; shrinkage=0.67
[Tune-x] 587: distribution=bernoulli; n.trees=700; interaction.depth=6; shrinkage=0.67
[Tune-y] 587: auc.test.mean=0.6659043; time: 0.8 min
[Tune-y] 587: auc.test.mean=0.6659043; time: 0.8 min
[Tune-x] 588: distribution=bernoulli; n.trees=800; interaction.depth=6; shrinkage=0.67
[Tune-x] 588: distribution=bernoulli; n.trees=800; interaction.depth=6; shrinkage=0.67
[Tune-y] 588: auc.test.mean=0.6636760; time: 0.9 min
[Tune-y] 588: auc.test.mean=0.6636760; time: 0.9 min
[Tune-x] 589: distribution=bernoulli; n.trees=900; interaction.depth=6; shrinkage=0.67
[Tune-x] 589: distribution=bernoulli; n.trees=900; interaction.depth=6; s

[Tune-x] 615: distribution=bernoulli; n.trees=500; interaction.depth=9; shrinkage=0.67
[Tune-y] 615: auc.test.mean=0.6459967; time: 0.8 min
[Tune-y] 615: auc.test.mean=0.6459967; time: 0.8 min
[Tune-x] 616: distribution=bernoulli; n.trees=600; interaction.depth=9; shrinkage=0.67
[Tune-x] 616: distribution=bernoulli; n.trees=600; interaction.depth=9; shrinkage=0.67
[Tune-y] 616: auc.test.mean=0.6364263; time: 1.0 min
[Tune-y] 616: auc.test.mean=0.6364263; time: 1.0 min
[Tune-x] 617: distribution=bernoulli; n.trees=700; interaction.depth=9; shrinkage=0.67
[Tune-x] 617: distribution=bernoulli; n.trees=700; interaction.depth=9; shrinkage=0.67
[Tune-y] 617: auc.test.mean=0.6530920; time: 1.1 min
[Tune-y] 617: auc.test.mean=0.6530920; time: 1.1 min
[Tune-x] 618: distribution=bernoulli; n.trees=800; interaction.depth=9; shrinkage=0.67
[Tune-x] 618: distribution=bernoulli; n.trees=800; interaction.depth=9; shrinkage=0.67
[Tune-y] 618: auc.test.mean=0.6296511; time: 1.2 min
[Tune-y] 618: auc.te

[Tune-y] 644: auc.test.mean=0.6919824; time: 0.2 min
[Tune-y] 644: auc.test.mean=0.6919824; time: 0.2 min
[Tune-x] 645: distribution=bernoulli; n.trees=500; interaction.depth=3; shrinkage=0.78
[Tune-x] 645: distribution=bernoulli; n.trees=500; interaction.depth=3; shrinkage=0.78
[Tune-y] 645: auc.test.mean=0.6954349; time: 0.3 min
[Tune-y] 645: auc.test.mean=0.6954349; time: 0.3 min
[Tune-x] 646: distribution=bernoulli; n.trees=600; interaction.depth=3; shrinkage=0.78
[Tune-x] 646: distribution=bernoulli; n.trees=600; interaction.depth=3; shrinkage=0.78
[Tune-y] 646: auc.test.mean=0.6854208; time: 0.4 min
[Tune-y] 646: auc.test.mean=0.6854208; time: 0.4 min
[Tune-x] 647: distribution=bernoulli; n.trees=700; interaction.depth=3; shrinkage=0.78
[Tune-x] 647: distribution=bernoulli; n.trees=700; interaction.depth=3; shrinkage=0.78
[Tune-y] 647: auc.test.mean=0.6744221; time: 0.4 min
[Tune-y] 647: auc.test.mean=0.6744221; time: 0.4 min
[Tune-x] 648: distribution=bernoulli; n.trees=800; int

[Tune-x] 674: distribution=bernoulli; n.trees=400; interaction.depth=6; shrinkage=0.78
[Tune-x] 674: distribution=bernoulli; n.trees=400; interaction.depth=6; shrinkage=0.78
[Tune-y] 674: auc.test.mean=0.6082637; time: 0.4 min
[Tune-y] 674: auc.test.mean=0.6082637; time: 0.4 min
[Tune-x] 675: distribution=bernoulli; n.trees=500; interaction.depth=6; shrinkage=0.78
[Tune-x] 675: distribution=bernoulli; n.trees=500; interaction.depth=6; shrinkage=0.78
[Tune-y] 675: auc.test.mean=0.6623662; time: 0.6 min
[Tune-y] 675: auc.test.mean=0.6623662; time: 0.6 min
[Tune-x] 676: distribution=bernoulli; n.trees=600; interaction.depth=6; shrinkage=0.78
[Tune-x] 676: distribution=bernoulli; n.trees=600; interaction.depth=6; shrinkage=0.78
[Tune-y] 676: auc.test.mean=0.6106553; time: 0.6 min
[Tune-y] 676: auc.test.mean=0.6106553; time: 0.6 min
[Tune-x] 677: distribution=bernoulli; n.trees=700; interaction.depth=6; shrinkage=0.78
[Tune-x] 677: distribution=bernoulli; n.trees=700; interaction.depth=6; s

[Tune-x] 703: distribution=bernoulli; n.trees=300; interaction.depth=9; shrinkage=0.78
[Tune-y] 703: auc.test.mean=0.6393172; time: 0.5 min
[Tune-y] 703: auc.test.mean=0.6393172; time: 0.5 min
[Tune-x] 704: distribution=bernoulli; n.trees=400; interaction.depth=9; shrinkage=0.78
[Tune-x] 704: distribution=bernoulli; n.trees=400; interaction.depth=9; shrinkage=0.78
[Tune-y] 704: auc.test.mean=0.6532935; time: 0.6 min
[Tune-y] 704: auc.test.mean=0.6532935; time: 0.6 min
[Tune-x] 705: distribution=bernoulli; n.trees=500; interaction.depth=9; shrinkage=0.78
[Tune-x] 705: distribution=bernoulli; n.trees=500; interaction.depth=9; shrinkage=0.78
[Tune-y] 705: auc.test.mean=0.6223302; time: 0.8 min
[Tune-y] 705: auc.test.mean=0.6223302; time: 0.8 min
[Tune-x] 706: distribution=bernoulli; n.trees=600; interaction.depth=9; shrinkage=0.78
[Tune-x] 706: distribution=bernoulli; n.trees=600; interaction.depth=9; shrinkage=0.78
[Tune-y] 706: auc.test.mean=0.6031923; time: 0.9 min
[Tune-y] 706: auc.te

[Tune-y] 732: auc.test.mean=0.6574826; time: 0.1 min
[Tune-y] 732: auc.test.mean=0.6574826; time: 0.1 min
[Tune-x] 733: distribution=bernoulli; n.trees=300; interaction.depth=3; shrinkage=0.89
[Tune-x] 733: distribution=bernoulli; n.trees=300; interaction.depth=3; shrinkage=0.89
[Tune-y] 733: auc.test.mean=0.6303622; time: 0.2 min
[Tune-y] 733: auc.test.mean=0.6303622; time: 0.2 min
[Tune-x] 734: distribution=bernoulli; n.trees=400; interaction.depth=3; shrinkage=0.89
[Tune-x] 734: distribution=bernoulli; n.trees=400; interaction.depth=3; shrinkage=0.89
[Tune-y] 734: auc.test.mean=0.6514247; time: 0.2 min
[Tune-y] 734: auc.test.mean=0.6514247; time: 0.2 min
[Tune-x] 735: distribution=bernoulli; n.trees=500; interaction.depth=3; shrinkage=0.89
[Tune-x] 735: distribution=bernoulli; n.trees=500; interaction.depth=3; shrinkage=0.89
[Tune-y] 735: auc.test.mean=0.5781212; time: 0.3 min
[Tune-y] 735: auc.test.mean=0.5781212; time: 0.3 min
[Tune-x] 736: distribution=bernoulli; n.trees=600; int

[Tune-x] 762: distribution=bernoulli; n.trees=200; interaction.depth=6; shrinkage=0.89
[Tune-x] 762: distribution=bernoulli; n.trees=200; interaction.depth=6; shrinkage=0.89
[Tune-y] 762: auc.test.mean=0.5437323; time: 0.2 min
[Tune-y] 762: auc.test.mean=0.5437323; time: 0.2 min
[Tune-x] 763: distribution=bernoulli; n.trees=300; interaction.depth=6; shrinkage=0.89
[Tune-x] 763: distribution=bernoulli; n.trees=300; interaction.depth=6; shrinkage=0.89
[Tune-y] 763: auc.test.mean=0.5961003; time: 0.3 min
[Tune-y] 763: auc.test.mean=0.5961003; time: 0.3 min
[Tune-x] 764: distribution=bernoulli; n.trees=400; interaction.depth=6; shrinkage=0.89
[Tune-x] 764: distribution=bernoulli; n.trees=400; interaction.depth=6; shrinkage=0.89
[Tune-y] 764: auc.test.mean=0.5061875; time: 0.4 min
[Tune-y] 764: auc.test.mean=0.5061875; time: 0.4 min
[Tune-x] 765: distribution=bernoulli; n.trees=500; interaction.depth=6; shrinkage=0.89
[Tune-x] 765: distribution=bernoulli; n.trees=500; interaction.depth=6; s

[Tune-x] 791: distribution=bernoulli; n.trees=100; interaction.depth=9; shrinkage=0.89
[Tune-y] 791: auc.test.mean=0.5833601; time: 0.2 min
[Tune-y] 791: auc.test.mean=0.5833601; time: 0.2 min
[Tune-x] 792: distribution=bernoulli; n.trees=200; interaction.depth=9; shrinkage=0.89
[Tune-x] 792: distribution=bernoulli; n.trees=200; interaction.depth=9; shrinkage=0.89
[Tune-y] 792: auc.test.mean=0.5297816; time: 0.3 min
[Tune-y] 792: auc.test.mean=0.5297816; time: 0.3 min
[Tune-x] 793: distribution=bernoulli; n.trees=300; interaction.depth=9; shrinkage=0.89
[Tune-x] 793: distribution=bernoulli; n.trees=300; interaction.depth=9; shrinkage=0.89
[Tune-y] 793: auc.test.mean=0.4854083; time: 0.4 min
[Tune-y] 793: auc.test.mean=0.4854083; time: 0.4 min
[Tune-x] 794: distribution=bernoulli; n.trees=400; interaction.depth=9; shrinkage=0.89
[Tune-x] 794: distribution=bernoulli; n.trees=400; interaction.depth=9; shrinkage=0.89
[Tune-y] 794: auc.test.mean=0.5258489; time: 0.6 min
[Tune-y] 794: auc.te

[Tune-y] 820: auc.test.mean=0.6292861; time: 0.4 min
[Tune-x] 821: distribution=bernoulli; n.trees=100; interaction.depth=3; shrinkage=1
[Tune-x] 821: distribution=bernoulli; n.trees=100; interaction.depth=3; shrinkage=1
[Tune-y] 821: auc.test.mean=0.6043596; time: 0.1 min
[Tune-y] 821: auc.test.mean=0.6043596; time: 0.1 min
[Tune-x] 822: distribution=bernoulli; n.trees=200; interaction.depth=3; shrinkage=1
[Tune-x] 822: distribution=bernoulli; n.trees=200; interaction.depth=3; shrinkage=1
[Tune-y] 822: auc.test.mean=0.5421479; time: 0.1 min
[Tune-y] 822: auc.test.mean=0.5421479; time: 0.1 min
[Tune-x] 823: distribution=bernoulli; n.trees=300; interaction.depth=3; shrinkage=1
[Tune-x] 823: distribution=bernoulli; n.trees=300; interaction.depth=3; shrinkage=1
[Tune-y] 823: auc.test.mean=0.6317058; time: 0.2 min
[Tune-y] 823: auc.test.mean=0.6317058; time: 0.2 min
[Tune-x] 824: distribution=bernoulli; n.trees=400; interaction.depth=3; shrinkage=1
[Tune-x] 824: distribution=bernoulli; n.t

[Tune-y] 850: auc.test.mean=0.5331387; time: 0.9 min
[Tune-x] 851: distribution=bernoulli; n.trees=100; interaction.depth=6; shrinkage=1
[Tune-x] 851: distribution=bernoulli; n.trees=100; interaction.depth=6; shrinkage=1
[Tune-y] 851: auc.test.mean=0.4544361; time: 0.1 min
[Tune-y] 851: auc.test.mean=0.4544361; time: 0.1 min
[Tune-x] 852: distribution=bernoulli; n.trees=200; interaction.depth=6; shrinkage=1
[Tune-x] 852: distribution=bernoulli; n.trees=200; interaction.depth=6; shrinkage=1
[Tune-y] 852: auc.test.mean=0.5037449; time: 0.2 min
[Tune-y] 852: auc.test.mean=0.5037449; time: 0.2 min
[Tune-x] 853: distribution=bernoulli; n.trees=300; interaction.depth=6; shrinkage=1
[Tune-x] 853: distribution=bernoulli; n.trees=300; interaction.depth=6; shrinkage=1
[Tune-y] 853: auc.test.mean=0.5510013; time: 0.4 min
[Tune-y] 853: auc.test.mean=0.5510013; time: 0.4 min
[Tune-x] 854: distribution=bernoulli; n.trees=400; interaction.depth=6; shrinkage=1
[Tune-x] 854: distribution=bernoulli; n.t

[Tune-y] 880: auc.test.mean=0.5085556; time: 1.3 min
[Tune-x] 881: distribution=bernoulli; n.trees=100; interaction.depth=9; shrinkage=1
[Tune-x] 881: distribution=bernoulli; n.trees=100; interaction.depth=9; shrinkage=1
[Tune-y] 881: auc.test.mean=0.5516268; time: 0.1 min
[Tune-y] 881: auc.test.mean=0.5516268; time: 0.1 min
[Tune-x] 882: distribution=bernoulli; n.trees=200; interaction.depth=9; shrinkage=1
[Tune-x] 882: distribution=bernoulli; n.trees=200; interaction.depth=9; shrinkage=1
[Tune-y] 882: auc.test.mean=0.5439326; time: 0.3 min
[Tune-y] 882: auc.test.mean=0.5439326; time: 0.3 min
[Tune-x] 883: distribution=bernoulli; n.trees=300; interaction.depth=9; shrinkage=1
[Tune-x] 883: distribution=bernoulli; n.trees=300; interaction.depth=9; shrinkage=1
[Tune-y] 883: auc.test.mean=0.4768061; time: 0.4 min
[Tune-y] 883: auc.test.mean=0.4768061; time: 0.4 min
[Tune-x] 884: distribution=bernoulli; n.trees=400; interaction.depth=9; shrinkage=1
[Tune-x] 884: distribution=bernoulli; n.t

In [79]:
# Retrain the model with tbe best hyper-parameters
best_md <- mlr::train(best_learner, train_task)

Distribution not specified, assuming bernoulli ...
Distribution not specified, assuming bernoulli ...


In [80]:
# Make prediction on valid data
pred <- predict(best_md, newdata=valid_processed[, -1])
performance(pred, measures=mlr::auc)

In [81]:
# Make prediction on test data
pred <- predict(best_md, newdata=test_processed[, -1])
performance(pred, measures=mlr::auc)

### 3.4 Ada boosting

In [82]:
# Set up cross-validation
rdesc = makeResampleDesc("CV", iters=5, predict="both")

# Define the model
learner <- makeLearner("classif.ada", predict.type="prob", fix.factors.prediction=T)
# Define the task
train_task <- makeClassifTask(id="bank_train", data=train_processed[, -1], target="subscribe")

# Set hyper parameter tuning
tune_params <- makeParamSet(makeIntegerParam(id = "maxdepth", default = 30L, lower = 1L, upper = 30L)
    )
 

ctrl = makeTuneControlGrid()

# Run the hyper parameter tuning with k-fold CV
if (length(tune_params$pars) > 0) {
    # Run parameter tuning
    res <- tuneParams(learner, task=train_task, resampling=rdesc,
      par.set=tune_params, control=ctrl, measures=list(mlr::auc))
    
    # Extract best model
    best_learner <- res$learner
    
} else {
    # Simple cross-validation
    res <- resample(learner, train_task, rdesc, measures=list(mlr::auc, setAggregation(mlr::auc, train.mean)))
    
    # No parameter for tuning, only 1 best learner
    best_learner <- learner
}

[Tune] Started tuning learner classif.ada for parameter set:
[Tune] Started tuning learner classif.ada for parameter set:
            Type len Def  Constr Req Tunable Trafo
maxdepth integer   -  30 1 to 30   -    TRUE     -
            Type len Def  Constr Req Tunable Trafo
maxdepth integer   -  30 1 to 30   -    TRUE     -
With control class: TuneControlGrid
With control class: TuneControlGrid
Imputation value: -0
Imputation value: -0
[Tune-x] 1: maxdepth=1
[Tune-x] 1: maxdepth=1
[Tune-y] 1: auc.test.mean=0.7730635; time: 0.3 min
[Tune-y] 1: auc.test.mean=0.7730635; time: 0.3 min
[Tune-x] 2: maxdepth=4
[Tune-x] 2: maxdepth=4
[Tune-y] 2: auc.test.mean=0.7714344; time: 0.3 min
[Tune-y] 2: auc.test.mean=0.7714344; time: 0.3 min
[Tune-x] 3: maxdepth=7
[Tune-x] 3: maxdepth=7
[Tune-y] 3: auc.test.mean=0.7710434; time: 0.3 min
[Tune-y] 3: auc.test.mean=0.7710434; time: 0.3 min
[Tune-x] 4: maxdepth=11
[Tune-x] 4: maxdepth=11
[Tune-y] 4: auc.test.mean=0.7724732; time: 0.3 min
[Tune-y] 4: auc.t

In [83]:
# Retrain the model with tbe best hyper-parameters
best_md <- mlr::train(best_learner, train_task)

In [84]:
# Make prediction on valid data
pred <- predict(best_md, newdata=valid_processed[, -1])
performance(pred, measures=mlr::auc)

In [85]:
# Make prediction on test data
pred <- predict(best_md, newdata=test_processed[, -1])
performance(pred, measures=mlr::auc)

### 3.5 KNN

In [86]:
# Set up cross-validation
rdesc = makeResampleDesc("CV", iters=5, predict="both")

# Define the model
learner <- makeLearner("classif.kknn", predict.type="prob", fix.factors.prediction=T)

# Define the task
train_task <- makeClassifTask(id="bank_train", data=train_processed[, -1], target="subscribe")

# Set hyper parameter tuning
tune_params <- makeParamSet(makeIntegerParam("k", 3, 11))

ctrl = makeTuneControlGrid()

# Run the hyper parameter tuning with k-fold CV
if (length(tune_params$pars) > 0) {
    # Run parameter tuning
    res <- tuneParams(learner, task=train_task, resampling=rdesc,
      par.set=tune_params, control=ctrl, measures=list(mlr::auc))
    
    # Extract best model
    best_learner <- res$learner
    
} else {
    # Simple cross-validation
    res <- resample(learner, train_task, rdesc, measures=list(mlr::auc, setAggregation(mlr::auc, train.mean)))
    
    # No parameter for tuning, only 1 best learner
    best_learner <- learner
}

[Tune] Started tuning learner classif.kknn for parameter set:
[Tune] Started tuning learner classif.kknn for parameter set:
     Type len Def  Constr Req Tunable Trafo
k integer   -   - 3 to 11   -    TRUE     -
     Type len Def  Constr Req Tunable Trafo
k integer   -   - 3 to 11   -    TRUE     -
With control class: TuneControlGrid
With control class: TuneControlGrid
Imputation value: -0
Imputation value: -0
[Tune-x] 1: k=3
[Tune-x] 1: k=3
[Tune-y] 1: auc.test.mean=0.6563681; time: 0.0 min
[Tune-y] 1: auc.test.mean=0.6563681; time: 0.0 min
[Tune-x] 2: k=4
[Tune-x] 2: k=4
[Tune-y] 2: auc.test.mean=0.6696287; time: 0.0 min
[Tune-y] 2: auc.test.mean=0.6696287; time: 0.0 min
[Tune-x] 3: k=5
[Tune-x] 3: k=5
[Tune-y] 3: auc.test.mean=0.6749266; time: 0.0 min
[Tune-y] 3: auc.test.mean=0.6749266; time: 0.0 min
[Tune-x] 4: k=6
[Tune-x] 4: k=6
[Tune-y] 4: auc.test.mean=0.6871217; time: 0.0 min
[Tune-y] 4: auc.test.mean=0.6871217; time: 0.0 min
[Tune-x] 5: k=7
[Tune-x] 5: k=7
[Tune-y] 5: auc.te

In [87]:
# Retrain the model with tbe best hyper-parameters
best_md <- mlr::train(best_learner, train_task)

In [88]:
# Make prediction on valid data
pred <- predict(best_md, newdata=valid_processed[, -1])
performance(pred, measures=mlr::auc)

In [89]:
# Make prediction on test data
pred <- predict(best_md, newdata=test_processed[, -1])
performance(pred, measures=mlr::auc)