# Boosting for Classifier Decision Trees

## Goal

In this tutorial, we will demonstrate the use of an ensemble method known as *boosting* for
*classifier decision trees*. We will study the `HMEQ` dataset available at [https://www.kaggle.com/datasets/ajay1735/hmeq-data](https://www.kaggle.com/datasets/ajay1735/hmeq-data), which contains information about applicants who applied for home equity line of credit. It contains the following variables:

## Load libraries and data

First, we load the required libraries. We use:

- `rpart` package because it contains implementations of decision trees;
- `adabag` package because it has implemented the *boosting* algorithm; and
- `caret` package to control trina-validation-test data.


In [1]:
library(rpart)
library(adabag)
library(caret)


Loading required package: caret

Loading required package: ggplot2

Loading required package: lattice

Loading required package: foreach

Loading required package: doParallel

Loading required package: iterators

Loading required package: parallel

“RGL: unable to open X11 display”
“'rgl.init' failed, will use the null device.


To load an `*.csv` file, simply use the function `read.csv()`. In addition, use `factor()`
to convert our response variable of interest to `factor`.


In [2]:
hmeq <- read.csv(file.path("data", "hmeq.csv"), header = TRUE)
hmeq$BAD <- factor(hmeq$BAD)
head(hmeq)


Unnamed: 0_level_0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
Unnamed: 0_level_1,<fct>,<int>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<int>,<int>,<dbl>,<int>,<int>,<dbl>
1,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.36667,1.0,9.0,
2,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.83333,0.0,14.0,
3,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.46667,1.0,10.0,
4,1,1500,,,,,,,,,,,
5,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.33333,0.0,14.0,
6,1,1700,30548.0,40320.0,HomeImp,Other,9.0,0.0,0.0,101.466,1.0,8.0,37.11361


## Create train-test split



In [3]:
set.seed(12345)
train_idx <- createDataPartition(hmeq$BAD, p = 0.7, list = FALSE)
train_data <- hmeq[train_idx, ]
test_data  <- hmeq[-train_idx, ]


In [4]:
head(train_data)



Unnamed: 0_level_0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
Unnamed: 0_level_1,<fct>,<int>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<int>,<int>,<dbl>,<int>,<int>,<dbl>
2,1,1300,70053,68400,HomeImp,Other,7,0,2,121.83333,0,14,
3,1,1500,13500,16700,HomeImp,Other,4,0,0,149.46667,1,10,
5,0,1700,97800,112000,HomeImp,Office,3,0,0,93.33333,0,14,
7,1,1800,48649,57037,HomeImp,Other,5,3,2,77.1,1,17,
8,1,1800,28502,43034,HomeImp,Other,11,0,0,88.76603,0,8,36.88489
9,1,2000,32700,46740,HomeImp,Other,3,0,2,216.93333,1,12,


In [5]:
head(test_data)



Unnamed: 0_level_0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
Unnamed: 0_level_1,<fct>,<int>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<int>,<int>,<dbl>,<int>,<int>,<dbl>
1,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.36667,1.0,9.0,
4,1,1500,,,,,,,,,,,
6,1,1700,30548.0,40320.0,HomeImp,Other,9.0,0.0,0.0,101.466,1.0,8.0,37.11361
10,1,2000,,62250.0,HomeImp,Sales,16.0,0.0,0.0,115.8,0.0,13.0,
12,1,2000,20627.0,29800.0,HomeImp,Office,11.0,0.0,1.0,122.53333,1.0,9.0,
14,0,2000,64536.0,87400.0,,Mgr,2.5,0.0,0.0,147.13333,0.0,24.0,


## Perform cross-validation to compare models

Let's split the train data into ten groups:


In [6]:
folds <- createFolds(train_data$BAD, k = 10, list = TRUE)
str(folds)


List of 10
 $ Fold01: int [1:417] 32 41 54 68 80 85 94 130 145 146 ...
 $ Fold02: int [1:417] 1 24 34 35 43 52 61 93 98 101 ...
 $ Fold03: int [1:417] 17 22 26 37 82 88 105 108 117 121 ...
 $ Fold04: int [1:418] 6 7 8 9 11 12 20 49 50 56 ...
 $ Fold05: int [1:418] 2 4 21 29 31 46 48 63 81 99 ...
 $ Fold06: int [1:417] 14 16 23 42 44 59 66 77 100 112 ...
 $ Fold07: int [1:417] 13 25 38 51 53 57 58 69 70 74 ...
 $ Fold08: int [1:417] 10 15 18 30 36 45 47 55 60 64 ...
 $ Fold09: int [1:418] 5 19 27 28 65 67 72 75 97 103 ...
 $ Fold10: int [1:417] 3 33 39 40 71 76 91 95 115 141 ...


Evaluate the performance of the models using 10-fold cross-validation:



In [7]:
tree_errors <- numeric(length(folds))
boost_errors <- numeric(length(folds))

for (i in seq_along(folds)) {
  val_idx <- folds[[i]]
  train_fold <- train_data[-val_idx, ]
  valid_fold <- train_data[val_idx, ]

  # Decision tree
  tree_fit <- rpart(BAD ~ ., data = train_fold, method = "class")
  tree_pred <- predict(tree_fit, valid_fold, type = "class")
  tree_cm <- table(Pred = tree_pred, Obs = valid_fold$BAD)
  tree_errors[i] <- 1 - sum(diag(tree_cm))/sum(tree_cm)

  # Boosting
  boost_fit <- boosting(BAD ~ ., data = train_fold, mfinal = 100,
                        control = rpart.control(minsplit = 5, cp = -1, maxdepth = 4))
  boost_pred <- predict(boost_fit, valid_fold)
  boost_cm <- table(Pred = boost_pred$class, Obs = valid_fold$BAD)
  boost_errors[i] <- 1 - sum(diag(boost_cm))/sum(boost_cm)
}


Errors for the decision tree model:



In [8]:
print(tree_errors)



 [1] 0.1438849 0.1630695 0.1678657 0.1698565 0.1387560 0.1558753 0.1678657
 [8] 0.1486811 0.1674641 0.1486811


Errors for boosting model:



In [9]:
print(boost_errors)



 [1] 0.13189448 0.10791367 0.13669065 0.12440191 0.10765550 0.12949640
 [7] 0.09832134 0.12709832 0.14354067 0.12949640


Compare the mean error:



In [10]:
c(tree = mean(tree_errors), boost = mean(boost_errors))



## Evaluate the performance of the best model in the test data

Finally, we will evaluate the performance of our model in the test data.


In [11]:
final_boost <- boosting(BAD ~ ., data = train_data, mfinal = 100,
                        control = rpart.control(minsplit = 5, cp = -1, maxdepth = 4))
final_pred <- predict(final_boost, test_data)
final_cm <- table(Pred = final_pred$class, Obs = test_data$BAD)


Let's compute the error:



In [12]:
test_error <- 1 - sum(diag(final_cm))/sum(final_cm)
test_error
