# Boosting for Classifier Decision Trees

## Goal

In this tutorial, we will demonstrate the use of an ensemble method known as *boosting* for
*classifier decision trees*. We will study the `HMEQ` dataset available at [https://www.kaggle.com/datasets/ajay1735/hmeq-data](https://www.kaggle.com/datasets/ajay1735/hmeq-data), which contains information about applicants who applied for home equity line of credit. It contains the following variables:

## Load libraries and data

First, we load the required libraries. We use:

- `rpart` package because it contains implementations of decision trees;
- `adabag` package because it has implemented the *boosting* algorithm; and
- `caret` package to control trina-validation-test data.


In [None]:
library(rpart)
library(adabag)
library(caret)


To load an `*.csv` file, simply use the function `read.csv()`. In addition, use `factor()`
to convert our response variable of interest to `factor`.


In [None]:
hmeq <- read.csv(file.path("data", "hmeq.csv"), header = TRUE)
hmeq$BAD <- factor(hmeq$BAD)
head(hmeq)


## Create train-test split



In [None]:
set.seed(12345)
train_idx <- createDataPartition(hmeq$BAD, p = 0.7, list = FALSE)
train_data <- hmeq[train_idx, ]
test_data  <- hmeq[-train_idx, ]


In [None]:
head(train_data)



In [None]:
head(test_data)



## Perform cross-validation to compare models

Let's split the train data into ten groups:


In [None]:
folds <- createFolds(train_data$BAD, k = 10, list = TRUE)
str(folds)


Evaluate the performance of the models using 10-fold cross-validation:



In [None]:
tree_errors <- numeric(length(folds))
boost_errors <- numeric(length(folds))

for (i in seq_along(folds)) {
  val_idx <- folds[[i]]
  train_fold <- train_data[-val_idx, ]
  valid_fold <- train_data[val_idx, ]

  # Decision tree
  tree_fit <- rpart(BAD ~ ., data = train_fold, method = "class")
  tree_pred <- predict(tree_fit, valid_fold, type = "class")
  tree_cm <- table(Pred = tree_pred, Obs = valid_fold$BAD)
  tree_errors[i] <- 1 - sum(diag(tree_cm))/sum(tree_cm)

  # Boosting
  boost_fit <- boosting(BAD ~ ., data = train_fold, mfinal = 100,
                        control = rpart.control(minsplit = 5, cp = -1, maxdepth = 4))
  boost_pred <- predict(boost_fit, valid_fold)
  boost_cm <- table(Pred = boost_pred$class, Obs = valid_fold$BAD)
  boost_errors[i] <- 1 - sum(diag(boost_cm))/sum(boost_cm)
}


Errors for the decision tree model:



In [None]:
print(tree_errors)



Errors for boosting model:



In [None]:
print(boost_errors)



Compare the mean error:



In [None]:
c(tree = mean(tree_errors), boost = mean(boost_errors))



## Evaluate the performance of the best model in the test data

Finally, we will evaluate the performance of our model in the test data.


In [None]:
final_boost <- boosting(BAD ~ ., data = train_data, mfinal = 100,
                        control = rpart.control(minsplit = 5, cp = -1, maxdepth = 4))
final_pred <- predict(final_boost, test_data)
final_cm <- table(Pred = final_pred$class, Obs = test_data$BAD)


Let's compute the error:



In [None]:
test_error <- 1 - sum(diag(final_cm))/sum(final_cm)
test_error
