# Boosting for Classifier Decision Trees



In [None]:
options(repr.plot.width = 15, repr.plot.height = 10)



## Goal

In this tutorial, we will demonstrate the use of an ensemble method known as *boosting* for
*classifier decision trees*. We will study the `HMEQ` dataset available at [https://www.kaggle.com/datasets/ajay1735/hmeq-data](https://www.kaggle.com/datasets/ajay1735/hmeq-data), which contains information about applicants who applied for home equity line of credit. It contains the following variables:

* **BAD:** Target variable: 1 = default on loan, 0 = repaid;
* **LOAN:** amount of the loan request;
* **MORTDUE:** amount due on existing mortgage;
* **VALUE:** value of the current property;
* **REASON:** reason for the loan (home improvement, debt consolidation);
* **JOB:** job category of the applicant (e.g., `Mgr`, `Office`, `Self`, `Sales`, `Other`);
* **YOJ:** years at present job;
* **DEROG:** number of major derogatory reports;
* **DELINQ:** number of delinquent credit lines;
* **CLAGE:** age of the oldest credit line (in months);
* **NINQ:** number of recent credit inquiries;
* **CLNO:** number of existing credit lines; and
* **DEBTINC:** debt-to-income ratio.

## Load libraries and data

First, we load the required libraries. We use the `rpart` package because it contains
implementations of decision trees, and the `adabag` package because it has implemented the
*boosting* algorithm.


In [None]:
library(rpart)
library(adabag)


To load an `*.csv` file, simply use the function `read.csv()`. In addition, use `factor()`
to convert our response variable of interest to `factor`.


In [None]:
hmeq <- read.csv(file.path("data", "hmeq.csv"), header = TRUE)
hmeq$BAD <- factor(hmeq$BAD)
head(hmeq)


## Exploratory analysis

We begin by obtaining summaries of the variables in the `hmeq` dataset.


In [None]:
summary(hmeq)



We will analyze the applicants behaviour for the line credit (`BAD`).

## Prepare train, validation and test data

In order to evaluate the adequacy of our models, we split the data into three groups:

- **Testing data**: to fit or trian the models,
- **Validation data**: to compare and select models,
- **Test data**: to evaluate the final performance.

Let's create the row indices associated to each split. We set a seed with `set.seed()` to
be able to reproduce the same train-validation-test split:


In [None]:
set.seed(45632)
test_indices <- sample(1:nrow(hmeq), 800, replace = FALSE)
remaining <- setdiff(1:nrow(hmeq), test_indices)
valid_indices <- sample(remaining, 800, replace = FALSE)
train_indices <- setdiff(remaining, valid_indices)


Now, let's obtain each data split:



In [None]:
train_data <- hmeq[train_indices, ]
valid_data <- hmeq[valid_indices, ]
test_data <- hmeq[test_indices, ]


In [None]:
head(train_data)



In [None]:
head(valid_data)



In [None]:
head(test_data)



## Classifier Decision Tree

Let's start fitting a decision tree with the `rpart` package, and predict in the
validation data using a arbitrary threshold ($0.5$) to exemplify the method:


In [None]:
initial_tree <- rpart(BAD ~ ., data = train_data)
initial_predictions <- predict(initial_tree, newdata = valid_data)
initial_predictions <- ifelse(initial_predictions[, 2] > 0.5, 1, 0)


Print the confusion matrix of the basic model:



In [None]:
initial_tree_performance <- data.frame(observed = factor(valid_data$BAD),
                                       predicted = factor(initial_predictions))
initial_cm <- with(initial_tree_performance, table(predicted, observed))
initial_cm


In [None]:
initial_error <- 1 - sum(diag(initial_cm)) / sum(initial_cm)
initial_error


## Boosting method

### Train

Now build the boosting model using the `boosting()` function from the `adabag` package. Be
careful with the number of iterations (`mfinal`) because this may take a long time. The
hyper-parameters to control the growth for the decision tree can be passed to the argument
`control` with the function `rpart.control` with arguments:

- **minsplit**: minimum number of observations in a node to attempt an split;
- **cp**: controls the complexity parameter for the stopping rule, if negative no pruning
  is done; and
- **maxdepth**: the maximum depth of the argument.


In [None]:
boosted_model <- boosting(BAD ~ ., data = train_data, mfinal = 100,
                          control = rpart.control(minsplit = 5, cp = -1, maxdepth = 4))


### Predict on the validation data

Create predictions on the validation data, and print the confusion matrix:


In [None]:
boosted_predictions <- predict.boosting(boosted_model, newdata = valid_data)
boosted_predictions$confusion


Print the error rate:



In [None]:
boosted_predictions$error



Plot a trace of how the error evolves as the ensemble size grows.



In [None]:
trace <- errorevol(boosted_model, valid_data)
plot(trace[[1]], xlab = "Ensemble size", ylab = "Error rate", type = "b", pch = 19)


## Check performance on test data

Create predictions of the best model on the test data, and print the confusion matrix:


In [None]:
predictions_best_model <- predict.boosting(boosted_model, newdata = test_data)
predictions_best_model$confusion


Print the error rate on the test data:



In [None]:
predictions_best_model$error

