# 1. Ensemble 

- An **ensemble** method is an approach that combines many simple building ensemble block models in order to obtain a single and potentially very powerful model. 
- These simple building block models are sometimes known as weak learners. 
- Methods 
    - Bagging 
    - Random forest 
    - Boosting 
    - Bayesian additive regression trees 

# 2. Bagging 

## 2.1 Bootstrap methods

1. Referring $(X_1, ..., X_n)$ as population, sample n with replacement 
2. Iterating n times : making n Bootstrap sets 
    1. Calculating statistics from sampled sets $(\hat{X_1}, ..., \hat{X_n})$
    2. Calculating aggregated statistics from Bootstrap sets 

```R
# Population of training set 
seq(20) 

# Boostrap set 
sort(sample(seq(20), 20))
```

## 2.2 Bagging Tree methods 

- **Bootstrap aggregation**, or **bagging**, is a general-purpose procedure for reducing the variance of a statistical learning method 
- Repeat calculating statistical with every sampled sets : $\frac{1}{B}\sum_{i=1}^{B} \bar{X_i}$ 
- Taking repeated samples from the training set. 
- Generate B different bootstrapped training data sets : $\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^b(x)$
- For classification trees : for each test observation, we record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occuring class among the B predictions. 

## 2.3 [Ex] Bagging Classification Tree

```R
# Separate training and test sets
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)

# Classification error rate for single tree 
heart.tran <- tree(AHD ~., subset=train, Heart)
heart.pred <- predict(heart.tran, Heart[test, ], type="class")
tree.err <- mean(Heart$AHD[test]!=heart.pred)
tree.err

# Bagging 
set.seed(12345)
B <- 500
n <- nrow(Heart)
Vote <- rep(0, length(test))
bag.err <- NULL 

for (i in 1:B) {
    # Bootstrap training set 
    index <- sample(train, replace=TRUE)
    heart.tran <- tree(AHD ~., Heart[index,])
    heart.pred <- predict(heart.tran, Heart[test, ], type="class")
    Vote[heart.pred=="Yes"] <- Vote[heart.pred=="Yes"] + 1
    preds <- rep("Yes", length(test))
    # Decide as "No" when the number of voted case is lower than i/2 
    # Apply majority rules
    preds[Vote < i/2] <- "No"
    bag.err[i] <- mean(Heart$AHD[test]!=preds)
}

# Visualize bagging decision tree 
plot(bag.err, type="l", xlab="Number of Trees", col=1, ylab="Classification Error Rate")
abline(h=tree.err, lty=2, col=2)
legend("topright", c("Single tree", "Bagging"), col=c(2,1), lty=c(2,1))
```

![](Img/Ensemble1.png)

- Missclassification rate converges to 0.23xxx

# 3. Out-of-Bag Error Estimation

- The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations. 
- One can show that on average, each bagged tree makes use of around two-thirds of the observations 
- The remaining one-third of the observations not used to fit a given bagged tree are referred to as the **out-of-bag(OOB)** observations. 
- We can predict the response for the $i$th observations using each of the trees in which that observation was **OOB**. This will yield around $\frac{B}{3}$ predictions for the $i$th observation on average. 

## 3.1 [Ex] Average of missclassification rate of single tree 

```R
# Average of missclassification rate of single tree
# Over 50 replications
set.seed(12345)
K <- 50
Err <- NULL 

for (i in 1:K) {
  train <- sample(1:nrow(Heart), nrow(Heart)*2/3) 
  test <- setdiff(1:nrow(Heart), train) 
  heart.tran <- tree(AHD ~., subset=train, Heart)
  heart.pred <- predict(heart.tran, Heart[test, ], type="class") 
  Err[i] <- mean(Heart$AHD[test]!=heart.pred)
}
summary(Err)
Tree.Err <- mean(Err)
```

|Min|1st|Median|Mean|3rd|Max|
|:---:|:---:|:---:|:---:|:---:|:---:|
|0.1616|0.2121|0.2424|0.2473|0.2727|0.3333| 

- Over 50 replacations, the mean of misscalssification rates is 0.2424.

## 3.2 [Ex] Out-of-Bagging missclassification rate 

```R
# OOB 
set.seed(1234)
Valid <- Vote <- Mis <- rep(0, nrow(Heart)) 
OOB.err <- NULL

for (i in 1:B) {
  # Bootstrapping from Heart index 
  index <- sample(1:nrow(Heart), replace=TRUE)
  # Extract test index from boostrapped index 
  test <- setdiff(1:nrow(Heart), unique(index))
  Valid[test] <- Valid[test] + 1
  # Train model with bootstrapped training sample 
  heart.tran <- tree(AHD ~., Heart[index,])
  # Make predictions of test sets 
  heart.pred <- predict(heart.tran, Heart[test,], type="class")
  Vote[test] <- Vote[test] + (heart.pred=="Yes")
  # Vote for test sets 
  preds <- rep("Yes", length(test))
  preds[Vote[test]/Valid[test] < 0.5] <- "No"
  # Find index of misscalssified case 
  wh <- which(Heart$AHD[test]!=preds)
  Mis[test[wh]] <- -1
  Mis[test[-wh]] <- 1
  OOB.err[i] <- sum(Mis==-1)/sum(Mis!=0)
}

# View statistical reports of error 
summary(OOB.err)
summary(OOB.err[-c(1:100)])

# Visualize results 
plot(OOB.err, type="l", xlab="Number of Trees", col=1,
     ylab="Classification Error Rate", ylim=c(0.1,0.4))
abline(h=Tree.Err, lty=2, col=2)
legend("topright", c("Single tree", "OOB"), col=c(2,1), lty=c(2,1))
```

<img src="Img/OOB1.png" width="300" height="150"> 

# 4. Random Forest

- **Random forests** provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduces the variance when we average the trees. 
- When building these decision trees, each time a split in a trees is considered, <span style="color:red"> a random selection of m predictors is chosen as split candidates from the full set of p predictors</span>. 
- A fresh selection of m predictors is taken at each split, and typically we choose $m=\sqrt{p}$.

## 4.1 [Ex] RandomForest with m predictors 

```R
# Train-test split
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2) 
test <- setdiff(1:nrow(Heart), train) 

# Bagging: m=13, Random forest: m=1, 4, 6
B <- 500 
n <- nrow(Heart)
m <- c(1, 4, 6, 13) 
Err <- matrix(0, B, length(m)) 
Vote <- matrix(0, length(test), length(m)) 

for (i in 1:B) { 
    index <- sample(train, replace=TRUE) 
    for (k in 1:length(m)) { 
        s <- c(sort(sample(1:13, m[k])), 14) 
        tr <- tree(AHD ~., data=Heart[index, s]) 
        pr <- predict(tr, Heart[test, ], type="class") 
        Vote[pr=="Yes", k] <- Vote[pr=="Yes", k] + 1
        PR <- rep("Yes", length(test)) 
        PR[Vote[,k] < i/2] <- "No" 
        Err[i, k] <- mean(Heart$AHD[test]!=PR) 
    } 
} 

# Visualize Result 
labels <- c("m = 1", "m = 4", "m = 6", "m = 13")
matplot(Err, type="l", xlab="Number of Trees", lty=1,
col=c(1,2:4), ylab="Classification Error Rate")
legend("topright", legend=labels, col=c(1,2:4), lty=1)

# Statistical reports 
colnames(Err) <- labels
apply(Err, 2, summary)
```

![](Img/RandomForest1.png)

- m = 1 : 0.2408725 
- m = 4 : 0.2242282 
- m = 6 : 0.2242550
- m = 13 : 0.2466040
- The missclassification rate was the lowest when m was 6. 

## 4.2 [Ex] Random Forest using randomForest packages

```R
library(randomForest)

# Bagging(m=13) 
bag.heart <- randomForest(x=Heart[train, -14], y=Heart[train,14], 
                          xtest=Heart[test, -14], ytest=Heart[test,14], 
                          mtry=13, importance=TRUE) 
bag.heart
bag.conf <- bag.heart$test$confusion[1:2,1:2] 

# Missclassification Error of m = 13
1 - sum(diag(bag.conf))/sum(bag.conf)

# Bagging(m=1) 
rf1 <- randomForest(x=Heart[train,-14], y=Heart[train,14], 
                    xtest=Heart[test,-14], ytest=Heart[test,14], 
                    mtry=1, importance=TRUE) 
rf1.conf <- rf1$test$confusion[1:2, 1:2]
1- sum(diag(rf1.conf))/sum(rf1.conf)

## Random forest with m=4
rf2 <- randomForest(x=Heart[train,-14], y=Heart[train,14],
                    xtest=Heart[test,-14], ytest=Heart[test,14],
                    mtry=4, importance=TRUE)
rf2.conf <- rf2$test$confusion[1:2,1:2]
1- sum(diag(rf2.conf))/sum(rf2.conf)

## Random forest with m=6
rf3 <- randomForest(x=Heart[train,-14], y=Heart[train,14],
                    xtest=Heart[test,-14], ytest=Heart[test,14],
                    mtry=6, importance=TRUE)
rf3.conf <- rf3$test$confusion[1:2,1:2]
1- sum(diag(rf3.conf))/sum(rf3.conf)
```

- Missclassification Error of m = 13 : 0.2684564
- Missclassification Error of m = 1 : 0.2147651
- Missclassification Error of m = 4 : 0.2416107
- Missclassification Error of m = 6 : 0.2348993 

**randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=n, replace=TRUE)** 

- x, y : x, y can be applied separately, usually using formula a lot 
- xtest, ytest : If we apply the test dataset together, perform the test at the same time
- ntree : The number of the tree 
- mtry : The number of descriptive variable candidates 

## 4.3 [Ex] Random Forest using mtry grid search

```R
# Set Grids 
set.seed(1111)
N <- 50
CER <- matrix(0, N, 13)

# Training models : 50 replications 
for (i in 1:N) {
    train <- sample(1:nrow(Heart), floor(nrow(Heart)*2/3))
    test <- setdiff(1:nrow(Heart), train)
    for (k in 1:13) {
        # Apply random forest 
        rf <- randomForest(x=Heart[train, -14], y=Heart[train, 14], 
                           xtest=Heart[test, -14], ytest=Heart[test, 14], mtry=k)
        rfc <- rf$test$confusion[1:2,1:2]
        # Calculate missclassification rate 
        CER[i, k] <- 1-sum(diag(rfc))/sum(rfc)
    }
}
apply(CER, 2, mean)

# Visualize results 
boxplot(CER, boxwex=0.5, main="Random Forest with m = 1 to 13",
        ylab="Classification Error Rates", col="orange", ylim=c(0, 0.4))
```

<img src="Img/RandomForest2.png" height=400 width=400>

## 4.3 [Ex] Variable Importance Reports 

```R
# Scatter plot of feature importance 
varImpPlot(rf1)
varImpPlot(rf2)
varImpPlot(rf3)

# Horizontal bar plot of feature importance 
(imp1 <- importance(rf1))
(imp2 <- importance(rf2))
(imp3 <- importance(rf3))

par(mfrow=c(1,3))
# Based on MeanDecreaseAccuracy
barplot(sort(imp1[,3]), main="RF (m=1)", horiz=TRUE, col=2)
barplot(sort(imp2[,3]), main="RF (m=4)", horiz=TRUE, col=2)
barplot(sort(imp3[,3]), main="RF (m=6)", horiz=TRUE, col=2)

# Based on MeanDecreaseGini
barplot(sort(imp1[,4]), main="RF (m=1)", horiz=TRUE, col=2)
barplot(sort(imp2[,4]), main="RF (m=4)", horiz=TRUE, col=2)
barplot(sort(imp3[,4]), main="RF (m=6)", horiz=TRUE, col=2)
```

- MeanDecreaseAccuracy : Shows feature importance when splitting nodes based on Accuracy
- MeanDecreaseGini : Shows feature importance when splitting nodes based on Gini index 

<img src="Img/RandomForest3.png">
<img src="Img/RandomForest4.png">
<img src="Img/RandomForest5.png">
<img src="Img/RandomForest6.png">
<img src="Img/RandomForest7.png">

## 4.5 [Ex] Assignments 

The Caravan insurance dataset consists of 85 quantitative variables and 1 factor with 2 levels. The factor is a response variable Purchase. Randomly separate the training. validation and test sets such that 

```
set.seed(1111)
data(Caravan)
train <- sample(1:nrow(Caravan), nrow(Caravan)/2)
others <- setdiff(1:nrow(Caravan), train)
validation <- sample(others, length(others)/2)
test <- setdiff(others, validation) 
```

Fit the random forest model using the training set, where the number of predictors begins from 1 to 9. First, find the optimal number of predictors using the validation set which minimizes the classification error rate(CER). Then, compute the CER of the test set, using the random forest with the optimal number of predictors. 

```R
library(randomForest)
library(ISLR)

set.seed(1111)
data(Caravan)
train <- sample(1:nrow(Caravan), nrow(Caravan)/2)
others <- setdiff(1:nrow(Caravan), train)
validation <- sample(others, length(others)/2)
test <- setdiff(others, validation) 

CER <- NULL
for (i in 1:9) { 
  rf <- randomForest(x=Caravan[train, -86], y=Caravan[train, 86],
                     xtest=Caravan[validation, -86], ytest=Caravan[validation, 86], mtry=i)  
  rfc <- rf$test$confusion[1:2,1:2]
  CER[i] <- 1-sum(diag(rfc))/sum(rfc)
}
wm <- which.min(CER)

# Calculate missclassification of test sets 
rf.test <- randomForest(x=Caravan[train, -86], y=Caravan[train, 86], 
                   xtest=Caravan[test, -86], ytest=Caravan[test, 86], mtry=wm)
rfc.test <- rf.test$test$confusion[1:2,1:2]
CER.test <- 1-sum(diag(rfc.test))/sum(rfc.test)
CER.test
```

- wm : 3
- CER.test : 0.06593407

# 5. Boosting 

- **Boosting** is a general approach that can be applied to many statistical learning methods for regression or classification. 
- Boosting grow trees sequentially. 
- The boosting approach learns slowly. 
- Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals. 
- Each of these models can be rather small, with just a few terminal nodes, determined by the parameter $d$ in the algorithm. 
- The kinds of boosting algorithms
    - XGBoost 
    - Adaboost 
    
    
## 5.1 Boosting algorithm 

1. Set $\hat{f}(x) = 0$ and $r_i = y_i$ for all $i$ in the training set. 
2. For $b = 1, 2, ..., B$, repeat : 
    1. Fit a tree $\hat{f}^b$ with $d$ splits to the training data $(X, r)$. 
    2. Update $\hat{f}$ by adding in a shrunken version of the new tree : $\hat{f}(x) \leftarrow \hat{f}(x) + \lambda\hat{f}^b(x)$
    3. Update the residuals, $r_i \leftarrow r_i - \lambda\hat{f}^b(x_i)$.
3. Output the boosted model : $\hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^b(x)$. 

## 5.2 Tuning parameters for boosting 

- The number of trees $B$ : Boosting can be overfit if $B$ is too large, although this overfitting tends to occur slowly if at all.
- The shrinkage parameter $\lambda$ : Controls the rate at which boosting learns. Typical values are 0.01 or 0.001.
- The number of splits $d$ in each tree : Controls the complexity of the boosted ensemble. 

## 5.3 [Ex] Boosting Decision Tree : Classification 

```R
# Prerequirisite : Importing library, dataset, preprocessing 
library(gbm)

# Train-Test split 
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)

# Create (0,1) response
Heart0 <- Heart
Heart0[,"AHD"] <- as.numeric(Heart$AHD)-1

# boosting (d=1)
boost.d1 <- gbm(AHD~., data=Heart0[train, ], n.trees=1000,
distribution="bernoulli", interaction.depth=1)

# Results of feature selection 
summary(boost.d1)

# Calcualte missclassification error 
# We need to set value of keyword n.trees for boosting trees 
yhat.d1 <- predict(boost.d1, newdata=Heart0[test, ], type="response", n.trees=1000)
phat.d1 <- rep(0, length(yhat.d1))
phat.d1[yhat.d1 > 0.5] <- 1
mean(phat.d1!=Heart0[test, "AHD"])
```

```R
# boosting (d=2)
# Training model with max_depth = 2
boost.d2 <- gbm(AHD~., data=Heart0[train, ], n.trees=1000,
distribution="bernoulli", interaction.depth=2)

# Make predictions 
yhat.d2 <- predict(boost.d2, newdata=Heart0[test, ], type="response", n.trees=1000)
phat.d2 <- rep(0, length(yhat.d2))
phat.d2[yhat.d2 > 0.5] <- 1
mean(phat.d2!=Heart0[test, "AHD"])
```

```R
# boosting (d=3)
# Training model with max_depth = 3
boost.d3 <- gbm(AHD~., data=Heart0[train, ], n.trees=1000,
distribution="bernoulli", interaction.depth=3)

# Make predictions
yhat.d3 <- predict(boost.d3, newdata=Heart0[test, ], type="response", n.trees=1000)
phat.d3 <- rep(0, length(yhat.d3))
phat.d3[yhat.d3 > 0.5] <- 1
mean(phat.d3!=Heart0[test, "AHD"])
```

```R
# boosting (d=4)
# Training model with max_depth = 4
boost.d4 <- gbm(AHD~., data=Heart0[train, ], n.trees=1000,
distribution="bernoulli", interaction.depth=4)

# Make predictions 
yhat.d4 <- predict(boost.d4, newdata=Heart0[test, ], type="response", n.trees=1000)
phat.d4 <- rep(0, length(yhat.d4))
phat.d4[yhat.d4 > 0.5] <- 1
mean(phat.d4!=Heart0[test, "AHD"])
```

<img src="Img/Boosting1.png">

|gbm(iteracion.depth=1)|gbm(iteraction.depth=2)|gbm(iteraction.depth=3)|gbm(interaction.depth=4)|
|:---:|:---:|:---:|:---:|
|0.2013423|0.2013423|0.2281879|0.1879105|

## 5.4 [Ex] Hyperparameter tuning : n.trees, interaction.depth

```R
# Simulation: Boosting with d=1, 2, 3 and 4
# The number of trees: 1 to 3000

# Set grids 
set.seed(1111)
Err <- matrix(0, 3000, 4)

# Training models with n.trees, interaction.depth : 1 to 4
for (k in 1:4) {
    boost <- gbm(AHD~., data=Heart0[train, ], n.trees=3000,
distribution="bernoulli", interaction.depth=k)
    for (i in 1:3000) {
        # Make predictions of n.trees : 1 to 3000 
        yhat <- predict(boost, newdata=Heart0[test, ],
        type="response", n.trees=i)
        phat <- rep(0, length(yhat))
        phat[yhat > 0.5] <- 1
        Err[i,k] <- mean(phat!=Heart0[test, "AHD"])
    }
}

# Visualize results 
labels <- c("d = 1", "d = 2", "d = 3", "d = 4")
matplot(Err, type="l", xlab="Number of Trees", lty=2, col=1:4,
ylab="Classification Error Rate")
legend("topright", legend=labels, col=1:4, lty=1)

# View statistical reports 
colnames(Err) <- labels
apply(Err, 2, summary)
apply(Err[-c(1:100),], 2, summary)
```

<img src="Img/Boosting2.png">
<img src="Img/Boosting3.png">

- Because the Boosting model is a continuously updated model, the initially trained model is less accurate. 

# 6. Bayesian Additive Regression Trees 

- **Bayesian additive regression trees(BART)** is another ensemble method that uses decision trees as its building blocks. 
- **BART** methods combines other ensemble methods : 
    - Each tree is constructed in a random manner as in bagging and random forest. 
    - Each tree tries to capture signals not yet accounted for by the current model as in boosting. 
- **BART** method can be viewed as a Bayesian approach : 
    - $\theta_1$ ~ $p(\theta_1)$ : Prior distribution 
    - $\theta_1 | \theta_2, ... \theta_k$ : Posterior distribution
    - Each time we randomly perturb a tree in order to fit the residuals, we are in fact drawing a new tree from a posterior distribution. 
    - MCMC(Markov chain Monte Carlo) algorithm
    - Remove out prediction results at burn-in period 

## 6.1 BART algorithms 

- $\hat{f}^b_k(x)$ represents the prediction at $x$ for the $k$th tree used in the $b$th iteration : $k = 1, ..., K$ and $b = 1, ..., B$
- Let $\hat{f}^1_1(x) = ... = \hat{f}^1_K(x) = \frac{1}{nK}\sum_{i=1}^n y_i$
- Compute $\hat{f}^1(x) = \sum_{k=1}^K \hat{f}^1_k(x)$ 
- For $b = 2, ..., B$ : 
    - For $k = 1, 2, ..., K$ : 
        - For $i = 1, ..., n$, compute the current partial residuals : $r_i = y_i - \sum_{\dot{k} < k} \hat{f}$
- Compute the mean after $L$ burn-in samples : $\hat{f}(x) = \frac{1}{B - L}\sum_{b = L+1}^B \hat{f}^b(x)$

## 6.2 [Ex] BART : lbart/pbart

```R
# Prerequirisite 
library(BART)

# Train-test split 
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)
x <- Heart[, -14]
y <- as.numeric(Heart[, 14])-1
xtrain <- x[train, ]
ytrain <- y[train]
xtest <- x[-train, ]
ytest <- y[-train]

# Logistic BART 
set.seed(11)
fit1 <- lbart(xtrain, ytrain, x.test=xtest)
names(fit1)

# Make predictions 
prob1 <- rep(0, length(ytest))
prob1[fit1$prob.test.mean > 0.5] <- 1
mean(prob1!=ytest)

# Probit BART 
set.seed(22)
fit2 <- pbart(xtrain, ytrain, x.test=xtest)

# Make Prediction 
prob2 <- rep(0, length(ytest))
prob2[fit2$prob.test.mean > 0.5] <- 1
mean(prob2!=ytest)

# Visualize results : lbart ~ pbart 
cbind(fit1$prob.test.mean, fit2$prob.test.mean)
plot(fit1$prob.test.mean, fit2$prob.test.mean, col=ytest+2,
xlab="Logistic BART", ylab="Probit BART")
abline(0, 1, lty=3, col="grey")
abline(v=0.5, lty=1, col="grey")
abline(h=0.5, lty=1, col="grey")
legend("topleft", col=c(2,3), pch=1,
legend=c("AHD = No", "AHD = Yes"))
```

<img src="Img/BART1.png">

- Misclassification error rate : 0.1946309

## 6.4 [Ex] Comparision of MSE among different models : Tree, RF, Boosting, BART

```R
# Revisit Boston data set with a quantitative response
library(MASS)
summary(Boston)
dim(Boston)

# Train-test split
set.seed(111)
train <- sample(1:nrow(Boston), floor(nrow(Boston)*2/3))
boston.test <- Boston[-train, "medv"]

# Calculate misssclassification error rate of Regression tree
library(tree)
tree.boston <- tree(medv ~ ., Boston, subset=train)
yhat <- predict(tree.boston, newdata=Boston[-train, ])
mean((yhat - boston.test)^2)

# Calculate missclassification error rate of LSE: least square estimates
g0 <- lm(medv ~ ., Boston, subset=train)
pred0 <- predict(g0, Boston[-train,])
mean((pred0 - boston.test)^2)

# Calculate missclassification error rate of Bagging
library(randomForest)
g1 <- randomForest(medv ~ ., data=Boston, mtry=13, subset=train)
yhat1 <- predict(g1, newdata=Boston[-train, ])
mean((yhat1 - boston.test)^2)

# Calculate missclassification error rate of Random Forest (m = 4)
g2 <- randomForest(medv ~ ., data=Boston, mtry=4, subset=train)
yhat2 <- predict(g2, newdata=Boston[-train, ])
mean((yhat2 - boston.test)^2)

# Calculate missclassification error rate of Boosting (d = 4)
library(gbm)
g3 <- gbm(medv~., data = Boston[train, ], distribution="gaussian", n.trees=5000, interaction.depth=4)
yhat3 <- predict(g3, newdata=Boston[-train, ], n.trees=5000)
mean((yhat3 - boston.test)^2)

# Calculate missclassifcation error rate of BART
library(BART)
g4 <- gbart(Boston[train, 1:13], Boston[train, "medv"], x.test=Boston[-train, 1:13])
yhat4 <- g4$yhat.test.mean
mean((yhat4 - boston.test)^2)
```

|Regression tree|LSE|Bagging|Random Forest|Boosting|BART|
|:---:|:---:|:---:|:---:|:---:|:---:|
|24.95443|21.52607|12.43788|8.304427|12.2265|11.68592|

## 6.5 [Ex] Simulation study of 6.4 

```R
# Simulation: 4 ensemble methods
set.seed(1111)
N <- 20
ERR <- matrix(0, N, 4)

# replicate 20 times 
for (i in 1:N) {
    train <- sample(1:nrow(Boston), floor(nrow(Boston)*2/3))
    boston.test <- Boston[-train, "medv"]
    
    # Bagging
    g1 <- randomForest(medv ~ ., data=Boston, mtry=13, subset=train)
    yhat1 <- predict(g1, newdata=Boston[-train, ])
    ERR[i,1] <- mean((yhat1 - boston.test)^2)

    # Random forest
    g2 <- randomForest(medv ~ ., data=Boston, mtry=4, subset=train)
    yhat2 <- predict(g2, newdata=Boston[-train, ])
    ERR[i, 2] <- mean((yhat2 - boston.test)^2)

    # Boosting
    g3 <- gbm(medv~., data = Boston[train, ], n.trees=5000, distribution="gaussian", interaction.depth=4)
    yhat3 <- predict(g3, newdata=Boston[-train, ], n.trees=5000)
    ERR[i, 3] <- mean((yhat3 - boston.test)^2)

    # BART
    invisible(capture.output(g4 <- gbart(Boston[train, 1:13], Boston[train, "medv"], x.test=Boston[-train, 1:13])))
    yhat4 <- g4$yhat.test.mean
    ERR[i, 4] <- mean((yhat4 - boston.test)^2)
}

# Visualize simulation results 
labels <- c("Bagging", "RF", "Boosting", "BART")
boxplot(ERR, boxwex=0.5, main="Ensemble Methods", col=2:5, names=labels, ylab="Mean Squared Errors", ylim=c(0,30))
colnames(ERR) <- labels

# Check statistical reports 
apply(ERR, 2, summary)
apply(ERR, 2, var)

# Check rankings 
RA <- t(apply(ERR, 1, rank))
RA
apply(RA, 2, table)
```

<img src="Img/BART3.png" width="400">

**<center>Mean of missclassification error rate** 

|Bagging|RF|Boosting|BART|
|:---:|:---:|:---:|:---:|
|12.265594|12.104095|11.323421|11.057668|

**<center>Variance of missclassification error rate** 

|Bagging|RF|Boosting|BART|
|:---:|:---:|:---:|:---:| 
|8.16965|12.691002|4.768590|7.754700|

**<center>Ranking of simulations**

|Ranks|Bagging|RF|Boosting|BART|
|:---:|:---:|:---:|:---:|:---:|
|1|3|3|8|6|
|2|4|5|3|8|
|3|6|3|6|5|
|4|7|9|3|1|


- The variance of boosting model is best. 