# Lecture 9

## Growing Regression Trees

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/9d2b4129" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

Here is the tree on the toy data we used in the slides.

In [None]:
library(tree)
data <- data.frame(X1 = 1:4, X2 = c(1, 4, 3, 2), Y = c(1, 11, 7, 3))
toytree <- tree(Y ~ ., data, minsize = 1)
plot(toytree)
text(toytree)

We can also look at the splits, predicted response (column `yval`) and the loss
(column `dev`) for every node.

In [None]:
toytree$frame

In the following cell we load the Hitters data.

In [None]:
library(ISLR)
Hitters <- na.omit(Hitters)
logSalary = log(Hitters$Salary)
hist(Hitters$Salary, nclass = 20)
hist(log(Hitters$Salary), nclass = 20)

The histogram reveals that the salaries do not at all follow a Gaussian
distribution. With the log-transformation the data does not really look
Gaussian either, but at least is is slightly more symmetric than without.

Let us fit a tree and plot the first 3 splits together with the data.

In [None]:
hitters.tree <- tree(log(Salary) ~ Years + Hits, data = Hitters)
plot(hitters.tree)
text(hitters.tree)
plot(Hitters$Years, Hitters$Hits, ylab = "Hits", xlab = "Years",
     col = hcl.colors(18, palette = "RdYlBu", rev = T)[10*(logSalary - 1.7)])
abline(v = 4.5, col = "red")
lines(c(4.5,30), c(117.5, 117.5), col = "red")

## Pruning Regression Tres

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/3415b0db" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

We can simply use the function `prune.tree` with argument `best = 3` to find the
best tree with 3 leaf nodes.

In [None]:
hitters.tree.pruned3 <- prune.tree(hitters.tree, best = 3)
plot(hitters.tree.pruned3)
text(hitters.tree.pruned3)

In the following we define some functions to fit the Hitters data with 9
predictors and run 6-fold cross-validation. We run 6-fold cross-validation,
because our training data has size 132, which is a multiple of 6.

In [None]:
hitters.train <- function(train) {
    formula <- log(Salary) ~ Years + RBI + PutOuts + Hits + Walks + Runs + AtBat + HmRun + Assists
    tree(formula, Hitters, subset = train)
}
hitters.evaluate <- function(tree, set) {
    sapply(2:10, function(i) mean((log(Hitters[set,'Salary']) - predict(prune.tree(tree, best = i), Hitters[set,]))^2)) # We compute the mean squared error for all trees with 2 to 10 leaf nodes.
}
hitters.cv <- function(train) {
    res <- sapply(1:6, function(v) {
                            idx.test <- seq((v-1)*22 + 1, v*22) # fold index
                            this.fold.test <- train[idx.test]   # validation
                            this.fold.train <- train[-idx.test] # training
                            tree <- hitters.train(this.fold.train)
                            hitters.evaluate(tree, this.fold.test)
                        })
    rowMeans(data.frame(res))
}
hitters.train.and.evaluate <- function() {
    train <- sample(nrow(Hitters), 132)
    tree <- hitters.train(train)
    list(train = hitters.evaluate(tree, train),
         test = hitters.evaluate(tree, -train),
         cv = hitters.cv(train),
         tree = tree)
}
set.seed(1)
res <- replicate(100, hitters.train.and.evaluate()) # we run everything for 100 different training sets

The function `hitters.train.and.evaluate` returns training and test errors, the
cross-validation estimate of the test error and the full tree itself.
We can plot individual trees. To look at other trees, change the `tree.index` to
another number between 1 and 100. Do you see how different the trees are,
depending on which training set they were fitted on?

In [None]:
tree.index <- 2
example_tree <- res[4, tree.index]$tree
plot(example_tree, col = 'darkgreen')
text(example_tree)

To plot the results including error bars we define the function `std.plot`.

In [None]:
std.plot <- function(data, x = 2:10, ...) {
    df <- data.frame(data)
    m <- rowMeans(df)
    std <- sqrt(rowMeans((df - m)^2))
    points(x, m, type = "b", ...)
    arrows(x, m - std, x, m + std, length=0.05, angle=90, code=3, ...)
}
plot(c(), ylim = c(.1, .57), xlim = c(2, 10), xlab = "Tree Size", ylab = "Mean Squared Error")
std.plot(res[1,])
std.plot(res[2,], col = "red")
std.plot(res[3,], col = "blue")
legend("bottomleft", c("train", "test", "CV"), bty = 'n',
                     col = c("black", "red", "blue"), lty = 1)

We can conclude from the plot above that the optimal tree size is around 3 or 4.
Let us plot this tree.

In [None]:
final.tree <- prune.tree(hitters.train(1:nrow(Hitters)), best = 3)
plot(final.tree)
text(final.tree)

Finding the optimal tree size with cross-validation is sometimes also called
hyper-parameter tuning. If you are interested to see how the modern `tidymodels`
library allows hyper-parameter tuning you can follow this
[link](https://www.tidymodels.org/start/tuning/).

## Classification Trees

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/fe6070e8" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

In the following cell we fit a classification tree to the `Heart` data.
The `as.factor` function is used to tell R which columns contain
categorical data.

In [None]:
Heart <-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Heart.csv")[,-1]
Heart$AHD <- as.factor(Heart$AHD)
Heart$ChestPain <- as.factor(Heart$ChestPain)
Heart$Thal <- as.factor(Heart$Thal)
Heart$Sex <- as.factor(Heart$Sex)
heart.tree <- tree(AHD ~ ., Heart)
plot(heart.tree)
text(heart.tree)

Once you are ready, please answer the [quiz questions](https://moodle.epfl.ch/mod/quiz/view.php?id=1112503).

## Trees Versus Other Methods

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/51e1e2d4" frameborder="0" allow="fullscreen" allowfullscreen></iframe>')

## Exercises
### Conceptual

**Q1.**
(a) Draw an example (of your own invention) of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions $R_1,R_2,...$, the cutpoints $t_1,t_2,...$, and so forth.

(b) Draw an example (of your own invention) of a partition of two-dimensional
feature space that could not result from recursive binary splitting. Justify,
why it cannot be the result from recursive binary splitting.

**Q2.**
![](img/8.12.png)

(a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of the figure above. The numbers inside the boxes indicate the mean of $Y$ within each region.

(b) Create a diagram similar to the left-hand panel of the figure above, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

### Applied

**Q3.** In this exercise we will look at the "Carseats" data set. You can get information about it by loading `library(ISLR)` and running `?Carseats`. We will seek to predict "Sales" using trees.

(a) Split the data set into a training set and a test set.

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test error rate do you obtain?

(c) Instead of treating "Sales" as a quantitative variable we could recode it as
a qualitative variable by classifying the sales as "low" if they are below 5, "medium" if they are below 9 and "high" otherwise, i.e. we will introduce the new response variable `sales.class <- as.factor(ifelse(Carseats$Sales < 5, "low", ifelse(Carseats$Sales < 9, "medium", "high")))`. Fit a classification tree to the Carseats data with response `sales.class`. Do you get a similar tree as in (b)?

(d) Use cross-validation in order to determine the optimal level of tree complexity for the tree in (c). Does pruning the tree improve the test error rate?

**Q4.** (optional)
Fit a classification tree to the Histopathalogic Cancer Detection data set that
we studied in the last exercise of sheet 7* - part 2.
To tell R that the 0s and 1s in PCaml_y should be treated as values of a
categorical response, you may use `PCaml_y <- as.factor(PCaml_y)`.
Compare your results to the ones obtained with linear regression and
convolutional networks.