<h1 id="Contents">Contents<a href="#Contents"></a></h2>
        <ol>
        <li><a class="" href="#Tree-Based-Methods">Tree-Based Methods</a></li>
<ol><li><a class="" href="#The-Basics-of-Decision-Trees">The Basics of Decision Trees</a></li>
<ol><li><a class="" href="#Regression-Trees">Regression Trees</a></li>
<ol><li><a class="" href="#Tree-Pruning">Tree Pruning</a></li>
</ol><li><a class="" href="#Classification-Trees">Classification Trees</a></li>
<li><a class="" href="#Trees-Versus-Linear-Models">Trees Versus Linear Models</a></li>
</ol><li><a class="" href="#Bagging">Bagging</a></li>
<ol><li><a class="" href="#Out-of-Bag-Error-Estimation">Out-of-Bag Error Estimation</a></li>
</ol><li><a class="" href="#Random-Forests">Random Forests</a></li>
<li><a class="" href="#Boosting">Boosting</a></li>
</ol>

# Tree-Based Methods

Tree-based methods involve stratifying or segmenting the predictor space
into a number of simple regions. In order to make a prediction for a given
observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of splitting rules used
to segment the predictor space can be summarized in a tree, these types of
approaches are known as decision tree methods.
decision tree.

## The Basics of Decision Trees

### Regression Trees

Roughly speaking,
there are two steps to form a decision tree:
1. We divide the predictor space—that is, the set of possible values for
$X_1, X_2, \ldots , X_p$—into J distinct and non-overlapping regions,
$R_1, R_2, \ldots , R_J$.
2. For every observation that falls into the region $R_j$, we make the same
prediction, which is simply the mean of the response values for the
training observations in $R_j$.

How do we construct the regions $R_1, R_2, \ldots , R_J$? The goal is to find these regions such that they minimize the RSS error:
$$
\text{RSS}(R_j) = \sum_{i=1}^N \left( y_i - \hat{y_{R_j}} \right)^2
$$

Unfortunately, it is computationally infeasible to consider every
possible partition of the feature space into J boxes. For this reason, we take
a top-down, greedy approach that is known as recursive binary splitting. The
recursive
binary
splitting
approach is top-down because it begins at the top of the tree (at which point
all observations belong to a single region) and then successively splits the
predictor space; each split is indicated via two new branches further down
on the tree. It is greedy because at each step of the tree-building process,
the best split is made at that particular step, rather than looking ahead
and picking a split that will lead to a better tree in some future step.

In order to perform recursive binary splitting, we first select the predictor $X_j$ and the cutpoint s such that splitting the predictor space into
the regions $\{X|X_j < s \}$ and $\{X|X_j\le s\}$ leads to the greatest possible
reduction in RSS.

That is, we consider all
predictors $X_1, X_2, \ldots , X_p$, and all possible values of the cutpoint s for each of
the predictors, and then choose the predictor and cutpoint such that the
resulting tree has the lowest RSS. In greater detail, for any j and s, we
define the pair of half-planes:
$$
R_1(j,s) = \{X|X_j < s\}\quad \text{and}\quad R_2(j,s)= \{X|X_j\le s\}
$$
and we seek j, s such that it minimizes:
$$
\sum_{i:x_i\epsilon R_1(j,s)}\left( y_i - \hat{y}_{R_1} \right)^2 +\sum_{i:x_i\epsilon R_2(j,s)}\left( y_i - \hat{y}_{R_2} \right)^2
$$

Once we find the j and s, we repeat the process, looking for the best predictor and best
cutpoint in order to split the data further so as to minimize the RSS within
each of the resulting regions. However, this time, instead of splitting the
entire predictor space, we split one of the two previously identified regions.
We now have three regions. Again, we look to split one of these three regions
further, so as to minimize the RSS. The process continues until a stopping
criterion is reached; for instance, we may continue until no region contains
more than five observations. Once the regions $R_1, R_2, \ldots , R_J$ have been created, we predict the response
for a given test observation using the mean of the training observations in
the region to which that test observation belongs.

#### Tree Pruning

The process described above may produce good predictions on the training
set, but is likely to overfit the data, leading to poor test set performance. One possible
alternative to the process described above is to build the tree only so long
as the decrease in the RSS due to each split exceeds some (high) threshold.
This strategy will result in smaller trees, but is too short-sighted since a
seemingly worthless split early on in the tree might be followed by a very
good split—that is, a split that leads to a large reduction in RSS later on. Therefore, a better strategy is to grow a very large tree $T_0$, and then
prune it back in order to obtain a subtree.

*Cost complexity pruning*—also known as weakest link pruning—gives us
a way to do just this. Rather than considering every possible subtree, we
consider a sequence of trees indexed by a nonnegative tuning parameter $\alpha$. For each value of $\alpha$, we consider the subtree $T\subset T_0$ such that:
$$
\sum_{m=1}^{|T|}\sum_{x_i\epsilon R_m}\left( y_i - \hat{y}_{R_m} \right)^2+\alpha|T|
$$

is as small as possible. We can select a value of
$\alpha$ using a validation set or using cross-validation. We then return to the
full data set and obtain the subtree corresponding to $\alpha$.

>The Algorithm
>1. Use recursive binary splitting to grow a large tree on the training
data, stopping only when each terminal node has fewer than some
minimum number of observations.
>2. Apply cost complexity pruning to the large tree in order to obtain a
sequence of best subtrees, as a function of $\alpha$.
>3. Use K-fold cross-validation to choose $\alpha$. That is, divide the training
observations into K folds. For each $k = 1, \ldots , K$:
(a) Repeat Steps 1 and 2 on all but the kth fold of the training data.
(b) Evaluate the mean squared prediction error on the data in the
left-out kth fold, as a function of $\alpha$.
Average the results for each value of $\alpha$, and pick $\alpha$ to minimize the
average error.
>4. Return the subtree from Step 2 that corresponds to the chosen value
of $\alpha$

### Classification Trees

A classification tree is very similar to a regression tree, except that it is
classification
used to predict a qualitative response rather than a quantitative one. Re- tree
call that for a regression tree, the predicted response for an observation is
given by the mean response of the training observations that belong to the
same terminal node. In contrast, for a classification tree, we predict that
each observation belongs to the most commonly occurring class of training
observations in the region to which it belongs.

To grow a classification tree, we use the same procedure as in the regression using *classification error rate* as the cost function.  The classification error rate is
simply the fraction of the training observations in that region that do not
belong to the most common class:
$$
E = 1-\max_k(\hat{p}_{mk})
$$
Here $\hat{p}_{mk}$ represents the proportion of training observations in the mth
region that are from the kth class. 

However, it turns out that classification
error is not sufficiently sensitive for tree-growing, and in practice two other
measures are preferable.
The Gini index is defined by
$$
G = \sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk})
$$

a measure of total variance across the K classes. It is not hard to see
that the Gini index takes on a small value if all of the $\hat{p}_{mk}$’s are close to
zero or one. For this reason the Gini index is referred to as a measure of
node purity—a small value indicates that a node contains predominantly
observations from a single class.

An alternative to the Gini index is cross-entropy, given by
$$
D = -\sum_{k=1}^K\hat{p}_{mk}\log\hat{p}_{mk}
$$

One can show that
the cross-entropy will take on a value near zero if the $\hat{p}_{mk}$’s are all near
zero or near one. Therefore, like the Gini index, the cross-entropy will take
on a small value if the mth node is pure.

When building a classification tree, either the Gini index or the crossentropy are typically used to evaluate the quality of a particular split,
since these two approaches are more sensitive to node purity than is the
classification error rate. Any of these three approaches might be used when
pruning the tree, but the classification error rate is preferable if prediction
accuracy of the final pruned tree is the goal.

>Decision trees can be constructed
even in the presence of qualitative predictor variables.

### Trees Versus Linear Models

Linear regression assumes a model of the form:
$$
f(X) = \beta_0 + \sum_{i=1}^n\beta_iX_i
$$
whereas regression trees assume a model of the form:
$$
f(X) = \sum_{m=1}^M c_m.1 (X\epsilon R_m)
$$

Which model is better depends on the problem at hand. If the
relationship between the features and the response is well approximated
by a linear model then an approach such as linear regression
will likely work well, and will outperform a method such as a regression
tree that does not exploit this linear structure. If instead there is a highly
non-linear and complex relationship between the features and the response
as indicated by the tree model then decision trees may outperform classical
approaches.

## Bagging

Trees generally do not have the same level of predictive
accuracy as some of the other regression and classification approaches however, by aggregating many decision trees, using methods like bagging,
random forests, and boosting, the predictive performance of trees can be
substantially improved. 

The decision tree suffer from high variance.
This means that if we split the training data into two parts at random,
and fit a decision tree to both halves, the results that we get could be
quite different. We know that averaging a set of observations reduces variance. Hence a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets
from the population, build a separate prediction model using each training
set, and average the resulting predictions. In other words, we could calculate $\hat{f}^1(x), \hat{f}^2(x), \ldots, \hat{f}^B(x)$ using B separate training sets, and average
them in order to obtain a single low-variance statistical learning model,
given by 
$$
\hat{f}^{\text{avg}}(x) = \frac{1}{B}\sum_{i=1}^B\hat{f}^i(x)
$$

Of course, this is not practical because we generally do not have access
to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the single training set, and fit a decision tree to each sample. This is called bagging.

To apply bagging to regression
trees, we simply construct B regression trees using B bootstrapped training
sets, and average the resulting predictions. These trees are grown deep,
and are not pruned. Hence each individual tree has high variance, but
low bias. Averaging these B trees reduces the variance. 

#### Out-of-Bag Error Estimation

One can show
that on average, each bagged tree makes use of around two-thirds of the
observations. The remaining one-third of the observations not used to fit a
given bagged tree are referred to as the out-of-bag (OOB) observations. We
out-of-bag
can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around B/3 predictions
for the ith observation. We can take average (or mode) which will give a valid test error for the model.  It can
be shown that with B sufficiently large, OOB error is virtually equivalent
to leave-one-out cross-validation error. The OOB approach for estimating
the test error is particularly convenient when performing bagging on large
data sets for which cross-validation would be computationally onerous.

## Random Forests

Random forests provide an improvement over bagged trees by way of a
random
small tweak that decorrelates the trees. As in bagging, we build a number forest
of decision trees on bootstrapped training samples. But when building these
decision trees, each time a split in a tree is considered, a random sample of
m predictors is chosen as split candidates from the full set of p predictors.
The split is allowed to use only one of those m predictors. A fresh sample of
m predictors is taken at each split, and typically we choose $m \approx \sqrt{p}$—that
is, the number of predictors considered at each split is approximately equal
to the square root of the total number of predictors.

Leaving all these features may sound weird, but it has a clever rationale. Suppose
that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged
trees, most or all of the trees will use this strong predictor in the top split.
Consequently, all of the bagged trees will look quite similar to each other.
Hence the predictions from the bagged trees will be highly correlated.

Random forests overcome this problem by forcing each split to consider
only a subset of the predictors. Therefore, on average $(p − m)/p$ of the
splits will not even consider the strong predictor, and so other predictors
will have more of a chance. We can think of this process as decorrelating
the trees, thereby making the average of the resulting trees less variable
and hence more reliable

## Boosting

Boosting works in a similar way as bagging, except that the trees are
grown sequentially: each tree is grown using information from previously
grown trees. Boosting does not involve bootstrap sampling; instead each
tree is fit on a modified version of the original data set. The algorithm is:
1. Set $\hat{f}(x) = 0$ and $r_i = y_i$ for all i in the training set.
2. For $b = 1, \dots, B$,
   - Fit a tree $\hat{f}^b$ with d splits to the training data (X, r)
   - Update $\hat{f}$ by adding in a shrunken version of the new tree: $$\hat{f}(x)\leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)$$
 - Update the residuals $$r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)$$
3. Output the final model: $$f(x) = \sum_{b=1}^B\lambda \hat{f}^b(x)$$ 

Unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially
overfitting, the boosting approach instead learns slowly. Given the current
model, we fit a decision tree to the residuals from the model. That is, we
fit a tree using the current residuals, rather than the outcome Y , as the response. We then add this new decision tree into the fitted function in order
to update the residuals. Each of these trees can be rather small, with just
a few terminal nodes, determined by the parameter d in the algorithm. By
fitting small trees to the residuals, we slowly improve $\hat{f}$ in areas where it
does not perform well. The shrinkage parameter $\lambda$ slows the process down
even further, allowing more and different shaped trees to attack the residuals. In general, statistical learning approaches that learn slowly tend to
perform well. Note that in boosting, unlike in bagging, the construction of
each tree depends strongly on the trees that have already been grown.