## Tree-Based Methods for Regression and Classification

Random forest is a powerful method for regression and classification. We will next cover the conceptual foundations of random forests, but we need to start from a simpler method first. These simpler methods are called tree-based methods.

The random forest method makes use of several trees when making its prediction, and since in graph theory, a collection of trees is called a forest, this is where the random forest method gets its name.

In other words, to make a prediction, the random forest considers the predictions of several trees. It would not, however, be useful to have many identical trees because all these trees would presumably give you the same prediction.
This is why the trees in the forest are randomized in a way we'll come back to shortly.

Tree-based methods can be used for regression and classification. These methods involve dividing the predictor space
into simpler regions using straight lines. So we first take the entire predictor space and divide it into two regions.
We now in turn look at each of these two smaller regions, divide them into yet smaller regions, and so on, continuing until we hit some stopping criteria.

So the way we divide the predictor space into smaller regions is recursive in nature. To make a prediction for a previously unseen test observation, we find the region of the predictor space where the test observation falls. In the regression setting, we return the mean of the outcomes of the training observations in that particular region,
whereas in a classification setting we return the mode, the most common element of the outcomes of the training
observations in that region.

When we use lines to divide the predictor space into regions, these lines must be aligned with the directions
of the axes of the predictor space. And because of this constraint, we can summarize the splitting rules
in a tree. This is also why these methods are known as decision tree methods.

In higher dimensions, these lines become planes, so we end up dividing the predictor space into high-dimensional rectangles or boxes.

How do we decide where to make these cuts? The basic idea is that we'd like to carve out regions in the predictor space that are maximally homogeneous in terms of their outcomes. Remember, we'll ultimately use the mean or the mode
of the outcomes falling in a given region as our predicted outcome for an unseen observation. So we can minimize error by finding maximally homogeneous regions in the predictor space.

Whenever we make a split, we consider all predictors from $x_1$ to $x_p$, and for each predictor, we consider all possible cut points. We choose the predictor - cut point combination such that the resulting division of the predictor space has the lowest value of some criterion, usually called a loss function, that we're trying to minimize.

In regression, this loss function is usually RSS, the residual sum of squares. In classification, two measures are commonly used, called the Gini index and the cross-entropy. You can find their definitions online, but the basic idea
is, again, to make cuts using a predictor cut point combination that makes the classes within each region as homogeneous as possible.

## Random Forest Predictions

In [1]:
from sklearn.ensemble import RandomForestRegressor

In [2]:
from sklearn.ensemble import RandomForestClassifier