# Random Forest
Decision Trees can suffer from high variance which makes their results fragile to the specific training data used.

Building multiple models from samples of your training data, called bagging, can reduce this variance, but the trees are highly correlated.

Random Forest is an extension of bagging that in addition to building trees based on multiple samples of your training data, it also constrains the features that can be used to build the trees, forcing trees to be different. This, in turn, can give a lift in performance.

In this tutorial, you will discover how to implement the Random Forest algorithm from scratch in Python.

After completing this tutorial, you will know:
-  The difference between bagged decision trees and the random forest algorithm.
-  How to construct bagged decision trees with more variance.
-  How to apply the random forest algorithm to a predictive modelinng problem.

Let's get started.

## Random Forest Algorithm

Decision trees involve the greedy selection of the best split point from the dataset at each step.

This algorithm makes decision trees susceptible to high variance if they are not pruned. This high variance can be harnessed and reduced by creating multiple trees with different samples of the training dataset (different views of the problem) and combining their predictions. This approach is called bootstrap aggregation or bagging for short.

A limitation of bagging is that the same greedy algorithm is used to create each tree, meaning that it is likely that the same or very similar split points will be chosen in each tree making the different trees very similar (trees will be correlated). This, in turn, makes their predictions similar, mitigating the variance originally sought.

We can force the decision trees to be different by limiting the features (rows) that the greedy algorithm can evaluate at each split point when creating the tree. This is called the Random Forest algorithm.

Like bagging, multiple samples of the training dataset are taken and a different tree trained on each. The difference is that at each point a split is made in the data and added to the tree, only a fixed subset of attributes can be considered.

For classification problems, the type of problems we will look at in this tutorial, the number of attributes to be considered for the split is limited to the square root of the number of input features.

$$\text{number of features for split} = \sqrt{\text{total input features}}$$

The result of this one small change are trees that are more different from each other (uncorrelated) resulting in predictions that are more diverse and a combined prediction that often has better performance than a single tree or bagging alone.

## Sonar Dataset

The dataset we will use in this tutorial is the Sonar dataset.

This is a dataset that describes sonar chirp returns bouncing off different surfaces. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders. There are 208 observations.

It is a well-understood dataset. All of the variables are continuous and generally in the range of 0 to 1. The output variable is a string "M" for mine and "R" for rock, which will need to be converted to integers 1 and 0.

By predicting the class with the most observations in the dataset (M or mines) the Zero Rule Algorithm can achieve an accuracy of 53%.

You can learn more about this dataset at the [UCI Machine Learning repository.](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

Download the dataset for free and place it in your working directory with the filename __sonar.all-data.csv__.

## Tutorial

This tutorial is broken down into 2 steps.

1. Calculating Splits.
2. Sonar Dataset Case Study.

These steps provide the foundation that you need to implement and apply the Random Forest algorithm to your own predictive modeling problems.

### 1. Calculating Splits

In a decision tree, split points are chosen by finding the attribute and the value of that attribute that results in the lowest cost.

For classification problems, this cost function is often the Gini index, that calculates the purity of the groups of data created by the split point. A Gini index of 0 is perfect purity where class values are perfectly separated into two groups, in the case of a two-class classification problem.

Finding the best split point in a decision tree involves evaluating the cost of each value in the training dataset for each input variable.

For bagging and random forest, this procedure is executed upon a sample of the training dataset, made with replacement. Sampling with replacement means that the same row may be chosen and added to the sample more than once.

We can update this procedure for Random Forest. Instead of enumerating all values for input attributes in search of the split with the lowest cost, we can create a sample of the input attributes to consider.

This sample of input attributes can be chosen randomly and without replacement, meaning that each input attribute needs only be considered once when looking for the split point with the lowest cost.

Below is a function named __get_split()__ that implements this procedure. It takes a dataset and a fixed number of input features from which to evaluate as input arguments, where the dataset may be a sample of the actual training dataset.

The helper function __test_split()__ is used to split the dataset by a candidate split point and __gini_index()__ is used to evaluate the cost of a given split by the groups of rows created.

We can see that a list of features is created by randomly selecting feature indices and adding them to a list (called __features__), this list of features is then enumerated and specific values in the training dataset are evaluated as split points.

In [None]:
# Calculate the Gini index for a split dataset
def gini_index(groups, classes):
    
    # Count total samples
    total_samples = float(sum([len(group) for group in groups]))
    
    # Sum weighted Gini index for each group
    gini = 0.0
    for group in groups:
        size = float(len(group))
        
        # Avoid divide by zero
        if size == 0:
            continue
        
        # Score the group based on the score for each class
        score = 0.0
        for class_val in classes:
            p = [row[-1] for row in group].count(class_val) / size
            score += p**2
            
        # Weight the groups score by its relative size
        gini += (1.0 - score) * (size / total_samples)
        
    return gini

# Split a dataset based on an attribute and a split value
def test_split(index, value, dataset):
    left, right = [], []
    for row in dataset:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return left, right

# Select the best split point for a dataset
def get_split(dataset, n_features):
    # Get a list of all unique class values
    class_values = list(set(row[-1] for row in dataset))
    
    # Record best values
    b_index, b_value, b_gini, b_groups = 999, 999, 999, None
    
    # Randomly select n feature indices without replacement
    features = random.sample(range(len(dataset[0])-1), n_features)

    # Iterate over all the elements in the selected feature columns
    for index in features:
        for row in dataset:
            groups = test_split(index, row[index], dataset)
            gini   = gini_index(groups, class_values)
            
            # If the split has a lower gini score
            if gini < b_gini:
                b_index  = index
                b_value  = row[index]
                b_gini   = gini
                b_groups = groups
    
    return {'index': b_index, 'value': b_value, 'groups': b_groups}