# **Random Forests**

We’ve seen that decision trees can be powerful supervised machine learning models. However, they’re not without their weaknesses — decision trees are often prone to overfitting.

We’ve discussed some strategies to minimize this problem, like pruning, but sometimes that isn’t enough. We need to find another way to generalize our trees. This is where the concept of a *random forest* comes in handy.

A *random forest* is an ensemble machine learning technique — a random forest contains many decision trees that all work together to classify new points. When a random forest is asked to classify a new point, the random forest gives that point to each of the decision trees. Each of those trees reports their classification and the random forest returns the most popular classification. It’s like every tree gets a vote, and the most popular classification wins.

Some of the trees in the random forest may be overfit, but by making the prediction based on a large number of trees, overfitting will have less of an impact.

In this lesson, we’ll learn how the trees in a random forest get created.

<img src="RF.png" width="40%" height="40%">

## Bagging

You might be wondering how the trees in the random forest get created. After all, right now, our algorithm for creating a decision tree is deterministic — given a training set, the same tree will be made every time.

Random forests create different trees using a process known as `bagging`. Every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had `1000` rows in it, we could make a decision tree by picking `100` of those rows at random to build the tree. This way, every tree is different, but all trees will still be created from a portion of the training data.

One thing to note is that when we’re randomly selecting these `100` rows, we’re doing so `with replacement`. Picture putting all `100` rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag.

This means that when we’re picking our `100` random rows, we could pick the same row more than once. In fact, it’s very unlikely, but all `100` randomly picked rows could all be the same row!

Because we’re picking these rows with replacement, there’s no need to shrink our bagged training set from `1000` rows to `100`. We can pick `1000` rows at random, and because we can get the same row more than once, we’ll still end up with a unique data set.

In [1]:
import numpy as np
import pandas as pd
import random
from tree import *  # our own decision tree script

In [2]:
# read in the data
with open("drug_consumption.data", "r") as f:
    attributes = []
    drug = []
    for line in f:
        attributes.append(line.strip().split(",")[1:13])
        drug.append(line.strip().split(",")[28])
        
tree = build_tree(attributes, drug)
print(print_tree(tree))

Splitting on Country of residence
--> Branch -0.09765:
 Splitting on Neuroticism score
 --> Branch -0.05188:
  Counter({'CL3': 1})
 --> Branch -0.14882:
  Counter({'CL3': 2})
 --> Branch -0.24649:
  Splitting on Age
  --> Branch -0.95197:
   Counter({'CL4': 1})
  --> Branch 0.49788:
   Counter({'CL0': 1})
 --> Branch -0.34799:
  Splitting on Extraversion score
  --> Branch 0.16767:
   Counter({'CL4': 1})
  --> Branch 0.32197:
   Counter({'CL3': 1})
  --> Branch 0.80523:
   Counter({'CL4': 1})
 --> Branch -0.58016:
  Splitting on Extraversion score
  --> Branch 0.32197:
   Counter({'CL6': 1})
  --> Branch 0.47617:
   Counter({'CL0': 1})
  --> Branch 0.63779:
   Counter({'CL2': 1})
 --> Branch -0.67825:
  Splitting on Extraversion score
  --> Branch 0.32197:
   Counter({'CL0': 1})
  --> Branch 0.63779:
   Counter({'CL1': 1})
  --> Branch 1.74091:
   Counter({'CL5': 1})
 --> Branch -0.92104:
  Counter({'CL0': 3})
 --> Branch -1.05308:
  Counter({'CL0': 1})
 --> Branch -1.19430:
  Counter(

In [3]:
# create a list of 1,000 random numbers
# (there can be repeats)
indices = [random.randint(0, 999) for i in range(1000)]

# select those observations
attributes_subset = [attributes[idx] for idx in indices]
drug_subset = [drug[idx] for idx in indices]

# build a second tree on the reduced dataset
tree_subset = build_tree(attributes_subset, drug_subset)
print(print_tree(tree_subset))

Splitting on Conscientiousness score
--> Branch -0.00665:
 Splitting on Neuroticism score
 --> Branch -0.05188:
  Counter({'CL0': 2})
 --> Branch -0.24649:
  Counter({'CL0': 2})
 --> Branch -0.34799:
  Counter({'CL3': 1})
 --> Branch -0.46725:
  Counter({'CL0': 1})
 --> Branch -0.67825:
  Splitting on Age
  --> Branch -0.95197:
   Counter({'CL3': 2})
  --> Branch 0.49788:
   Counter({'CL0': 1})
 --> Branch -0.79151:
  Counter({'CL1': 6})
 --> Branch -0.92104:
  Splitting on Age
  --> Branch -0.07854:
   Counter({'CL2': 1})
  --> Branch 1.09449:
   Counter({'CL0': 4})
 --> Branch -1.05308:
  Counter({'CL0': 2})
 --> Branch -1.19430:
  Splitting on Age
  --> Branch -0.07854:
   Counter({'CL2': 1})
  --> Branch -0.95197:
   Counter({'CL2': 2})
  --> Branch 1.09449:
   Counter({'CL1': 1})
 --> Branch -1.55078:
  Counter({'CL0': 1})
 --> Branch -1.69163:
  Counter({'CL1': 3})
 --> Branch 0.04257:
  Counter({'CL0': 1})
 --> Branch 0.13606:
  Splitting on Education
  --> Branch -0.05921:
   C

## Bagging Features

We’re now making trees based on different random subsets of our initial dataset. But we can continue to add variety to the ways our trees are created by changing the features that we use.

Recall that for our data set, the original features were the following:
- age
- gender
- education
- country of current residence
- ethnicity
- neuroticism score
- extraversion score
- openness to experience score
- agreeableness score
- conscientiousness score
- impulsiveness score
- sensation seeing score

Right now when we create a decision tree, we look at every one of those features and choose to split the data based on the feature that produces the most information gain. We could change how the tree is created by only allowing a subset of those features to be considered at each split.

For example, when finding which feature to split the data on the first time, we might randomly choose to only consider ethnicity, impulsiveness, country of current residence and age.

After splitting the data on the best feature from that subset, we’ll likely want to split again. For this next split, we’ll randomly select three features again to consider. This time those features might be age, ethnicity, openness to experience and education. We’ll continue this process until the tree is complete.

One question to consider is how to choose the number of features to randomly select. Why did we choose `4` in this example? A good rule of thumb is to randomly select the square root of the total number of features. Our drugs dataset doesn’t have a lot of features, so it's only just possible to follow this rule. But if we had a dataset with `25` features for example, we’d want to randomly select `5` features to consider at every split point.

In [4]:
# copied from `tree.py`
def find_best_split(dataset, labels):
    best_gain = 0
    best_feature = 0
    
    for feature in range(len(dataset[0])):
        data_subsets, label_subsets = split(dataset, labels, feature)
        gain = information_gain(labels, label_subsets)
        if gain > best_gain:
            best_gain, best_feature = gain, feature
    return best_gain, best_feature


# determine the best feature to split the data on (openness to experience)
print(find_best_split(attributes_subset, drug_subset))

(0.06448837777958108, 9)


In [5]:
# modify find_best_split to only consider a subset of the features
def find_best_split(dataset, labels, num_features):
    best_gain = 0
    best_feature = 0
    features = np.random.choice(len(dataset[0]), num_features, replace=False)
    
    for feature in features:
        data_subsets, label_subsets = split(dataset, labels, feature)
        gain = information_gain(labels, label_subsets)
        if gain > best_gain:
            best_gain, best_feature = gain, feature
    return best_gain, best_feature


# check if the best feature to split on is the same
print(find_best_split(attributes_subset, drug_subset, 4))

(0.061699617406980725, 7)


## Classify

Now that we can make different decision trees, it’s time to plant a whole forest! Let’s say we make different `8` trees using bagging and feature bagging. We can now take a new unlabeled point, give that point to each tree in the forest, and count the number of times different labels are predicted.

The trees give us their votes and the label that is predicted most often will be our final classification! For example, if we gave our random forest of 8 trees a new data point, we might get the following results:
```
["recommend", "priority", "not_recom", "recommend", "priority", "not_recom", "priority", "priority"]
```

Since the most commonly predicted classification was "priority", this would be the random forest’s final classification.

Let’s write some code that can classify an unlabelled point!

In [6]:
# choose a random observation to test
unlabelled_point = sample(attributes, 1)[0]
print(unlabelled_point)

# build a third tree on the reduced dataset using the modified find_best_split function
bagged_tree = build_bagged_tree(attributes_subset, drug_subset, 4)
pred = classify(unlabelled_point, bagged_tree)
print(pred)

['-0.07854', '-0.48246', '0.45468', '0.96082', '-0.31685', '-0.46725', '-1.76250', '-2.85950', '-1.47955', '-1.13788', '-1.37983', '0.07987']
CL0


In [7]:
# output the predictions of 20 trees
preds = []
for i in range(20):
    indices = [random.randint(0, 999) for i in range(1000)]
    # select those observations
    attributes_subset = [attributes[idx] for idx in indices]
    drug_subset = [drug[idx] for idx in indices]

    # build a tree on the reduced dataset
    bagged_tree = build_bagged_tree(attributes_subset, drug_subset, 4)
    pred = classify(unlabelled_point, bagged_tree)
    preds.append(pred)
print(preds)

# identify the majority class
final_pred = max(preds, key=preds.count)
print(final_pred)

[None, None, 'CL0', 'CL0', None, 'CL0', 'CL0', None, 'CL0', 'CL0', 'CL0', 'CL0', 'CL0', None, 'CL0', None, 'CL0', 'CL0', 'CL0', None]
CL0


## Test Set

We’re now able to create a random forest, but how accurate is it compared to a single decision tree? To answer this question we’ve split our data into a training set and test set. By building our models using the training set and testing on every data point in the test set, we can calculate the accuracy of both a single decision tree and a random forest.

We’ve given you code that calculates the accuracy of a single tree. This tree was made without using any of the bagging techniques we just learned. We created the tree by using every row from the training set once and considered every feature when splitting the data rather than a random subset.

Let’s also calculate the accuracy of a random forest and see how it compares!

In [8]:
# read in the data
with open("drug_consumption.data", "r") as f:
    data = []
    for line in f:
        attributes = line.strip().split(",")[1:13]
        attributes.append(line.strip().split(",")[28])
        data.append(attributes)

# shuffle it
random.shuffle(attributes)

# separate the data into features and labels
attributes = [x[:-1] for x in data]
drug = [x[-1] for x in data]

# split into training and test sets
X_train = attributes[:int(len(attributes)*0.8)]
y_train = drug[:int(len(attributes)*0.8)]

X_test = attributes[int(len(attributes)*0.8):]
y_test = drug[int(len(attributes)*0.8):]

# create a single tree
tree = build_tree(X_train, y_train)
tree_correct = 0

for i in range(len(X_test)):
    pred = classify(X_test[i], tree)
    if pred == y_test[i]:
        tree_correct += 1
print(tree_correct / len(X_test))

0.23872679045092837


In [9]:
# create a random forest algorithm
def build_random_forest(n, training_data, training_labels):
    trees = []
    for i in range(n):
        indices = [random.randint(0, len(training_data)-1) for x in range(len(training_data))]

        training_data_subset = [training_data[index] for index in indices]
        training_labels_subset = [training_labels[index] for index in indices]

        tree = build_bagged_tree(training_data_subset, training_labels_subset, 2)
        trees.append(tree)
    return trees

# create a random forest with 40 trees
random_forest = build_random_forest(40, X_train, y_train)
forest_correct = 0

preds = []
for i in range(len(X_test)):
    for tree in random_forest:
        preds.append(classify(X_test[i], tree))
    forest_pred = max(preds, key=preds.count)
    if forest_pred == y_test[i]:
        forest_correct += 1
print(forest_correct / len(X_test))

0.0


## Random Forests in scikit-learn

You now have the ability to make a random forest using your own decision trees. However, `scikit-learn` has a `RandomForestClassifier` class that will do all of this work for you!

`RandomForestClassifier` works almost identically to `DecisionTreeClassifier` — the `.fit()`, `.predict()`, and `.score()` methods work in exactly the same way.

When creating a `RandomForestClassifier`, you can choose how many trees to include in the random forest by using the `n_estimators` parameter like this:
```
classifier = RandomForestClassifier(n_estimators = 100)
```
We now have a very powerful machine learning model that is fairly resistant to overfitting!

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [11]:
drugs = pd.read_csv("drug_consumption.data", header=None)
# filter out unnecessary columns
drugs.drop(0, inplace=True, axis=1)
drugs.drop(drugs.iloc[:, 12:27], inplace=True, axis=1)
drugs.drop(drugs.iloc[:, -3:], inplace=True, axis=1)
drugs.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,28
0,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,-0.00665,-0.21712,-1.18084,CL0
1,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,-0.14277,-0.71126,-0.21575,CL0
2,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.6209,-1.0145,-1.37983,0.40148,CL1
3,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,0.58489,-1.37983,-1.18084,CL0
4,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.6334,-0.45174,-0.30172,1.30612,-0.21712,-0.21575,CL2


In [13]:
# split the data into 80% train and 20% validation
y = drugs[28]
X = drugs.drop(28, axis=1)

# ensure the data is shuffled and that the training and test sets are representative of the target distribution
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y)
# view the first observation in the training set
print(X_train.iloc[0, :])
print('-----')
print(y_train.iloc[0])

1    -0.95197
2    -0.48246
3    -0.61113
4     0.21128
5    -0.31685
6    -0.46725
7    -0.43999
8     1.06238
9     0.13136
10   -1.38502
11   -0.21712
12    0.76540
Name: 1069, dtype: float64
-----
CL0


In [14]:
# create a random forest classifier
clf = RandomForestClassifier(n_estimators=2000, random_state=0)

clf.fit(X_train, y_train)
# assess the model's accuracy on the validation data
print(clf.score(X_test, y_test))

0.5358090185676393


## Review

Here are some of the major takeaways about random forests:

- A random forest is an ensemble machine learning model. It makes a classification by aggregating the classifications of many decision trees.
- Random forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in a random forest is less impactful.
- Every decision tree in a random forest is created by using a different subset of data points from the training set. Those data points are chosen at random *with replacement*, which means a single data point can be chosen more than once. This process is known as *bagging*.
- When creating a tree in a random forest, a randomly selected subset of features are considered as candidates for the best splitting feature. If your dataset has `n` features, it is common practice to randomly select the square root of `n` features.