# Random Forest

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# fixing the randomness
import random
random.seed(42)
np.random.seed(42)

To start, we will create our data. We are looking at the random forest classifier, so we'll use the diabetes dataset, used for classification. 

In [3]:
# Import the diabetes dataset from sklearn
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
from sklearn.datasets import load_breast_cancer
# load the dataset
cancer = load_breast_cancer()
print(cancer['feature_names'])

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target,  test_size = 0.3, random_state = 42)

Creating a Random Forest classifier with sk-learn is not different than using other type of classifier, such as the Decision Tree we saw last time

In [5]:
# importing the random forest classifier algorithm function from sklearn
from sklearn.ensemble import RandomForestClassifier
# the random forest is stochastic, so we use a random_state parameter to fix the result
clf = RandomForestClassifier(random_state=42)

And we use the same method for training than we usually do

In [6]:
clf.fit(x_train, y_train)
clf.score(x_test, y_test)

0.9707602339181286

## Part II


The `max_features` parameter can be set manually with an integer for an hard value or a float for percentage, or you can use some preset value such as `'sqrt'` which consider $\sqrt{N}$ features for each split, with $N$ being the number of features. 

The square root is the default parameter.

In [7]:
clf2 = RandomForestClassifier(max_features=1, random_state=42)

In [8]:
clf2.fit(x_train, y_train)
clf2.score(x_test, y_test)

0.9766081871345029

By default the `max_depth` parameter doesn't have any value, and such the nodes are expanded until the leaves are pure (or until other hyper-parameter such as `min_samples_split` decides it).

The sk-learn user mention this:
> Good results are often achieved when setting `max_depth=None` in combination with `min_samples_split=2` (i.e., when fully developing the trees).

However, you can still set a maximum depth by hand, with an integer. Doing so will help reducing the size of the model.

In [9]:
clf3 = RandomForestClassifier(max_depth=1, random_state=42)

In [10]:
clf3.fit(x_train, y_train)
clf3.score(x_test, y_test)

0.9590643274853801

The number of trees is also an important parameter. You can changed it with the `n_estimators` argument. 

User guide:
> The larger the better, but also the longer it will take to compute. In addition, note that results will stop getting significantly better beyond a critical number of trees. 

The default value is 100, but you can change it for another integer.

In [11]:
clf4 = RandomForestClassifier(n_estimators=5, random_state=42)

In [12]:
clf4.fit(x_train, y_train)
clf4.score(x_test, y_test)

0.9649122807017544

In [13]:
clf5 = RandomForestClassifier(n_estimators=1000, random_state=42)

In [14]:
clf5.fit(x_train, y_train)
clf5.score(x_test, y_test)

0.9707602339181286

You can decide to use bootstrapping or the entire dataset for each tree with the `bootstrap` argument. The default value is `True`. 

If it is `True`, you can control the size of the bagging with the `max_samples` argument. By default it is `None`, which draws all the sample. You can change it to an int or a float for percentage. 

User guide:
> A typical value of subsample is 0.5.

In [15]:
clf6 = RandomForestClassifier(max_samples=1, random_state=42)

In [16]:
clf6.fit(x_train, y_train)
clf6.score(x_test, y_test)

0.631578947368421

In [17]:
clf7 = RandomForestClassifier(max_samples=0.4, random_state=42)

In [18]:
clf7.fit(x_train, y_train)
clf7.score(x_test, y_test)

0.9707602339181286