# MACHINE LEARNING WITH DECISION TREES AND RANDOM FORESTS

>by Dr Juan H Klopper

- Research Fellow
- School for Data Science and Computational Thinking
- Stellenbosch University

## INTRODUCTION

__Random Forests__ and __gradient boosted trees__ are commonly used machine learning (ML) techniques used in classification and regression problems. They have the advantage over some other ML techniques in that the models are interpretable.

The basic building block of a random forest is a __decision tree__. The term decision tree is almost self-explanatory. The algorithm builds a tree structure by making repeated decisions on the data. As such, it is very similar to a flowchart. 

In this notebook we explore a simple decision tree and take a closer look at random forests. We start with the concept of information gain, vital to random forests.

Imagine that we have a basket of green apples, oranges, and bananas. Without examining the basket, we have very little information. To gain more information we might consider if a fruit is orange in colour or not. This will immediately split the oranges from the green apples and the bananas. We have gained information. We see a simplified decision tree analgoue in the image below.

<img src="https://drive.google.com/uc?id=1y7q1noKhje77gr7NJuXFguSrIlUOKyhc" width=600>

As the image shows, a decision tree asks questions at each __node__. The first question is the __root node__. All nodes that follow from a previous node are __child nodes__ and the node from which it originated is a __parent node__. The last nodes are also termed __leaf nodes__ or __terminal nodes__. A __branch__ is any tree structure that _flows from_ a parent node. The __depth__ of a tree is longest path from the root node to a leaf. In the image above the depth is $2$ (there are two layers below the root node on the right).

In our image above, the leaf nodes are __pure__. They only contain a single class. We have gained information by _asking our questions_ and dividing the data set. We will se later that there are ways of calculating information gain.

Different questions could be asked of the data leading to different trees. In the image above, one of the feature variables (weight) was not even included.

Many trees can be generated together in an ensemble of trees. This leads to random forests and gradient boosted trees. Such algorithms can greatly improve on a simple decision tree.

## PACKAGES USED IN THIS NOTEBOOK

In [None]:
# The usual suspects
import numpy as np
import pandas as pd

In [None]:
# The industry-leading scikit-learn package for machine learning in Python
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
import pydotplus
from IPython.display import Image

In [None]:
# Data visualisation
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
import plotly.io as pio
pio.templates.default = 'plotly_white'

In [None]:
# Two more data visualisation packages
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%config InlineBackend.figure_format = "retina" # For Retina type displays

In [None]:
# Format tables printed to the screen (don't put this on the same line as the code)
%load_ext google.colab.data_table

## DECISION TREES

The knowledge we require to use random forests starts by understanding a decision tree. Below, we see a dataset with three categorical feature variables, each with three elements in its sample space. The target variable is dichotomous.

In [None]:
cat_1 = ['I', 'I', 'I', 'I', 'I', 'I', 'II', 'II', 'II', 'III', 'III', 'I', 'I', 'I', 'I', 'III', 'III', 'III', 'III', 'III', 'III']
cat_2 = ['A', 'A', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C']
cat_3 = ['2', '2', '1', '2', '1', '1', '2', '1', '2', '1', '1', '2', '2', '2', '3', '3', '2', '3', '3', '2', '3']
target = ['No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']

df = pd.DataFrame({
    'CAT1':cat_1,
    'CAT2':cat_2,
    'CAT3':cat_3,
    'Target':target
})

df

We can view the frequency of each variable's sample space elements.

In [None]:
df.CAT1.value_counts()

In [None]:
df.CAT2.value_counts()

In [None]:
df.CAT3.value_counts()

In [None]:
df.Target.value_counts()

A decision tree is similar to a flowchart. Our aim is to build a decision tree to predict the target class.

We can make our root node any of the three feature variables. We shall start with the first categorical variable. Using the `groupby` method, we can see the proportion of target classes in each child node. There are three child nodes that follow from this root node as there are three sample space elements in the `CAT1` variable.

In [None]:
df.groupby('CAT1')['Target'].value_counts()

The image below gives a visual representation of the results following from using `CAT1` as the root node.

<img src="https://drive.google.com/uc?id=117S_JJPZGnvraN3FoYbPs1Bim-sKARvA">

We have been introduced to the term _pure node_. This means that there is also an _impure node_. __Purity__ refers to class frequency. If a child node only contains a single class it is pure (as with the second child node), else it is impure.

With our aim that our decision three is knowing the target class, we already know that if we choose `CAT1` as our root node that a value of `II` will always predict a `No` for the target class. It is now a leaf or terminal node. Not so for the other two nodes (`CAT1` values of `I` and `III`). We need for them to _branch_ further.

If we select `CAT2` for the firts child node on the left (`CAT1` being `I`) we get the following results (using `groupby` again, after selecting `I`).

In [None]:
df.loc[df.CAT1 == 'I'].groupby('CAT2')['Target'].value_counts()

Now we see that for values of `C` we get a pure node. So, if `CAT1` is `I` and then `CAT2` is `C` then our decision tree predicts a target class of `No`. What about the other child node (`III`)? Below, we also choose `CAT2` for it.

In [None]:
df.loc[df.CAT1 == 'III'].groupby('CAT2')['Target'].value_counts()

`CAT2` brings no pure nodes. So what if we choose `CAT3` instead?

In [None]:
df.loc[df.CAT1 == 'III'].groupby('CAT3')['Target'].value_counts()

All three child nodes are pure. 

We could carry on this process until all nodes are pure. This random selection might be very inefficient and the depth might be quite large. So, how do we improve on our selection of variables for our nodes? The answer is information gain.

## INFORMATION GAIN

We have seen that any variable can be chosen at a node. Given the data, a decision tree must decide on these variables.

This decision is made using __information gain__. We require maximum information gain at each node. Information gain is the difference in information before and after a node. An equation for information gain in showed in (1).

$$\text{IG} \left( \text{D}_{p} , \text{f} \right) = \text{I} \left( \text{D}_{p} \right) - \sum_{i=1}^{m}{\frac{N_{i}}{N} \text{I} \left( \text{D}_{i} \right)} \tag{1}$$

Here $\text{IG}$ is information gain given the data set of a parent node, $\text{D}_p$, and the feature, $\text{f}$. $\text{I}$ is an impurity criterion (see below). $N$ is the total number of samples and $\text{D}_{i}$ is the data set of the $i^{\text{th}}$ child node. The equation simply states that we subtract the averarge information from child nodes from that of their parent node.

Two commonly used impurity criteria are the entropy and Gini index, shown in (2) and (3).

$$ \text{I}_{\text{Entropy}} = - \sum_{i=1}^{c}{p_{i} \log_{2} \left( p_{i} \right)} \tag{2}$$

$$\text{I}_{\text{Gini}} = 1 - \sum_{i=1}^{c}{p_{i}^{2}} \tag{3}$$

Gini impurity can only be used for classification problems (categorical target variable). Here, $p_{i}$ is the proportions of observations that belongs to class $c$ for a particular node.

We will discuss entropy is more detail. It requires us to understand the logarithm function and summation notation.

As a quick reminder of the logarithm we have (4).

$$y = \log_{2} \left( x \right) \text{ means } 2^{y} = x \tag{4}$$

The $y$ (the solution we seek) is what we have to raise the base ($2$ is this case) to, to get $x$.

$\Sigma$ is the summation symbol. It has a subscript and a superscript. The former tells us where to start counting and the latter is where we stop. The increment is $1$. In (5) we get a look at how summation notation works for adding three numbers denoted as $x_{1}$, $x_{2}$, and $x_{3}$.

$$\sum_{i=1}^{3} \left( x_{i} \right) = x_{1} + x_{2} + x_{3} \tag{5}$$

We simply increment the value of $i$ at each step.

Back to our equation for entropy, (2). __Shannon entropy__ is a measure of information. When we only have the data set and have not constructed a decision tree, our entropy (a measure of missing information) is high and our information is low. We need to gain information and decrease entropy (decrease the amount of missing knowledge about our target in this case). To understand this equation, we view our example from above.

We have two target classes, so $i$ starts at $1$ and goes to $c=2$. From this we have $p_{1}$, the probability of say ,`Yes` (we are free to choose), as the number of `Yes` classes at the first child node (`I`) divided by the total number of observations (`Yes` + `No`) of $10$. So $p_{1}$ is $4$ divided by $10$. For $i=2$, that is to say `No`, $p_{2}$ would be $6$ divided by $10$. The $\log_{2}$ is the logarithm base $2$. For clarity, we have (6) that shows the entropy for the first child node in various ways. To remind us of the child nodes of the root node, we repeat the grouping again below.

In [None]:
df.groupby('CAT1')['Target'].value_counts()

$$ \begin{align} &\text{I}_{\text{I}} = - {p_{1} \log_{2} p_{1}} - {p_{2} \log_{2} p_{2}} \\ &\text{I}_{\text{I}} = - p_{\text{Yes}} \log_{2} p_{\text{Yes}} - p_{\text{No}} \log_{2} p_{\text{No}} \\ &\text{I}_{\text{I}} = -\frac{4}{10} \log_{2} \frac{4}{10} - \frac{6}{10} \log_{2} \frac{6}{10} \end{align} \tag{6}$$

The numpy `log2` function calculates the logarithm base $2$.

In [None]:
cat1_I = -((4/10) * (np.log2(4/10))) - ((6/10) * (np.log2(6/10)))
cat1_I

Below, we do this for the other two child nodes in the image above. Remember that in the second node (`CAT1` = `II`) we have a pure node of three `No` classes. In the third node (`CAT1` = `III`) we have six `Yes` classes and two `No` classes.

Since the logarithm of $0$ is not defined, we do not include it in the equation.

In [None]:
# Second node
cat1_II = - ((3/3) * (np.log2(3/3)))
cat1_II

The result is $0$ (bar the rounding error).

In [None]:
# Third node
cat1_III = -((6/8) * (np.log2(6/8))) - ((2/8) * (np.log2(2/8)))
cat1_III

Entropy ranges from $0$ where we have complete information (a pure node) to $1$ where we have no information.

We also look at the entropy of the parent node. At the root we simply have the frequency of the target classes.

In [None]:
df.Target.value_counts()

In [None]:
# Root node
start = -((10/21) * (np.log2(10/21))) - ((11/21) * (np.log2(11/21)))
start

If we average over the entropy of each of the child nodes and subtract this average from the entropy of the root (parent) node, we know the information gain for choosing `CAT1` as our root node. This is the equation (1) above.

In [None]:
# Information gain given CAT1 as choice for root node
start - np.mean([cat1_I, cat1_II, cat1_III])

Would it have been better to choose one of the other variables? We start by taking a look at the information gain from `CAT2` as root node.

In [None]:
df.groupby('CAT2').Target.value_counts()

In [None]:
# Calculating the three entropies
cat2_A = -((1/6) * np.log(1/6)) - ((5/6) * np.log(5/6))
cat2_B = -((8/11) * np.log(8/11)) - ((3/11) * np.log(3/11))
cat2_C = -((1/4) * np.log(1/4)) - ((3/4) * np.log(3/4))

start - np.mean([cat2_A, cat2_B, cat2_C])

The information gain is higher. What about `CAT3`?

In [None]:
df.groupby('CAT3').Target.value_counts()

In [None]:
# Calculating the three entropies
cat3_1 = - ((6/6) * np.log(6/6))
cat3_2 = -((5/10) * np.log(5/10)) - ((5/10) * np.log(5/10))
cat3_3 = -((5/5) * np.log(5/5))

start - np.mean([cat3_1, cat3_2, cat3_3])

The information gain is even higher. This would be the best choice for our first node.

This is one algorithm used by a decsion tree. It repeats this process at every branch until it reaches purity in all child nodes or until a hyperparameter setting requires it to stop branching (see later).

When can and should a decision tree stop? One obvious stopping criterium is when all the child nodes are leaves or terminal nodes, i.e. they are pure. This is a problematic approach as the depth can be large and the model will probably overfit the training data and not generalise well to unseen data.

In another method, we set a minimum information gain. Once successive branching fails to improve beyond this minimim, the decision tree terminates. We can also call a halt when a number of the child nodes contain less than a set proportion of the classes.

While a single decision tree is relatively easy to create and understand, it does have drawbacks. We have mentioned overfitting. This is worsened by smaller data sets. __Pruning__ is a technique where the depth is made more shallow. This can be set or occur after a tree is fully constructed. Such pruned trees might do better on unseen data.

Another major drawback occurs when some feature variables contain many classes. A tree might preferentially split on this variable. __Information gain ratio__ reduces the bias a tree has for these variables by looking at the size and number of branches of each variable.

In the next section, we use the `DecisionTreeClassifier` from the scikit-learn package to investigate our data set.

## A DECISION TREE CLASSIFIER

The scikit-learn package provides a decision tree classifier for classification problems. We can use it on our simple data set. The process is as with the $k$ nearest neighbour classifier from the previous notebook. First, we instantiate the classifier and then fit the data to it.

First, though, we have to transform our data. The `DecisionTreeClassifier` class only works with numerical data. We use the `LabelEncoder` and `LabelBinarizer` to transcode our variable values.

In [None]:
# Instantiate the label encoder
label_encoder = LabelEncoder()

In [None]:
# Instantiate the label binazier
label_binarizer = LabelBinarizer()

The `fit_transform` method for each of the encoders will fit and transform the data.

In [None]:
encoded_cat1 = label_encoder.fit_transform(cat_1)
encoded_cat2 = label_encoder.fit_transform(cat_2)
encoded_cat3 = label_encoder.fit_transform(cat_3)
y = label_binarizer.fit_transform(target).flatten()

Now we create a numpy array and append the three feature variables.

In [None]:
X = []

for i in range(len(y)):
  X.append([encoded_cat1[i], encoded_cat2[i], encoded_cat3[i]])

All that remains is to instantiate our classifier and fit the data.

In [None]:
# Instantiate the classifier
d_tree = DecisionTreeClassifier(criterion='entropy')

In [None]:
# Fit the data (in numpy array format)
d_tree.fit(
    X,
    y
)

If you are running this notebook on a local system then the following code will export a PNG image of the decision tree.

```
feature_names = ['Cat 1', 'Cat 2', 'Cat 3']
target_names = ['No', 'Yes']

dot_data = export_graphviz(
    d_tree,
    out_file=None,
    class_names=target_names
)

graph = pydotplus.graph_from_dot_data(dot_data)

Image(graph.create_png)
graph.write_png('tree.png')
```

We can use the `predict` method to pass an unseen observation to the model.

In [None]:
d_tree.predict([[1, 2, 1]])

We can compute the accuracy of our model by passing the feature variable array to the `predict` method.

In [None]:
y_pred = d_tree.predict(X)

We use logic to return `True` and `False` values while comparing the predicted and the true target variable values. Since `True` is represented internally as a $1$, we can sum over all the Boolean values and divide by the number of observations to return the accuracy of the model.

In [None]:
np.sum(y == y_pred) / len(y_pred)

As with the $k$ nearest neighbour classifier, we can use a confusion matrix plot to evaluate the model's prediction using the test data predictions and actual values.

In [None]:
metrics.plot_confusion_matrix(d_tree,
                              X,
                              y);

From this we can use all the other metrics described in the previous notebook.

## A DECISION TREE REGRESSOR

The scikit-learn decision tree regressor class is very simular to the classifier class. In regression problems, the target variable is continuous numerical.

To work through an example of a decision tree regression problem, we generate data using the `make_regression` function from the models module of the scikit-learn package.

In [None]:
X, y = make_regression(
    n_samples=1000, # Sample size
    n_features=4, # Total number of feature variables
    n_informative=2, # Number of feature variable that are correlated to target
    noise=0.9, # Add noise to the data
    random_state=12 # For reproducible results
)

To visualise the correlation between every pair of variables we create a scatter plot matrix after importing the data into a pandas DataFrame object.

In [None]:
columns = ['Var1', 'Var2', 'Var3', 'Var4'] # Feature variable names

regression_data = pd.DataFrame(
    X,
    columns=columns
)

regression_data['Target'] = y # Add target variable as another column

regression_data[:5] # First five observations

In [None]:
px.scatter_matrix(
    regression_data,
    title='Scatter plot matrix'
)

We note that `Var1` and `Var3` seem to be correlated to the target variable.

Below, we instantiate the regressor class and then proceed as we did above with the classification problem.

In [None]:
# Instantiate the decision tree regressor class
regressor = DecisionTreeRegressor() # All hyperparameters left at their default values

We take the added step of splitting the data into a training and a test set.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=12
)

In [None]:
x_train.shape, y_train.shape # Verifying the splitting result

Now we can fit the model and evaluate its performance.

In [None]:
dt_reg_model = regressor.fit(
    x_train,
    y_train
)

We can use the coefficent of determination, $R^{2}$, to evaluate the model given the test set.

In [None]:
dt_reg_model.score(
    x_test,
    y_test
)

We can calculate the predicted target values using the `predict` method.

In [None]:
y_reg_pred = dt_reg_model.predict(
    x_test
)

A scatter plot shows very good correlation between the actual and predicted target values.

In [None]:
go.Figure(
    go.Scatter(
        x=y_test,
        y=y_reg_pred,
        mode='markers',
        marker={
            'size':10
        }
    )
).update_layout(
    title='Actual vs predicted values for the test set',
    xaxis={'title':'Actual target values'},
    yaxis={'title':'Predicted targte values'}
)

## RANDOM FORESTS

A single decision tree is easy to understand and to generate. It does not fare very well in the real world. To improve on the performance, we use ensemble techniques such a random forests. As the name suggests, it is a collection of trees.

The decision trees in a random forest are all individual models and the final result is majority vote or an average of all the predictions.

The trees themselves select a random set of the feature variables, a random sample of the observations (resampling with replacement), and are trained to various depths. These are all hyperpaarmeters that can be set. This combination can improve the performance on real-world data.

As an example, we will use the same data as with the decision tree regression problem above. The steps we take to generate, train, and evaluate the model should now be very familiar.

In [None]:
rf_regressor = RandomForestRegressor() # Hyperparameters are left at their default

In [None]:
# Training the model
rf_reg_model = rf_regressor.fit(
    x_train,
    y_train
)

In [None]:
# Coefficent of correlation
rf_reg_model.score(
    x_test,
    y_test
)

This is almost $1.0$. Below, we look at a scatter plot of the actual versus the predicted values.

In [None]:
y_reg_pred = dt_reg_model.predict(
    x_test
)

In [None]:
go.Figure(
    go.Scatter(
        x=y_test,
        y=y_reg_pred,
        mode='markers',
        marker={
            'size':10
        }
    )
).update_layout(
    title='Actual vs predicted values for the test set',
    xaxis={'title':'Actual target values'},
    yaxis={'title':'Predicted targte values'}
)

## TENSORFLOW DECISION FORESTS

Google, which provides the popular TensorFlow deep neural network architecture, now also provides a wrapper for the Yggdrasil Decision Forest C++ libraries named `tensorflow_decision_forests`. The name is a bit different from the traditional decion tree and random forest in that it combines both terms.

The tensorflow_decision_forests package can use numerical and categorical variables. We do not need to transform the data, i.e. standardize the numerical variables or convert categorical variables into numerical variables, except the target variable as it is used by the keras module of this package for metrics (see later). It can also manage missing data.

### INSTALLING AND IMPORTING THE PACKAGE

At the time of the creation of this notebook, it is not yet part of Colab. We have to install it first.

In [None]:
# Install this package
!pip install tensorflow_decision_forests

We can now import the package, as well as some other packages we will need.

In [None]:
import tensorflow_decision_forests as tfdf
import tensorflow as tf

The tensorflow_decision_forests package is part of the TensorFlow family. Always check what the current version is for updates and changes.

In [None]:
# Current version
tfdf.__version__

### LOADING A DATA SET

Along with the official tutorials by Google, we import the very famous Palmer penguins ML data set directly from the internet.

In [None]:
# Download the data set
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

Colab saves this as a temporary file that we can import using pandas.

In [None]:
penguins = pd.read_csv('/tmp/penguins.csv')

Below, we inspect the data set.

In [None]:
penguins.shape # Number of observations and variables

This is a small data set with only $344$ observations. There are eight variables.

In [None]:
penguins.info() # Variable data type and missing data information

We note a few missing values, especially for the `sex` variable.

In [None]:
penguins[:5] # First five observations

There are three penguin species, with underrepresentation of the Chinstrap species.

In [None]:
penguins.species.value_counts()

In order to use the metrics, we need to transform the target variable to an integer data type.

In [None]:
classes = penguins.species.unique().tolist() # A list of the three species
classes

The `map` method and the `classes` list index is used to convert Adelie to $0$, Gentoo to $1$, and Chinstrap to $2$.

In [None]:
penguins.species = penguins.species.map(classes.index)

### DATA SPLITTING

Instead of using the train test split method from the scikit-learn package, we generate a function to split the data. We call the function `split`. It takes two arguments, `ds` from the DataFrame object, and `r` for the fraction of test set values. We set the default at $0.3$ or $30$% of the observations.

Internal to the function we create a computer variable, `test_ind` to hold index values. To it we assign `True` and `False`. The effect is seen in the two code cells below.

The indices of the `True` values are used to generate the training set and for `false`, the test set.

In [None]:
def split(ds, r=0.3):
  test_ind = np.random.rand(len(ds)) < r

  return ds[~test_ind], ds[test_ind]

In [None]:
# Splitting the data
np.random.seed(12)
penguins_train, penguins_test = split(penguins)

We investiagte the number of observations in each set using the `shape` attribute of each DataFrame object to ensure that our split was executed correctly.

In [None]:
penguins_train.shape

In [None]:
penguins_test.shape

We also need to make sure that we have proper representation of the target classes in each set.

In [None]:
px.bar(
    penguins_train,
    x='species',
    title='Training set target class frequency',
    labels={
        'species':'Species'
    }
).update_xaxes(
    type='category'
)

In [None]:
px.bar(
    penguins_test,
    x='species',
    title='Test set target class frequency',
    labels={
        'species':'Species'
    }
).update_xaxes(
    type='category'
)

Finally, we need to transform the pandas dataframe objects to TensorFlow dataset objects using the `pd_dataframe_to_tf_dataset` function.

In [None]:
penguins_train = tfdf.keras.pd_dataframe_to_tf_dataset(
    penguins_train,
    label='species'
)

penguins_test = tfdf.keras.pd_dataframe_to_tf_dataset(
    penguins_test,
    label='species'
)

### CREATING A DECISION FOREST MODEL

Below, we instantiate a random forest model, with default hyperparameter values.

In [None]:
rf_model = tfdf.keras.RandomForestModel()

We set accuracy as metric and compile the model. This step is only required if we want to specify metrics.

In [None]:
rf_model.compile(
    metrics=['accuracy']
)

### TRAINING THE MODEL

Now we fit the data to the model.

In [None]:
rf_model.fit(
    x=penguins_train
)

The `summary` function provides information about the model.

In [None]:
print(rf_model.summary())

### EVALUATING THE MODEL

The `evaluate` method provides a loss and an accuracy value. Below, we evaluate the model using the test set.

In [None]:
evaluation = rf_model.evaluate(
    penguins_test,
    return_dict=True # Returning a dictionary of metrics
)

We can already see the loss and the accuracy. Below, we use the `keys` and the `values` methods to return the parts of the metrics dictionary.

In [None]:
evaluation.keys() # Metric dictionary keys

In [None]:
evaluation.values() # Metric values

### VISUALISING THE MODEL

As mentioned, decision trees and random forests are interpretable. Plotting the model shows us how information was gained.

In [None]:
tfdf.model_plotter.plot_model_in_colab(
    rf_model,
    tree_idx=0,
    max_depth=4
)

The model chose `flipper_length_mm` as the first node and split on whether the length was equal to or more than $207$. The first four decision node layers are shown.

We can inspect the feature variable importance using the `variable_importance` method.

In [None]:
rf_model.make_inspector().variable_importances()

The model can also provide a self-assessment. The `evaluation` method returns the number of samples and the accuraccy.

In [None]:
rf_model.make_inspector().evaluation()

## CONCLUSION

Random forests and other ensemble techniques using decision trees have very recently gained a lot of attention. They are easier to interpret and perform better than state of the art deep neural networks in many cases.