# RANDOM FOREST IN TENSORFLOW
## IRIS DATASET

Source: https://www.kaggle.com/thomascolthurst/tensorforest-on-iris/data       
    

Hi, everyone. I'm Thomas Colthurst, and I work on a random forest implementation called TensorForest. As the name suggests, TensorForest is built on top of TensorFlow, which makes it easy to use all the goodies that TensorFlow provides (feature preprocessing, distributed training, etc.). You can find out more about TensorForest by reading our NIPS 2017 paper.

This is a simple example to show you how to use TensorForest on the Iris classification task. (TensorForest also works for regression problems, but this won't cover that.)

First, let's load the data:

In [2]:
import tensorflow as tf
import numpy as np
import pandas as pd
import math
import os
from glob import glob

tf.logging.set_verbosity(tf.logging.DEBUG)

all_data = pd.read_csv("iris.csv")

train = all_data[::2]
test = all_data[1::2]

print("Training = ")
print(train[:5])

print("Test = ")
print(test[:5])

Training = 
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa
6   7            4.6           3.4            1.4           0.3  Iris-setosa
8   9            4.4           2.9            1.4           0.2  Iris-setosa
Test = 
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
1   2            4.9           3.0            1.4           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
5   6            5.4           3.9            1.7           0.4  Iris-setosa
7   8            5.0           3.4            1.5           0.2  Iris-setosa
9  10            4.9           3.1            1.5           0.1  Iris-setosa


I split the data into halves for training and test. Next, let's split the training data into features and labels:

In [3]:
x_train = train.drop(['Species', 'Id'], axis=1).astype(np.float32).values
label_map = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
y_train = train['Species'].map(label_map).astype(np.float32).values

print("Training features =")
print(x_train[:5])
print("Training labels =")
print(y_train[:5])

Training features =
[[5.1 3.5 1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [5.  3.6 1.4 0.2]
 [4.6 3.4 1.4 0.3]
 [4.4 2.9 1.4 0.2]]
Training labels =
[0. 0. 0. 0. 0.]


Great, now let's configure the forest. We need to specify how many classes there are, how many features, how many trees we want, and the maximum size of those trees. TensorForest will intelligently set a bunch of other training parameters based on those values. In this example, we choose to override one of those -- we say we want to split any node once it has seen 50 examples.


In [8]:
params = tf.contrib.tensor_forest.python.tensor_forest.ForestHParams(
  num_classes=3, num_features=4, num_trees=50, max_nodes=1000, split_after_samples=50).fill()

print("Params =")
print(vars(params))

Params =
{'num_trees': 50, 'max_nodes': 1000, 'bagging_fraction': 1.0, 'feature_bagging_fraction': 1.0, 'num_splits_to_consider': 10, 'max_fertile_nodes': 0, 'split_after_samples': 50, 'valid_leaf_threshold': 1, 'dominate_method': 'bootstrap', 'dominate_fraction': 0.99, 'model_name': 'all_dense', 'split_finish_name': 'basic', 'split_pruning_name': 'none', 'collate_examples': False, 'checkpoint_stats': False, 'use_running_stats_method': False, 'initialize_average_splits': False, 'inference_tree_paths': False, 'param_file': None, 'split_name': 'less_or_equal', 'early_finish_check_every_samples': 0, 'prune_every_samples': 0, 'num_classes': 3, 'num_features': 4, 'bagged_num_features': 4, 'bagged_features': None, 'regression': False, 'num_outputs': 1, 'num_output_columns': 4, 'base_random_seed': 0, 'leaf_model_type': 0, 'stats_model_type': 0, 'finish_type': 0, 'pruning_type': 0, 'split_type': 0}


Everything is set up, so it's time to train the forest:

In [13]:
# Remove previous checkpoints so that we can re-run this step if necessary.
for f in glob("./*"):
    os.remove(f)
classifier = tf.contrib.tensor_forest.client.random_forest.TensorForestEstimator(
    params, model_dir="./")
classifier.fit(x=x_train, y=y_train)

PermissionError: [WinError 5] Отказано в доступе: '/$Recycle.Bin'

While the forest is training, the negative of the current average tree size is reported as the loss. (This is sort of a hack to get around the fact that random forest training isn't loss-based in the way that TensorFlow expects.)

Now let's see how well this forest does on the test data:

In [10]:
x_test = test.drop(['Species', 'Id'], axis=1).astype(np.float32).values
y_test = test['Species'].map(label_map).astype(np.float32).values

print("x_test = ")
print(x_test[:5])

print("test labels =")
print(y_test[:5])

y_out = list(classifier.predict(x=x_test))

print(y_out[:5])

x_test = 
[[4.9 3.  1.4 0.2]
 [4.6 3.1 1.5 0.2]
 [5.4 3.9 1.7 0.4]
 [5.  3.4 1.5 0.2]
 [4.9 3.1 1.5 0.1]]
test labels =
[0. 0. 0. 0. 0.]


NameError: name 'classifier' is not defined

The output comes both as soft (probabilities) and hard (single class) predictions, so there are a bunch of ways you can slice and dice them. Here are a few:

In [11]:
n = len(y_test)
out_soft = list(y['classes'] for y in y_out)
out_hard = list(y['probabilities'] for y in y_out)

print("Soft predictions:")
print(out_soft[:5])
print("Hard predictions:")
print(out_hard[:5])

soft_zipped = zip(y_test, out_soft)
hard_zipped = list(zip(y_test, out_hard))

num_correct = sum(1 for p in hard_zipped if p[0] == p[1])
print("Accuracy = %s" % (num_correct / n))

NameError: name 'y_out' is not defined