<a href="https://colab.research.google.com/github/Kerriea-star/TensorFlow-Decision-Forests/blob/main/Learning_Decision_Forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Build, train and Evaluate a model with Decision Forests**

*Introduction*

Installing TF-DF

Wurlitzer is needed to display the detailed training logs in Colabs (when using verbose=2 in the model)

In [22]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

In [23]:
from IPython.core.magic import register_line_magic
from IPython.display import Javascript
from IPython.display import display as ipy_display

# Some of the model training logs can cover the full
# screen if not compressed to a smaller viewport.
# This magic allows setting a max height for a cell
@register_line_magic
def set_cell_height(size):
  ipy_display(
      Javascript("google.colab.output.setInfameHeight(0, true, {maxHeight: " + str(size) +"})"))

In [24]:
# check the version of TensorFlow Decision Forests
print("Found TensorFlow Decision Forests v" + tfdf.__version__)

Found TensorFlow Decision Forests v1.5.0


**Training a Random Forest Model**

In this section we train, evaluate, analyze and export a multi-class classification Random Forest trained on the Palmer's Penguins dataset

Let's assemble the dataset into a csv file (i.e. add the header), and load it:

In [25]:
# Download the Dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# load the dataset into a Pandas DataFrame
dataset_df = pd.read_csv("/tmp/penguins.csv")

# Display the first 3 examples
dataset_df.head(3)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007


The dataset contains a mix of numerical (e.g bill_depth_mm), categorical(e.g. island) and missing features. TF-DF supports all these feature types natively (differently than NN based models), therefore there is no need for preprocessing in the form of one-hot encoding, normalization or extra is_[present feature.

Labels are a bit different:Keras metrics expect integers. The label (species) is stored as string, so let's convert it into an integer.

In [26]:
# Encode the categorical labels as integers
# Details:
# This is necessary if your classification label is represented as a
# string since Keras expects inter classification labels
# When using 'pd_dataframe_to_tf_dataset' (see below) this steo can be skipped.

# Name of the column.
label = "species"

classes = dataset_df[label].unique().tolist()
print(f"Label classes: {classes}")

dataset_df[label] = dataset_df[label].map(classes.index)

Label classes: ['Adelie', 'Gentoo', 'Chinstrap']


Next split the dataset into training and testing

In [27]:
# Split the dataset into a training and testing dataset
def split_dataset(dataset, test_ratio=0.30):
  """Splits a panda dataframe into two."""
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
    len(train_ds_pd), len(test_ds_pd)
))

217 examples in training, 127 examples for testing.


And finally, convert the pandas dataframe (pd.DataFrame) into tensorflow datasets (tf.data.Datasets)

In [28]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

Recall that pd_dataframe_to_tf_dataset converts string labels to integers if necessary.

If you want to create the tf.data.Dataset yourself, there are a couple of things to remember


*   The learning algorithms work with a one-epoch dataset and without shuffling
*   The batch size does not impact the training algorithm, but a small value might slow down reading the dataset



**Train the model**

In [29]:
%set_cell_height 300

# Specify the model
model_1 = tfdf.keras.RandomForestModel(verbose=2)

# Train the model
model_1.fit(train_ds)

<IPython.core.display.Javascript object>

Use 2 thread(s) for training
Use /tmp/tmpj6o76ukf as temporary training directory
Reading training dataset...
Training tensor examples:
Features: {'island': <tf.Tensor 'data:0' shape=(None,) dtype=string>, 'bill_length_mm': <tf.Tensor 'data_1:0' shape=(None,) dtype=float64>, 'bill_depth_mm': <tf.Tensor 'data_2:0' shape=(None,) dtype=float64>, 'flipper_length_mm': <tf.Tensor 'data_3:0' shape=(None,) dtype=float64>, 'body_mass_g': <tf.Tensor 'data_4:0' shape=(None,) dtype=float64>, 'sex': <tf.Tensor 'data_5:0' shape=(None,) dtype=string>, 'year': <tf.Tensor 'data_6:0' shape=(None,) dtype=int64>}
Label: Tensor("data_7:0", shape=(None,), dtype=int64)
Weights: None
Normalized tensor features:
 {'island': SemanticTensor(semantic=<Semantic.CATEGORICAL: 2>, tensor=<tf.Tensor 'data:0' shape=(None,) dtype=string>), 'bill_length_mm': SemanticTensor(semantic=<Semantic.NUMERICAL: 1>, tensor=<tf.Tensor 'Cast:0' shape=(None,) dtype=float32>), 'bill_depth_mm': SemanticTensor(semantic=<Semantic.NUMERIC

[INFO 23-08-11 12:26:12.8926 UTC kernel.cc:773] Start Yggdrasil model training
[INFO 23-08-11 12:26:12.8927 UTC kernel.cc:774] Collect training examples
[INFO 23-08-11 12:26:12.8927 UTC kernel.cc:787] Dataspec guide:
column_guides {
  column_name_pattern: "^__LABEL$"
  type: CATEGORICAL
  categorial {
    min_vocab_frequency: 0
    max_vocab_count: -1
  }
}
default_column_guide {
  categorial {
    max_vocab_count: 2000
  }
  discretized_numerical {
    maximum_num_bins: 255
  }
}
ignore_columns_without_guides: false
detect_numerical_as_discretized_numerical: false

[INFO 23-08-11 12:26:12.8931 UTC kernel.cc:393] Number of batches: 1
[INFO 23-08-11 12:26:12.8931 UTC kernel.cc:394] Number of examples: 217
[INFO 23-08-11 12:26:12.8932 UTC kernel.cc:794] Training dataset:
Number of records: 217
Number of columns: 8

Number of columns by type:
	NUMERICAL: 5 (62.5%)
	CATEGORICAL: 3 (37.5%)

Columns:

NUMERICAL: 5 (62.5%)
	1: "bill_depth_mm" NUMERICAL num-nas:2 (0.921659%) mean:17.0428 min:1

Model trained in 0:00:00.250803
Compiling model...
Model compiled.


<keras.src.callbacks.History at 0x7a5c64286050>