##### Copyright 2022 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Build, train and evaluate models with TensorFlow Decision Forests

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/decision_forests/tutorials/beginner_colab"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/decision-forests/blob/main/documentation/tutorials/beginner_colab.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/decision-forests/blob/main/documentation/tutorials/beginner_colab.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/decision-forests/documentation/tutorials/beginner_colab.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>


## Introduction

Decision Forests (DF) are a family of Machine Learning algorithms for
supervised classification, regression and ranking. As the name suggests, DFs use
decision trees as a building block. Today, the two most popular DF training
algorithms are [Random Forests](https://en.wikipedia.org/wiki/Random_forest) and
[Gradient Boosted Decision Trees](https://en.wikipedia.org/wiki/Gradient_boosting).

TensorFlow Decision Forests (TF-DF) is a library for the training,
evaluation, interpretation and inference of Decision Forest models.

In this tutorial, you will learn how to:

1.  Train a multi-class classification Random Forest on a dataset containing numerical, categorical and missing features.
1.  Evaluate the model on a test dataset.
1.  Prepare the model for
    [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving).
1.  Examine the overall structure of the model and the importance of each feature.
1.  Re-train the model with a different learning algorithm (Gradient Boosted Decision Trees).
1.  Use a different set of input features.
1.  Change the hyperparameters of the model.
1.  Preprocess the features.
1.  Train a model for regression.

Detailed documentation is available in the [user manual](https://github.com/tensorflow/decision-forests/tree/main/documentation).
The [example directory](https://github.com/tensorflow/decision-forests/tree/main/examples) contains other end-to-end examples.

## Installing TensorFlow Decision Forests

Install TF-DF by running the following cell.

In [1]:
!pip install tensorflow_decision_forests==1.8.1

Collecting tensorflow~=2.15.0 (from tensorflow_decision_forests==1.8.1)
  Using cached tensorflow-2.15.1-cp310-cp310-win_amd64.whl.metadata (3.4 kB)
Collecting tensorflow-intel==2.15.1 (from tensorflow~=2.15.0->tensorflow_decision_forests==1.8.1)
  Using cached tensorflow_intel-2.15.1-cp310-cp310-win_amd64.whl.metadata (4.9 kB)
Collecting tensorboard<2.16,>=2.15 (from tensorflow-intel==2.15.1->tensorflow~=2.15.0->tensorflow_decision_forests==1.8.1)
  Using cached tensorboard-2.15.2-py3-none-any.whl.metadata (1.7 kB)
Collecting keras<2.16,>=2.15.0 (from tensorflow-intel==2.15.1->tensorflow~=2.15.0->tensorflow_decision_forests==1.8.1)
  Using cached keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Using cached tensorflow-2.15.1-cp310-cp310-win_amd64.whl (2.1 kB)
Using cached tensorflow_intel-2.15.1-cp310-cp310-win_amd64.whl (300.9 MB)
Using cached keras-2.15.0-py3-none-any.whl (1.7 MB)
Using cached tensorboard-2.15.2-py3-none-any.whl (5.5 MB)
Installing collected packages: keras, tensorbo

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tf-keras 2.16.0 requires tensorflow<2.17,>=2.16, but you have tensorflow 2.15.1 which is incompatible.


In [2]:
!pip install tf_keras==2.15.0

Collecting tf_keras==2.15.0
  Downloading tf_keras-2.15.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.15.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   -- ------------------------------------- 0.1/1.7 MB 1.1 MB/s eta 0:00:02
   ------- -------------------------------- 0.3/1.7 MB 2.4 MB/s eta 0:00:01
   -------------- ------------------------- 0.6/1.7 MB 3.5 MB/s eta 0:00:01
   --------------------- ------------------ 0.9/1.7 MB 4.3 MB/s eta 0:00:01
   ------------------------ --------------- 1.0/1.7 MB 4.4 MB/s eta 0:00:01
   ------------------------ --------------- 1.0/1.7 MB 4.4 MB/s eta 0:00:01
   ------------------------ --------------- 1.0/1.7 MB 4.4 MB/s eta 0:00:01
   ------------------------------ --------- 1.3/1.7 MB 3.2 MB/s eta 0:00:01
   ------------------------------------- -- 1.6/1.7 MB 3.5 MB/s eta 0:00:01
   ----------------------------

In [4]:
!pip install tensorflow_decision_forests
# TF-DF requires Tensorflow < 2.15 or tf_keras
!pip install tf_keras








Using cached tensorflow_decision_forests-1.8.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.3 MB)


Installing collected packages: wurlitzer, tensorflow_decision_forests


Successfully installed tensorflow_decision_forests-1.8.1 wurlitzer-3.0.3




[Wurlitzer](https://pypi.org/project/wurlitzer/) is needed to display the detailed training logs in Colabs (when using `verbose=2` in the model constructor).

In [None]:
!pip install wurlitzer

## Importing libraries

In [None]:
!pip install numpy
!pip install pandas
!pip install math

In [5]:
import os
# Keep using Keras 2
os.environ['TF_USE_LEGACY_KERAS'] = '1'
import tensorflow_decision_forests as tfdf
import numpy as np
import pandas as pd
import tensorflow as tf
import tf_keras
import math






NotFoundError: C:\Users\robot\anaconda3\envs\tf2\lib\site-packages\tensorflow_decision_forests\tensorflow\ops\inference\inference.so not found

In [3]:
# Check the version of TensorFlow Decision Forests
print("TensorFlow Decision Forests v" + tfdf.__version__)
print("Keras v" + tf_keras.__version__)

NameError: name 'tfdf' is not defined

The hidden code cell limits the output height in colab.


In [None]:
#@title

from IPython.core.magic import register_line_magic
from IPython.display import Javascript
from IPython.display import display as ipy_display

# Some of the model training logs can cover the full
# screen if not compressed to a smaller viewport.
# This magic allows setting a max height for a cell.
@register_line_magic
def set_cell_height(size):
  ipy_display(
      Javascript("google.colab.output.setIframeHeight(0, true, {maxHeight: " +
                 str(size) + "})"))

In [None]:
# Check the version of TensorFlow Decision Forests
print("Found TensorFlow Decision Forests v" + tfdf.__version__)

## Training a Random Forest model

In this section, we train, evaluate, analyse and export a multi-class classification Random Forest trained on the [Palmer's Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset.

<center>
<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" width="150"/></center>

**Note:** The dataset was exported to a csv file without pre-processing: `library(palmerpenguins); write.csv(penguins, file="penguins.csv", quote=F, row.names=F)`. 

### Load the dataset and convert it in a tf.Dataset

This dataset is very small (300 examples) and stored as a .csv-like file. Therefore, use Pandas to load it.

**Note:** Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the
[TensorFlow Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to read the files may be better suited.

Let's assemble the dataset into a csv file (i.e. add the header), and load it:

In [None]:
# Download the dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# Load a dataset into a Pandas Dataframe.
dataset_df = pd.read_csv("/tmp/penguins.csv")

# Display the first 3 examples.
dataset_df.head(3)

The dataset contains a mix of numerical (e.g. `bill_depth_mm`), categorical
(e.g. `island`) and missing features. TF-DF supports all these feature types natively (differently than NN based models), therefore there is no need for preprocessing in the form of one-hot encoding, normalization or extra `is_present` feature.

Labels are a bit different: Keras metrics expect integers. The label (`species`) is stored as a string, so let's convert it into an integer.

In [None]:
# Encode the categorical labels as integers.
#
# Details:
# This stage is necessary if your classification label is represented as a
# string since Keras expects integer classification labels.
# When using `pd_dataframe_to_tf_dataset` (see below), this step can be skipped.

# Name of the label column.
label = "species"

classes = dataset_df[label].unique().tolist()
print(f"Label classes: {classes}")

dataset_df[label] = dataset_df[label].map(classes.index)

Next split the dataset into training and testing:

In [None]:
# Split the dataset into a training and a testing dataset.

def split_dataset(dataset, test_ratio=0.30):
  """Splits a panda dataframe in two."""
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]


train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
    len(train_ds_pd), len(test_ds_pd)))

And finally, convert the pandas dataframe (`pd.Dataframe`) into tensorflow datasets (`tf.data.Dataset`):

In [None]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

**Notes:** Recall that `pd_dataframe_to_tf_dataset` converts string labels to integers if necessary.

If you want to create the `tf.data.Dataset` yourself, there are a couple of things to remember:

- The learning algorithms work with a one-epoch dataset and without shuffling.
- The batch size does not impact the training algorithm, but a small value might slow down reading the dataset.


### Train the model

In [None]:
%set_cell_height 300

# Specify the model.
model_1 = tfdf.keras.RandomForestModel(verbose=2)

# Train the model.
model_1.fit(train_ds)

Training dataset read in 0:00:03.671866. Found 248 examples.


Training model...


Standard output detected as not visible to the user e.g. running in a notebook. Creating a training log redirection. If training gets stuck, try calling tfdf.keras.set_training_logs_redirection(False).


[INFO 24-01-31 12:18:17.7858 UTC kernel.cc:771] Start Yggdrasil model training
[INFO 24-01-31 12:18:17.7858 UTC kernel.cc:772] Collect training examples
[INFO 24-01-31 12:18:17.7859 UTC kernel.cc:785] Dataspec guide:
column_guides {
  column_name_pattern: "^__LABEL$"
  type: CATEGORICAL
  categorial {
    min_vocab_frequency: 0
    max_vocab_count: -1
  }
}
default_column_guide {
  categorial {
    max_vocab_count: 2000
  }
  discretized_numerical {
    maximum_num_bins: 255
  }
}
ignore_columns_without_guides: false
detect_numerical_as_discretized_numerical: false

[INFO 24-01-31 12:18:17.7862 UTC kernel.cc:391] Number of batches: 1
[INFO 24-01-31 12:18:17.7862 UTC kernel.cc:392] Number of examples: 248
[INFO 24-01-31 12:18:17.7863 UTC kernel.cc:792] Training dataset:
Number of records: 248
Number of columns: 8

Number of columns by type:
	NUMERICAL: 5 (62.5%)
	CATEGORICAL: 3 (37.5%)

Columns:

NUMERICAL: 5 (62.5%)
	1: "bill_depth_mm" NUMERICAL num-nas:1 (0.403226%) mean:17.1433 min:1

[INFO 24-01-31 12:18:17.7865 UTC kernel.cc:822] Training config:
learner: "RANDOM_FOREST"
features: "^bill_depth_mm$"
features: "^bill_length_mm$"
features: "^body_mass_g$"
features: "^flipper_length_mm$"
features: "^island$"
features: "^sex$"
features: "^year$"
label: "^__LABEL$"
task: CLASSIFICATION
random_seed: 123456
metadata {
  framework: "TF Keras"
}
pure_serving_model: false
[yggdrasil_decision_forests.model.random_forest.proto.random_forest_config] {
  num_trees: 300
  decision_tree {
    max_depth: 16
    min_examples: 5
    in_split_min_examples_check: true
    keep_non_leaf_label_distribution: true
    num_candidate_attributes: 0
    missing_value_policy: GLOBAL_IMPUTATION
    allow_na_conditions: false
    categorical_set_greedy_forward {
      sampling: 0.1
      max_num_items: -1
      min_item_frequency: 1
    }
    growing_strategy_local {
    }
    categorical {
      cart {
      }
    }
    axis_aligned_split {
    }
    internal {
      sorting_strategy: PRESORTED


[INFO 24-01-31 12:18:17.7928 UTC random_forest.cc:802] Training of tree  1/300 (tree index:1) done accuracy:0.952941 logloss:1.69617
[INFO 24-01-31 12:18:17.7930 UTC random_forest.cc:802] Training of tree  11/300 (tree index:10) done accuracy:0.961864 logloss:0.671715
[INFO 24-01-31 12:18:17.7932 UTC random_forest.cc:802] Training of tree  21/300 (tree index:20) done accuracy:0.967078 logloss:0.503792
[INFO 24-01-31 12:18:17.7934 UTC random_forest.cc:802] Training of tree  31/300 (tree index:33) done accuracy:0.971774 logloss:0.219305
[INFO 24-01-31 12:18:17.7936 UTC random_forest.cc:802] Training of tree  42/300 (tree index:37) done accuracy:0.967742 logloss:0.0889185


[INFO 24-01-31 12:18:17.7937 UTC random_forest.cc:802] Training of tree  53/300 (tree index:54) done accuracy:0.967742 logloss:0.0857322
[INFO 24-01-31 12:18:17.7941 UTC random_forest.cc:802] Training of tree  64/300 (tree index:64) done accuracy:0.971774 logloss:0.0816141
[INFO 24-01-31 12:18:17.7944 UTC random_forest.cc:802] Training of tree  75/300 (tree index:71) done accuracy:0.971774 logloss:0.0799052
[INFO 24-01-31 12:18:17.7948 UTC random_forest.cc:802] Training of tree  87/300 (tree index:86) done accuracy:0.971774 logloss:0.0774856


[INFO 24-01-31 12:18:17.7951 UTC random_forest.cc:802] Training of tree  97/300 (tree index:96) done accuracy:0.971774 logloss:0.0774835
[INFO 24-01-31 12:18:17.7954 UTC random_forest.cc:802] Training of tree  107/300 (tree index:108) done accuracy:0.975806 logloss:0.0744989
[INFO 24-01-31 12:18:17.7957 UTC random_forest.cc:802] Training of tree  117/300 (tree index:116) done accuracy:0.975806 logloss:0.0757725


[INFO 24-01-31 12:18:17.7960 UTC random_forest.cc:802] Training of tree  127/300 (tree index:125) done accuracy:0.979839 logloss:0.0765672
[INFO 24-01-31 12:18:17.7964 UTC random_forest.cc:802] Training of tree  138/300 (tree index:138) done accuracy:0.979839 logloss:0.0770547
[INFO 24-01-31 12:18:17.7968 UTC random_forest.cc:802] Training of tree  148/300 (tree index:149) done accuracy:0.975806 logloss:0.0774622


[INFO 24-01-31 12:18:17.7970 UTC random_forest.cc:802] Training of tree  159/300 (tree index:158) done accuracy:0.975806 logloss:0.0782705
[INFO 24-01-31 12:18:17.7974 UTC random_forest.cc:802] Training of tree  169/300 (tree index:169) done accuracy:0.975806 logloss:0.0806146
[INFO 24-01-31 12:18:17.7978 UTC random_forest.cc:802] Training of tree  180/300 (tree index:178) done accuracy:0.975806 logloss:0.0809432


[INFO 24-01-31 12:18:17.7983 UTC random_forest.cc:802] Training of tree  196/300 (tree index:193) done accuracy:0.975806 logloss:0.0817119
[INFO 24-01-31 12:18:17.7987 UTC random_forest.cc:802] Training of tree  206/300 (tree index:204) done accuracy:0.975806 logloss:0.0811454
[INFO 24-01-31 12:18:17.7990 UTC random_forest.cc:802] Training of tree  216/300 (tree index:216) done accuracy:0.975806 logloss:0.0821296
[INFO 24-01-31 12:18:17.7993 UTC random_forest.cc:802] Training of tree  226/300 (tree index:223) done accuracy:0.975806 logloss:0.0817466


[INFO 24-01-31 12:18:17.7996 UTC random_forest.cc:802] Training of tree  237/300 (tree index:237) done accuracy:0.975806 logloss:0.0823455
[INFO 24-01-31 12:18:17.7999 UTC random_forest.cc:802] Training of tree  248/300 (tree index:248) done accuracy:0.975806 logloss:0.0824368
[INFO 24-01-31 12:18:17.8003 UTC random_forest.cc:802] Training of tree  260/300 (tree index:260) done accuracy:0.971774 logloss:0.0821114
[INFO 24-01-31 12:18:17.8006 UTC random_forest.cc:802] Training of tree  270/300 (tree index:271) done accuracy:0.971774 logloss:0.0829068


[INFO 24-01-31 12:18:17.8010 UTC random_forest.cc:802] Training of tree  281/300 (tree index:279) done accuracy:0.975806 logloss:0.0831618
[INFO 24-01-31 12:18:17.8013 UTC random_forest.cc:802] Training of tree  291/300 (tree index:290) done accuracy:0.975806 logloss:0.0826058
[INFO 24-01-31 12:18:17.8017 UTC random_forest.cc:802] Training of tree  300/300 (tree index:299) done accuracy:0.975806 logloss:0.0827604


[INFO 24-01-31 12:18:17.8030 UTC random_forest.cc:882] Final OOB metrics: accuracy:0.975806 logloss:0.0827604


[INFO 24-01-31 12:18:17.8037 UTC kernel.cc:919] Export model in log directory: /tmpfs/tmp/tmp3eddfnse with prefix dd78f89c05734ab8


[INFO 24-01-31 12:18:17.8072 UTC kernel.cc:937] Save model in resources


[INFO 24-01-31 12:18:17.8101 UTC abstract_model.cc:881] Model self evaluation:
Number of predictions (without weights): 248
Number of predictions (with weights): 248
Task: CLASSIFICATION
Label: __LABEL

Accuracy: 0.975806  CI95[W][0.95281 0.989412]
LogLoss: : 0.0827604
ErrorRate: : 0.0241935

Default Accuracy: : 0.451613
Default LogLoss: : 1.04913
Default ErrorRate: : 0.548387

Confusion Table:
truth\prediction
     1   2   3
1  110   1   1
2    0  86   0
3    4   0  46
Total: 248




[INFO 24-01-31 12:18:17.8203 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp3eddfnse/model/ with prefix dd78f89c05734ab8


[INFO 24-01-31 12:18:17.8331 UTC decision_forest.cc:660] Model loaded with 300 root(s), 4152 node(s), and 7 input feature(s).
[INFO 24-01-31 12:18:17.8332 UTC abstract_model.cc:1344] Engine "RandomForestGeneric" built
[INFO 24-01-31 12:18:17.8332 UTC kernel.cc:1061] Use fast generic engine


Model trained in 0:00:00.055359


Compiling model...


Model compiled.


<tf_keras.src.callbacks.History at 0x7fe5fc1edc10>

### Remarks

-   No input features are specified. Therefore, all the columns will be used as
    input features except for the label. The feature used by the model are shown
    in the training logs and in the `model.summary()`.
-   DFs consume natively numerical, categorical, categorical-set features and
    missing-values. Numerical features do not need to be normalized. Categorical
    string values do not need to be encoded in a dictionary.
-   No training hyper-parameters are specified. Therefore the default
    hyper-parameters will be used. Default hyper-parameters provide
    reasonable results in most situations.
-   Calling `compile` on the model before the `fit` is optional. Compile can be
    used to provide extra evaluation metrics.
-   Training algorithms do not need validation datasets. If a validation dataset
    is provided, it will only be used to show metrics.
-   Tweak the `verbose` argument to `RandomForestModel` to control the amount of
    displayed training logs. Set `verbose=0` to hide most of the logs. Set
    `verbose=2` to show all the logs.

**Note:** A *Categorical-Set* feature is composed of a set of categorical values (while a *Categorical* is only one value). More details and examples are given later.

## Evaluate the model

Let's evaluate our model on the test dataset.

In [None]:
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(test_ds, return_dict=True)
print()

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")




loss: 0.0000
accuracy: 0.9688


**Remark:** The test accuracy is close to the Out-of-bag accuracy
shown in the training logs.

See the **Model Self Evaluation** section below for more evaluation methods.

## Prepare this model for TensorFlow Serving.

Export the model to the SavedModel format for later re-use e.g.
[TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving).


In [None]:
model_1.save("/tmp/my_saved_model")

## Plot the model

Plotting a decision tree and following the first branches helps learning about decision forests. In some cases, plotting a model can even be used for debugging.

Because of the difference in the way they are trained, some models are more interesting to plan than others. Because of the noise injected during training and the depth of the trees, plotting Random Forest is less informative than plotting a CART or the first tree of a Gradient Boosted Tree.

Never the less, let's plot the first tree of our Random Forest model:

In [None]:
tfdf.model_plotter.plot_model_in_colab(model_1, tree_idx=0, max_depth=3)

The root node on the left contains the first condition (`bill_depth_mm >= 16.55`), number of examples (240) and label distribution (the red-blue-green bar).

Examples that evaluates true to `bill_depth_mm >= 16.55` are branched to the green path. The other ones are branched to the red path.

The deeper the node, the more `pure` they become i.e. the label distribution is biased toward a subset of classes. 

**Note:** Over the mouse on top of the plot for details.

## Model structure and feature importance

The overall structure of the model is show with `.summary()`. You will see:

-   **Type**: The learning algorithm used to train the model (`Random Forest` in
    our case).
-   **Task**: The problem solved by the model (`Classification` in our case).
-   **Input Features**: The input features of the model.
-   **Variable Importance**: Different measures of the importance of each
    feature for the model.
-   **Out-of-bag evaluation**: The out-of-bag evaluation of the model. This is a
    cheap and efficient alternative to cross-validation.
-   **Number of {trees, nodes} and other metrics**: Statistics about the
    structure of the decisions forests.

**Remark:** The summary's content depends on the learning algorithm (e.g.
Out-of-bag is only available for Random Forest) and the hyper-parameters (e.g.
the *mean-decrease-in-accuracy* variable importance can be disabled in the
hyper-parameters).

In [None]:
%set_cell_height 300
model_1.summary()

_________________________________________________________________


 Layer (type)                Output Shape              Param #   






Total params: 1 (1.00 Byte)


Trainable params: 0 (0.00 Byte)


Non-trainable params: 1 (1.00 Byte)


_________________________________________________________________


Type: "RANDOM_FOREST"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (7):
	bill_depth_mm
	bill_length_mm
	body_mass_g
	flipper_length_mm
	island
	sex
	year

No weights

Variable Importance: INV_MEAN_MIN_DEPTH:
    1.    "bill_length_mm"  0.484952 ################
    2. "flipper_length_mm"  0.404391 ##########
    3.     "bill_depth_mm"  0.324956 #####
    4.            "island"  0.300894 ###
    5.       "body_mass_g"  0.272408 #
    6.               "sex"  0.245202 
    7.              "year"  0.243988 

Variable Importance: NUM_AS_ROOT:
    1. "flipper_length_mm" 125.000000 ################
    2.    "bill_length_mm" 116.000000 ##############
    3.     "bill_depth_mm" 47.000000 #####
    4.       "body_mass_g"  6.000000 
    5.            "island"  6.000000 

Variable Importance: NUM_NODES:
    1.    "bill_length_mm" 676.000000 ################
    2.     "bill_depth_mm" 397.000000 #########
    3. "flipper_length_mm" 308.000000 #######
    4.       "body_mass_g" 254.000000 

The information in ``summary`` are all available programmatically using the model inspector:

In [None]:
# The input features
model_1.make_inspector().features()

In [None]:
# The feature importances
model_1.make_inspector().variable_importances()

The content of the summary and the inspector depends on the learning algorithm (`tfdf.keras.RandomForestModel` in this case) and its hyper-parameters (e.g. `compute_oob_variable_importances=True` will trigger the computation of Out-of-bag variable importances for the Random Forest learner).

## Model Self Evaluation

During training TFDF models can self evaluate even if no validation dataset is provided to the `fit()` method. The exact logic depends on the model. For example, Random Forest will use Out-of-bag evaluation while Gradient Boosted Trees will use internal train-validation.

**Note:** While this evaluation is  computed during training, it is NOT computed on the training dataset and can be used as a low quality evaluation.

The model self evaluation is available with the inspector's `evaluation()`:

In [None]:
model_1.make_inspector().evaluation()

## Plotting the training logs

The training logs show the quality of the model (e.g. accuracy evaluated on the out-of-bag or validation dataset) according to the number of trees in the model. These logs are helpful to study the balance between model size and model quality.

The logs are available in multiple ways:

1. Displayed in during training if `fit()` is wrapped in `with sys_pipes():` (see example above).
1. At the end of the model summary i.e. `model.summary()` (see example above).
1. Programmatically, using the model inspector i.e. `model.make_inspector().training_logs()`.
1. Using [TensorBoard](https://www.tensorflow.org/tensorboard)

Let's try the options 2 and 3:


In [None]:
%set_cell_height 150
model_1.make_inspector().training_logs()

Let's plot it:

In [None]:
import matplotlib.pyplot as plt

logs = model_1.make_inspector().training_logs()

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")

plt.subplot(1, 2, 2)
plt.plot([log.num_trees for log in logs], [log.evaluation.loss for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Logloss (out-of-bag)")

plt.show()

This dataset is small. You can see the model converging almost immediately.

Let's use TensorBoard:

In [None]:
# This cell start TensorBoard that can be slow.
# Load the TensorBoard notebook extension
%load_ext tensorboard
# Google internal version
# %load_ext google3.learning.brain.tensorboard.notebook.extension

In [None]:
# Clear existing results (if any)
!rm -fr "/tmp/tensorboard_logs"

In [None]:
# Export the meta-data to tensorboard.
model_1.make_inspector().export_to_tensorboard("/tmp/tensorboard_logs")

In [None]:
# docs_infra: no_execute
# Start a tensorboard instance.
%tensorboard --logdir "/tmp/tensorboard_logs"

<!-- <img class="tfo-display-only-on-site" src="images/beginner_tensorboard.png"/> -->


## Re-train the model with a different learning algorithm

The learning algorithm is defined by the model class. For
example, `tfdf.keras.RandomForestModel()` trains a Random Forest, while
`tfdf.keras.GradientBoostedTreesModel()` trains a Gradient Boosted Decision
Trees.

The learning algorithms are listed by calling `tfdf.keras.get_all_models()` or in the
[learner list](https://ydf.readthedocs.io/en/latest/cli_user_manual.html#learners-and-models).

In [None]:
tfdf.keras.get_all_models()

The description of the learning algorithms and their hyper-parameters are also available in the [API reference](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf) and builtin help:

In [None]:
# help works anywhere.
help(tfdf.keras.RandomForestModel)

# ? only works in ipython or notebooks, it usually opens on a separate panel.
tfdf.keras.RandomForestModel?

## Using a subset of features

The previous example did not specify the features, so all the columns were used
as input feature (except for the label). The following example shows how to
specify input features.

In [None]:
feature_1 = tfdf.keras.FeatureUsage(name="bill_length_mm")
feature_2 = tfdf.keras.FeatureUsage(name="island")

all_features = [feature_1, feature_2]

# Note: This model is only trained with two features. It will not be as good as
# the one trained on all features.

model_2 = tfdf.keras.GradientBoostedTreesModel(
    features=all_features, exclude_non_specified_features=True)

model_2.compile(metrics=["accuracy"])
model_2.fit(train_ds, validation_data=test_ds)

print(model_2.evaluate(test_ds, return_dict=True))

Num validation examples: tf.Tensor(96, shape=(), dtype=int32)




Validation dataset read in 0:00:00.202614. Found 96 examples.


Training model...


Model trained in 0:00:00.239200


Compiling model...


Model compiled.


[INFO 24-01-31 12:18:27.9538 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpp64xoosr/model/ with prefix 3b3b3b2760024850
[INFO 24-01-31 12:18:27.9584 UTC decision_forest.cc:660] Model loaded with 51 root(s), 1589 node(s), and 2 input feature(s).
[INFO 24-01-31 12:18:27.9584 UTC abstract_model.cc:1344] Engine "GradientBoostedTreesGeneric" built
[INFO 24-01-31 12:18:27.9584 UTC kernel.cc:1061] Use fast generic engine






{'loss': 0.0, 'accuracy': 0.9270833134651184}


**Note:** As expected, the accuracy is lower than previously.

**TF-DF** attaches a **semantics** to each feature. This semantics controls how
the feature is used by the model. The following semantics are currently supported:

-   **Numerical**: Generally for quantities or counts with full ordering. For
    example, the age of a person, or the number of items in a bag. Can be a
    float or an integer. Missing values are represented with float(Nan) or with
    an empty sparse tensor.
-   **Categorical**: Generally for a type/class in finite set of possible values
    without ordering. For example, the color RED in the set {RED, BLUE, GREEN}.
    Can be a string or an integer. Missing values are represented as "" (empty
    sting), value -2 or with an empty sparse tensor.
-   **Categorical-Set**: A set of categorical values. Great to represent
    tokenized text. Can be a string or an integer in a sparse tensor or a
    ragged tensor (recommended). The order/index of each item doesn't matter.

If not specified, the semantics is inferred from the representation type and shown in the training logs:

- int, float (dense or sparse) → Numerical semantics.
- str (dense or sparse) → Categorical semantics
- int, str (ragged) → Categorical-Set semantics

In some cases, the inferred semantics is incorrect. For example: An Enum stored as an integer is semantically categorical, but it will be detected as numerical. In this case, you should specify the semantic argument in the input. The `education_num` field of the Adult dataset is classical example.

This dataset doesn't contain such a feature. However, for the demonstration, we will make the model treat the `year` as a categorical feature:

In [None]:
%set_cell_height 300

feature_1 = tfdf.keras.FeatureUsage(name="year", semantic=tfdf.keras.FeatureSemantic.CATEGORICAL)
feature_2 = tfdf.keras.FeatureUsage(name="bill_length_mm")
feature_3 = tfdf.keras.FeatureUsage(name="sex")
all_features = [feature_1, feature_2, feature_3]

model_3 = tfdf.keras.GradientBoostedTreesModel(features=all_features, exclude_non_specified_features=True)
model_3.compile( metrics=["accuracy"])

model_3.fit(train_ds, validation_data=test_ds)

Num validation examples: tf.Tensor(96, shape=(), dtype=int32)


Validation dataset read in 0:00:00.154900. Found 96 examples.


Training model...


Model trained in 0:00:00.213496


Compiling model...


Model compiled.


[INFO 24-01-31 12:18:28.9470 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpdtv_ods_/model/ with prefix 67767722bfd0419b
[INFO 24-01-31 12:18:28.9508 UTC decision_forest.cc:660] Model loaded with 33 root(s), 1003 node(s), and 3 input feature(s).
[INFO 24-01-31 12:18:28.9509 UTC kernel.cc:1061] Use fast generic engine


<tf_keras.src.callbacks.History at 0x7fe42066b9a0>

Note that `year` is in the list of CATEGORICAL features (unlike the first run).

## Hyper-parameters

**Hyper-parameters** are parameters of the training algorithm that impact
the quality of the final model. They are specified in the model class
constructor. The list of hyper-parameters is visible with the *question mark* colab command (e.g. `?tfdf.keras.GradientBoostedTreesModel`).

Alternatively, you can find them on the [TensorFlow Decision Forest Github](https://github.com/tensorflow/decision-forests/blob/main/tensorflow_decision_forests/keras/wrappers_pre_generated.py) or the [Yggdrasil Decision Forest documentation](https://github.com/google/yggdrasil-decision-forests/blob/main/documentation/learners.md).

The default hyper-parameters of each algorithm matches approximatively the initial publication paper. To ensure consistancy, new features and their matching hyper-parameters are always disable by default. That's why it is a good idea to tune your hyper-parameters.

In [None]:
# A classical but slighly more complex model.
model_6 = tfdf.keras.GradientBoostedTreesModel(
    num_trees=500, growing_strategy="BEST_FIRST_GLOBAL", max_depth=8)
model_6.fit(train_ds)

Model trained in 0:00:00.563517


Compiling model...


Model compiled.


[INFO 24-01-31 12:18:30.0216 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpjgzlsodc/model/ with prefix dc1054235a75450b
[INFO 24-01-31 12:18:30.0383 UTC decision_forest.cc:660] Model loaded with 108 root(s), 5406 node(s), and 7 input feature(s).
[INFO 24-01-31 12:18:30.0384 UTC kernel.cc:1061] Use fast generic engine


<tf_keras.src.callbacks.History at 0x7fe420554be0>

In [None]:
# A more complex, but possibly, more accurate model.
model_7 = tfdf.keras.GradientBoostedTreesModel(
    num_trees=500,
    growing_strategy="BEST_FIRST_GLOBAL",
    max_depth=8,
    split_axis="SPARSE_OBLIQUE",
    categorical_algorithm="RANDOM",
    )
model_7.fit(train_ds)



Model compiled.


<tf_keras.src.callbacks.History at 0x7fe42056b610>

As new training methods are published and implemented, combination of hyper-parameters can emerge as good or almost-always-better than the default parameters. To avoid changing the default hyper-parameter values these good combination are indexed and available as hyper-parameter templates.

For example, the `benchmark_rank1` template is the best combination on our internal benchmarks. Those templates are versioned to allow training configuration stability e.g. `benchmark_rank1@v1`.

In [None]:
# A good template of hyper-parameters.
model_8 = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="benchmark_rank1")
model_8.fit(train_ds)

[INFO 24-01-31 12:18:42.2478 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpivrthe03/model/ with prefix 3f17958e0d434c5e
[INFO 24-01-31 12:18:42.3573 UTC decision_forest.cc:660] Model loaded with 900 root(s), 37318 node(s), and 7 input feature(s).
[INFO 24-01-31 12:18:42.3573 UTC kernel.cc:1061] Use fast generic engine






Model compiled.


<tf_keras.src.callbacks.History at 0x7fe42040bc70>

The available templates are available with `predefined_hyperparameters`. Note that different learning algorithms have different templates, even if the name is similar.

In [None]:
# The hyper-parameter templates of the Gradient Boosted Tree model.
print(tfdf.keras.GradientBoostedTreesModel.predefined_hyperparameters())

## Feature Preprocessing

Pre-processing features is sometimes necessary to consume signals with complex
structures, to regularize the model or to apply transfer learning.
Pre-processing can be done in one of three ways:

1.  Preprocessing on the Pandas dataframe. This solution is easy to implement
    and generally suitable for experimentation. However, the
    pre-processing logic will not be exported in the model by `model.save()`.

2.  [Keras Preprocessing](https://keras.io/guides/preprocessing_layers/): While
    more complex than the previous solution, Keras Preprocessing is packaged in
    the model.

3.  [TensorFlow Feature Columns](https://www.tensorflow.org/tutorials/structured_data/feature_columns):
    This API is part of the TF Estimator library (!= Keras) and planned for
    deprecation. This solution is interesting when using existing preprocessing
    code.

Note: Using [TensorFlow Hub](https://www.tensorflow.org/hub)
pre-trained embedding is often, a great way to consume text and image with
TF-DF. For example, `hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2")`. See the [Intermediate tutorial](intermediate_colab.ipynb) for more details.

In the next example, pre-process the `body_mass_g` feature into `body_mass_kg = body_mass_g / 1000`. The `bill_length_mm` is consumed without pre-processing. Note that such
monotonic transformations have generally no impact on decision forest models.

In [None]:
%set_cell_height 300

body_mass_g = tf_keras.layers.Input(shape=(1,), name="body_mass_g")
body_mass_kg = body_mass_g / 1000.0

bill_length_mm = tf_keras.layers.Input(shape=(1,), name="bill_length_mm")

raw_inputs = {"body_mass_g": body_mass_g, "bill_length_mm": bill_length_mm}
processed_inputs = {"body_mass_kg": body_mass_kg, "bill_length_mm": bill_length_mm}

# "preprocessor" contains the preprocessing logic.
preprocessor = tf_keras.Model(inputs=raw_inputs, outputs=processed_inputs)

# "model_4" contains both the pre-processing logic and the decision forest.
model_4 = tfdf.keras.RandomForestModel(preprocessing=preprocessor)
model_4.fit(train_ds)

model_4.summary()

Model trained in 0:00:00.041472


Compiling model...


Model compiled.




[INFO 24-01-31 12:18:43.8764 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpg1pw2xd5/model/ with prefix 04ad7ce2517f4691
[INFO 24-01-31 12:18:43.8929 UTC decision_forest.cc:660] Model loaded with 300 root(s), 5644 node(s), and 2 input feature(s).
[INFO 24-01-31 12:18:43.8929 UTC kernel.cc:1061] Use fast generic engine


Model: "random_forest_model_1"


_________________________________________________________________


 Layer (type)                Output Shape              Param #   




 model (Functional)          {'body_mass_kg': (None,   0         


                              1),                                


                              'bill_length_mm': (Non             


                             e, 1)}                              


                                                                 




Total params: 1 (1.00 Byte)


Trainable params: 0 (0.00 Byte)


Non-trainable params: 1 (1.00 Byte)


_________________________________________________________________


Type: "RANDOM_FOREST"
Task: CLASSIFICATION
Label: "__LABEL"

Input Features (2):
	bill_length_mm
	body_mass_kg

No weights

Variable Importance: INV_MEAN_MIN_DEPTH:
    1. "bill_length_mm"  0.996678 ################
    2.   "body_mass_kg"  0.412305 

Variable Importance: NUM_AS_ROOT:
    1. "bill_length_mm" 299.000000 ################
    2.   "body_mass_kg"  1.000000 

Variable Importance: NUM_NODES:
    1. "bill_length_mm" 1415.000000 ################
    2.   "body_mass_kg" 1257.000000 

Variable Importance: SUM_SCORE:
    1. "bill_length_mm" 48426.479070 ################
    2.   "body_mass_kg" 24452.918388 



Winner takes all: true
Out-of-bag evaluation: accuracy:0.927419 logloss:0.571626
Number of trees: 300
Total number of nodes: 5644

Number of nodes by tree:
Count: 300 Average: 18.8133 StdDev: 2.91979
Min: 11 Max: 29 Ignored: 0
----------------------------------------------
[ 11, 12)  2   0.67%   0.67%
[ 12, 13)  0   0.00%   0.67%
[ 13, 14) 10   3.33%   4.00% #
[ 14, 15)  0 

The following example re-implements the same logic using TensorFlow Feature
Columns.

In [None]:
def g_to_kg(x):
  return x / 1000

feature_columns = [
    tf.feature_column.numeric_column("body_mass_g", normalizer_fn=g_to_kg),
    tf.feature_column.numeric_column("bill_length_mm"),
]

preprocessing = tf_keras.layers.DenseFeatures(feature_columns)

model_5 = tfdf.keras.RandomForestModel(preprocessing=preprocessing)
model_5.fit(train_ds)

Training model...


Model trained in 0:00:00.041174


Compiling model...


Model compiled.




[INFO 24-01-31 12:18:44.8808 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpqc51ghzt/model/ with prefix 2e9db5c6f4514133
[INFO 24-01-31 12:18:44.8970 UTC decision_forest.cc:660] Model loaded with 300 root(s), 5644 node(s), and 2 input feature(s).
[INFO 24-01-31 12:18:44.8970 UTC kernel.cc:1061] Use fast generic engine


<tf_keras.src.callbacks.History at 0x7fe42048e6a0>

## Training a regression model

The previous example trains a classification model (TF-DF does not differentiate
between binary classification and multi-class classification). In the next
example, train a regression model on the
[Abalone dataset](https://archive.ics.uci.edu/ml/datasets/abalone). The
objective of this dataset is to predict the number of shell's rings of an
abalone.

**Note:** The csv file is assembled by appending UCI's header and data files. No preprocessing was applied.

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/LivingAbalone.JPG/800px-LivingAbalone.JPG" width="200"/></center>

In [None]:
# Download the dataset.
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/abalone_raw.csv -O /tmp/abalone.csv

dataset_df = pd.read_csv("/tmp/abalone.csv")
print(dataset_df.head(3))

In [None]:
# Split the dataset into a training and testing dataset.
train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
    len(train_ds_pd), len(test_ds_pd)))

# Name of the label column.
label = "Rings"

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)

In [None]:
%set_cell_height 300

# Configure the model.
model_7 = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)

# Train the model.
model_7.fit(train_ds)

Compiling model...


[INFO 24-01-31 12:18:46.9600 UTC decision_forest.cc:660] Model loaded with 300 root(s), 259684 node(s), and 8 input feature(s).
[INFO 24-01-31 12:18:46.9600 UTC kernel.cc:1061] Use fast generic engine


Model compiled.


<tf_keras.src.callbacks.History at 0x7fe6240aafd0>

In [None]:
# Evaluate the model on the test dataset.
model_7.compile(metrics=["mse"])
evaluation = model_7.evaluate(test_ds, return_dict=True)

print(evaluation)
print()
print(f"MSE: {evaluation['mse']}")
print(f"RMSE: {math.sqrt(evaluation['mse'])}")



{'loss': 0.0, 'mse': 4.820103168487549}

MSE: 4.820103168487549
RMSE: 2.1954733358634875
