<a href="https://colab.research.google.com/github/Kerriea-star/Advanced/blob/master/Learning_Decision_Forests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Build, train and Evaluate a model with Decision Forests**

*Introduction*

In [4]:
pip install tensorflow_decision_forests

Collecting tensorflow_decision_forests
  Downloading tensorflow_decision_forests-1.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow~=2.13.0 (from tensorflow_decision_forests)
  Downloading tensorflow-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (524.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m524.1/524.1 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting wurlitzer (from tensorflow_decision_forests)
  Downloading wurlitzer-3.0.3-py3-none-any.whl (7.3 kB)
Collecting keras<2.14,>=2.13.1 (from tensorflow~=2.13.0->tensorflow_decision_forests)
  Downloading keras-2.13.1-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m80.0 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.14,>=2.13 (from tensorflow~=2.13.0->tens

Installing TF-DF

In [5]:
pip install wurlitzer



Wurlitzer is needed to display the detailed training logs in Colabs (when using verbose=2 in the model)

In [6]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

In [7]:
from IPython.core.magic import register_line_magic
from IPython.display import Javascript
from IPython.display import display as ipy_display

# Some of the model training logs can cover the full
# screen if not compressed to a smaller viewport.
# This magic allows setting a max height for a cell
@register_line_magic
def set_cell_height(size):
  ipy_display(
      Javascript("google.colab.output.setInfameHeight(0, true, {maxHeight: " + str(size) +"})"))

In [8]:
# check the version of TensorFlow Decision Forests
print("Found TensorFlow Decision Forests v" + tfdf.__version__)

Found TensorFlow Decision Forests v1.5.0


**Training a Random Forest Model**

In this section we train, evaluate, analyze and export a multi-class classification Random Forest trained on the Palmer's Penguins dataset

Let's assemble the dataset into a csv file (i.e. add the header), and load it:

In [9]:
# Download the Dataset
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# load the dataset into a Pandas DataFrame
dataset_df = pd.read_csv("/tmp/penguins.csv")

# Display the first 3 examples
dataset_df.head(3)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007


The dataset contains a mix of numerical (e.g bill_depth_mm), categorical(e.g. island) and missing features. TF-DF supports all these feature types natively (differently than NN based models), therefore there is no need for preprocessing in the form of one-hot encoding, normalization or extra is_[present feature.

Labels are a bit different:Keras metrics expect integers. The label (species) is stored as string, so let's convert it into an integer.

In [10]:
# Encode the categorical labels as integers
# Details:
# This is necessary if your classification label is represented as a
# string since Keras expects inter classification labels
# When using 'pd_dataframe_to_tf_dataset' (see below) this steo can be skipped.

# Name of the column.
label = "species"

classes = dataset_df[label].unique().tolist()
print(f"Label classes: {classes}")

dataset_df[label] = dataset_df[label].map(classes.index)

Label classes: ['Adelie', 'Gentoo', 'Chinstrap']


Next split the dataset into training and testing

In [11]:
# Split the dataset into a training and testing dataset
def split_dataset(dataset, test_ratio=0.30):
  """Splits a panda dataframe into two."""
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, test_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples for testing.".format(
    len(train_ds_pd), len(test_ds_pd)
))

239 examples in training, 105 examples for testing.


And finally, convert the pandas dataframe (pd.DataFrame) into tensorflow datasets (tf.data.Datasets)

In [12]:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)

Recall that pd_dataframe_to_tf_dataset converts string labels to integers if necessary.

If you want to create the tf.data.Dataset yourself, there are a couple of things to remember


*   The learning algorithms work with a one-epoch dataset and without shuffling
*   The batch size does not impact the training algorithm, but a small value might slow down reading the dataset



**Train the model**

In [13]:
%set_cell_height 300

# Specify the model
model_1 = tfdf.keras.RandomForestModel(verbose=2)

# Train the model
model_1.fit(train_ds)

<IPython.core.display.Javascript object>

Use 2 thread(s) for training
Use /tmp/tmpcji3rv1w as temporary training directory
Reading training dataset...
Training tensor examples:
Features: {'island': <tf.Tensor 'data:0' shape=(None,) dtype=string>, 'bill_length_mm': <tf.Tensor 'data_1:0' shape=(None,) dtype=float64>, 'bill_depth_mm': <tf.Tensor 'data_2:0' shape=(None,) dtype=float64>, 'flipper_length_mm': <tf.Tensor 'data_3:0' shape=(None,) dtype=float64>, 'body_mass_g': <tf.Tensor 'data_4:0' shape=(None,) dtype=float64>, 'sex': <tf.Tensor 'data_5:0' shape=(None,) dtype=string>, 'year': <tf.Tensor 'data_6:0' shape=(None,) dtype=int64>}
Label: Tensor("data_7:0", shape=(None,), dtype=int64)
Weights: None
Normalized tensor features:
 {'island': SemanticTensor(semantic=<Semantic.CATEGORICAL: 2>, tensor=<tf.Tensor 'data:0' shape=(None,) dtype=string>), 'bill_length_mm': SemanticTensor(semantic=<Semantic.NUMERICAL: 1>, tensor=<tf.Tensor 'Cast:0' shape=(None,) dtype=float32>), 'bill_depth_mm': SemanticTensor(semantic=<Semantic.NUMERIC

[INFO 23-08-11 13:29:46.4501 UTC kernel.cc:773] Start Yggdrasil model training
[INFO 23-08-11 13:29:46.4501 UTC kernel.cc:774] Collect training examples
[INFO 23-08-11 13:29:46.4502 UTC kernel.cc:787] Dataspec guide:
column_guides {
  column_name_pattern: "^__LABEL$"
  type: CATEGORICAL
  categorial {
    min_vocab_frequency: 0
    max_vocab_count: -1
  }
}
default_column_guide {
  categorial {
    max_vocab_count: 2000
  }
  discretized_numerical {
    maximum_num_bins: 255
  }
}
ignore_columns_without_guides: false
detect_numerical_as_discretized_numerical: false

[INFO 23-08-11 13:29:46.4508 UTC kernel.cc:393] Number of batches: 1
[INFO 23-08-11 13:29:46.4508 UTC kernel.cc:394] Number of examples: 239
[INFO 23-08-11 13:29:46.4509 UTC kernel.cc:794] Training dataset:
Number of records: 239
Number of columns: 8

Number of columns by type:
	NUMERICAL: 5 (62.5%)
	CATEGORICAL: 3 (37.5%)

Columns:

NUMERICAL: 5 (62.5%)
	1: "bill_depth_mm" NUMERICAL num-nas:1 (0.41841%) mean:17.1588 min:13

Model trained in 0:00:00.337771
Compiling model...


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
Model compiled.


<keras.src.callbacks.History at 0x7cf03ee74190>

Remarks

*   No input features are specified. Therefore, all the columns will be used as input features except for the label. The feature used by the model are shown in the training logs and the model_summary()
*   DFs consume natively numerical, categorical, categorical-set features and
missing values. Numerical features do not need to be normalized. Categorical string values do not need to be encoded in a dictionary.
*   No training hyper-parameters are specified. Therefore hyper-parameters will be used. Default hyper-parameters provide reasonable results in most situations.
*   Calling compile on the model before the fit is optional. Compile can be used to provide extra evaluation metrics.
*   Training algorithms do not need validation sets. If a validation set is provided, it will only be used to show metrics
*   Tweak the verbose argument to RandomForestModel to control the amount of displayed training logs. Set verbose=0 to hide most of the logs. Set verbose=2 to show all the logs





**Evaluate the model**

In [14]:
# Let's evaluate the model on the test dataset
model_1.compile(metrics=["accuracy"])
evaluation = model_1.evaluate(test_ds, return_dict=True)
print()

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")


loss: 0.0000
accuracy: 0.9905


Remark: The test accuracy is close to the Out-of-bag accuracy shown in the training logs.



In [17]:
# Prepare the model for TensorFlow Serving
# Export the model to the SavedModel format later re-use e.g. TensorFlow Serving
model_1.save("/tmp/my_save_model")

In [15]:
# Prepare this model for TensorFlow Serving
# Export the model to the SavedModel format later re-use e.g. TensorFlow Serving
import os
import tempfile
import requests

MODEL_DIR = "ML/models/decision_forests"
version = "1"
export_path = os.path.join(MODEL_DIR, str(version))

# save the model
model_1.save(export_path, save_format="tf")
print("\nexport_path = {}".format(export_path))
!dir {export_path}


export_path = ML/models/decision_forests/1
assets	fingerprint.pb	keras_metadata.pb  saved_model.pb  variables


In [16]:
!zip -r decision_forests.zip ML/models/decision_forests

  adding: ML/models/decision_forests/ (stored 0%)
  adding: ML/models/decision_forests/1/ (stored 0%)
  adding: ML/models/decision_forests/1/assets/ (stored 0%)
  adding: ML/models/decision_forests/1/assets/54b1583a39474b67random_forest_header.pb (deflated 91%)
  adding: ML/models/decision_forests/1/assets/54b1583a39474b67header.pb (deflated 29%)
  adding: ML/models/decision_forests/1/assets/54b1583a39474b67data_spec.pb (deflated 16%)
  adding: ML/models/decision_forests/1/assets/54b1583a39474b67done (stored 0%)
  adding: ML/models/decision_forests/1/assets/54b1583a39474b67nodes-00000-of-00001 (deflated 83%)
  adding: ML/models/decision_forests/1/fingerprint.pb (stored 0%)
  adding: ML/models/decision_forests/1/keras_metadata.pb (deflated 79%)
  adding: ML/models/decision_forests/1/saved_model.pb (deflated 86%)
  adding: ML/models/decision_forests/1/variables/ (stored 0%)
  adding: ML/models/decision_forests/1/variables/variables.index (deflated 46%)
  adding: ML/models/decision_forest