<a href="https://colab.research.google.com/github/Oliwash254/machine-learning/blob/main/documentation/public/docs/tutorial/tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tuning

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/tuning.ipynb)

## Setup

In [1]:
pip install ydf -U



## What is model tuning?

**Model tuning**, also known as automated model hyperparameter optimization or AutoML, involves finding the optimal hyperparameters for a learner to maximize the performance of a model. YDF supports model tuning out-of-the-box.

YDF model tuning has two modes. A user can either manually specify the hyperparameters to optimize and their candidate values, or use a pre-configured tuning. The second option is simpler, while the first option gives you more control. We will demonstrate both options in this tutorial.

Tuning can be done on a single machine or across multiple machines using distributed training. **This tutorial focuses on tuning on a single machine**. Local tuning is simple to set up and can produce excellent results on small datasets.

### Distributed model tuning
Distributed training tuning can be advantageous for models that take a long time to train or have a large hyperparameter search space. Distributed tuning requires configuring workers and specifying the `workers` constructor argument of the learner. After the workers are set up, the model tuning strategy is the same as for tuning on a local machine. For more information, see the [distributed training tutorial](../distributed_training).

## Download dataset

We use the adult dataset.

In [None]:
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets

# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,44,Private,228057,7th-8th,4,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,40,Dominican-Republic,<=50K
1,20,Private,299047,Some-college,10,Never-married,Other-service,Not-in-family,White,Female,0,0,20,United-States,<=50K
2,40,Private,342164,HS-grad,9,Separated,Adm-clerical,Unmarried,White,Female,0,0,37,United-States,<=50K
3,30,Private,361742,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,<=50K
4,67,Self-emp-inc,171564,HS-grad,9,Married-civ-spouse,Prof-specialty,Wife,White,Female,20051,0,30,England,>50K


## Local tuning with manually set hyper-parameters

The hyper-parameters of a learner are accessible in the API and on the [hyper-parameter page](https://ydf.readthedocs.io/en/latest/hyperparameters/). The guide [How to improve a model](https://ydf.readthedocs.io/en/latest/guide_how_to_improve_model/) also provides some recommendations on the hyper-parameters that are most impactful to optimize. In this example, we train a gradient boosted trees model and optimize the following hyper-parameters: `shrinkage`, `subsample`, and `max_depth`.

The tuning objective is automatically selected for the model. For instance, for `GradientBoostedTreesLearner` used in this example, the loss is minimized.

Let's configure the tuner:

In [None]:
tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4, 5, 6])

<ydf.learner.tuner.SearchSpace at 0x7f3eb4372310>

We create a learner using this tuner, and train a model:

**Note:** Parameters that are not tuned can be specified directly on the learner.

**Note:** To print the tuning logs during tuning, enable logging with `ydf.verbose(2)`.

In [None]:
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100, # Used for all the trials.
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:03.998356


The model description includes the tuning logs, which is a list of the hyper-parameters that were tested and their scores, are available in the `tuning` tab of the model description.

In [None]:
model.describe()

trial,score,duration,shrinkage,subsample,max_depth
16,-0.574861,2.49348,0.2,1.0,5
31,-0.576405,3.53616,0.2,1.0,6
15,-0.577211,2.4727,0.1,1.0,5
33,-0.578941,3.69053,0.2,0.9,5
32,-0.579071,3.54803,0.2,0.9,6
35,-0.579637,3.99118,0.1,1.0,6
19,-0.581703,2.68832,0.2,0.8,6
34,-0.582941,3.90171,0.1,0.8,6
14,-0.583348,2.46785,0.2,0.8,5
27,-0.583466,3.23896,0.2,0.9,4


The model can then be evaluated as usual.

In [None]:
model.evaluate(test_ds)

Label \ Pred,<=50K,>50K
<=50K,6974,438
>50K,781,1576


## Configuring conditional hyper-parameters

There are hyper-parameters that are only relevant when other hyper-parameters are configured in a specific way. For example, when `growing_strategy=LOCAL`, it makes sense to optimize `max_depth`. However, when `growing_strategy=BEST_FIRST_GLOBAL`, it is better to optimize `max_num_nodes`. We can configure a tuner to account for these conditional dependencies.



In [None]:
tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])

local_subspace = tuner.choice("growing_strategy", ["LOCAL"])
local_subspace.choice("max_depth", [3, 4, 5, 6])

global_subspace = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_subspace.choice("max_num_nodes", [32, 64, 128, 256])

<ydf.learner.tuner.SearchSpace at 0x7f3f10549e50>

Let's tune the model and display the results.

In [None]:
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:06.789261


In [None]:
model.describe()

trial,score,duration,shrinkage,subsample,growing_strategy,max_depth,max_num_nodes
31,-0.574861,5.4128,0.2,1.0,LOCAL,5.0,
10,-0.576405,2.72618,0.2,1.0,LOCAL,6.0,
18,-0.578031,3.67246,0.1,0.9,BEST_FIRST_GLOBAL,,32.0
25,-0.578941,4.434,0.2,0.9,LOCAL,5.0,
11,-0.579071,2.97415,0.2,0.9,LOCAL,6.0,
21,-0.579482,4.04769,0.1,0.9,BEST_FIRST_GLOBAL,,64.0
39,-0.579482,5.72021,0.1,0.9,BEST_FIRST_GLOBAL,,128.0
44,-0.579637,6.08383,0.1,1.0,LOCAL,6.0,
16,-0.580548,3.50807,0.1,0.8,BEST_FIRST_GLOBAL,,32.0
8,-0.582698,2.65852,0.2,1.0,BEST_FIRST_GLOBAL,,64.0


In [None]:
model.evaluate(test_ds)

Label \ Pred,<=50K,>50K
<=50K,6974,438
>50K,781,1576


## Local tuning with automatically configured hyper-parameters

If you do not want to configure the hyperparameters to optimize, you can use a preconfigured tuner.

In [None]:
tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True)

Model training is similar:

In [None]:
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:01.745021


As well as looking at the model:

In [None]:
model.describe()

trial,score,duration
0,-0.579637,1.74332


And evaluating the model:

In [None]:
model.evaluate(test_ds)

Label \ Pred,<=50K,>50K
<=50K,6985,427
>50K,796,1561
