# Tuning a Pipeline

This short guide shows how tune a Pipeline using a [BTB](https://github.com/MLBazaar/BTB) Tuner.

Note that some steps are not explained for simplicity. Full details
about them can be found in the previous parts of the tutorial.

Here we will:
1. Load a dataset and a pipeline
2. Explore the pipeline tunable hyperparameters
3. Write a scoring function
4. Build a BTB Tunable and BTB Tuner.
5. Write a tuning loop

## Load dataset and the pipeline

The first step will be to load the dataset that we were using in previous tutorials.

In [1]:
from mlprimitives.datasets import load_dataset

dataset = load_dataset('census')

And load a suitable pipeline.

Note how in this case we are using the variable name `template` instead of `pipeline`,
because this will only be used as a template for the pipelines that we will create
and evaluate during the later tuning loop.

In [2]:
from mlblocks import MLPipeline

template = MLPipeline('single_table.classification.xgb')

## Explore the pipeline tunable hyperparameters

Once we have loaded the pipeline, we can now extract the hyperparameters that we will tune
by calling the `get_tunable_hyperparameters` method.

In this case we will call it using `flat=True` to obtain the hyperparameters in a format
that is compatible with BTB.

In [3]:
tunable_hyperparameters = template.get_tunable_hyperparameters(flat=True)

In [4]:
tunable_hyperparameters

{('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
  'max_labels'): {'type': 'int', 'default': 0, 'range': [0, 100]},
 ('sklearn.impute.SimpleImputer#1', 'strategy'): {'type': 'str',
  'default': 'mean',
  'values': ['mean', 'median', 'most_frequent', 'constant']},
 ('xgboost.XGBClassifier#1', 'n_estimators'): {'type': 'int',
  'default': 100,
  'range': [10, 1000]},
 ('xgboost.XGBClassifier#1', 'max_depth'): {'type': 'int',
  'default': 3,
  'range': [3, 10]},
 ('xgboost.XGBClassifier#1', 'learning_rate'): {'type': 'float',
  'default': 0.1,
  'range': [0, 1]},
 ('xgboost.XGBClassifier#1', 'gamma'): {'type': 'float',
  'default': 0,
  'range': [0, 1]},
 ('xgboost.XGBClassifier#1', 'min_child_weight'): {'type': 'int',
  'default': 1,
  'range': [1, 10]}}

## Write a scoring function

To tune the pipeline we will need to evaluate its performance multiple times with different hyperparameters.

For this reason, we will start by writing a scoring function that will expect only one
input, the hyperparameters dictionary, and evaluate the performance of the pipeline using them.

In this case, the evaluation will be done using 5-fold cross validation based on the `get_splits`
method from the dataset.

In [5]:
import numpy as np

def cross_validate(hyperparameters=None):
    scores = []
    for X_train, X_test, y_train, y_test in dataset.get_splits(5):
        pipeline = MLPipeline(template.to_dict())  # Make a copy of the template
        if hyperparameters:
            pipeline.set_hyperparameters(hyperparameters)

        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        
        scores.append(dataset.score(y_test, y_pred))
        
    return np.mean(scores)

By calling this function without any arguments we will obtain the score obtained
with the default hyperparameters.

In [6]:
default_score = cross_validate()
default_score

0.8639171383183359

Optionally, we can certify that by passing a hyperparameters dictionary the new hyperparameters
will be used, resulting on a different score.

In [7]:
hyperparameters = {
    ('xgboost.XGBClassifier#1', 'max_depth'): 4
}
cross_validate(hyperparameters)

0.8686773872402614

## Create a BTB Tunable

The next step is to create the BTB Tunable instance that will be tuned by the BTB Tuner.

For this we will use its `from_dict` method, passing our hyperparameters dict.

In [8]:
from btb.tuning import Tunable

tunable = Tunable.from_dict(tunable_hyperparameters)

## Create the BTB Tuner

After creating the Tunable, we need to create a Tuner to tune it.

In this case we will use the GPTuner, a Meta-model based tuner that uses a Gaussian Process Regressor
for the optimization.

In [9]:
from btb.tuning import GPTuner

tuner = GPTuner(tunable)

Optionally, since we already know the score obtained by the default arguments and
these have a high probability of being already decent, we will inform the tuner
about their performance.

In order to obtain the default hyperparameters used before we can either call
the template `get_hyperparameters(flat=True)` method, the `tunable.get_defaults()`.

In [10]:
defaults = tunable.get_defaults()
defaults

{('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
  'max_labels'): 0,
 ('sklearn.impute.SimpleImputer#1', 'strategy'): 'mean',
 ('xgboost.XGBClassifier#1', 'n_estimators'): 100,
 ('xgboost.XGBClassifier#1', 'max_depth'): 3,
 ('xgboost.XGBClassifier#1', 'learning_rate'): 0.1,
 ('xgboost.XGBClassifier#1', 'gamma'): 0.0,
 ('xgboost.XGBClassifier#1', 'min_child_weight'): 1}

In [11]:
tuner.record(defaults, default_score)

## Start the Tuning loop

Once we have the tuner ready we can the tuning loop.

During this loop we will:

1. Ask the tuner for a new hyperparameter proposal
2. Run the `cross_validate` function to evaluate these hyperparameters
3. Record the obtained score back to the tuner.
4. If the obtained score is better than the previous one, store the proposal.

In [12]:
best_score = default_score
best_proposal = defaults

for iteration in range(10):
    print("scoring pipeline {}".format(iteration + 1))
    
    proposal = tuner.propose()
    score = cross_validate(proposal)
    
    tuner.record(proposal, score)
    
    if score > best_score:
        print("New best found: {}".format(score))
        best_score = score
        best_proposal = proposal

scoring pipeline 1
scoring pipeline 2
scoring pipeline 3
scoring pipeline 4
New best found: 0.8642241881762839
scoring pipeline 5
scoring pipeline 6
scoring pipeline 7
New best found: 0.8644390957265209
scoring pipeline 8
New best found: 0.8679095503945804
scoring pipeline 9
scoring pipeline 10


After the loop has finished, the best proposal will be stored in the `best_proposal` variable,
which can be used to generate a new pipeline instance.

In [13]:
best_proposal

{('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
  'max_labels'): 39,
 ('sklearn.impute.SimpleImputer#1', 'strategy'): 'most_frequent',
 ('xgboost.XGBClassifier#1', 'n_estimators'): 70,
 ('xgboost.XGBClassifier#1', 'max_depth'): 6,
 ('xgboost.XGBClassifier#1', 'learning_rate'): 0.07406443671152008,
 ('xgboost.XGBClassifier#1', 'gamma'): 0.9244108160038952,
 ('xgboost.XGBClassifier#1', 'min_child_weight'): 1}

In [14]:
best_pipeline = MLPipeline(template.to_dict())

In [15]:
best_pipeline.set_hyperparameters(best_proposal)

In [16]:
best_pipeline.fit(dataset.data, dataset.target)