# Selecting and Tuning pipelines

This guide shows you how to search for multiple pipelines for your problem
and later on use a [BTBSession](https://mlbazaar.github.io/BTB/api/btb.session.html#btb.session.BTBSession)
to select and tune the best one.

Note that some steps are not explained for simplicity. Full details
about them can be found in the previous parts of the tutorial.

Here we will:

1. Load a dataset
2. Search and load suitable templates
3. Write a scoring function
4. Build a BTBSession for our templates
5. Run the session to find the best pipeline

## Load the Dataset

The first step will be to load the dataset.

In [1]:
from utils import load_census

dataset = load_census()

In [2]:
dataset.describe()

Adult Census dataset.

    Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset.

    Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean
    records was extracted using the following conditions: ((AAGE>16) && (AGI>100) &&
    (AFNLWGT>1)&& (HRSWK>0))

    Prediction task is to determine whether a person makes over 50K a year.

    source: "UCI
    sourceURI: "https://archive.ics.uci.edu/ml/datasets/census+income"
    
Data Modality: single_table
Task Type: classification
Task Subtype: binary
Data shape: (32561, 14)
Target shape: (32561,)
Metric: accuracy_score
Extras: 


## Find and load suitable Templates

We will be using the `mlblocks.discovery.find_pipelines` function to search
for compatible pipelines.

In this case, we will be looking for `single_table/classification` pipelines.

In [3]:
from mlblocks.discovery import find_pipelines

templates = find_pipelines('single_table.classification')

In [4]:
templates

['single_table.classification',
 'single_table.classification.text',
 'single_table.classification.xgb']

And we will create a dictionary with MLPipeline instances that will be used as tempaltes for our tuning.

In [5]:
from mlblocks import MLPipeline

templates_dict = {
    template: MLPipeline(template)
    for template in templates
}

In [6]:
templates_dict['single_table.classification.xgb']

<mlblocks.mlpipeline.MLPipeline at 0x293518790>

## Create a scoring function

In order to use a `BTBSession` we will need a function that is able to score a proposal,
which will always be a pair of template name and proposed hyperparameters.

In this case, the evaluation will be done using 5-fold cross validation over our dataset.

In [7]:
import numpy as np

def cross_validate(template_name, hyperparameters=None):
    template = templates_dict[template_name]
    scores = []
    for X_train, X_test, y_train, y_test in dataset.get_splits(5):
        pipeline = MLPipeline(template.to_dict())  # Make a copy of the template
        if hyperparameters:
            pipeline.set_hyperparameters(hyperparameters)

        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        
        scores.append(dataset.score(y_test, y_pred))
        
    return np.mean(scores)

## Setup the BTBSession

We will create another dictionary with the tunable hyperparameters of each template.
This will be used by the BTBSession to know how to tune each template.

In [8]:
tunables = {
    name: template.get_tunable_hyperparameters(flat=True)
    for name, template in templates_dict.items()
}

In [9]:
tunables['single_table.classification.xgb']

{('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
  'max_labels'): {'type': 'int', 'default': 0, 'range': [0, 100]},
 ('sklearn.impute.SimpleImputer#1', 'strategy'): {'type': 'str',
  'default': 'mean',
  'values': ['mean', 'median', 'most_frequent', 'constant']},
 ('xgboost.XGBClassifier#1', 'n_estimators'): {'type': 'int',
  'default': 100,
  'range': [10, 1000]},
 ('xgboost.XGBClassifier#1', 'max_depth'): {'type': 'int',
  'default': 3,
  'range': [3, 10]},
 ('xgboost.XGBClassifier#1', 'learning_rate'): {'type': 'float',
  'default': 0.1,
  'range': [0, 1]},
 ('xgboost.XGBClassifier#1', 'gamma'): {'type': 'float',
  'default': 0,
  'range': [0, 1]},
 ('xgboost.XGBClassifier#1', 'min_child_weight'): {'type': 'int',
  'default': 1,
  'range': [1, 10]}}

And then create a `BTBSession` instance passing them and the `cross_validate` function.

We will also be setting it in `verbose` mode, so we can have a better insight on what is going on.

In [10]:
from baytune.session import BTBSession

session = BTBSession(tunables, cross_validate, verbose=True)

## 5. Run the session

After everything is set up, we can start running the tuning session passing it
the number of iterations that we want to perform.

In [11]:
session.run(5)

  0%|          | 0/5 [00:00<?, ?it/s]

Exception caught producing MLBlock mlprimitives.custom.text.TextCleaner#1
Traceback (most recent call last):
  File "/opt/anaconda3/envs/py10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'text'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sarah/Documents/git-repos/MLBlocks/mlblocks/mlpipeline.py", line 679, in _produce_block
    block_outputs = block.produce(**produce_args)
  File "/Users/sarah/Documents/git-repos/MLBlocks/mlblocks/

{'id': '0ebe8af9c06a05f39821de36d6c9ffc2',
 'name': 'single_table.classification.xgb',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 52,
  ('sklearn.impute.SimpleImputer#1', 'strategy'): 'median',
  ('xgboost.XGBClassifier#1', 'n_estimators'): 313,
  ('xgboost.XGBClassifier#1', 'max_depth'): 5,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.7119589664956909,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.944854007471167,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 10},
 'score': 0.8641320270062784}

During this loop, the BTBSession will build pipelines based on our templates and evaluate them
using our scoring function.

## 6. Evaluate results

When the session funishes running it will return a the best proposal available and the
obtained score.

These results are also available as the `best_proposal` attribute from the btb session object.

In [12]:
session.best_proposal

{'id': '0ebe8af9c06a05f39821de36d6c9ffc2',
 'name': 'single_table.classification.xgb',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 52,
  ('sklearn.impute.SimpleImputer#1', 'strategy'): 'median',
  ('xgboost.XGBClassifier#1', 'n_estimators'): 313,
  ('xgboost.XGBClassifier#1', 'max_depth'): 5,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.7119589664956909,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.944854007471167,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 10},
 'score': 0.8641320270062784}

## Continue Running

If we feel that the score can still be improved and want to keep searching, we can simply run the session again which will continue tuning over the previous results.

In [13]:
session.run(10)

  0%|          | 0/10 [00:00<?, ?it/s]

{'id': '0e379b2b0932f77d9b541925a05716be',
 'name': 'single_table.classification.xgb',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 43,
  ('sklearn.impute.SimpleImputer#1', 'strategy'): 'median',
  ('xgboost.XGBClassifier#1', 'n_estimators'): 609,
  ('xgboost.XGBClassifier#1', 'max_depth'): 5,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.16947366722929258,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.8805192101300107,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 1},
 'score': 0.8727005495718071}

**NOTE**: If you look at the logs you will notice how the BTBSession captures the errors that finds
while executing the pipelines and automatically discards the failing tempaltes to be able to continue
the tuning session without wasting time on them.

The number of errors that we want to wait before discarding a template can be changed passing the
`max_errors` argument to the `BTBSession` when it is build.

Isn't it cool?

## Build the best pipeline

Once we are satisfied with the results, we can then build an instance of the best pipeline
by reading the `best_proposal` attribute from the `session`.

In [14]:
best_proposal = session.best_proposal
best_proposal

{'id': '0e379b2b0932f77d9b541925a05716be',
 'name': 'single_table.classification.xgb',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 43,
  ('sklearn.impute.SimpleImputer#1', 'strategy'): 'median',
  ('xgboost.XGBClassifier#1', 'n_estimators'): 609,
  ('xgboost.XGBClassifier#1', 'max_depth'): 5,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.16947366722929258,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.8805192101300107,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 1},
 'score': 0.8727005495718071}

In [15]:
template = templates_dict[best_proposal['name']]

pipeline = MLPipeline(template.to_dict())
pipeline.set_hyperparameters(best_proposal['config'])

pipeline.fit(dataset.data, dataset.target)

## Explore other results

Optionally, if we are interested in exploring the results of the previous proposals we can access them
in the `trials` attribute of the `session` object.

In [16]:
list(session.proposals.values())[0:2]

[{'id': 'c2cd14c7e9470448a0eeb58a3cce327f',
  'name': 'single_table.classification',
  'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
    'max_labels'): 0,
   ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
    'lowercase'): True,
   ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
    'binary'): True,
   ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
    'max_features'): 1000,
   ('sklearn.impute.SimpleImputer#1', 'strategy'): 'mean',
   ('xgboost.XGBClassifier#1', 'n_estimators'): 100,
   ('xgboost.XGBClassifier#1', 'max_depth'): 3,
   ('xgboost.XGBClassifier#1', 'learning_rate'): 0.1,
   ('xgboost.XGBClassifier#1', 'gamma'): 0.0,
   ('xgboost.XGBClassifier#1', 'min_child_weight'): 1},
  'score': 0.863978563379761},
 {'id': 'adbd189a819483ddc869ceb94513b369',
  'name': 'single_table.classification.text',
  'config': {('mlprimitives.custom.text.TextCleaner#1', 'lower'): True,
   ('mlprimitives.custom.text.TextCleane