# Selecting and Tuning pipelines

This guide shows you how to search for multiple pipelines for your problem
and later on use a [BTBSession](https://hdi-project.github.io/BTB/api/btb.session.html#btb.session.BTBSession)
to select and tune the best one.

Note that some steps are not explained for simplicity. Full details
about them can be found in the previous parts of the tutorial.

Here we will:

1. Load a dataset
2. Search and load suitable templates
3. Write a scoring function
4. Build a BTBSession for our templates
5. Run the session to find the best pipeline

## Load the Dataset

The first step will be to load the dataset.

In [1]:
from mlprimitives.datasets import load_dataset

dataset = load_dataset('census')

In [2]:
dataset.describe()

Adult Census dataset.

    Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset.

    Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean
    records was extracted using the following conditions: ((AAGE>16) && (AGI>100) &&
    (AFNLWGT>1)&& (HRSWK>0))

    Prediction task is to determine whether a person makes over 50K a year.

    source: "UCI
    sourceURI: "https://archive.ics.uci.edu/ml/datasets/census+income"
    
Data Modality: single_table
Task Type: classification
Task Subtype: binary
Data shape: (32561, 14)
Target shape: (32561,)
Metric: accuracy_score
Extras: 


## Find and load suitable Templates

We will be using the `mlblocks.discovery.find_pipelines` function to search
for compatible pipelines.

In this case, we will be looking for `single_table/classification` pipelines.

In [3]:
from mlblocks.discovery import find_pipelines

filters = {
    'metadata.data_type': 'single_table',
    'metadata.task_type': 'classification'
}
find_pipelines(filters=filters)

['keras.Sequential.MLPBinaryClassifier',
 'keras.Sequential.MLPMultiClassClassifier',
 'single_table.classification',
 'single_table.classification.text',
 'single_table.classification.xgb',
 'sklearn.decomposition.DictionaryLearning',
 'sklearn.decomposition.FactorAnalysis',
 'sklearn.decomposition.FastICA',
 'sklearn.decomposition.KernelPCA',
 'sklearn.decomposition.PCA',
 'sklearn.decomposition.TruncatedSVD',
 'sklearn.ensemble.AdaBoostClassifier',
 'sklearn.ensemble.BaggingClassifier',
 'sklearn.ensemble.ExtraTreesClassifier',
 'sklearn.ensemble.GradientBoostingClassifier',
 'sklearn.ensemble.IsolationForest',
 'sklearn.ensemble.RandomForestClassifier',
 'sklearn.ensemble.RandomTreesEmbedding',
 'sklearn.linear_model.LogisticRegression']

And we will create a dictionary with MLPipeline instances that will be used as tempaltes for our tuning.

In [20]:
from mlblocks import MLPipeline

templates = [
    'single_table.classification',
    'single_table.classification.text',
    'single_table.classification.xgb',
    'sklearn.ensemble.AdaBoostClassifier',
    'sklearn.ensemble.BaggingClassifier',
    'sklearn.ensemble.ExtraTreesClassifier',
    'sklearn.ensemble.GradientBoostingClassifier',
    #'sklearn.ensemble.IsolationForest',
    #'sklearn.ensemble.RandomForestClassifier',
    #'sklearn.ensemble.RandomTreesEmbedding',
    #'sklearn.linear_model.LogisticRegression'
]

templates_dict = {
    template: MLPipeline(template)
    for template in templates
}

In [21]:
templates_dict['single_table.classification']

<mlblocks.mlpipeline.MLPipeline at 0x7f6fbb494a58>

## Create a scoring function

In order to use a `BTBSession` we will need a function that is able to score a proposal,
which will always be a pair of template name and proposed hyperparameters.

In this case, the evaluation will be done using 5-fold cross validation over our dataset.

In [22]:
import numpy as np

def cross_validate(template_name, hyperparameters=None):
    template = templates_dict[template_name]
    scores = []
    for X_train, X_test, y_train, y_test in dataset.get_splits(5):
        pipeline = MLPipeline(template.to_dict())  # Make a copy of the template
        if hyperparameters:
            pipeline.set_hyperparameters(hyperparameters)

        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        
        scores.append(dataset.score(y_test, y_pred))
        
    return np.mean(scores)

## Setup the BTBSession

We will create another dictionary with the tunable hyperparameters of each template.
This will be used by the BTBSession to know how to tune each template.

In [23]:
tunables = {
    name: template.get_tunable_hyperparameters(flat=True)
    for name, template in templates_dict.items()
}

In [24]:
tunables['single_table.classification']

{('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
  'max_labels'): {'type': 'int', 'default': 0, 'range': [0, 100]},
 ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
  'lowercase'): {'type': 'bool', 'default': True},
 ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
  'binary'): {'type': 'bool', 'default': True},
 ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
  'max_features'): {'type': 'int', 'default': 1000, 'range': [1, 10000]},
 ('sklearn.impute.SimpleImputer#1', 'strategy'): {'type': 'str',
  'default': 'mean',
  'values': ['mean', 'median', 'most_frequent', 'constant']},
 ('xgboost.XGBClassifier#1', 'n_estimators'): {'type': 'int',
  'default': 100,
  'range': [10, 1000]},
 ('xgboost.XGBClassifier#1', 'max_depth'): {'type': 'int',
  'default': 3,
  'range': [3, 10]},
 ('xgboost.XGBClassifier#1', 'learning_rate'): {'type': 'float',
  'default': 0.1,
  'range': [0, 1]},
 ('xgboost.XGBClassifier#1', 'gamma'): {'type': 'float'

And then create a `BTBSession` instance passing them and the `cross_validate` function.

We will also be setting it in `verbose` mode, so we can have a better insight on what is going on.

In [25]:
from btb.session import BTBSession

session = BTBSession(tunables, cross_validate, verbose=True)

## 5. Run the session

After everything is set up, we can start running the tuning session passing it
the number of iterations that we want to perform.

In [26]:
session.run(5)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

2020-06-08 14:14:00,210 - INFO - session - Creating Tunable instance from dict.
2020-06-08 14:14:00,211 - INFO - session - Obtaining default configuration for single_table.classification
2020-06-08 14:14:05,387 - INFO - session - New optimal found: single_table.classification - 0.8639171383183359
2020-06-08 14:14:05,393 - INFO - session - Creating Tunable instance from dict.
2020-06-08 14:14:05,394 - INFO - session - Obtaining default configuration for single_table.classification.text
2020-06-08 14:14:05,537 - ERROR - mlpipeline - Exception caught producing MLBlock mlprimitives.custom.text.TextCleaner#1
Traceback (most recent call last):
  File "/home/xals/.virtualenvs/MLBlocks.clean/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2657, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs

2020-06-08 14:14:12,105 - ERROR - session - Proposal 4 - sklearn.ensemble.AdaBoostClassifier crashed with the following configuration: ('sklearn.ensemble.AdaBoostClassifier#1', 'n_estimators'): 50
('sklearn.ensemble.AdaBoostClassifier#1', 'learning_rate'): 1.0
('sklearn.ensemble.AdaBoostClassifier#1', 'algorithm'): SAMME.R
Traceback (most recent call last):
  File "/home/xals/.virtualenvs/MLBlocks.clean/lib/python3.6/site-packages/btb/session.py", line 336, in run
    score = self._scorer(tunable_name, config)
  File "<ipython-input-22-067b925bbee5>", line 11, in cross_validate
    pipeline.fit(X_train, y_train)
  File "/home/xals/Projects/MIT/MLBlocks.clean/mlblocks/mlpipeline.py", line 719, in fit
    self._fit_block(block, block_name, context)
  File "/home/xals/Projects/MIT/MLBlocks.clean/mlblocks/mlpipeline.py", line 619, in _fit_block
    block.fit(**fit_args)
  File "/home/xals/Projects/MIT/MLBlocks.clean/mlblocks/mlblock.py", line 302, in fit
    getattr(self.instance, self.fit






{'id': 'c2cd14c7e9470448a0eeb58a3cce327f',
 'name': 'single_table.classification',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 0,
  ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
   'lowercase'): True,
  ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
   'binary'): True,
  ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
   'max_features'): 1000,
  ('sklearn.impute.SimpleImputer#1', 'strategy'): 'mean',
  ('xgboost.XGBClassifier#1', 'n_estimators'): 100,
  ('xgboost.XGBClassifier#1', 'max_depth'): 3,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.1,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.0,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 1},
 'score': 0.8639171383183359}

During this loop, the BTBSession will build pipelines based on our templates and evaluate them
using our scoring function.

## 6. Evaluate results

When the session funishes running it will return a the best proposal available and the
obtained score.

These results are also available as the `best_proposal` attribute from the btb session object.

In [27]:
session.best_proposal

{'id': 'c2cd14c7e9470448a0eeb58a3cce327f',
 'name': 'single_table.classification',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 0,
  ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
   'lowercase'): True,
  ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
   'binary'): True,
  ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
   'max_features'): 1000,
  ('sklearn.impute.SimpleImputer#1', 'strategy'): 'mean',
  ('xgboost.XGBClassifier#1', 'n_estimators'): 100,
  ('xgboost.XGBClassifier#1', 'max_depth'): 3,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.1,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.0,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 1},
 'score': 0.8639171383183359}

## Continue Running

If we feel that the score can still be improved and want to keep searching, we can simply run the session again which will continue tuning over the previous results.

In [28]:
session.run(20)

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

2020-06-08 14:15:55,578 - INFO - session - Creating Tunable instance from dict.
2020-06-08 14:15:55,579 - INFO - session - Obtaining default configuration for sklearn.ensemble.ExtraTreesClassifier
2020-06-08 14:15:55,652 - ERROR - mlpipeline - Exception caught fitting MLBlock sklearn.ensemble.ExtraTreesClassifier#1
Traceback (most recent call last):
  File "/home/xals/Projects/MIT/MLBlocks.clean/mlblocks/mlpipeline.py", line 619, in _fit_block
    block.fit(**fit_args)
  File "/home/xals/Projects/MIT/MLBlocks.clean/mlblocks/mlblock.py", line 302, in fit
    getattr(self.instance, self.fit_method)(**fit_kwargs)
  File "/home/xals/.virtualenvs/MLBlocks.clean/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 250, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "/home/xals/.virtualenvs/MLBlocks.clean/lib/python3.6/site-packages/sklearn/utils/validation.py", line 527, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "/home/xals

2020-06-08 14:18:38,741 - INFO - session - Generating new proposal configuration for single_table.classification
2020-06-08 14:19:08,970 - INFO - session - Generating new proposal configuration for single_table.classification.xgb
2020-06-08 14:20:23,832 - INFO - session - Generating new proposal configuration for single_table.classification
2020-06-08 14:20:55,544 - INFO - session - New optimal found: single_table.classification - 0.8704278836015364
2020-06-08 14:20:55,550 - INFO - session - Generating new proposal configuration for single_table.classification
2020-06-08 14:21:28,246 - INFO - session - New optimal found: single_table.classification - 0.8721784648431357
2020-06-08 14:21:28,251 - INFO - session - Generating new proposal configuration for single_table.classification.xgb
2020-06-08 14:22:34,006 - INFO - session - Generating new proposal configuration for single_table.classification
2020-06-08 14:22:54,891 - INFO - session - Generating new proposal configuration for single_




{'id': 'c2c290721e685580ec264a6351e423a1',
 'name': 'single_table.classification.xgb',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 48,
  ('sklearn.impute.SimpleImputer#1', 'strategy'): 'median',
  ('xgboost.XGBClassifier#1', 'n_estimators'): 794,
  ('xgboost.XGBClassifier#1', 'max_depth'): 4,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.18247155663126724,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.9300424341582778,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 9},
 'score': 0.8734376738867757}

**NOTE**: If you look at the logs you will notice how the BTBSession captures the errors that finds
while executing the pipelines and automatically discards the failing tempaltes to be able to continue
the tuning session without wasting time on them.

The number of errors that we want to wait before discarding a template can be changed passing the
`max_errors` argument to the `BTBSession` when it is build.

Isn't it cool?

## Build the best pipeline

Once we are satisfied with the results, we can then build an instance of the best pipeline
by reading the `best_proposal` attribute from the `session`.

In [29]:
best_proposal = session.best_proposal
best_proposal

{'id': 'c2c290721e685580ec264a6351e423a1',
 'name': 'single_table.classification.xgb',
 'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
   'max_labels'): 48,
  ('sklearn.impute.SimpleImputer#1', 'strategy'): 'median',
  ('xgboost.XGBClassifier#1', 'n_estimators'): 794,
  ('xgboost.XGBClassifier#1', 'max_depth'): 4,
  ('xgboost.XGBClassifier#1', 'learning_rate'): 0.18247155663126724,
  ('xgboost.XGBClassifier#1', 'gamma'): 0.9300424341582778,
  ('xgboost.XGBClassifier#1', 'min_child_weight'): 9},
 'score': 0.8734376738867757}

In [30]:
template = templates_dict[best_proposal['name']]

pipeline = MLPipeline(template.to_dict())
pipeline.set_hyperparameters(best_proposal['config'])

pipeline.fit(dataset.data, dataset.target)

## Explore other results

Optionally, if we are interested in exploring the results of the previous proposals we can access them
in the `trials` attribute of the `session` object.

In [31]:
list(session.proposals.values())[0:2]

[{'id': 'c2cd14c7e9470448a0eeb58a3cce327f',
  'name': 'single_table.classification',
  'config': {('mlprimitives.custom.feature_extraction.CategoricalEncoder#1',
    'max_labels'): 0,
   ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
    'lowercase'): True,
   ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
    'binary'): True,
   ('mlprimitives.custom.feature_extraction.StringVectorizer#1',
    'max_features'): 1000,
   ('sklearn.impute.SimpleImputer#1', 'strategy'): 'mean',
   ('xgboost.XGBClassifier#1', 'n_estimators'): 100,
   ('xgboost.XGBClassifier#1', 'max_depth'): 3,
   ('xgboost.XGBClassifier#1', 'learning_rate'): 0.1,
   ('xgboost.XGBClassifier#1', 'gamma'): 0.0,
   ('xgboost.XGBClassifier#1', 'min_child_weight'): 1},
  'score': 0.8639171383183359},
 {'id': 'adbd189a819483ddc869ceb94513b369',
  'name': 'single_table.classification.text',
  'config': {('mlprimitives.custom.text.TextCleaner#1', 'lower'): True,
   ('mlprimitives.custom.text.TextClean