# Hyperparameter Tuning Tutorial

Machine learning models often have many hyperparameters that need to be tuned to achieve maximal performance (e.g: learning rate, dropout rate, hidden layer dimension) . This motivates the need for hyperparameter tuners that intelligently search the space of hyperparameters that configure high performing model. 

To address this, MeTaL supports multiple hyperparameter tuners with an easy to use interface which allows users to streamline the hyperparameter optimization process. This tutorial covers utilizing MeTaL's hyperparameter tuners to tune an EndModel for maximal performance. Currently, two hyperparameter algorithms are supported:

- <b>Random Search</b>
- <b>Hyperband</b>

The tutorial is broken down into the following sections 

1. <b>Setting up the Problem and Loading the Data</b>
2. <b>Defining the Search Space</b>
3. <b>Performing Random Search</b>
4. <b>Performing Hyperband Search</b>
5. <b>Comparing Random Search against Hyperband Search</b>

Let's begin!

## Setup
Before beginning, we first need to make sure that the metal/ directory is on our Python path. If the following cell runs without an error, you're all set. If not, make sure that you've installed snorkel-metal with pip or conda (or that you've added the repo to your path if you're running from source; for example, running source add_to_path.sh from the repository root).

In [10]:
import matplotlib
%load_ext autoreload
%autoreload 2
%matplotlib inline
import metal

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Setting up the Problem and Loading the Data

First let's set up our problem and load our data. For the purposes of this tutorial (and to keep the search process short) we use the small model we were introduced to in the basic tutorial. 

In [41]:
# Load basic tutorial data
from metal.utils import split_data
import pickle

with open("data/basics_tutorial.pkl", 'rb') as f:
    X, Y, L, D = pickle.load(f)
    
Xs, Ys, Ls, Ds = split_data(X, Y, L, D, splits=[0.8, 0.1, 0.1], stratify_by=Y, seed=123)

Let's furthermore define and train our label model like we did in the basic tutorial.

In [42]:
# Train a the label model
from metal.label_model import LabelModel
label_model = LabelModel(k=2, seed=123)

label_model.train(Ls[0], Y_dev=Ys[1], n_epochs=1000, print_every=250, lr=0.01, l2=1e-1)
score = label_model.score(Ls[1], Ys[1])
scores = label_model.score(Ls[1], Ys[1], metric=['precision', 'recall', 'f1'])

from metal.label_model.baselines import MajorityLabelVoter

mv = MajorityLabelVoter(seed=123)
scores = mv.score(Ls[1], Ys[1], metric=['accuracy', 'precision', 'recall', 'f1'])
Y_train_ps = label_model.predict_proba(Ls[0])

Computing O...
Estimating \mu...
[E:0]	Train Loss: 6.036
[E:250]	Train Loss: 0.029
[E:500]	Train Loss: 0.029
[E:750]	Train Loss: 0.029
[E:999]	Train Loss: 0.029
Finished Training
Accuracy: 0.879
Precision: 0.771
Recall: 0.724
F1: 0.746
Accuracy: 0.836
Precision: 0.623
Recall: 0.841
F1: 0.716


Now let's define our EndModel and verify that it successfully runs and achieves a decent score. 

In [43]:
# Train an end model
from metal.end_model import EndModel

end_model_basic = EndModel([1000,10,2], 
                     batchnorm=True,
                     dropout=.5,
                     l2=.1,
                     
                     seed=123)

end_model_basic.train(Xs[0], Y_train_ps, Xs[1], Ys[1], l2=0.1, batch_size=256, 
                n_epochs=5, print_every=1, validation_metric='f1')


Network architecture:
Sequential(
  (0): IdentityModule()
  (1): Sequential(
    (0): Linear(in_features=1000, out_features=10, bias=True)
    (1): ReLU()
    (2): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.5)
  )
  (2): Linear(in_features=10, out_features=2, bias=True)
)

Saving model at iteration 0 with best score 0.761
[E:0]	Train Loss: 0.561	Dev score: 0.761
Saving model at iteration 1 with best score 0.902
[E:1]	Train Loss: 0.468	Dev score: 0.902
[E:2]	Train Loss: 0.458	Dev score: 0.840
[E:3]	Train Loss: 0.451	Dev score: 0.870
[E:4]	Train Loss: 0.450	Dev score: 0.818
Restoring best model from iteration 1 with score 0.902
Finished Training
Confusion Matrix (Dev)
        y=1    y=2   
 l=1    202     0    
 l=2    44     754   


Great. Notice that our F1 is around .902. In the sections below we will be trying to optimize the hyperparameters of this EndModel to achieve an even higher score!

## Defining the Search Space

Before starting the hyperparameter tuning process, we need to specify the space of the hyperparameters we're searching. 

For the purposes of this tutorial we search over the following hyperparameters:
- <b>n_epochs</b>: Integer representing the number of epochs to train
- <b>batchnorm</b>: Boolean representing whether to use batch-normalization
- <b>lr</b>: Float representing the learning rate for optimization
- <b>layer_out_dims</b>: The architecture of our neural network

In [62]:
search_space = {
    'n_epochs': [1, 5, 10],
    'batchnorm' : [True, False],
    'dropout': [0, .1, .2, .3, .4, .5],
    'lr': {'range': [1e-5, 1], 'scale': 'log'},
    'layer_out_dims' : [[1000,10,2], [1000, 100, 2]],
    'print_every': 5,
    'data_loader_config': [{"batch_size": 256, "num_workers": 1}],
}

Here's a breakdown of what each line in the configuration means:

- `'n_epochs': [1, 5, 10],`: This specifies that the hyperparameter tuner may train the model for either 1, 5 or 10 epochs
- `'batchnorm' : [True, False],`: This specifies that a model instantiated by the tuner may have batchnorm as either True or False
- `dropout': [0, .1, .2, .3, .4, .5]`: Like the above, this specifies that the dropout parameter of an instantiated model may be one of 0, .1, .2, .3, .4, or .5
- `'lr': {'range': [1e-5, 1], 'scale': 'log'}`: This specifies that the learning rate of the training of a model may range from 1e-5 to 1, and that the tuner samples the learning rate on a log scale
- `'layer_out_dims' : [[1000,10,2], [1000, 100, 2]]`: This specifies that upon instantiation of the model, the structure of the fully connected network can either be [1000, 10, 2] or [1000, 100, 2]; in the latter case, this means the network takes a 1000 dimensional input, has a hidden layer with 100 features and an output layer with 2 classes
- `'print_every': 5`: This specifies that the model should print status updates every 5 iterations of training.
- `'data_loader_config': [{"batch_size": 256, "num_workers": 1}],`: This specifies to use a batch of 256 for optimization

Now that our search space is defined, let's start optimizing hyperparameters!

## Performing Random Search

While simple, random search has proven to be a powerful and efficient algorithm for tuning hyperparameters (see http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf for why). Let's use the RandomSearch tuner to find a good set of hyperparameters for our EndModel. Note that although we only do hyperparameter optimization for the EndModel, the tuners may also be used to do hyperparameter optimization for LabelModels.

To start, let's import the RandomSearchTuner and instantiate our RandomSearchTuner to optimize an EndModel model.

In [63]:
from metal.tuners.random_tuner import RandomSearchTuner
rs_tuner = RandomSearchTuner(EndModel, seed=123)

Next let's define our training and validation datasets.

In [64]:
train_args = [Xs[0], Y_train_ps]
X_dev, Y_dev = Xs[1], Ys[1]

And just like that we're prepped to launch our random search! Performing the search is just as easy and requires just a single call to the `search` function.

Most of the arguments to the `search` function below are self explanatory, but there are a couple of key arguments to watch out for:
- `max_search` : This specifies the number of configurations to search over. As it is set to 10 below, this means we search over 10 random models and return the best one
- `verbose`: This specifies whether the tuner should be verbose or not and can be used to turn on/off the its logging feature

In [65]:
best_rs_model = rs_tuner.search(search_space, X_dev, Y_dev, train_args=train_args, max_search=10, metric='f1', verbose=True)


Network architecture:
Sequential(
  (0): IdentityModule()
  (1): Sequential(
    (0): Linear(in_features=1000, out_features=10, bias=True)
    (1): ReLU()
  )
  (2): Linear(in_features=10, out_features=2, bias=True)
)

[1] Testing {'n_epochs': 10, 'batchnorm': False, 'dropout': 0, 'layer_out_dims': [1000, 10, 2], 'data_loader_config': {'batch_size': 256, 'num_workers': 1}, 'lr': 0.3700237151852522}
Saving model at iteration 0 with best score 0.754
[E:0]	Train Loss: 0.994	Dev score: 0.754
[E:5]	Train Loss: 0.577	Dev score: 0.754
[E:9]	Train Loss: 0.577	Dev score: 0.754
Restoring best model from iteration 0 with score 0.754
Finished Training
Confusion Matrix (Dev)
        y=1    y=2   
 l=1     0      0    
 l=2    246    754   
F1: 0.000

Network architecture:
Sequential(
  (0): IdentityModule()
  (1): Sequential(
    (0): Linear(in_features=1000, out_features=100, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1)
  )
  (2): Linear(in_features=100, out_features=2, bias=True)
)

[2] Te

Saving model at iteration 0 with best score 0.948
[E:0]	Train Loss: 0.605	Dev score: 0.948
Restoring best model from iteration 0 with score 0.948
Finished Training
Confusion Matrix (Dev)
        y=1    y=2   
 l=1    243    58    
 l=2     3     696   
F1: 0.909
[SUMMARY]
Best model: [9]
Best config: {'n_epochs': 10, 'batchnorm': False, 'dropout': 0.1, 'layer_out_dims': [1000, 100, 2], 'print_every': 5, 'data_loader_config': {'batch_size': 256, 'num_workers': 1}, 'lr': 0.003970906941573151, 'seed': 131}
Best score: 0.9938900203665988


Awesome, our best random search model achieves an F1 of ~.994 which outperforms the model we had previously (F1 ~ .90). Can we do even better than random search by either attaining the same accuracy faster or achieving a higher score? The following section walks through using the <b>Hyperband</b> tuner, which recent research has shown to be more efficient than random search.

## Performing Hyperband Search

While random search performs surprisingly well, we can be more efficient if we adaptively allocate more compute resources for configurations that perform well than to those that don't. For example if a configuration seems to yield a really poor model after the first epoch of training, it's unlikely it'll perform well even after more training, so we can early-terminate the training of this configuration to save compute. This is the core idea behind the <b>Hyperband</b> algorithm which recent research has shown to outperform various algorithms including random search. (See https://arxiv.org/abs/1603.06560 if interested!)

Running Hyperband is just as easy as running random search. Let's import the HyperbandTuner and instantiate it. 

Note that there is one extra argument to initialize the HyperbandTuner:
- `hyperband_epochs_budget`: This specifies the number of total epochs of training the tuner can perform in its search for a performant model. This is used to create the Hyperband search schedule.

In [71]:
from metal.tuners.hyperband_tuner import HyperbandTuner
hb_tuner = HyperbandTuner(EndModel, hyperband_epochs_budget=100, seed=123)

|           Hyperband Schedule          |
Table consists of tuples of (num configs, num_resources_per_config)which specify how many configs to run andfor how many epochs. 
Each bracket starts with a list of random configurations which is successively halved according the schedule.
See the Hyperband paper (https://arxiv.org/pdf/1603.06560.pdf) for more details.
-----------------------------------------
Bracket 0: (9, 1) (3, 4) (1, 13)
Bracket 1: (3, 4) (1, 13)
Bracket 2: (3, 13)
-----------------------------------------


We can launch the Hyperband search process using the same `search` call. Note that since the Hyperband schedule already limits the amount of compute we do, we don't have to set the `max_search` argument.

In [73]:
best_hb_model = hb_tuner.search(search_space, X_dev, Y_dev, train_args=train_args, metric='f1', verbose=True)


Network architecture:
Sequential(
  (0): IdentityModule()
  (1): Sequential(
    (0): Linear(in_features=1000, out_features=10, bias=True)
    (1): ReLU()
  )
  (2): Linear(in_features=10, out_features=2, bias=True)
)

[0 Testing {'n_epochs': 1, 'batchnorm': False, 'dropout': 0, 'layer_out_dims': [1000, 10, 2], 'data_loader_config': {'batch_size': 256, 'num_workers': 1}, 'lr': 0.3700237151852522}
Saving model at iteration 0 with best score 0.754
[E:0]	Train Loss: 0.994	Dev score: 0.754
Restoring best model from iteration 0 with score 0.754
Finished Training
Confusion Matrix (Dev)
        y=1    y=2   
 l=1     0      0    
 l=2    246    754   
F1: 0.000

Network architecture:
Sequential(
  (0): IdentityModule()
  (1): Sequential(
    (0): Linear(in_features=1000, out_features=100, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1)
  )
  (2): Linear(in_features=100, out_features=2, bias=True)
)

[1 Testing {'n_epochs': 1, 'batchnorm': False, 'dropout': 0.1, 'layer_out_dims': [1000, 10

Saving model at iteration 0 with best score 0.972
[E:0]	Train Loss: 0.495	Dev score: 0.972
[E:3]	Train Loss: 0.437	Dev score: 0.915
Restoring best model from iteration 0 with score 0.972
Finished Training
Confusion Matrix (Dev)
        y=1    y=2   
 l=1    221     1    
 l=2    25     753   
F1: 0.946

Network architecture:
Sequential(
  (0): IdentityModule()
  (1): Sequential(
    (0): Linear(in_features=1000, out_features=100, bias=True)
    (1): ReLU()
    (2): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.2)
  )
  (2): Linear(in_features=100, out_features=2, bias=True)
)

[11 Testing {'n_epochs': 4, 'batchnorm': True, 'dropout': 0.2, 'layer_out_dims': [1000, 100, 2], 'data_loader_config': {'batch_size': 256, 'num_workers': 1}, 'lr': 0.0004837052086066461}
Saving model at iteration 0 with best score 0.965
[E:0]	Train Loss: 0.593	Dev score: 0.965
Saving model at iteration 1 with best score 0.975
[E:3]	Train Loss: 0.431	Dev scor

Saving model at iteration 0 with best score 0.258
[E:0]	Train Loss: 0.739	Dev score: 0.258
Saving model at iteration 1 with best score 0.296
Saving model at iteration 2 with best score 0.377
Saving model at iteration 3 with best score 0.498
Saving model at iteration 4 with best score 0.649
Saving model at iteration 5 with best score 0.730
[E:5]	Train Loss: 0.674	Dev score: 0.730
Saving model at iteration 6 with best score 0.778
Saving model at iteration 7 with best score 0.783
[E:10]	Train Loss: 0.612	Dev score: 0.772
[E:12]	Train Loss: 0.594	Dev score: 0.765
Restoring best model from iteration 7 with score 0.783
Finished Training
Confusion Matrix (Dev)
        y=1    y=2   
 l=1    48     20    
 l=2    198    734   
F1: 0.306

Network architecture:
Sequential(
  (0): IdentityModule()
  (1): Sequential(
    (0): Linear(in_features=1000, out_features=100, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.5)
  )
  (2): Linear(in_features=100, out_features=2, bias=True)
)

[19 Testing {'n_

EndModel(
  (network): Sequential(
    (0): IdentityModule()
    (1): Sequential(
      (0): Linear(in_features=1000, out_features=10, bias=True)
      (1): ReLU()
      (2): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Dropout(p=0.4)
    )
    (2): Linear(in_features=10, out_features=2, bias=True)
  )
  (criteria): SoftCrossEntropyLoss()
)

Awesome, we achieved an F1 ~.97, which beat our initial F1 ~.90. However, unfortunately it did not beat the score achieved by random search score. The next section will compare the performances of random search and hyperband using the logged data.

## Comparing Random Search against Hyperband Search