## MNIST example

This notebook demonstrates and end-to-end application of the glimr package.

Using MNIST classification as a simple example, we demonstrate the steps to create a search space, model builder, and dataloader for use in tuning. This provides a concrete example of topics like using the `glimr.utils` and `glimr.keras` functions to create hyperparameters and to correctly name losses and metrics for training and reporting.

This is followed by a demonstration of the `Search` class to show how to setup and run experiments.

In [1]:
!pip install ../../glimr

Processing /Users/lcoop22/Desktop/glimr
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


Building wheels for collected packages: glimr
  Building wheel for glimr (pyproject.toml) ... [?25ldone
[?25h  Created wheel for glimr: filename=glimr-0.1.dev39+g3711710.d20230321-py3-none-any.whl size=19010 sha256=e4125d3ba5ed3d727736e4e87ce8d58e179665375e3397fda41edc153124791c
  Stored in directory: /private/var/folders/p8/2m9hqfn51c3_zkpq1894xzf80000gn/T/pip-ephem-wheel-cache-te8s93j1/wheels/58/08/39/c88c61a75aca3782dfcc11b86c8a6af860f75d1d00fddf72e2
Successfully built glimr
Installing collected packages: glimr
  Attempting uninstall: glimr
    Found existing installation: glimr 0.1.dev39+g3711710
    Uninstalling glimr-0.1.dev39+g3711710:
      Successfully uninstalled glimr-0.1.dev39+g3711710
Successfully installed glimr-0.1.dev39+g3711710.d20230321


### Creating the search space

First let's create a search space for a simple two layer network for a multiclass MNIST classifier.

For each layer we define hyperparameters for the number of units, dropout rate, and activation functions. We explore two losses for the single output task (named "mnist"), and explore a variety of gradient optimization algorithms and batch sizes.

In [3]:
# import optimization search space from glimr
from glimr.optimization import optimization_space
from glimr.utils import set_hyperparameter

# define the possible layer activations
activations = {"elu", "gelu", "linear", "relu", "selu", "sigmoid", "softplus"}

# define the layer 1 hyperparameters
layer1 = {
    "activation": activations,
    "dropout": [0.0, 0.2, 0.05],
    "units": {64, 48, 32, 16}
}

# define the task
task = {
    "activation": activations,
    "dropout": [0.0, 0.2, 0.05],
    "units": 10,
    "loss": {"categorical_hinge", "categorical_crossentropy"},
    "loss_weight": 1.0,
    "metrics": {"auc": "auc"}
}

# data loader keyword arguments to control loading, augmentation, and batching
data = {
    "batch_size": {32, 64, 128},
    "random_brightness": {True, False}, # whether to perform random brightness transformation
    "max_delta": [0.01, 0.15, 0.01]
}

# put it all together
space = {
    "layer1": layer1,
    "optimization": optimization_space(),
    "tasks": {
        "mnist": task
    },
    "data": data
}

# display space
from pprint import pprint
pprint(space, indent=4)

# define a recursive procedure for setting hyperparameters for list, set types
def recursive_set_hyperparameter(dictionary):
    for key in dictionary.keys():
        if isinstance(dictionary[key], (list, set)):
            dictionary[key] = set_hyperparameter(dictionary[key])
        elif isinstance(dictionary[key], dict):
            recursive_set_hyperparameter(dictionary[key])
            
# convert from glimr hyperparameter notation to Ray Tune hyperparameters
recursive_set_hyperparameter(space)

2023-03-21 11:49:32.002408: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


{   'data': {   'batch_size': {32, 64, 128},
                'max_delta': [0.01, 0.15, 0.01],
                'random_brightness': {False, True}},
    'layer1': {   'activation': {   'elu',
                                    'gelu',
                                    'linear',
                                    'relu',
                                    'selu',
                                    'sigmoid',
                                    'softplus'},
                  'dropout': [0.0, 0.2, 0.05],
                  'units': {64, 16, 48, 32}},
    'optimization': {   'batch': <ray.tune.search.sample.Categorical object at 0x7ffb7559b220>,
                        'beta_1': <ray.tune.search.sample.Float object at 0x7ffb600cfd00>,
                        'beta_2': <ray.tune.search.sample.Float object at 0x7ffb600cfd30>,
                        'epochs': 100,
                        'learning_rate': <ray.tune.search.sample.Float object at 0x7ffb7559bfd0>,
                        'met

### Implement the model-building function

The model-builder function transforms a sample of the space into a `tf.keras.Model`, and loss, loss weight, and metric inputs for model compilation. This is a user-defined function to provide maximum flexibility in the models that can be used with glimr.

In [5]:
from glimr.keras import keras_losses, keras_metrics
import tensorflow as tf


def builder(config):
    
    # a helper function for building layers
    def _build_layer(x, units, activation, dropout, name):
        # dense layer
        x = tf.keras.layers.Dense(units, activation=activation, name=name)(x)

        # add dropout if necessary
        if dropout > 0.0:
            x = tf.keras.layers.Dropout(dropout)(x)

        return x
    
    # create input layer
    input_layer = tf.keras.Input([784], name="input")
    
    # build layer 1
    x = _build_layer(input_layer, 
                     config["layer1"]["units"], 
                     config["layer1"]["activation"], 
                     config["layer1"]["dropout"],
                     "layer1")
    
    # build output / task layer
    task_name = list(config["tasks"].keys())[0]
    output = _build_layer(input_layer, 
                     config["tasks"][task_name]["units"], 
                     config["tasks"][task_name]["activation"], 
                     config["tasks"][task_name]["dropout"],
                     task_name)

    # build named output dict
    named = {f"{task_name}": output}

    # create model
    model = tf.keras.Model(inputs=input_layer, outputs=named)

    # create a loss dictionary using utility function
    metric_mapper = {
        "categorical_crossentropy": tf.keras.losses.CategoricalCrossentropy(from_logits=True),
        "categorical_hinge": tf.keras.losses.CategoricalHinge()
    }
    losses, loss_weights = keras_losses(config, metric_mapper)

    # create a metric dictionary using utility function
    loss_mapper = {
        "auc": tf.keras.metrics.AUC
    }
    metrics = keras_metrics(config, loss_mapper)
    
    return model, losses, loss_weights, metrics

### Create a data loading function

Write a function to load and batch mnist samples. Flatten the images and apply a one-hot encoding to the labels.

In [10]:
import numpy as np


def dataloader(batch_size, random_brightness, max_delta):
    
    # load mnist data
    train, validation = tf.keras.datasets.mnist.load_data(path="mnist.npz")
    
    # flattening function
    def mnist_flat(features):
        return features.reshape(features.shape[0], features.shape[1]*features.shape[2])

    # extract features, labels
    train_features = tf.cast(mnist_flat(train[0]), tf.float32) / 255.
    train_labels = train[1]
    validation_features = tf.cast(mnist_flat(validation[0]), tf.float32) / 255.
    validation_labels = validation[1]
    
    # build datasets
    train_ds = tf.data.Dataset.from_tensor_slices(
        (train_features, {"mnist": tf.one_hot(train_labels, 10)})
    )
    validation_ds = tf.data.Dataset.from_tensor_slices(
        (validation_features, {"mnist": tf.one_hot(validation_labels, 10)})
    )
    
    # batch
    train_ds = train_ds.shuffle(len(train_labels), reshuffle_each_iteration=True)
    train_ds = train_ds.batch(batch_size)
    validation_ds = validation_ds.batch(batch_size)
    
    # apply augmentation
    if random_brightness:
        train_ds = train_ds.map(lambda x, y: (tf.image.random_brightness(x, max_delta), y))

    return train_ds, validation_ds

### Test the search space, model builder, and dataloader

Before doing a hyperparameter search, let's test this combination to verify that the models can train.

We generate a sample configuration from the search space and build, compile, and train a model with this config.

In [7]:
from glimr.keras import keras_optimizer
import ray

# define a function for sampling a config from a space - ray will handle this automatically
def sample_space(space):
    config = {}
    for key in space:
        if isinstance(space[key], dict):
            config[key] = sample_space(space[key])
        elif isinstance(space[key], (ray.tune.search.sample.Categorical,
                                     ray.tune.search.sample.Integer,
                                     ray.tune.search.sample.Float)):
            config[key] = space[key].sample()
        else: # non sampleable value
            config[key] = space[key]
    return config

# sample a configuration
config = sample_space(space)

# display the configuration
from pprint import pprint
pprint(config, indent=4)

# build the model
model, losses, loss_weights, metrics = builder(config)

# build the optimizer
optimizer = keras_optimizer(config["optimization"])

# test compile the model
model.compile(optimizer=optimizer,
              loss=losses,
              metrics=metrics,
              loss_weights=loss_weights)

{   'data': {'batch_size': 32, 'max_delta': 0.14, 'random_brightness': False},
    'layer1': {'activation': 'linear', 'dropout': 0.05, 'units': 16},
    'optimization': {   'batch': 128,
                        'beta_1': 0.9,
                        'beta_2': 0.62,
                        'epochs': 100,
                        'learning_rate': 0.00254,
                        'method': 'sgd',
                        'momentum': 0.08,
                        'rho': 0.93},
    'tasks': {   'mnist': {   'activation': 'linear',
                              'dropout': 0.05,
                              'loss': 'categorical_hinge',
                              'loss_weight': 1.0,
                              'metrics': {'auc': 'auc'},
                              'units': 10}}}


2023-03-21 11:50:02.557117: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [11]:
# build dataset and train
train_ds, val_ds = dataloader(**config["data"])
model.fit(x=train_ds, validation_data=val_ds, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ffb61280e50>

## Using Search for hyperparameter tuning

The `Search` class implements the hyperparameter tuning process of Ray Tune. It is designed to provide sensible defaults for the many options available in Ray Tune, but also allows fine grained access to all Ray Tune options through it's class attributes. It is written ass a builder class that is incrementally changed to add tuning options for things like reporting, checkpointing, and experiment resources.

We start by setting up a basic experiment, and then demonstrate how to control tuning options through class methods and class attribute assignment.

In [12]:
import contextlib
from glimr.search import Search
import os
import tempfile

# Initialize the class using the search space, model builder, data loader, 
# and the name of the metric to optimize. The metric name for this single-task
# model has format task_metric. This is the standard convention when using
# glimr.keras.keras_metrics.
tuner = Search(space, builder, dataloader, "mnist_auc")

# setup a temporary directory to hold tune outputs
temp_dir = tempfile.TemporaryDirectory()

# run the experiment in this folder
with contextlib.redirect_stderr(open(os.devnull, "w")):
    tuner.experiment(temp_dir.name)

# cleanup the temporary folder
temp_dir.cleanup()

== Status ==
Current time: 2023-03-21 11:51:24 (running for 00:00:00.36)
Memory usage on this node: 12.2/16.0 GiB 
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 40.000: None | Iter 10.000: None
Resources requested: 2.0/8 CPUs, 0/0 GPUs, 0.0/2.93 GiB heap, 0.0/1.47 GiB objects
Result logdir: /var/folders/p8/2m9hqfn51c3_zkpq1894xzf80000gn/T/tmpj3r9fg_3/2023_03_21_11_51_15
Number of trials: 8/100 (7 PENDING, 1 RUNNING)
+-----------------------+----------+-----------------+----------+-----------------+
| Trial name            | status   | loc             | method   |   learning rate |
|-----------------------+----------+-----------------+----------+-----------------|
| trainable_9c6de_00000 | RUNNING  | 127.0.0.1:46554 | rms      |         0.00948 |
| trainable_9c6de_00001 | PENDING  |                 | adam     |         0.00517 |
| trainable_9c6de_00002 | PENDING  |                 | adadelta |         0.00583 |
| trainable_9c6de_00003 | PENDING  |                 | adadelta |       

Trial name,date,done,episodes_total,experiment_id,hostname,iterations_since_restore,mnist_auc,node_ip,pid,should_checkpoint,time_since_restore,time_this_iter_s,time_total_s,timestamp,timesteps_since_restore,timesteps_total,training_iteration,trial_id,warmup_time
trainable_9c6de_00000,2023-03-21_11-51-44,True,,a66fccae708646f58126cb06110526ed,Lees-MacBook-Pro.local,4,0.975734,127.0.0.1,46554,True,10.1528,2.04418,10.1528,1679417504,0,,4,9c6de_00000,0.0165281
trainable_9c6de_00001,2023-03-21_11-52-05,True,,bf2f59d1fe4245098c63bdb546ed9874,Lees-MacBook-Pro.local,4,0.962731,127.0.0.1,46564,True,20.6657,4.23212,20.6657,1679417525,0,,4,9c6de_00001,0.0135722
trainable_9c6de_00002,2023-03-21_11-52-35,True,,03aeb4dc7f0b4576a270420232fd806f,Lees-MacBook-Pro.local,10,0.918946,127.0.0.1,46565,True,49.8952,5.94344,49.8952,1679417555,0,,10,9c6de_00002,0.0161121
trainable_9c6de_00003,2023-03-21_11-52-28,True,,b80a6ec35ccd44f78cf9653b67f761df,Lees-MacBook-Pro.local,14,0.889055,127.0.0.1,46566,True,43.084,3.14889,43.084,1679417548,0,,14,9c6de_00003,0.0167301
trainable_9c6de_00004,2023-03-21_11-52-03,True,,a66fccae708646f58126cb06110526ed,Lees-MacBook-Pro.local,9,0.773652,127.0.0.1,46554,True,18.6769,2.04111,18.6769,1679417523,0,,9,9c6de_00004,0.0165281
trainable_9c6de_00005,2023-03-21_11-52-19,True,,a66fccae708646f58126cb06110526ed,Lees-MacBook-Pro.local,5,0.968565,127.0.0.1,46554,True,15.4952,2.80116,15.4952,1679417539,0,,5,9c6de_00005,0.0165281
trainable_9c6de_00006,2023-03-21_11-52-20,True,,bf2f59d1fe4245098c63bdb546ed9874,Lees-MacBook-Pro.local,8,0.911756,127.0.0.1,46564,True,14.7358,1.75055,14.7358,1679417540,0,,8,9c6de_00006,0.0135722
trainable_9c6de_00007,2023-03-21_11-52-30,True,,a66fccae708646f58126cb06110526ed,Lees-MacBook-Pro.local,5,0.962182,127.0.0.1,46554,True,11.5528,2.26921,11.5528,1679417550,0,,5,9c6de_00007,0.0165281
trainable_9c6de_00008,2023-03-21_11-52-30,True,,bf2f59d1fe4245098c63bdb546ed9874,Lees-MacBook-Pro.local,4,0.869706,127.0.0.1,46564,True,9.63929,1.83759,9.63929,1679417550,0,,4,9c6de_00008,0.0135722
trainable_9c6de_00009,2023-03-21_11-52-48,True,,b80a6ec35ccd44f78cf9653b67f761df,Lees-MacBook-Pro.local,4,0.668985,127.0.0.1,46566,True,19.7896,3.86557,19.7896,1679417568,0,,4,9c6de_00009,0.0167301


== Status ==
Current time: 2023-03-21 11:51:54 (running for 00:00:30.37)
Memory usage on this node: 12.3/16.0 GiB 
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 40.000: None | Iter 10.000: None
Resources requested: 8.0/8 CPUs, 0/0 GPUs, 0.0/2.93 GiB heap, 0.0/1.47 GiB objects
Current best trial: 9c6de_00000 with mnist_auc=0.9757341146469116 and parameters={'optimization/method': 'rms', 'optimization/learning_rate': 0.00948}
Result logdir: /var/folders/p8/2m9hqfn51c3_zkpq1894xzf80000gn/T/tmpj3r9fg_3/2023_03_21_11_51_15
Number of trials: 9/100 (4 PENDING, 4 RUNNING, 1 TERMINATED)
+-----------------------+------------+-----------------+----------+-----------------+-------------+
| Trial name            | status     | loc             | method   |   learning rate |   mnist_auc |
|-----------------------+------------+-----------------+----------+-----------------+-------------|
| trainable_9c6de_00001 | RUNNING    | 127.0.0.1:46564 | adam     |         0.00517 |    0.960276 |
| trainable

== Status ==
Current time: 2023-03-21 11:53:58 (running for 00:02:34.43)
Memory usage on this node: 12.2/16.0 GiB 
Using AsyncHyperBand: num_stopped=1
Bracket: Iter 40.000: None | Iter 10.000: 0.9238103777170181
Resources requested: 8.0/8 CPUs, 0/0 GPUs, 0.0/2.93 GiB heap, 0.0/1.47 GiB objects
Current best trial: 9c6de_00000 with mnist_auc=0.9757341146469116 and parameters={'optimization/method': 'rms', 'optimization/learning_rate': 0.00948}
Result logdir: /var/folders/p8/2m9hqfn51c3_zkpq1894xzf80000gn/T/tmpj3r9fg_3/2023_03_21_11_51_15
Number of trials: 32/100 (4 PENDING, 4 RUNNING, 24 TERMINATED)
+-----------------------+------------+-----------------+----------+-----------------+-------------+
| Trial name            | status     | loc             | method   |   learning rate |   mnist_auc |
|-----------------------+------------+-----------------+----------+-----------------+-------------|
| trainable_9c6de_00024 | RUNNING    | 127.0.0.1:46564 | adam     |         0.00176 |    0     

== Status ==
Current time: 2023-03-21 11:55:31 (running for 00:04:06.97)
Memory usage on this node: 11.5/16.0 GiB 
Using AsyncHyperBand: num_stopped=3
Bracket: Iter 40.000: None | Iter 10.000: 0.9286746084690094
Resources requested: 8.0/8 CPUs, 0/0 GPUs, 0.0/2.93 GiB heap, 0.0/1.47 GiB objects
Current best trial: 9c6de_00050 with mnist_auc=0.9813962578773499 and parameters={'optimization/method': 'rms', 'optimization/learning_rate': 0.007330000000000001}
Result logdir: /var/folders/p8/2m9hqfn51c3_zkpq1894xzf80000gn/T/tmpj3r9fg_3/2023_03_21_11_51_15
Number of trials: 55/100 (4 PENDING, 4 RUNNING, 47 TERMINATED)
+-----------------------+------------+-----------------+----------+-----------------+-------------+
| Trial name            | status     | loc             | method   |   learning rate |   mnist_auc |
|-----------------------+------------+-----------------+----------+-----------------+-------------|
| trainable_9c6de_00045 | RUNNING    | 127.0.0.1:46565 | adadelta |         0.0097

== Status ==
Current time: 2023-03-21 11:57:03 (running for 00:05:39.13)
Memory usage on this node: 11.9/16.0 GiB 
Using AsyncHyperBand: num_stopped=3
Bracket: Iter 40.000: None | Iter 10.000: 0.9286746084690094
Resources requested: 8.0/8 CPUs, 0/0 GPUs, 0.0/2.93 GiB heap, 0.0/1.47 GiB objects
Current best trial: 9c6de_00050 with mnist_auc=0.9785543084144592 and parameters={'optimization/method': 'rms', 'optimization/learning_rate': 0.007330000000000001}
Result logdir: /var/folders/p8/2m9hqfn51c3_zkpq1894xzf80000gn/T/tmpj3r9fg_3/2023_03_21_11_51_15
Number of trials: 75/100 (4 PENDING, 4 RUNNING, 67 TERMINATED)
+-----------------------+------------+-----------------+----------+-----------------+-------------+
| Trial name            | status     | loc             | method   |   learning rate |   mnist_auc |
|-----------------------+------------+-----------------+----------+-----------------+-------------|
| trainable_9c6de_00067 | RUNNING    | 127.0.0.1:46565 | sgd      |         0.0086

== Status ==
Current time: 2023-03-21 11:58:34 (running for 00:07:09.79)
Memory usage on this node: 12.2/16.0 GiB 
Using AsyncHyperBand: num_stopped=3
Bracket: Iter 40.000: None | Iter 10.000: 0.9286746084690094
Resources requested: 8.0/8 CPUs, 0/0 GPUs, 0.0/2.93 GiB heap, 0.0/1.47 GiB objects
Current best trial: 9c6de_00050 with mnist_auc=0.9785543084144592 and parameters={'optimization/method': 'rms', 'optimization/learning_rate': 0.007330000000000001}
Result logdir: /var/folders/p8/2m9hqfn51c3_zkpq1894xzf80000gn/T/tmpj3r9fg_3/2023_03_21_11_51_15
Number of trials: 97/100 (4 PENDING, 4 RUNNING, 89 TERMINATED)
+-----------------------+------------+-----------------+----------+-----------------+-------------+
| Trial name            | status     | loc             | method   |   learning rate |   mnist_auc |
|-----------------------+------------+-----------------+----------+-----------------+-------------|
| trainable_9c6de_00088 | RUNNING    | 127.0.0.1:46564 | sgd      |         0.0044

[2m[33m(raylet)[0m [2023-03-21 11:59:41,688 E 46537 13998315] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-03-21_11-51-15_850702_46267 is over 95% full, available space: 12992143360; capacity: 1000240963584. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-03-21 11:59:51,783 E 46537 13998315] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-03-21_11-51-15_850702_46267 is over 95% full, available space: 12992385024; capacity: 1000240963584. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-03-21 12:00:01,794 E 46537 13998315] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-03-21_11-51-15_850702_46267 is over 95% full, available space: 12992368640; capacity: 1000240963584. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-03-21 12:00:11,871 E 46537 13998315] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-03-21_11-51-15_850702_46267 is over 95% full, av