# Ray Tune - Search Algorithms and Schedulers

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This notebook introduces the concepts of search algorithms and schedulers which help optimize HPO. We'll see an example that combines the use of one search algorithm and one schedulers.

The full set of search algorithms provided by Tune is documented [here](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html), along with information about implementing your own. The full set of schedulers provided is documented [here](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html).

## About Search Algorithms

Tune integrates many [open source optimization libraries](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html), each of which defines the parameter search space in its own way. Hence, you should read the corresponding documentation for an algorithm to understand the particular details of using it.

Some of the search algorithms supported include the following:

* [Bayesian Optimization](https://github.com/fmfn/BayesianOptimization): This constrained global optimization process builds upon bayesian inference and gaussian processes. It attempts to find the maximum value of an unknown function in as few iterations as possible. This is a good technique for optimization of high cost functions.
* [BOHB (Bayesian Optimization HyperBand](https://github.com/automl/HpBandSter): An algorithm that both terminates bad trials and also uses Bayesian Optimization to improve the hyperparameter search. It is backed by the [HpBandSter](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-scheduler-bohb) library. BOHB is intended to be paired with a specific scheduler class: [HyperBandForBOHB](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-scheduler-bohb).
* [HyperOpt](http://hyperopt.github.io/hyperopt): A Python library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions.
* [Nevergrad](https://github.com/facebookresearch/nevergrad): HPO without computing gradients.

These and other algorithms are described in the [documentation](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html).

A limitation of search algorithms used by themselves is they can't affect or stop training processes, for example early stopping of trail that are performing poorly. The schedulers can do this, so it's common to use a compatible search algorithm with a scheduler, as we'll show in the first example.

## About Schedulers

Tune includes distributed implementations of several early-stopping algorithms, including the following:

* [Median Stopping Rule](https://research.google.com/pubs/pub46180.html): It applies the simple rule that a trial is aborted if the results are trending below the median of the previous trials.
* [HyperBand](https://arxiv.org/abs/1603.06560): It structures search as an _infinite-armed, stochastic, exploration-only, multi-armed bandit_. See the [Multi-Armed Bandits lessons](../ray-rllib/multi-armed-bandits/00-Multi-Armed-Bandits-Overview.ipynb) for information on these concepts. The infinite arms correspond to the tunable parameters. Trying values stochastically ensures quick exploration of the parameter space. Exploration-only is desirable because for HPO, we aren't interested in _exploiting_ parameter combinations we've already tried (the usual case when using MABs where rewards are the goal). Intead, we need to explore as many new parameter combinations as possible.
* [ASHA](https://openreview.net/forum?id=S1Y7OOlRZ). This is an aynchronous version of HyperBand that improves on the latter. Hence it is recommended over the original HyperBand implementation. 

Tune also includes a distributed implementation of [Population Based Training (PBT)](https://deepmind.com/blog/population-based-training-neural-networks). When the PBT scheduler is enabled, each trial variant is treated as a member of the _population_. Periodically, top-performing trials are checkpointed, which means your [`tune.Trainable`](https://docs.ray.io/en/latest/tune/api_docs/trainable.html#tune-trainable) object (e.g., the `TrainMNist` class we used in the previous exercise) has to support save and restore. 

Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation. PBT trains a group of models (or RLlib agents) in parallel. So, unlike other hyperparameter search algorithms, PBT mutates hyperparameters during training time. This enables very fast hyperparameter discovery and also automatically discovers good [annealing](https://en.wikipedia.org/wiki/Simulated_annealing) schedules.

See the [Tune schedulers](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html) for a complete list and descriptions.

## Examples

Let's initialize Ray as before:

In [1]:
!../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


In [2]:
import ray
from ray import tune

In [3]:
ray.init(address='auto', ignore_reinit_error=True)

{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:6379',
 'object_store_address': '/tmp/ray/session_2020-07-18_09-21-16_202196_93175/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-07-18_09-21-16_202196_93175/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-07-18_09-21-16_202196_93175'}

### BOHB

BOHB (Bayesian Optimization HyperBand) is an algorithm that both terminates bad trials and also uses Bayesian Optimization to improve the hyperparameter search. The [Tune implementation](https://docs.ray.io/en/latest/tune/api_docs/suggestion.html#bohb-tune-suggest-bohb-tunebohb) is backed by the [HpBandSter library](https://github.com/automl/HpBandSter), which we must install, along with [ConfigSpace](https://automl.github.io/HpBandSter/build/html/quickstart.html#searchspace), which is used to define the search space specification:

In [4]:
!pip install hpbandster ConfigSpace





We use BOHB with the scheduler [HyperBandForBOHB](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#bohb-tune-schedulers-hyperbandforbohb).

Let's try it. We'll use the same MNIST example from the previous lesson, but this time, we'll import the code from a file in this directory, `mnist.py`. Note that the implementation of `TrainMNIST` in the file has enhancements not present in the previous lesson, such as methods to support saving and restoring checkpoints, which are required to be used here. See the code comments for details.

In [5]:
from mnist import ConvNet, TrainMNIST, EPOCH_SIZE, TEST_SIZE, DATA_ROOT

Import and configure the `ConfigSpace` object we need for the search algorithm.

In [6]:
import ConfigSpace as CS
from ray.tune.schedulers.hb_bohb import HyperBandForBOHB
from ray.tune.suggest.bohb import TuneBOHB

In [7]:
config_space = CS.ConfigurationSpace()

# There are also UniformIntegerHyperparameter and UniformFloatHyperparameter
# objects for defining integer and float ranges, respectively. For example:
# config_space.add_hyperparameter(
#     CS.UniformIntegerHyperparameter('foo', lower=0, upper=100))

config_space.add_hyperparameter(
    CS.CategoricalHyperparameter('lr', choices=[0.001, 0.01, 0.1]))
config_space.add_hyperparameter(
    CS.CategoricalHyperparameter('momentum', choices=[0.001, 0.01, 0.1, 0.9]))

config_space

Configuration space object:
  Hyperparameters:
    lr, Type: Categorical, Choices: {0.001, 0.01, 0.1}, Default: 0.001
    momentum, Type: Categorical, Choices: {0.001, 0.01, 0.1, 0.9}, Default: 0.001

In [8]:
experiment_metrics = dict(metric="mean_accuracy", mode="max")

search_algorithm = TuneBOHB(config_space, max_concurrent=4, **experiment_metrics)

scheduler = HyperBandForBOHB(
    time_attr='training_iteration',
    reduction_factor=4,
    max_t=200,
    **experiment_metrics)

Through experimentation, we determined that `max_t=200` is necessary to get good results. For the smallest learning rate and momentum values, it takes longer for training to converge.

In [9]:
analysis = tune.run(TrainMNIST, 
    scheduler=scheduler, 
    search_alg=search_algorithm, 
    num_samples=12,                           # Force it try all 12 combinations
    verbose=2,                                # Change to 0 or 1 to reduce the output.
    ray_auto_init=False                       # Don't allow Tune to initialize Ray.
)

Trial name,status,loc,lr,momentum
TrainMNIST_c4a1a4b4,RUNNING,,0.01,0.001
TrainMNIST_c4a1ce80,PENDING,,0.1,0.01
TrainMNIST_c4a1ef6e,PENDING,,0.001,0.01
TrainMNIST_c4a2102a,PENDING,,0.01,0.9


Result for TrainMNIST_c4a1a4b4:
  date: 2020-07-19_08-01-49
  done: false
  experiment_id: bc79054f24c84d0f8de02eb418e668d1
  experiment_tag: 1_lr=0.01,momentum=0.001
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 1
  mean_accuracy: 0.159375
  node_ip: 192.168.1.149
  pid: 25149
  time_since_restore: 0.20097112655639648
  time_this_iter_s: 0.20097112655639648
  time_total_s: 0.20097112655639648
  timestamp: 1595170909
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: c4a1a4b4
  
Result for TrainMNIST_c4a1ce80:
  date: 2020-07-19_08-01-49
  done: false
  experiment_id: c132d028028c4b3cb198d349abc182ef
  experiment_tag: 2_lr=0.1,momentum=0.01
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 1
  mean_accuracy: 0.5375
  node_ip: 192.168.1.149
  pid: 25148
  time_since_restore: 0.20450210571289062
  time_this_iter_s: 0.20450210571289062
  time_total_s: 0.20450210571289062
  timestamp: 1595170909
  timesteps_since_restore: 0
  training_iteration: 1
  tr

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,PAUSED,,0.01,0.001,0.296875,3.0,0.643175
TrainMNIST_c4a1ce80,PAUSED,,0.1,0.01,0.678125,3.0,0.655609
TrainMNIST_c4a1ef6e,PAUSED,,0.001,0.01,0.1125,3.0,0.701357
TrainMNIST_c4a2102a,PAUSED,,0.01,0.9,0.496875,3.0,1.01279
TrainMNIST_c5ab82b2,RUNNING,192.168.1.149:25208,0.01,0.001,0.153125,2.0,0.444333
TrainMNIST_c5b631e4,PAUSED,,0.1,0.01,0.721875,3.0,0.662265
TrainMNIST_c5c661f4,PAUSED,,0.001,0.1,0.1625,3.0,0.695368
TrainMNIST_c6437144,PAUSED,,0.001,0.9,0.046875,3.0,0.639237
TrainMNIST_c787ca1e,RUNNING,,0.01,0.1,,,
TrainMNIST_c7946260,RUNNING,,0.001,0.001,,,


Result for TrainMNIST_c787ca1e:
  date: 2020-07-19_08-01-54
  done: false
  experiment_id: 11fbaaf6b8c74dde897cc48662ce07c9
  experiment_tag: 9_lr=0.01,momentum=0.1
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 1
  mean_accuracy: 0.1125
  node_ip: 192.168.1.149
  pid: 25213
  time_since_restore: 0.27850914001464844
  time_this_iter_s: 0.27850914001464844
  time_total_s: 0.27850914001464844
  timestamp: 1595170914
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: c787ca1e
  
Result for TrainMNIST_c7946260:
  date: 2020-07-19_08-01-54
  done: false
  experiment_id: c56eadbdb27d4e3c8b8e5974716448fa
  experiment_tag: 10_lr=0.001,momentum=0.001
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 1
  mean_accuracy: 0.06875
  node_ip: 192.168.1.149
  pid: 25210
  time_since_restore: 0.2643098831176758
  time_this_iter_s: 0.2643098831176758
  time_total_s: 0.2643098831176758
  timestamp: 1595170914
  timesteps_since_restore: 0
  training_iteration: 1
  tria



[2m[36m(pid=25209)[0m 2020-07-19 08:01:57,682	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_1_lr=0.01,momentum=0.001_2020-07-19_08-01-4736q4ja5q/tmpuka90ieqrestore_from_object/checkpoint
[2m[36m(pid=25209)[0m 2020-07-19 08:01:57,682	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 0.6431751251220703, '_episodes_total': None}
[2m[36m(pid=25216)[0m 2020-07-19 08:01:57,769	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_2_lr=0.1,momentum=0.01_2020-07-19_08-01-47e02oa8w0/tmpp5oeb8s3restore_from_object/checkpoint
[2m[36m(pid=25216)[0m 2020-07-19 08:01:57,769	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 0.6556088924407959, '_episodes_total': None}
[2m[36m(pid=25211)[0m 2020-07-19 08:01:57,864	INFO 

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,RUNNING,192.168.1.149:25209,0.01,0.001,0.190625,4,1.1977
TrainMNIST_c4a1ce80,RUNNING,,0.1,0.01,0.678125,3,0.655609
TrainMNIST_c4a1ef6e,RUNNING,,0.001,0.01,0.1125,3,0.701357
TrainMNIST_c4a2102a,RUNNING,,0.01,0.9,0.496875,3,1.01279
TrainMNIST_c5ab82b2,RUNNING,,0.01,0.001,0.2625,3,0.715963
TrainMNIST_c5b631e4,RUNNING,,0.1,0.01,0.721875,3,0.662265
TrainMNIST_c5c661f4,RUNNING,,0.001,0.1,0.1625,3,0.695368
TrainMNIST_c6437144,RUNNING,,0.001,0.9,0.046875,3,0.639237
TrainMNIST_c787ca1e,PENDING,,0.01,0.1,0.19375,3,0.755553
TrainMNIST_c7946260,PENDING,,0.001,0.001,0.05,3,0.729817


Result for TrainMNIST_c4a1ef6e:
  date: 2020-07-19_08-01-58
  done: false
  experiment_id: 8ae67266fc6a41fdb8ce381454d01d81
  experiment_tag: 3_lr=0.001,momentum=0.01
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 1
  mean_accuracy: 0.121875
  node_ip: 192.168.1.149
  pid: 25211
  time_since_restore: 0.43152594566345215
  time_this_iter_s: 0.43152594566345215
  time_total_s: 1.132883071899414
  timestamp: 1595170918
  timesteps_since_restore: 0
  training_iteration: 4
  trial_id: c4a1ef6e
  
Result for TrainMNIST_c4a1ce80:
  date: 2020-07-19_08-01-58
  done: false
  experiment_id: c132d028028c4b3cb198d349abc182ef
  experiment_tag: 2_lr=0.1,momentum=0.01
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 1
  mean_accuracy: 0.63125
  node_ip: 192.168.1.149
  pid: 25216
  time_since_restore: 0.5355730056762695
  time_this_iter_s: 0.5355730056762695
  time_total_s: 1.1911818981170654
  timestamp: 1595170918
  timesteps_since_restore: 0
  training_iteration: 4
  trial_

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,PAUSED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,PAUSED,,0.1,0.01,0.9,12,4.46602
TrainMNIST_c4a1ef6e,PAUSED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,PAUSED,,0.01,0.9,0.790625,12,4.85826
TrainMNIST_c5ab82b2,PAUSED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,PAUSED,,0.1,0.01,0.7625,12,4.47754
TrainMNIST_c5c661f4,PAUSED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,PAUSED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,RUNNING,,0.01,0.1,0.19375,3,0.755553
TrainMNIST_c7946260,RUNNING,,0.001,0.001,0.05,3,0.729817


[2m[36m(pid=25253)[0m 2020-07-19 08:02:03,669	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_9_lr=0.01,momentum=0.1_2020-07-19_08-01-52431ftvft/tmphtjpexkurestore_from_object/checkpoint
[2m[36m(pid=25253)[0m 2020-07-19 08:02:03,669	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 0.7555532455444336, '_episodes_total': None}
[2m[36m(pid=25250)[0m 2020-07-19 08:02:03,705	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_10_lr=0.001,momentum=0.001_2020-07-19_08-01-52l0ihd_lv/tmp7xunz8hgrestore_from_object/checkpoint
[2m[36m(pid=25250)[0m 2020-07-19 08:02:03,705	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 3, '_timesteps_total': None, '_time_total': 0.7298171520233154, '_episodes_total': None}
[2m[36m(pid=25248)[0m 2020-07-19 08:02:03,773	INF

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,TERMINATED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,RUNNING,192.168.1.149:25256,0.1,0.01,0.66875,14,5.07019
TrainMNIST_c4a1ef6e,TERMINATED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,RUNNING,192.168.1.149:25252,0.01,0.9,0.121875,13,5.18681
TrainMNIST_c5ab82b2,TERMINATED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,RUNNING,192.168.1.149:25251,0.1,0.01,0.56875,13,4.85536
TrainMNIST_c5c661f4,TERMINATED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,TERMINATED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,TERMINATED,,0.01,0.1,0.6875,12,2.84579
TrainMNIST_c7946260,TERMINATED,,0.001,0.001,0.13125,12,2.81379


[2m[36m(pid=25281)[0m 2020-07-19 08:02:08,668	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_11_lr=0.1,momentum=0.1_2020-07-19_08-01-52vy8bwg41/tmp8pbv78fcrestore_from_object/checkpoint
[2m[36m(pid=25281)[0m 2020-07-19 08:02:08,668	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 12, '_timesteps_total': None, '_time_total': 2.8070592880249023, '_episodes_total': None}
[2m[36m(pid=25278)[0m 2020-07-19 08:02:08,751	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_12_lr=0.01,momentum=0.01_2020-07-19_08-01-52ljc3dbjw/tmp___4vhherestore_from_object/checkpoint
[2m[36m(pid=25278)[0m 2020-07-19 08:02:08,751	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 12, '_timesteps_total': None, '_time_total': 2.723043203353882, '_episodes_total': None}
Result for TrainMNIST_c7a21fd6:
  date: 2020-07-19_08

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,TERMINATED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,RUNNING,192.168.1.149:25256,0.1,0.01,0.90625,31,9.81756
TrainMNIST_c4a1ef6e,TERMINATED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,RUNNING,192.168.1.149:25252,0.01,0.9,0.60625,31,10.1898
TrainMNIST_c5ab82b2,TERMINATED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,RUNNING,192.168.1.149:25251,0.1,0.01,0.88125,31,9.88002
TrainMNIST_c5c661f4,TERMINATED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,TERMINATED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,TERMINATED,,0.01,0.1,0.6875,12,2.84579
TrainMNIST_c7946260,TERMINATED,,0.001,0.001,0.13125,12,2.81379


Result for TrainMNIST_c7a21fd6:
  date: 2020-07-19_08-02-14
  done: false
  experiment_id: e0685c211e29497992e5561881efbc6b
  experiment_tag: 11_lr=0.1,momentum=0.1
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 20
  mean_accuracy: 0.95
  node_ip: 192.168.1.149
  pid: 25281
  time_since_restore: 5.565382957458496
  time_this_iter_s: 0.2779850959777832
  time_total_s: 8.372442245483398
  timestamp: 1595170934
  timesteps_since_restore: 0
  training_iteration: 32
  trial_id: c7a21fd6
  
Result for TrainMNIST_c7aeb8a4:
  date: 2020-07-19_08-02-14
  done: false
  experiment_id: dc24002a5fd74f2c8a9651114b17d102
  experiment_tag: 12_lr=0.01,momentum=0.01
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 21
  mean_accuracy: 0.8
  node_ip: 192.168.1.149
  pid: 25278
  time_since_restore: 5.963758707046509
  time_this_iter_s: 0.2821178436279297
  time_total_s: 8.68680191040039
  timestamp: 1595170934
  timesteps_since_restore: 0
  training_iteration: 33
  trial_id: c7aeb8

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,TERMINATED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,PAUSED,,0.1,0.01,0.921875,48,14.5216
TrainMNIST_c4a1ef6e,TERMINATED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,PAUSED,,0.01,0.9,0.871875,48,14.8339
TrainMNIST_c5ab82b2,TERMINATED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,PAUSED,,0.1,0.01,0.934375,48,14.5483
TrainMNIST_c5c661f4,TERMINATED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,TERMINATED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,TERMINATED,,0.01,0.1,0.6875,12,2.84579
TrainMNIST_c7946260,TERMINATED,,0.001,0.001,0.13125,12,2.81379


Result for TrainMNIST_c7aeb8a4:
  date: 2020-07-19_08-02-18
  done: true
  experiment_id: dc24002a5fd74f2c8a9651114b17d102
  experiment_tag: 12_lr=0.01,momentum=0.01
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 36
  mean_accuracy: 0.8625
  node_ip: 192.168.1.149
  pid: 25278
  time_since_restore: 9.920998573303223
  time_this_iter_s: 0.20065903663635254
  time_total_s: 12.644041776657104
  timestamp: 1595170938
  timesteps_since_restore: 0
  training_iteration: 48
  trial_id: c7aeb8a4
  
[2m[36m(pid=25282)[0m 2020-07-19 08:02:19,793	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_6_lr=0.1,momentum=0.01_2020-07-19_08-01-49azho76dy/tmp3bacuiqvrestore_from_object/checkpoint
[2m[36m(pid=25282)[0m 2020-07-19 08:02:19,793	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 48, '_timesteps_total': None, '_time_total': 14.548253297805786, '_episodes_total': None}
Result for TrainMNI

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,TERMINATED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,TERMINATED,,0.1,0.01,0.921875,48,14.5216
TrainMNIST_c4a1ef6e,TERMINATED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,TERMINATED,,0.01,0.9,0.871875,48,14.8339
TrainMNIST_c5ab82b2,TERMINATED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,RUNNING,192.168.1.149:25282,0.1,0.01,0.89375,66,18.1723
TrainMNIST_c5c661f4,TERMINATED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,TERMINATED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,TERMINATED,,0.01,0.1,0.6875,12,2.84579
TrainMNIST_c7946260,TERMINATED,,0.001,0.001,0.13125,12,2.81379


Result for TrainMNIST_c5b631e4:
  date: 2020-07-19_08-02-25
  done: false
  experiment_id: 88c2f848b44b414b867431e2888bb8a4
  experiment_tag: 6_lr=0.1,momentum=0.01
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 26
  mean_accuracy: 0.9375
  node_ip: 192.168.1.149
  pid: 25282
  time_since_restore: 5.1743385791778564
  time_this_iter_s: 0.1844637393951416
  time_total_s: 19.722591876983643
  timestamp: 1595170945
  timesteps_since_restore: 0
  training_iteration: 74
  trial_id: c5b631e4
  
Result for TrainMNIST_c7a21fd6:
  date: 2020-07-19_08-02-25
  done: false
  experiment_id: e0685c211e29497992e5561881efbc6b
  experiment_tag: 11_lr=0.1,momentum=0.1
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 26
  mean_accuracy: 0.85625
  node_ip: 192.168.1.149
  pid: 25290
  time_since_restore: 5.293384790420532
  time_this_iter_s: 0.18712186813354492
  time_total_s: 17.94904351234436
  timestamp: 1595170945
  timesteps_since_restore: 0
  training_iteration: 74
  trial_id

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,TERMINATED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,TERMINATED,,0.1,0.01,0.921875,48,14.5216
TrainMNIST_c4a1ef6e,TERMINATED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,TERMINATED,,0.01,0.9,0.871875,48,14.8339
TrainMNIST_c5ab82b2,TERMINATED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,RUNNING,192.168.1.149:25282,0.1,0.01,0.946875,92,23.0046
TrainMNIST_c5c661f4,TERMINATED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,TERMINATED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,TERMINATED,,0.01,0.1,0.6875,12,2.84579
TrainMNIST_c7946260,TERMINATED,,0.001,0.001,0.13125,12,2.81379


Result for TrainMNIST_c5b631e4:
  date: 2020-07-19_08-02-30
  done: false
  experiment_id: 88c2f848b44b414b867431e2888bb8a4
  experiment_tag: 6_lr=0.1,momentum=0.01
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 54
  mean_accuracy: 0.93125
  node_ip: 192.168.1.149
  pid: 25282
  time_since_restore: 10.255598783493042
  time_this_iter_s: 0.1824951171875
  time_total_s: 24.803852081298828
  timestamp: 1595170950
  timesteps_since_restore: 0
  training_iteration: 102
  trial_id: c5b631e4
  
Result for TrainMNIST_c7a21fd6:
  date: 2020-07-19_08-02-30
  done: false
  experiment_id: e0685c211e29497992e5561881efbc6b
  experiment_tag: 11_lr=0.1,momentum=0.1
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 53
  mean_accuracy: 0.93125
  node_ip: 192.168.1.149
  pid: 25290
  time_since_restore: 10.26732087135315
  time_this_iter_s: 0.17782998085021973
  time_total_s: 22.922979593276978
  timestamp: 1595170950
  timesteps_since_restore: 0
  training_iteration: 101
  trial_i

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,TERMINATED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,TERMINATED,,0.1,0.01,0.921875,48,14.5216
TrainMNIST_c4a1ef6e,TERMINATED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,TERMINATED,,0.01,0.9,0.871875,48,14.8339
TrainMNIST_c5ab82b2,TERMINATED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,RUNNING,192.168.1.149:25282,0.1,0.01,0.940625,120,28.2245
TrainMNIST_c5c661f4,TERMINATED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,TERMINATED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,TERMINATED,,0.01,0.1,0.6875,12,2.84579
TrainMNIST_c7946260,TERMINATED,,0.001,0.001,0.13125,12,2.81379


Result for TrainMNIST_c5b631e4:
  date: 2020-07-19_08-02-35
  done: false
  experiment_id: 88c2f848b44b414b867431e2888bb8a4
  experiment_tag: 6_lr=0.1,momentum=0.01
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 80
  mean_accuracy: 0.95
  node_ip: 192.168.1.149
  pid: 25282
  time_since_restore: 15.301830530166626
  time_this_iter_s: 0.2082371711730957
  time_total_s: 29.850083827972412
  timestamp: 1595170955
  timesteps_since_restore: 0
  training_iteration: 128
  trial_id: c5b631e4
  
Result for TrainMNIST_c7a21fd6:
  date: 2020-07-19_08-02-35
  done: false
  experiment_id: e0685c211e29497992e5561881efbc6b
  experiment_tag: 11_lr=0.1,momentum=0.1
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 78
  mean_accuracy: 0.95
  node_ip: 192.168.1.149
  pid: 25290
  time_since_restore: 15.256131410598755
  time_this_iter_s: 0.2039790153503418
  time_total_s: 27.911790132522583
  timestamp: 1595170955
  timesteps_since_restore: 0
  training_iteration: 126
  trial_id: 

Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,TERMINATED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,TERMINATED,,0.1,0.01,0.921875,48,14.5216
TrainMNIST_c4a1ef6e,TERMINATED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,TERMINATED,,0.01,0.9,0.871875,48,14.8339
TrainMNIST_c5ab82b2,TERMINATED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,RUNNING,192.168.1.149:25282,0.1,0.01,0.934375,145,33.2055
TrainMNIST_c5c661f4,TERMINATED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,TERMINATED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,TERMINATED,,0.01,0.1,0.6875,12,2.84579
TrainMNIST_c7946260,TERMINATED,,0.001,0.001,0.13125,12,2.81379


Result for TrainMNIST_c7a21fd6:
  date: 2020-07-19_08-02-40
  done: true
  experiment_id: e0685c211e29497992e5561881efbc6b
  experiment_tag: 11_lr=0.1,momentum=0.1
  hostname: DWAnyscaleMBP.local
  iterations_since_restore: 104
  mean_accuracy: 0.959375
  node_ip: 192.168.1.149
  pid: 25290
  time_since_restore: 20.302512645721436
  time_this_iter_s: 0.1692650318145752
  time_total_s: 32.958171367645264
  timestamp: 1595170960
  timesteps_since_restore: 0
  training_iteration: 152
  trial_id: c7a21fd6
  


Trial name,status,loc,lr,momentum,acc,iter,total time (s)
TrainMNIST_c4a1a4b4,TERMINATED,,0.01,0.001,0.46875,12,4.40749
TrainMNIST_c4a1ce80,TERMINATED,,0.1,0.01,0.921875,48,14.5216
TrainMNIST_c4a1ef6e,TERMINATED,,0.001,0.01,0.215625,12,4.45656
TrainMNIST_c4a2102a,TERMINATED,,0.01,0.9,0.871875,48,14.8339
TrainMNIST_c5ab82b2,TERMINATED,,0.01,0.001,0.646875,12,4.52849
TrainMNIST_c5b631e4,TERMINATED,,0.1,0.01,0.953125,152,34.5061
TrainMNIST_c5c661f4,TERMINATED,,0.001,0.1,0.15,12,4.50463
TrainMNIST_c6437144,TERMINATED,,0.001,0.9,0.2,12,4.40513
TrainMNIST_c787ca1e,TERMINATED,,0.01,0.1,0.6875,12,2.84579
TrainMNIST_c7946260,TERMINATED,,0.001,0.001,0.13125,12,2.81379


In [10]:
print("Best config: ", analysis.get_best_config(metric="mean_accuracy"))

Best config:  {'lr': 0.1, 'momentum': 0.01}


In [11]:
analysis.dataframe().sort_values('mean_accuracy', ascending=False).head()

Unnamed: 0,mean_accuracy,done,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,time_this_iter_s,time_total_s,...,hostname,node_ip,time_since_restore,timesteps_since_restore,iterations_since_restore,trial_id,experiment_tag,config/lr,config/momentum,logdir
10,0.959375,True,,,152,e0685c211e29497992e5561881efbc6b,2020-07-19_08-02-40,1595170960,0.169265,32.958171,...,DWAnyscaleMBP.local,192.168.1.149,20.302513,0,104,c7a21fd6,"11_lr=0.1,momentum=0.1",0.1,0.1,/Users/deanwampler/ray_results/TrainMNIST/Trai...
5,0.953125,False,,,152,88c2f848b44b414b867431e2888bb8a4,2020-07-19_08-02-39,1595170959,0.183516,34.506141,...,DWAnyscaleMBP.local,192.168.1.149,19.957888,0,104,c5b631e4,"6_lr=0.1,momentum=0.01",0.1,0.01,/Users/deanwampler/ray_results/TrainMNIST/Trai...
1,0.921875,False,,,48,c132d028028c4b3cb198d349abc182ef,2020-07-19_08-02-17,1595170937,0.273994,14.521615,...,DWAnyscaleMBP.local,192.168.1.149,10.055592,0,36,c4a1ce80,"2_lr=0.1,momentum=0.01",0.1,0.01,/Users/deanwampler/ray_results/TrainMNIST/Trai...
3,0.871875,False,,,48,98333a4fe4bc4e7fb492a3bd604acab5,2020-07-19_08-02-17,1595170937,0.265256,14.83388,...,DWAnyscaleMBP.local,192.168.1.149,9.975619,0,36,c4a2102a,"4_lr=0.01,momentum=0.9",0.01,0.9,/Users/deanwampler/ray_results/TrainMNIST/Trai...
11,0.8625,True,,,48,dc24002a5fd74f2c8a9651114b17d102,2020-07-19_08-02-18,1595170938,0.200659,12.644042,...,DWAnyscaleMBP.local,192.168.1.149,9.920999,0,36,c7aeb8a4,"12_lr=0.01,momentum=0.01",0.01,0.01,/Users/deanwampler/ray_results/TrainMNIST/Trai...


In [12]:
analysis.dataframe()[['mean_accuracy', 'config/lr', 'config/momentum']].sort_values('mean_accuracy', ascending=False)

Unnamed: 0,mean_accuracy,config/lr,config/momentum
10,0.959375,0.1,0.1
5,0.953125,0.1,0.01
1,0.921875,0.1,0.01
3,0.871875,0.01,0.9
11,0.8625,0.01,0.01
8,0.6875,0.01,0.1
4,0.646875,0.01,0.001
0,0.46875,0.01,0.001
2,0.215625,0.001,0.01
7,0.2,0.001,0.9


How long did it take?

In [13]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

  53.08 seconds,    0.88 minutes


The runs in the previous lesson, for the class-based and the function-based Tune APIs, took between 12 and 20 seconds (on my machine), but we only trained for 20 iterations, where as here we went for 100 iterations. That also accounts for the different results, notably that a much smaller momentum value `0.01` and `0.1` perform best here, while for the the previous lesson `0.9` performed best. This is because a smaller momentum value will result in longer training times required, but more fine-tuned iterating to the optimal result, so more training iterations will favor a smaller momentum value. Still, the mean accuracies among the top three or four combinations are quite close.

## Exercise - Population Base Training

Read the [documentation]() on _population based training_ to understand what it is doing. The next cell configures a PBT scheduler and defines other things you'll need. 

In [14]:
from ray.tune.schedulers import PopulationBasedTraining

pbt_scheduler = PopulationBasedTraining(
        time_attr='training_iteration',
        perturbation_interval=10,  # Every N time_attr units, "perturb" the parameters.
        hyperparam_mutations={
            "lr": [0.001, 0.01, 0.1],
            "momentum": [0.001, 0.01, 0.1, 0.9]
        },
        **experiment_metrics)

config = {
    "lr": 0.001,            # Use the lowest values from the previous definition
    "momentum": 0.001
}

[2m[36m(pid=26711)[0m 2020-07-19 08:43:59,909	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_2_2020-07-19_08-33-07vmh5os0q/tmpje8nicoarestore_from_object/checkpoint
[2m[36m(pid=26711)[0m 2020-07-19 08:43:59,910	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 1290, '_timesteps_total': None, '_time_total': 613.1370136737823, '_episodes_total': None}
[2m[36m(pid=26715)[0m 2020-07-19 08:44:02,606	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_0_2020-07-19_08-33-07uc57ewdo/tmpzups7il9restore_from_object/checkpoint
[2m[36m(pid=26715)[0m 2020-07-19 08:44:02,606	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 1300, '_timesteps_total': None, '_time_total': 624.1683716773987, '_episodes_total': None}
[2m[36m(pid=26716)[0m 2020-07-19 08:44:06,818	INFO trainable.py:423 -- Restored on 192.168.

2020-07-19 08:44:12,385	INFO (unknown file):0 -- gc.collect() freed 16 refs in 0.10996127999987948 seconds


[2m[36m(pid=26725)[0m 2020-07-19 08:44:14,952	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_2_2020-07-19_08-33-07vmh5os0q/tmp5fs3_sbhrestore_from_object/checkpoint
[2m[36m(pid=26725)[0m 2020-07-19 08:44:14,953	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 1330, '_timesteps_total': None, '_time_total': 631.2472841739655, '_episodes_total': None}
[2m[36m(pid=26728)[0m 2020-07-19 08:44:19,112	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/TrainMNIST/TrainMNIST_0_2020-07-19_08-33-07uc57ewdo/tmpdoe3nvmyrestore_from_object/checkpoint
[2m[36m(pid=26728)[0m 2020-07-19 08:44:19,113	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 1360, '_timesteps_total': None, '_time_total': 645.7453458309174, '_episodes_total': None}
[2m[36m(pid=26732)[0m 2020-07-19 08:44:23,103	INFO trainable.py:423 -- Restored on 192.168.

Now modify the the following cell, copied from above, to use this scheduler:
1. Use the new scheduler.
2. Remove the search_alg argument.
3. Add the `config` argument.
4. Remove the `num_samples` argument.
5. Consider using `0` or `1` for the `verbose` argument.

Then run it. 

> **WARNING:** This will run for a LONG time.

In [None]:
analysis = tune.run(TrainMNIST, 
    scheduler=scheduler, 
    search_alg=search_algorithm, 
    num_samples=12,                           # Force it try all 12 combinations
    verbose=2,                                # Change to 0 or 1 to reduce the output.
    ray_auto_init=False                       # Don't allow Tune to initialize Ray.
)

Look at the `analysis` data of interest, as done previously. How well does PBT work?