# 超参数自动调优示例（HPO）

1. 准备超参数搜索空间
2. 调用SDK超参数自动调优接口
3. 查看NNIBoard、等待超参数调优结果
4. 中断重启
5. 获取最佳模型

### 1. 准备超参数搜索空间

超参数调优功能的前提是，用户对算法的超参数有基本的认知，并且已知哪些超参数需要调优以及这些超参数的取值范围。

这里我们以一个简单的CNN为例，准备调优的超参数是`batch_size`、`learning_rate`、`epochs`，如下：

In [1]:
search_space = {
    "batch-size": { "_type": "choice", "_value": [32, 64, 128] },
    "learning-rate": { "_type": "choice", "_value": [0.001, 0.01] },
    "epochs": { "_type": "choice", "_value": [3, 6, 12] },
}

### 2. 调用SDK超参数自动调优接口

将前置准备的超参数搜索空间、本地算法路径、本地数据集路径传入`run_hpo()`接口，并根据具体需求传入其他参数。

接口集成了超参数自动调优的开源工具**NNI**，并在上层封装了准备Anylearn中相关资源以及调用NNI进行调参的过程。

In [2]:
# 示例CNN的指标为accuracy，优化目标应为最大化（"maximize"）
# 如指标为loss或error等，优化目标应为最小化（"minimize"）
hpo_mode = "maximize"

In [4]:
from anylearn.applications.hpo import run_hpo
from anylearn.config import init_sdk

init_sdk("http://192.168.10.22:31888", "yhuang", "Anylearn2021!")

hpo_experiment = run_hpo(
    hpo_search_space=search_space,
    hpo_mode=hpo_mode,
    hpo_concurrency=2,
    project_name="TEST_SDK_HPO",
    algorithm_name="TEST_SDK_HPO",
    algorithm_dir="../../resources/cnn",
    dataset_dir="../../resources/fashion_mnist",
    algorithm_entrypoint="python fashion_mnist.py",
    algorithm_output="output",
    dataset_hyperparam_name="data-path",
    resource_request=[{
        'default': {
            'P-100-shared': 1,
            'CPU': 2,
            'Memory': 4,
        }
    }],
)
hpo_experiment

Local algorithm (None) has been deleted remotely, forced to re-registering algorithm.


HpoExperiment(
    project_id = 'PROJe0681dda11ecb06a9efcdf6b9861',
    algorithm_id = 'ALGO0c741dda11ec97889efcdf6b9861',
    dataset_id = 'DSETaf621dda11ec97889efcdf6b9861',
    hpo_search_space = '{'batch-size': {'_type': 'choice', '_value': [32, 64, 128]}, 'epochs': {'_type': 'choice', '_value': [3, 6, 12]}, 'learning-rate': {'_type': 'choice', '_value': [0.001, 0.01]}}',
    hpo_max_runs = '10',
    hpo_max_duration = '24h',
    hpo_tuner_name = 'TPE',
    hpo_mode = 'maximize',
    hpo_concurrency = '2',
    gpu_num = '-1',
    gpu_mem = '-1',
    created_at = '2021-09-25 16:29:43',
    hpo_id = 'a4k5y7sz',
    hpo_ip = '10.0.0.7',
    hpo_port = '8000',
    hpo_status = 'RUNNING',
    tasks = '{}',
    err = '[]',
)

#### 2-bis. 调参实验Python对象

调用`run_hpo`接口返回`HpoExperiment`实例，相当于一次调参实验的profile，其中记录了实验的基本元信息，以及所使用的算法、数据集等相关元信息，方便用户回忆和区分不同实验的内容，以便后续导出所需的模型。

In [5]:
# 同步实验对象
hpo_experiment.get_detail()
hpo_experiment.get_tasks()
hpo_experiment

HpoExperiment(
    project_id = 'PROJ1fec04c611eca5e4025560493f54',
    algorithm_id = 'ALGO164404b311eca64e1ef0f663bff1',
    dataset_id = 'DSET427cff3711eb956ff2b2f0027438',
    hpo_search_space = '{'batch-size': {'_type': 'choice', '_value': [32, 64, 128]}, 'epochs': {'_type': 'choice', '_value': [3, 6, 12]}, 'learning-rate': {'_type': 'choice', '_value': [0.001, 0.01]}}',
    hpo_max_runs = '10',
    hpo_max_duration = '24h',
    hpo_tuner_name = 'TPE',
    hpo_mode = 'maximize',
    hpo_concurrency = '2',
    gpu_num = '1',
    gpu_mem = '0',
    created_at = '2021-08-24 18:35:00',
    hpo_id = '7piwjtu9',
    hpo_ip = '10.0.0.165',
    hpo_port = '8000',
    hpo_status = 'RUNNING',
    tasks = '{'TRAI1cb604c611eca5e4025560493f54': 'ziBQp', 'TRAI612004c611eca5e4025560493f54': 'OWtcE'}',
    err = '[]',
)

In [6]:
# 查看实验相关算法
hpo_experiment.algorithm

Algorithm(tags='', mirror_id='MIRRtestyhuquicktraincde48001122', follows_anylearn_norm=False, entrypoint_training='python fashion_mnist.py', output_training='model', entrypoint_evaluation=None, output_evaluation=None, name='TEST_SDK_HPO', description='SDK_QUICKSTART', state=3, visibility=3, upload_time='2021-08-24 16:17:17', filename='TEST_SDK_HPO.zip', is_zipfile=True, file_path='USER6c1404b311ecb4f10e2019d0107b/algorithm/ALGO164404b311eca64e1ef0f663bff1/release', size='0', creator_id='USER6c1404b311ecb4f10e2019d0107b', node_id='cave-c31e1c203f10da7b', owner=['USER6c1404b311ecb4f10e2019d0107b'], id='ALGO164404b311eca64e1ef0f663bff1', _Algorithm__train_params=[{'name': 'dataset', 'type': 'dataset', 'suggest': 1}], required_train_params=[{'name': 'dataset', 'type': 'dataset', 'suggest': 1}], default_train_params={}, _Algorithm__evaluate_params=[{'name': 'dataset', 'type': 'dataset', 'suggest': 1}, {'name': 'model_path', 'alias': '', 'description': '', 'type': 'model', 'suggest': 1}], re

In [7]:
# 查看实验相关数据集
hpo_experiment.dataset

Dataset(name='DSET_k5v6mg2a', description='SDK_QUICKSTART', state=3, visibility=1, upload_time='2021-08-17 16:46:53', filename='DSET_k5v6mg2a.zip', is_zipfile=True, file_path='USERfb6c6d2111eaadda13fd17feeac7/dataset/DSET427cff3711eb956ff2b2f0027438/release', size='0', creator_id='USERfb6c6d2111eaadda13fd17feeac7', node_id='cave-c31e1c203f10da7b', owner=['USERfb6c6d2111eaadda13fd17feeac7'], id='DSET427cff3711eb956ff2b2f0027438')

In [8]:
# 查看实验相关项目
hpo_experiment.project

Project(id='PROJ1fec04c611eca5e4025560493f54', name='TEST_SDK_HPO', description='SDK_HPO_EXPERIMENT', visibility=1, create_time='2021-08-24 18:34:57', update_time='2021-08-24 18:34:57', creator_id='USER6c1404b311ecb4f10e2019d0107b', datasets=['DSET427cff3711eb956ff2b2f0027438'], owner=['USER6c1404b311ecb4f10e2019d0107b'])

### 3. 查看实验日志

实验日志分为三大类：
1. NNI实验总线日志（nnimanager.log），记录实验启停等流程信息，通过`HpoExperiment.get_log()`获取
2. NNI实验任务输出日志（stderr、stdout和trial.log），记录实验任务执行过程中的输出，通过`HpoExperiment.get_trial_logs()`获取
3. Anylearn训练任务日志，记录算法在执行训练过程中的输出，通过`HpoExperiment.get_trial_train_tasks()[i].get_full_log()`获取

In [9]:
# 查询NNI实验总线日志
meta_log = hpo_experiment.get_log()
[print(l) for l in meta_log]

[2021-08-24 18:34:59] INFO [ 'Datastore initialization done' ]
[2021-08-24 18:34:59] INFO [ 'RestServer start' ]
[2021-08-24 18:34:59] INFO [ 'RestServer base port is 8000' ]
[2021-08-24 18:34:59] INFO [ 'Rest server listening on: http://0.0.0.0:8000' ]
[2021-08-24 18:34:59] INFO [ 'Starting experiment: 7piwjtu9' ]
[2021-08-24 18:34:59] INFO [ 'Setup training service...' ]
[2021-08-24 18:35:00] INFO [ 'Construct local machine training service.' ]
[2021-08-24 18:35:00] INFO [ 'Setup tuner...' ]
[2021-08-24 18:35:00] INFO [ 'Change NNIManager status from: INITIALIZED to: RUNNING' ]
[2021-08-24 18:35:00] INFO [ 'Add event listeners' ]
[2021-08-24 18:35:00] INFO [ 'Run local machine training service.' ]
[2021-08-24 18:35:01] INFO [ 'NNIManager received command from dispatcher: ID, ' ]
[2021-08-24 18:35:01] INFO [ 'NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch-size": 32, "learning-rate": 0.01, "epochs": 3}, "param

[None]

In [10]:
# 查询当前所有NNI实验任务输出日志
trial_logs = hpo_experiment.get_trial_logs()
[print(l) for l in trial_logs[list(trial_logs.keys())[0]]['log']]

Note that this method may be time-consuming


[2021-08-24 18:35:10] PRINT ---
[2021-08-24 18:35:10] PRINT         Project ID: PROJ1fec04c611eca5e4025560493f54
[2021-08-24 18:35:10] PRINT         ---
[2021-08-24 18:35:10] PRINT         Running trial...
[2021-08-24 18:35:10] PRINT         
[2021-08-24 18:35:10] PRINT {'name': 'HPO_TRIAL_N1_ziBQp', 'project_id': 'PROJ1fec04c611eca5e4025560493f54', 'algorithm_id': 'ALGO164404b311eca64e1ef0f663bff1', 'files': ['DSET427cff3711eb956ff2b2f0027438'], 'train_params': '{"batch-size": 128, "learning-rate": 0.01, "epochs": 6, "data-path": "$DSET427cff3711eb956ff2b2f0027438"}', 'gpu_num': '1', 'gpu_mem': '0', 'hpo_id': 'ziBQp'}
[2021-08-24 18:36:21] PRINT Intermediate metrics: [{'id': 'METR629e04c711eca5e4025560493f54', 'metric': 0.8729000091552734, 'reported_at': '2021-08-24 18:36:11', 'train_task_id': 'TRAI1cb604c611eca5e4025560493f54'}, {'id': 'METRf7d204c711eca5e4025560493f54', 'metric': 0.8895999789237976, 'reported_at': '2021-08-24 18:36:15', 'train_task_id': 'TRAI1cb604c611eca5e402556049

[None]

In [11]:
# 获取实验相关的所有训练任务
tasks = hpo_experiment.get_trial_train_tasks()
tasks

[TrainTask(name='HPO_TRIAL_N1_ziBQp', description='', state=2, visibility=1, creator_id='USER6c1404b311ecb4f10e2019d0107b', owner=['USER6c1404b311ecb4f10e2019d0107b'], project_id='PROJ1fec04c611eca5e4025560493f54', algorithm_id='ALGO164404b311eca64e1ef0f663bff1', train_params='{"batch-size": 128, "learning-rate": 0.01, "epochs": 6, "data-path": "$DSET427cff3711eb956ff2b2f0027438"}', files='DSET427cff3711eb956ff2b2f0027438', results_id='FILE413004c611eca5e4025560493f54', secret_key='TKEYf95404c611eca5e4025560493f54', create_time='2021-08-24 18:35:10', finish_time='2021-08-24 18:36:36', envs='', gpu_num=1, gpu_mem=0, hpo=False, hpo_search_space=None, final_metric=None, id='TRAI1cb604c611eca5e4025560493f54'),
 TrainTask(name='HPO_TRIAL_N2_i6IiE', description='', state=1, visibility=1, creator_id='USER6c1404b311ecb4f10e2019d0107b', owner=['USER6c1404b311ecb4f10e2019d0107b'], project_id='PROJ1fec04c611eca5e4025560493f54', algorithm_id='ALGO164404b311eca64e1ef0f663bff1', train_params='{"batc

In [16]:
# 查询第一个训练任务的完整日志
tasks[0].get_full_log()

[{'offset': 117,
  'text': 'From http://anylearn-gitea-http.anylearn-yhu:3000/xlearn/TEST_SDK_HPO'},
 {'offset': 257, 'text': ' * [new branch]      master     -> origin/master'},
 {'offset': 381,
  'text': 'HEAD is now at c559a1f Anylearn auto-commit 2021-07-15 01:25:10'},
 {'offset': 515, 'text': "Note: checking out 'origin/master'."},
 {'offset': 620, 'text': ''},
 {'offset': 690,
  'text': "You are in 'detached HEAD' state. You can look around, make experimental"},
 {'offset': 832,
  'text': 'changes and commit them, and you can discard any commits you make in this'},
 {'offset': 976,
  'text': 'state without impacting any branches by performing another checkout.'},
 {'offset': 1114, 'text': ''},
 {'offset': 1185,
  'text': 'If you want to create a new branch to retain commits you create, you may'},
 {'offset': 1328,
  'text': 'do so (now or later) by using -b with the checkout command again. Example:'},
 {'offset': 1473, 'text': ''},
 {'offset': 1544, 'text': '  git checkout -b <ne

### 4. 查看NNIBoard

In [5]:
# 获取NNIBoard地址
board = hpo_experiment.view()
board

{'url': 'http://192.168.10.22:31254'}

In [6]:
from IPython.display import display, HTML
display(HTML(f"<a href='{board['url']}' target='_blank'><h2>打开NNIBoard</h2></a>"))

**在NNIBoard中可以查看每次调参的具体信息，包括运行的参数组合、中间结果记录、运行时间、状态等等**

### 5. 中断重启

超参数自动调优的过程是漫长的，也有被中断的风险，如主动中断进程、意外退出、甚至宕机、断电等等。这些中断可能会导致调优无法正常结束、无法输出调优结果及最佳模型，已经运行完成的调参子任务便丧失了价值。因此，调优过程的中断重启至关重要。

这里我们手动结束之前启动的调优过程，以模拟作业中断，然后再通过重新获取调参实验对象、调用其`resume`方法，达到重启调优过程的目的。

注意：调优中断会不可避免地污染当前正在运行中的调参子任务，但调参算法会在重启后对调优过程进行全局统筹。

In [45]:
# 记录Anylearn项目ID
project_id = hpo_experiment.project_id

# 模拟调参过程中断，并失去调参实验对象引用
hpo_experiment.stop()

project_id

'PROJ96e804b711eca5e4025560493f54'

In [22]:
# 同步实验对象，可以看到hpo_status已置为STOPPED
import time
time.sleep(5)
hpo_experiment.get_detail()
hpo_experiment.hpo_status

{'data-path': '$DSET3b9ae2f511ebaba35e9e5b63a5ec'}
<class 'dict'>


'STOPPED'

In [24]:
# 重新获取调参实验对象
from anylearn.applications.hpo_experiment import HpoExperiment

hpo_experiment = HpoExperiment(project_id=project_id)
hpo_experiment.get_detail()
hpo_experiment

{'data-path': '$DSET3b9ae2f511ebaba35e9e5b63a5ec'}
<class 'dict'>


HpoExperiment(
    project_id = 'PROJe058e4c811eb94189ac5f1637d2a',
    algorithm_id = 'ALGO4d32e4b011eb889a6a087fa89dfd',
    dataset_id = 'DSET3b9ae2f511ebaba35e9e5b63a5ec',
    hpo_search_space = '{'batch-size': {'_type': 'choice', '_value': [32, 64, 128]}, 'epochs': {'_type': 'choice', '_value': [3, 6, 12]}, 'learning-rate': {'_type': 'choice', '_value': [0.001, 0.01]}}',
    hpo_max_runs = '10',
    hpo_max_duration = '24h',
    hpo_tuner_name = 'TPE',
    hpo_mode = 'maximize',
    hpo_concurrency = '1',
    gpu_num = '1',
    gpu_mem = '2',
    created_at = '2021-07-15 01:25:13',
    hpo_id = '52ayvcij',
    hpo_ip = '10.0.0.111',
    hpo_port = '8001',
    hpo_status = 'STOPPED',
    tasks = '{'TRAI9c74e4c811eb94189ac5f1637d2a': 'BS49h', 'TRAId7a6e4c811eb94189ac5f1637d2a': 'Dnc5v'}',
    err = '["[2021-07-15 01:30:05][PROJe058e4c811eb94189ac5f1637d2a] HTTPConnectionPool(host='localhost', port=8001): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectio

In [25]:
# 重启实验
hpo_experiment.resume()
time.sleep(5)
hpo_experiment.get_detail()
hpo_experiment

HpoExperiment(
    project_id = 'PROJe058e4c811eb94189ac5f1637d2a',
    algorithm_id = 'ALGO4d32e4b011eb889a6a087fa89dfd',
    dataset_id = 'DSET3b9ae2f511ebaba35e9e5b63a5ec',
    hpo_search_space = '{'batch-size': {'_type': 'choice', '_value': [32, 64, 128]}, 'epochs': {'_type': 'choice', '_value': [3, 6, 12]}, 'learning-rate': {'_type': 'choice', '_value': [0.001, 0.01]}}',
    hpo_max_runs = '10',
    hpo_max_duration = '24h',
    hpo_tuner_name = 'TPE',
    hpo_mode = 'maximize',
    hpo_concurrency = '1',
    gpu_num = '1',
    gpu_mem = '2',
    created_at = '2021-07-15 01:25:13',
    hpo_id = '52ayvcij',
    hpo_ip = '10.0.0.111',
    hpo_port = '8001',
    hpo_status = 'STOPPED',
    tasks = '{'TRAI9c74e4c811eb94189ac5f1637d2a': 'BS49h', 'TRAId7a6e4c811eb94189ac5f1637d2a': 'Dnc5v'}',
    err = '["[2021-07-15 01:30:05][PROJe058e4c811eb94189ac5f1637d2a] HTTPConnectionPool(host='localhost', port=8001): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectio

### 6. 获取最佳模型

调参过程结束后，可通过调用调参实验对象的`export_best_model`方法导出最佳模型到本地。

In [29]:
hpo_experiment.export_best_model(local_save_path="../../_tmp/")

或通过`transform_best_model`方法将最佳模型直接转存到Anylearn后端引擎，方便后续通过其ID调用。

In [30]:
hpo_experiment.transform_best_model(model_name="HPO_EXAMPLE_CNN")

Model(algorithm_id='ALGO4d32e4b011eb889a6a087fa89dfd', name='HPO_EXAMPLE_CNN', description='', state=3, visibility=1, upload_time='2021-07-15 01:39:12', filename='.', is_zipfile=False, file_path='USERfb6c6d2111eaadda13fd17feeac7/model/MODEe7cee4ca11eb929a3ea893258ff7/release', size='0', creator_id='USERfb6c6d2111eaadda13fd17feeac7', node_id='cave-c31e1c203f10da7b', owner=['USERfb6c6d2111eaadda13fd17feeac7'], id='MODEe7cee4ca11eb929a3ea893258ff7')

另外，如有需要，可调用调参实验对象的`get_best_train_task`方法，获得取得最佳指标的调参任务（`TrainTask`实例），并对其进行其他细粒度操作。

In [16]:
train_task = hpo_experiment.get_best_train_task()
train_task.get_final_metric()
train_task

TrainTask(name='HPO_TRIAL_N9_zkQQJ', description='', state=2, visibility=1, creator_id='USER6c1404b311ecb4f10e2019d0107b', owner=['USER6c1404b311ecb4f10e2019d0107b'], project_id='PROJ1fec04c611eca5e4025560493f54', algorithm_id='ALGO164404b311eca64e1ef0f663bff1', train_params='{"batch-size": 128, "learning-rate": 0.001, "epochs": 12, "data-path": "$DSET427cff3711eb956ff2b2f0027438"}', files='DSET427cff3711eb956ff2b2f0027438', results_id='FILE983a04c811eca4ac265fe0cb6be1', secret_key='TKEY3e5204c811eca4ac265fe0cb6be1', create_time='2021-08-24 18:43:45', finish_time='2021-08-24 18:45:36', envs='', gpu_num=1, gpu_mem=0, hpo=False, hpo_search_space=None, final_metric=0.9265999794006348, id='TRAI4e7204c811eca4ac265fe0cb6be1')