# Distributed Hyper-parameter Optimization on Cori notebook

In this notebook we will develop a Cori example for HPO.

Let's develop the notebook in stages:
1. Make it so we can invoke python and run training of a single job from the notebook on the login node.
2. Launch the single job on the batch system using the SlurmJob API.
3. Develop HPO logic
4. Demonstrate distributed HPO with multi-node training

In [87]:
import os
import yaml
import numpy as np

In [111]:
from utils.slurm_helpers import SlurmJob

## Environment setup

Let's start by editing our environment directly.
I might also prefer to do this in the kernel json file.

In [12]:
os.environ['PATH'] = '/usr/common/software/pytorch/v0.4.1/bin:' + os.environ['PATH']
os.environ['LD_LIBRARY_PATH'] = '/usr/common/software/pytorch/v0.4.1/lib:' + os.environ['LD_LIBRARY_PATH']
os.environ['MPICH_MAX_THREAD_SAFETY'] = 'multiple'
os.environ['KMP_AFFINITY'] = 'granularity=fine,compact,1,0'
os.environ['KMP_BLOCKTIME'] = '1'

In [3]:
cd ..

/global/u2/s/sfarrell/WorkAreas/dl_science_benchmarks/pytorch-examples


## Generate configuration

In [103]:
def build_config(output_dir, conv_sizes, dense_sizes,
                 learning_rate=0.001, optimizer='Adam',
                 batch_size=64, n_epochs=1):
    data_config = dict(name='mnist', data_path='$SCRATCH/pytorch-mnist/data')
    experiment_config = dict(name='basic', output_dir=output_dir)
    model_config = dict(
        model_type='cnn_classifier',
        input_shape=[1, 28, 28], n_classes=10,
        conv_sizes=conv_sizes, dense_sizes=dense_sizes,
        optimizer=optimizer, learning_rate=learning_rate
    )
    train_config = dict(batch_size=batch_size, n_epochs=n_epochs)
    return dict(data_config=data_config, experiment_config=experiment_config,
                model_config=model_config, train_config=train_config)

In [71]:
hpo_dir = os.path.expandvars('$SCRATCH/pytorch-examples/mnist-hpo')
os.makedirs(hpo_dir, exist_ok=True)

In [107]:
config = build_config(output_dir=os.path.join(hpo_dir, 'output'),
                      conv_sizes=[8, 16, 32], dense_sizes=[])

In [108]:
# Serialize the configuration to a temporary file
config_file = os.path.join(hpo_dir, 'test.yaml')
with open(config_file, 'w') as f:
    yaml.dump(config, f)

## Run training locally

In [110]:
!python ./main.py $config_file

2018-10-15 13:28:19,639 INFO Initializing
2018-10-15 13:28:19,646 INFO Configuration: {'data_config': {'data_path': '$SCRATCH/pytorch-mnist/data', 'name': 'mnist'}, 'experiment_config': {'name': 'basic', 'output_dir': '/global/cscratch1/sd/sfarrell/pytorch-examples/mnist-hpo/output'}, 'model_config': {'conv_sizes': [8, 16, 32], 'dense_sizes': [], 'input_shape': [1, 28, 28], 'learning_rate': 0.001, 'model_type': 'cnn_classifier', 'n_classes': 10, 'optimizer': 'Adam'}, 'train_config': {'batch_size': 64, 'n_epochs': 1}}
2018-10-15 13:28:20,225 INFO Loaded 60000 training samples
2018-10-15 13:28:20,226 INFO Loaded 10000 validation samples
2018-10-15 13:28:20,274 INFO Model: 
CNNClassifier(
  (conv_net): Sequential(
    (0): Conv2d(1, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(8, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU()
    (5): MaxPool

## Trun training on batch system

In [112]:
# Job configuration
job_config = dict(
    node_type='haswell',
    n_nodes=1,
    qos='interactive',
    time=30,
)

In [113]:
job = SlurmJob(**job_config)

Launched in background. Redirecting stdin to /dev/null
salloc: Pending job allocation 15769472
salloc: job 15769472 queued and waiting for resources
salloc: job 15769472 has been allocated resources
salloc: Granted job allocation 15769472
salloc: Waiting for resource configuration
salloc: Nodes nid00046 are ready for job



In [115]:
out, err = job.submit_task('python ./main.py %s' % config_file).communicate()

In [116]:
# Our python logging currently goes to stderr
print(err.decode())

Launched in background. Redirecting stdin to /dev/null
2018-10-15 13:45:47,406 INFO Initializing
2018-10-15 13:45:47,925 INFO Configuration: {'data_config': {'data_path': '$SCRATCH/pytorch-mnist/data', 'name': 'mnist'}, 'experiment_config': {'name': 'basic', 'output_dir': '/global/cscratch1/sd/sfarrell/pytorch-examples/mnist-hpo/output'}, 'model_config': {'conv_sizes': [8, 16, 32], 'dense_sizes': [], 'input_shape': [1, 28, 28], 'learning_rate': 0.001, 'model_type': 'cnn_classifier', 'n_classes': 10, 'optimizer': 'Adam'}, 'train_config': {'batch_size': 64, 'n_epochs': 1}}
2018-10-15 13:45:48,601 INFO Loaded 60000 training samples
2018-10-15 13:45:48,601 INFO Loaded 10000 validation samples
2018-10-15 13:45:48,678 INFO Model: 
CNNClassifier(
  (conv_net): Sequential(
    (0): Conv2d(1, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(8, 16, kernel_size=(3, 3), stride=(

In [117]:
# End the allocation
del job

## Define evaluation

To evaluate a model, we retrieve its output which was saved to the filesystem

In [118]:
def get_val_acc(config):
    output_dir = os.path.expandvars(config['experiment_config']['output_dir'])
    summaries = np.load(os.path.join(output_dir, 'summaries.npz'))
    return summaries['valid_acc'].max()

In [119]:
get_val_acc(config)

0.9532

## Define HP sets

In [123]:
n_hpo_trials = 2

# Hyper-parameters for model config
c1 = np.random.choice([4, 8, 16], size=n_hpo_trials)
c2 = np.random.choice([4, 8, 16], size=n_hpo_trials)
c3 = np.random.choice([8, 16, 32], size=n_hpo_trials)
lr = np.random.choice([0.0001, 0.001, 0.01], size=n_hpo_trials)
conv_sizes = np.stack([c1, c2, c3], axis=1)

# Training config
batch_size = 64
n_epochs = 2

In [129]:
# Build the configurations for the HPO tasks
configs = [
    build_config(output_dir=os.path.join(hpo_dir, 'hp_%i' % i),
                 conv_sizes=conv_sizes[i], dense_sizes=[], learning_rate=lr[i],
                 batch_size=batch_size, n_epochs=n_epochs)
    for i in range(n_hpo_trials)
]

## Launch HP tasks to batch system

In [136]:
# Job configuration
job_config = dict(
    node_type='haswell',
    n_nodes=1,
    qos='interactive',
    time=30,
)

# Fix thread settings for remote job
#os.environ['OMP_NUM_THREADS'] = '32'

In [137]:
# Start the job
job = SlurmJob(**job_config)

Launched in background. Redirecting stdin to /dev/null
salloc: Pending job allocation 15770421
salloc: job 15770421 queued and waiting for resources
salloc: job 15770421 has been allocated resources
salloc: Granted job allocation 15770421



In [138]:
results = []
for i, config in enumerate(configs):
    output_dir = config['experiment_config']['output_dir']
    
    # Write the configuration to file
    os.makedirs(output_dir, exist_ok=True)
    config_file = os.path.join(output_dir, 'config.yaml')
    with open(config_file, 'w') as f:
        yaml.dump(config, f)
    
    # Submit the task
    results.append(job.submit_task('python ./main.py %s' % config_file))

In [146]:
!sacct | tail -n 4

15770421     allocation interacti+    dasrepo         64    RUNNING      0:0 
15770421.ex+     extern               dasrepo         64    RUNNING      0:0 
15770421.0       python               dasrepo          1  COMPLETED      0:0 
15770421.1       python               dasrepo          1  COMPLETED      0:0 


In [147]:
stdout, stderr = results[0].communicate()

In [149]:
print(stderr.decode())

Launched in background. Redirecting stdin to /dev/null
2018-10-15 15:13:39,559 INFO Initializing
2018-10-15 15:13:39,626 INFO Configuration: {'data_config': {'data_path': '$SCRATCH/pytorch-mnist/data', 'name': 'mnist'}, 'experiment_config': {'name': 'basic', 'output_dir': '/global/cscratch1/sd/sfarrell/pytorch-examples/mnist-hpo/hp_0'}, 'model_config': {'conv_sizes': array([4, 8, 8]), 'dense_sizes': [], 'input_shape': [1, 28, 28], 'learning_rate': 0.0001, 'model_type': 'cnn_classifier', 'n_classes': 10, 'optimizer': 'Adam'}, 'train_config': {'batch_size': 64, 'n_epochs': 2}}
2018-10-15 15:13:40,228 INFO Loaded 60000 training samples
2018-10-15 15:13:40,228 INFO Loaded 10000 validation samples
2018-10-15 15:13:40,324 INFO Model: 
CNNClassifier(
  (conv_net): Sequential(
    (0): Conv2d(1, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(4, 8, kernel_size=(3, 3), strid

In [150]:
get_val_acc(configs[0])

0.8872

In [151]:
get_val_acc(configs[1])

0.9315

## Scratch