# Distributed Hyper-parameter Optimization on Cori notebook

In this notebook we will develop a Cori example for distributed HPO with distributed training.

We will demonstrate the following:
1. Show how to invoke the command-line training script from the notebook
2. Launch the single job on the batch system using the SlurmJob API.
3. Demonstrate distributed HPO with multi-node training using SlurmJob.


In [1]:
cd ..

/global/u2/s/sfarrell/WorkAreas/jupyter-dl/pytorch-examples


In [2]:
import os
import yaml
import numpy as np

In [3]:
from utils.slurm_helpers import SlurmJob

## Environment setup

Let's start by editing our environment directly.
I might also prefer to do this in the kernel json file.

In [4]:
os.environ['PATH'] = '/usr/common/software/pytorch/v0.4.1/bin:' + os.environ['PATH']
os.environ['LD_LIBRARY_PATH'] = '/usr/common/software/pytorch/v0.4.1/lib:' + os.environ['LD_LIBRARY_PATH']
os.environ['MPICH_MAX_THREAD_SAFETY'] = 'multiple'
os.environ['KMP_AFFINITY'] = 'granularity=fine,compact,1,0'
os.environ['KMP_BLOCKTIME'] = '1'

In [5]:
hpo_dir = os.path.expandvars('$SCRATCH/pytorch-examples/mnist-hpo')
os.makedirs(hpo_dir, exist_ok=True)

## Useful functions

In [6]:
def build_config(output_dir, conv_sizes, dense_sizes,
                 learning_rate=0.001, optimizer='Adam',
                 batch_size=64, n_epochs=1):
    data_config = dict(name='mnist', data_path='$SCRATCH/pytorch-mnist/data')
    experiment_config = dict(name='basic', output_dir=output_dir)
    model_config = dict(
        model_type='cnn_classifier',
        input_shape=[1, 28, 28], n_classes=10,
        conv_sizes=conv_sizes, dense_sizes=dense_sizes,
        optimizer=optimizer, learning_rate=learning_rate
    )
    train_config = dict(batch_size=batch_size, n_epochs=n_epochs)
    return dict(data_config=data_config, experiment_config=experiment_config,
                model_config=model_config, train_config=train_config)

def write_config(config, file):
    os.makedirs(os.path.dirname(file), exist_ok=True)
    with open(file, 'w') as f:
        yaml.dump(config, f)

def get_val_acc(config):
    output_dir = os.path.expandvars(config['experiment_config']['output_dir'])
    summaries = np.load(os.path.join(output_dir, 'summaries.npz'))
    return summaries['valid_acc'].max()

## Run training locally

In [7]:
output_dir = os.path.expandvars('$SCRATCH/pytorch-examples/mnist-jupyter')
config = build_config(output_dir=output_dir,
                      conv_sizes=[8, 16, 32], dense_sizes=[])

# Serialize the configuration to a temporary file
config_file = os.path.join(output_dir, 'config.yaml')
write_config(config, config_file)

In [8]:
!python ./main.py $config_file

2018-10-16 17:18:52,485 INFO Initializing
2018-10-16 17:18:52,488 INFO Configuration: {'data_config': {'data_path': '$SCRATCH/pytorch-mnist/data', 'name': 'mnist'}, 'experiment_config': {'name': 'basic', 'output_dir': '/global/cscratch1/sd/sfarrell/pytorch-examples/mnist-jupyter'}, 'model_config': {'conv_sizes': [8, 16, 32], 'dense_sizes': [], 'input_shape': [1, 28, 28], 'learning_rate': 0.001, 'model_type': 'cnn_classifier', 'n_classes': 10, 'optimizer': 'Adam'}, 'train_config': {'batch_size': 64, 'n_epochs': 1}}
2018-10-16 17:18:52,664 INFO Loaded 60000 training samples
2018-10-16 17:18:52,665 INFO Loaded 10000 validation samples
2018-10-16 17:18:52,684 INFO Model: 
CNNClassifier(
  (conv_net): Sequential(
    (0): Conv2d(1, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(8, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(

In [9]:
print('Validation set accuracy:', get_val_acc(config))

Validation set accuracy: 0.9683


## Run training on batch system

In [17]:
# Job configuration
job_config = dict(
    node_type='haswell',
    n_nodes=1,
    qos='interactive',
    time=30,
)

In [18]:
job = SlurmJob(**job_config)

Launched in background. Redirecting stdin to /dev/null
salloc: Pending job allocation 15801422
salloc: job 15801422 queued and waiting for resources
salloc: job 15801422 has been allocated resources
salloc: Granted job allocation 15801422
salloc: Waiting for resource configuration
salloc: Nodes nid00082 are ready for job



In [19]:
out, err = job.submit_task('python ./main.py %s' % config_file).communicate()

In [20]:
# Our python logging currently goes to stderr
print(err)

Launched in background. Redirecting stdin to /dev/null
2018-10-16 17:24:54,186 INFO Initializing
2018-10-16 17:24:54,224 INFO Configuration: {'data_config': {'data_path': '$SCRATCH/pytorch-mnist/data', 'name': 'mnist'}, 'experiment_config': {'name': 'basic', 'output_dir': '/global/cscratch1/sd/sfarrell/pytorch-examples/mnist-jupyter'}, 'model_config': {'conv_sizes': [8, 16, 32], 'dense_sizes': [], 'input_shape': [1, 28, 28], 'learning_rate': 0.001, 'model_type': 'cnn_classifier', 'n_classes': 10, 'optimizer': 'Adam'}, 'train_config': {'batch_size': 64, 'n_epochs': 1}}
2018-10-16 17:24:54,955 INFO Loaded 60000 training samples
2018-10-16 17:24:54,955 INFO Loaded 10000 validation samples
2018-10-16 17:24:55,007 INFO Model: 
CNNClassifier(
  (conv_net): Sequential(
    (0): Conv2d(1, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(8, 16, kernel_size=(3, 3), stride=(1, 

In [21]:
print('Validation set accuracy:', get_val_acc(config))

Validation set accuracy: 0.9642


In [22]:
# End the allocation
del job

## Define HP sets

In [25]:
n_hpo_trials = 32

# Hyper-parameters for model config
c1 = np.random.choice([4, 8, 16], size=n_hpo_trials)
c2 = np.random.choice([4, 8, 16], size=n_hpo_trials)
c3 = np.random.choice([8, 16, 32], size=n_hpo_trials)
lr = np.random.choice([0.0001, 0.001, 0.01], size=n_hpo_trials)
conv_sizes = np.stack([c1, c2, c3], axis=1)

# Training config
batch_size = 64
n_epochs = 4

In [26]:
# Build the configurations for the HPO tasks
configs = [
    build_config(output_dir=os.path.join(hpo_dir, 'hp_%i' % i),
                 conv_sizes=conv_sizes[i], dense_sizes=[], learning_rate=lr[i],
                 batch_size=batch_size, n_epochs=n_epochs)
    for i in range(n_hpo_trials)
]

## Launch HP tasks to batch system

In [30]:
# Job configuration
job_config = dict(
    node_type='haswell',
    n_nodes=16,
    qos='interactive',
    time='2:00:00',
)

# Fix thread settings for remote job
os.environ['OMP_NUM_THREADS'] = '32'

In [31]:
# Start the job
job = SlurmJob(**job_config)

Launched in background. Redirecting stdin to /dev/null
salloc: Pending job allocation 15801963
salloc: job 15801963 queued and waiting for resources
salloc: job 15801963 has been allocated resources
salloc: Granted job allocation 15801963



In [32]:
# Multi-node training configuration
n_nodes_per_task = 2

In [None]:
results = []
for i, config in enumerate(configs):    
    # Write the configuration to file
    output_dir = config['experiment_config']['output_dir']
    config_file = os.path.join(output_dir, 'config.yaml')
    write_config(config, config_file)
    # Submit the task
    results.append(job.submit_task('python ./main.py -d %s' % config_file, n_nodes=n_nodes_per_task))

In [34]:
jobid = job.jobid

In [39]:
!sacct -j $jobid

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
15801963     allocation interacti+    dasrepo       1024    RUNNING      0:0 
15801963.ex+     extern               dasrepo       1024    RUNNING      0:0 
15801963.0       python               dasrepo          2  COMPLETED      0:0 
15801963.1       python               dasrepo          2  COMPLETED      0:0 
15801963.2       python               dasrepo          2  COMPLETED      0:0 
15801963.3       python               dasrepo          2  COMPLETED      0:0 
15801963.4       python               dasrepo          2  COMPLETED      0:0 
15801963.5       python               dasrepo          2  COMPLETED      0:0 
15801963.6       python               dasrepo          2  COMPLETED      0:0 
15801963.7       python               dasrepo          2  COMPLETED      0:0 
15801963.8       python               dasrepo          2  COMPLE

In [36]:
# Wait and gather all the results
outputs = [r.communicate() for r in results]

In [38]:
# Show the full output from one job
print(outputs[0][1])

Launched in background. Redirecting stdin to /dev/null
srun: Job 15801963 step creation temporarily disabled, retrying
srun: Step created for job 15801963
2018-10-16 17:48:57,071 INFO Initializing
2018-10-16 17:49:00,587 INFO Initializing
2018-10-16 17:49:00,632 INFO MPI rank 0
2018-10-16 17:49:00,631 INFO MPI rank 1
2018-10-16 17:49:00,655 INFO Configuration: {'data_config': {'data_path': '$SCRATCH/pytorch-mnist/data', 'name': 'mnist'}, 'experiment_config': {'name': 'basic', 'output_dir': '/global/cscratch1/sd/sfarrell/pytorch-examples/mnist-hpo/hp_0'}, 'model_config': {'conv_sizes': array([ 8,  8, 32]), 'dense_sizes': [], 'input_shape': [1, 28, 28], 'learning_rate': 0.01, 'model_type': 'cnn_classifier', 'n_classes': 10, 'optimizer': 'Adam'}, 'train_config': {'batch_size': 64, 'n_epochs': 4}}
2018-10-16 17:49:01,081 INFO Loaded 60000 training samples
2018-10-16 17:49:01,081 INFO Loaded 10000 validation samples
2018-10-16 17:49:02,401 INFO Loaded 60000 training samples
2018-10-16 17:49

In [40]:
# Gather the validation set accuracies
val_accs = np.array([get_val_acc(config) for config in configs])
print('Validation set accuracies:', val_accs)

Validation set accuracies: [0.9852 0.9798 0.987  0.9485 0.94   0.9708 0.9811 0.9371 0.9849 0.9843
 0.9708 0.9832 0.981  0.9674 0.937  0.9763 0.9352 0.9832 0.9706 0.8802
 0.9826 0.9283 0.9132 0.9181 0.9638 0.9495 0.9855 0.9851 0.9248 0.9848
 0.9706 0.9571]


In [41]:
# Best model configuration
configs[val_accs.argmax()]

{'data_config': {'name': 'mnist', 'data_path': '$SCRATCH/pytorch-mnist/data'},
 'experiment_config': {'name': 'basic',
  'output_dir': '/global/cscratch1/sd/sfarrell/pytorch-examples/mnist-hpo/hp_2'},
 'model_config': {'model_type': 'cnn_classifier',
  'input_shape': [1, 28, 28],
  'n_classes': 10,
  'conv_sizes': array([16, 16, 16]),
  'dense_sizes': [],
  'optimizer': 'Adam',
  'learning_rate': 0.001},
 'train_config': {'batch_size': 64, 'n_epochs': 4}}

In [44]:
# Worst model configuration
configs[val_accs.argmin()]

{'data_config': {'name': 'mnist', 'data_path': '$SCRATCH/pytorch-mnist/data'},
 'experiment_config': {'name': 'basic',
  'output_dir': '/global/cscratch1/sd/sfarrell/pytorch-examples/mnist-hpo/hp_19'},
 'model_config': {'model_type': 'cnn_classifier',
  'input_shape': [1, 28, 28],
  'n_classes': 10,
  'conv_sizes': array([4, 4, 8]),
  'dense_sizes': [],
  'optimizer': 'Adam',
  'learning_rate': 0.0001},
 'train_config': {'batch_size': 64, 'n_epochs': 4}}

In [45]:
del job