# Hyperparameter Tuning using Your Own Keras/Tensorflow Container

This notebook shows how to build your own Keras(Tensorflow) container, test it locally using SageMaker Python SDK local mode, and bring it to SageMaker for training, leveraging hyperparameter tuning. 

The model used for this notebook is a ResNet model, trainer with the CIFAR-10 dataset. The example is based on https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py

## Set up the notebook instance to support local mode
Currently you need to install docker-compose in order to use local mode (i.e., testing the container in the notebook instance without pushing it to ECR).

In [62]:
!/bin/bash setup.sh

## Set up the environment
We will set up a few things before starting the workflow. 

1. get the execution role which will be passed to sagemaker for accessing your resources such as s3 bucket
2. specify the s3 bucket and prefix where training data set and model artifacts are stored

In [5]:
import os
import numpy as np
import tempfile

import tensorflow as tf

import sagemaker
import boto3
from sagemaker.estimator import Estimator

region = boto3.Session().region_name

sagemaker_session = sagemaker.Session()
smclient = boto3.client('sagemaker')

bucket = sagemaker.Session().default_bucket()  # s3 bucket name, must be in the same region as the one specified above
prefix = 'sagemaker/hpo-keras-seedling'

role = sagemaker.get_execution_role()

In [6]:
bucket

'sagemaker-us-west-2-127230316473'

In [7]:
# Upload Data to S3
sagemaker_session.upload_data(path='npX_keras.pkl', bucket=bucket, key_prefix='data/hpo-keras-seedling/train')
sagemaker_session.upload_data(path='oh_npY.pkl', bucket=bucket, key_prefix='data/hpo-keras-seedling/train')

's3://sagemaker-us-west-2-127230316473/data/hpo-keras-seedling/train/oh_npY.pkl'

In [11]:
!aws s3 ls s3://sagemaker-us-west-2-127230316473/data/hpo-keras-seedling/train --recursive --human

2018-08-01 04:16:52  317.3 MiB data/hpo-keras-seedling/train/npX_keras.pkl
2018-08-01 04:16:55  876.9 KiB data/hpo-keras-seedling/train/oh_npY.pkl


## Complete source code
- [trainer/start.py](trainer/start.py): Keras model
- [trainer/environment.py](trainer/environment.py): Contain information about the SageMaker environment

## Building the image
We will build the docker image using the Tensorflow versions on dockerhub. The full list of Tensorflow versions can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/


In [131]:
import shlex
import subprocess

def get_image_name(ecr_repository, tensorflow_version_tag):
    return '%s:tensorflow-%s' % (ecr_repository, tensorflow_version_tag)

def build_image(name, version):
    cmd = 'docker build -t %s --build-arg VERSION=%s -f Dockerfile .' % (name, version)
    subprocess.check_call(shlex.split(cmd))

#version tag can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/ 
#e.g., latest cpu version is 'latest', while latest gpu version is 'latest-gpu'
tensorflow_version_tag = 'latest'   

account = boto3.client('sts').get_caller_identity()['Account']
    
ecr_repository="%s.dkr.ecr.%s.amazonaws.com/test" %(account,region) # your ECR repository, which you should have been created before running the notebook

image_name = get_image_name(ecr_repository, tensorflow_version_tag)

print('building image:'+image_name)
build_image(image_name, tensorflow_version_tag)

building image:127230316473.dkr.ecr.us-west-2.amazonaws.com/test:tensorflow-latest


### Setting the hyperparameters

In [109]:
hyperparameters = dict(batch_size=32, learning_rate=.0001, epochs=1)
hyperparameters

{'batch_size': 32, 'learning_rate': 0.0001, 'epochs': 1}

In [132]:
# channels
train_data_location = 's3://sagemaker-us-west-2-127230316473/data/hpo-keras-seedling/train'
channels = {'train': train_data_location}

## Pushing the container to ECR
Now that we've tested the container locally and it works fine, we can move on to run the hyperparmeter tuning. Before kicking off the tuning job, you need to push the docker image to ECR first. The ECR repository has been set up in the beginning of the sample notebook, please make sure you have input your ECR repository information there.

In [111]:
# create our own repository 
# !aws ecr create-repository --repository-name test

In [133]:
def push_image(name):
    cmd = 'aws ecr get-login --no-include-email --region '+region
    login = subprocess.check_output(shlex.split(cmd)).strip()

    subprocess.check_call(shlex.split(login.decode()))

    cmd = 'docker push %s' % name
    subprocess.check_call(shlex.split(cmd))

print ("pushing image:"+image_name)
push_image(image_name)

pushing image:127230316473.dkr.ecr.us-west-2.amazonaws.com/test:tensorflow-latest


## Specify hyperparameter tuning job configuration
*Note, with the default setting below, the hyperparameter tuning job can take 20~30 minutes to complete. You can customize the code in order to get better result, such as increasing the total number of training jobs, epochs, etc., with the understanding that the tuning time will be increased accordingly as well.*

Now you configure the tuning job by defining a JSON object that you pass as the value of the TuningJobConfig parameter to the create_tuning_job call. In this JSON object, you specify:
* The ranges of hyperparameters you want to tune
* The limits of the resource the tuning job can consume 
* The objective metric for the tuning job

In [146]:
import json
from time import gmtime, strftime

tuning_job_name = 'seedling-keras-HPO-' + strftime("%d-%H-%M-%S", gmtime())

print(tuning_job_name)

tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "0.0005",
          "MinValue": "0.0001",
          "Name": "learning_rate",          
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "100",
          "MinValue": "50",
          "Name": "epochs",          
        },
        {
          "MaxValue": "64",
          "MinValue": "32",
          "Name": "batch_size",          
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 4,
      "MaxParallelTrainingJobs": 2
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "acc",
      "Type": "Maximize"
    }
  }


seedling-keras-HPO-02-00-51-19


## Specify training job configuration
Now you configure the training jobs the tuning job launches by defining a JSON object that you pass as the value of the TrainingJobDefinition parameter to the create_tuning_job call.
In this JSON object, you specify:
* Metrics that the training jobs emit
* The container image for the algorithm to train
* The input configuration for your training and test data
* Configuration for the output of the algorithm
* The values of any algorithm hyperparameters that are not tuned in the tuning job
* The type of instance to use for the training jobs
* The stopping condition for the training jobs

This example defines one metric that Tensorflow container emits: loss. 

In [147]:
training_image = image_name

print('training artifacts will be uploaded to: {}'.format(output_location))

training_job_definition = {
    "AlgorithmSpecification": {
      "MetricDefinitions": [
        {
          "Name": "acc",
          "Regex": "accuracy: ([0-9\\.]+)"
        }
      ],
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": channels['train'],
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None"
        },
        {
            "ChannelName": "test",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": channels['train'],
                    "S3DataDistributionType": "FullyReplicated"
                }
            },            
            "CompressionType": "None",
            "RecordWrapperType": "None"            
        }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m5.xlarge",
      "VolumeSizeInGB": 50
    },
    "RoleArn": role,
    "StaticHyperParameters": {
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}


training artifacts will be uploaded to: s3://sagemaker-us-west-2-127230316473/sagemaker/hpo-keras-seedling/output


## Create and launch a hyperparameter tuning job
Now you can launch a hyperparameter tuning job by calling create_tuning_job API. Pass the name and JSON objects you created in previous steps as the values of the parameters. After the tuning job is created, you should be able to describe the tuning job to see its progress in the next step, and you can go to SageMaker console->Jobs to check out the progress of each training job that has been created.

In [148]:
len(tuning_job_name)

30

In [149]:
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                               HyperParameterTuningJobConfig = tuning_job_config,
                                               TrainingJobDefinition = training_job_definition)

{'HyperParameterTuningJobArn': 'arn:aws:sagemaker:us-west-2:127230316473:hyper-parameter-tuning-job/seedling-keras-hpo-02-00-51-19',
 'ResponseMetadata': {'RequestId': '71c5b440-4daa-4e9c-8562-3f90174ab7ac',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '71c5b440-4daa-4e9c-8562-3f90174ab7ac',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '131',
   'date': 'Thu, 02 Aug 2018 00:51:40 GMT'},
  'RetryAttempts': 0}}

Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully and is `InProgress`.

In [150]:
smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name)['HyperParameterTuningJobStatus']

'InProgress'

In [145]:
import pandas as pd

tuner = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner.dataframe()
full_df

Unnamed: 0,FinalObjectiveValue,TrainingElapsedTimeSeconds,TrainingEndTime,TrainingJobName,TrainingJobStatus,TrainingStartTime,batch_size,epochs,learning_rate
0,,16249.0,2018-08-02 00:47:25+00:00,seedling-keras-HPO-01-16-13-27-004-a9727c74,Stopped,2018-08-01 20:16:36+00:00,37.0,96.0,0.000155
1,,16684.0,2018-08-02 00:47:27+00:00,seedling-keras-HPO-01-16-13-27-003-0773193e,Stopped,2018-08-01 20:09:23+00:00,95.0,96.0,0.000259
2,2.364892,13441.0,2018-08-01 19:59:16+00:00,seedling-keras-HPO-01-16-13-27-002-90d75a2b,Completed,2018-08-01 16:15:15+00:00,61.0,63.0,0.000297
3,2.424883,14100.0,2018-08-01 20:10:17+00:00,seedling-keras-HPO-01-16-13-27-001-ab6f7f64,Completed,2018-08-01 16:15:17+00:00,67.0,66.0,0.000933


## Analyze tuning job results - after tuning job is completed
Please refer to "HPO_Analyze_TuningJob_Results.ipynb" to see example code to analyze the tuning job results.