# GCMC Hyperparameter Tuning with Amazon SageMaker and DGL with MXNet backend
_**Creating a Hyperparameter Tuning Job for an DGL Network**_
___
___


## Contents
1. [Background](#Background)  
2. [Setup](#Setup)  
3. [Code](#Code)  
4. [Tune](#Train)  
5. [Wrap-up](#Wrap-up)  

## Background
This example notebook focuses on how to create a graph neural network model to train train [Graph Convolutional Matrix Completion](https://arxiv.org/abs/1706.02263) network using DGL with mxnet backend with the [MovieLens dataset](https://grouplens.org/datasets/movielens/). It leverages SageMaker's hyperparameter tuning to kick off multiple training jobs with different hyperparameter combinations, to find the set with best model performance. This is an important step in the machine learning process as hyperparameter settings can have a large impact on model accuracy. In this example, we'll use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to create a hyperparameter tuning job for an sagemaker estimator.

## Setup
This notebook was created and tested on an ml.p3.2xlarge notebook instance.

Let's start by specifying:
 * We assume you can successfully run the gcmc example. You have your \{account\}.dkr.ecr.\{region\}.amazonaws.com/sagemaker-dgl-gcmc:latest under your ECR with specific account and region.
 * The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the notebook instance, training, and hosting.
 * The IAM role arn used to give training and hosting access to your data. See the documentation for more details on creating these. Note, if a role not associated with the current notebook instance, or more than one role is required for training and/or hosting, please replace sagemaker.get_execution_role() with a the appropriate full IAM role arn string(s).

In [None]:
import sagemaker

from sagemaker import get_execution_role
from sagemaker.session import Session

# Setup session
sess = sagemaker.Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

# Location to put your custom code.
custom_code_upload_location = 'customcode'

# Location where results of model training are saved.
model_artifacts_location = 's3://{}/artifacts'.format(bucket)

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment. 
role = sagemaker.get_execution_role()

Now we'll import the Python libraries we'll need.

In [None]:
import boto3
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

## Code
To use SageMaker to run docker containers, we need to provide an python script for the container to run. In this example, mxnet_gcn.py provides all the code we need for training a SageMaker model.

In [None]:
!cat train.py

Once we've specified and tested our training script to ensure it works, we can start our tuning job. Testing can be done in either local mode or using SageMaker training. 

## Tune
Similar to training a single training job in SageMaker, we define our training estimator passing in the code scripts, IAM role, (per job) hardware configuration, and any hyperparameters we're not tuning.

We assume you have already got your own GCMC docker image in your ECR following the steps in mxnet_gcmc.ipynb.

In [None]:
from sagemaker.mxnet.estimator import MXNet

# Set target dgl-docker name
docker_name='sagemaker-dgl-gcmc'

CODE_PATH = '../dgl_gcmc'
CODE_ENTRY = 'train.py'
#code_location = sess.upload_data(CODE_PATH, bucket=bucket, key_prefix=custom_code_upload_location)

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, docker_name)
print(image)

params = {}
params['data_name'] = 'ml-1m'
# set output to SageMaker ML output
params['save_dir'] = '/opt/ml/model/'
estimator = MXNet(entry_point=CODE_ENTRY,
                  source_dir=CODE_PATH,
                        role=role, 
                        train_instance_count=1, 
                        train_instance_type='ml.p3.2xlarge',
                        image_name=image,
                        hyperparameters=params,
                        sagemaker_session=sess)

Once we've defined our estimator we can specify the hyperparameters we'd like to tune and their possible values. We have three different types of hyperparameters.
  * Categorical parameters need to take one value from a discrete set. We define this by passing the list of possible values to CategoricalParameter(list)
  * Continuous parameters can take any real number value between the minimum and maximum value, defined by ContinuousParameter(min, max)
  * Integer parameters can take any integer value between the minimum and maximum value, defined by IntegerParameter(min, max)
  
Note, if possible, it's almost always best to specify a value as the least restrictive type. For example, tuning thresh as a continuous value between 0.01 and 0.2 is likely to yield a better result than tuning as a categorical parameter with possible values of 0.01, 0.1, 0.15, or 0.2.

In [None]:
hyperparameter_ranges = {'gcn_agg_accum': CategoricalParameter(['sum', 'stack']),
                         'train_lr': ContinuousParameter(0.001, 0.1),
                         'gen_r_num_basis_func': IntegerParameter(1, 3)}

Next we'll specify the objective metric that we'd like to tune and its definition. This includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of our training job.

In [None]:
objective_metric_name = 'Validation-accuracy'
metric_definitions = [{'Name': 'Validation-accuracy',
                       'Regex': 'Best Iter Idx=[0-9\\.]+, Best Valid RMSE=[0-9\\.]+, Best Test RMSE=([0-9\\.]+)'}]


Now, we'll create a HyperparameterTuner object, which we pass:

 * The training estimator we created above
 * Our hyperparameter ranges
 * Objective metric name and definition
 * Number of training jobs to run in total and how many training jobs should be run simultaneously. More parallel jobs will finish tuning sooner, but may sacrifice accuracy. We recommend you set the parallel jobs value to less than 10% of the total number of training jobs (we'll set it higher just for this example to keep it short).
 * Whether we should maximize or minimize our objective metric (we haven't specified here since it defaults to 'Maximize', which is what we want for validation accuracy)

In [None]:
tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            objective_type='Minimize',
                            max_jobs=10,
                            max_parallel_jobs=2)

And finally, we can start our tuning job by calling .fit().

In [None]:
tuner.fit()

Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully and is InProgress.

In [None]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

## Wrap-up
Now that we've started our hyperparameter tuning job, it will run in the background and we can close this notebook. Once finished, we can go to console to analyze the result.

For more detail on Amazon SageMaker's Hyperparameter Tuning, please refer to the AWS documentation.