# Training SageMaker Models for Molecular Property Prediction Using DGL with PyTorch Backend

The **SageMaker Python SDK** makes it easy to train DGL models. In this example, we train a simple graph neural network for molecular toxicity prediction using [DGL](https://github.com/dmlc/dgl) and Tox21 dataset.

The dataset contains qualitative toxicity measurement for 8014 compounds on 12 different targets, including nuclear 
receptors and stress response pathways. Each target yields a binary classification problem. We can model the problem as a graph classification problem. 

## Setup

We need to define a few variables that will be needed later in the example.

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

# Setup session
sess = sagemaker.Session()

# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = sess.default_bucket()

# Location to put your custom code.
custom_code_upload_location = 'customcode'

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment. 
role = get_execution_role()

## Training Script

`main.py` provides all the code we need for training a SageMaker model.

In [None]:
!cat main.py

## Bring Your Own Image for SageMaker

In this example, we will need rdkit library to handle the tox21 dataset. Fortunately, we provide dgl-0.4 gpu-docker with rdkit library pre-installed at dockerhub under dgllib registry (named dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit). You can pull it yourself and push it into your AWS ECR. Following script helps you to do so. You can skip this step, if you have already got/prepared your dgl docker image in you ECR.

In [None]:
%%sh
default_docker_name="dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit"
docker pull $default_docker_name

docker_name=sagemaker-dgl-pytorch-gcn-tox21

docker build -t $docker_name -f gcn_tox21.Dockerfile .

account=$(aws sts get-caller-identity --query Account --output text)
echo $account
region=$(aws configure get region)
region=${region:-us-east-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${docker_name}:latest"
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${docker_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${docker_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

docker tag ${docker_name} ${fullname}

docker push ${fullname}

## SageMaker's Estimator Class

The SageMaker Estimator allows us to run a single machine in SageMaker, using CPU or GPU-based instances.

When we create the estimator, we pass in the filename of our training script, the name of our IAM execution role. We also provide a few other parameters. `train_instance_count` and `train_instance_type` determine the number and type of SageMaker instances that will be used for the training job. The hyperparameters can be passed to the training script via a dict of values. See `main.py` for how they are handled.

The entrypoint of sagemaker docker (e.g., dgllib/dgl-sagemaker-gpu:dgl_0.4_pytorch_1.2.0_rdkit) is a train script under /usr/bin/. The train script inside dgl docker image provided above will try to get the real entrypoint from hyperparameters and run the real entrypoint under 'training-code' data channel (/opt/ml/input/data/training-code/) .

For this example, we will choose one ml.p3.2xlarge instance.

In [None]:
import boto3

# Set target dgl-docker name
docker_name='sagemaker-dgl-pytorch-gcn-tox21'

CODE_PATH = 'main.py'
code_location = sess.upload_data(CODE_PATH, bucket=bucket, key_prefix=custom_code_upload_location)

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, docker_name)
print(image)

estimator = sagemaker.estimator.Estimator(image,
                        role, 
                        train_instance_count=1, 
                        train_instance_type='ml.p3.2xlarge',
                        hyperparameters={'entrypoint': CODE_PATH},
                        sagemaker_session=sess)

## Running the Training Job

After we've constructed an Estimator object, we can fit it using SageMaker. 

In [None]:
estimator.fit({'training-code': code_location})

## Output
You can get the model training output from the Sagemaker Console by searching for the training task and looking for the address of 'S3 model artifact'