# Molecular Property Prediction

####This exercise is part of *Chapter 8* in the book *Applied Machine Learning for Healthcare and Lifesciences on AWS*. Make sure you have completed the steps as outlined in the prerequistes section and the initial steps in the section *Building a molecular property prediction model on Sagemaker* of *Chapter 8* to successfully complete this exercise.

In this exercise, we will train a molecular property prediction model on Sagemaker using a custom training container. We will run the training in two modes.
1. **Local Mode**: In this mode, we will test our custom container by running a single model for Human Intestinal Absorption (HIA) prediction model. 
2. **Sagemaker training mode**: In this mode, we will run multiple ADME models on a GPU on Sagemaker. 

We will then download the trained models locally. 


Let's begin by building and pushing our docker container. This is done by using the docker. Let's use our example docker file. 

In [None]:
!pygmentize Dockerfile

[37m# Part of the implementation of this container is based on the Amazon SageMaker Apache MXNet container.[39;49;00m
[37m# https://github.com/aws/sagemaker-mxnet-container[39;49;00m

[34mFROM[39;49;00m [33mubuntu:16.04[39;49;00m

[34mLABEL[39;49;00m [31mmaintainer[39;49;00m=[33m"Amazon AI"[39;49;00m

[37m# Defining some variables used at build time to install Python3[39;49;00m
[34mARG[39;49;00m [31mPYTHON[39;49;00m=python3
[34mARG[39;49;00m [31mPYTHON_PIP[39;49;00m=python3-pip
[34mARG[39;49;00m [31mPIP[39;49;00m=pip3
[34mARG[39;49;00m [31mPYTHON_VERSION[39;49;00m=[34m3[39;49;00m.6.6

[37m# Install some handful libraries like curl, wget, git, build-essential, zlib[39;49;00m
[34mRUN[39;49;00m apt-get update && apt-get install -y --no-install-recommends software-properties-common && [33m\[39;49;00m
    add-apt-repository ppa:deadsnakes/ppa -y && [33m\[39;49;00m
    apt-get update && apt-get install -y --no-install-recommends [33m

We use a base `ubuntu:16.04` container. We then install our necessary base software and then add some custom libraries like `RDKit` and `DeepPurpose`. Finally, we copy our training scripts to the location `//opt/ml/code/`. This is the location where Sagemaker picks up the training code from.






Let's now look at our local training script.

In [None]:
!pygmentize train_local.py

[37m# Portions of this script is borrowed from https://github.com/mims-harvard/TDC/blob/main/tutorials/TDC_104_ML_Model_DeepPurpose.ipynb[39;49;00m

[34mfrom[39;49;00m [04m[36mDeepPurpose[39;49;00m [34mimport[39;49;00m utils, CompoundPred
[34mfrom[39;49;00m [04m[36mtdc[39;49;00m[04m[36m.[39;49;00m[04m[36msingle_pred[39;49;00m [34mimport[39;49;00m ADME
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m

[34mif[39;49;00m [31m__name__[39;49;00m == [33m"[39;49;00m[33m__main__[39;49;00m[33m"[39;49;00m:
    
    
    parser = argparse.ArgumentParser()
    parser.add_argument([33m'[39;49;00m[33m--model_dir[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m, default=os.environ[[33m'[39;49;00m[33mSM_MODEL_DIR[39;49;00m[33m'[39;49;00m])
    parser.add_argument([33m'[39;49;00m[33m--train[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mstr[39;49;00m, default=os.enviro

This is a script that we run locally to test our training container. It runs a single model for HIA prediction. Note that we pass the hyperparameter `sagemaker_program` to make sure Sagemaker is picking the correct script to run.


Let's now look at the script for Sagemaker training.

In [None]:
!pygmentize train_sm.py

[37m# Portions of this script is borrowed from https://github.com/mims-harvard/TDC/blob/main/tutorials/TDC_104_ML_Model_DeepPurpose.ipynb[39;49;00m


[34mfrom[39;49;00m [04m[36mtdc[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m retrieve_dataset_names
[34mfrom[39;49;00m [04m[36mtdc[39;49;00m[04m[36m.[39;49;00m[04m[36msingle_pred[39;49;00m [34mimport[39;49;00m ADME
[34mfrom[39;49;00m [04m[36mDeepPurpose[39;49;00m [34mimport[39;49;00m utils, CompoundPred
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36mshutil[39;49;00m [34mimport[39;49;00m make_archive
[34mimport[39;49;00m [04m[36mboto3[39;49;00m

[34mif[39;49;00m [31m__name__[39;49;00m == [33m"[39;49;00m[33m__main__[39;49;00m[33m"[39;49;00m:
    
    
    parser = argparse.ArgumentParser()
    parser.add_argument([33m'[39;49;00m[33m--model_dir[39;49;00m[33m'[

This script also accepts an S3 output location for our trained model. This script trains multiple ADME models in a loop and uploads them to an output location on S3. 

Let's now build and push the container using the following shell script.

In [None]:
%%sh

docker_name=sagemaker-deeppurpose
account=$(aws sts get-caller-identity --query Account --output text)
echo $account
region=$(aws configure get region)

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${docker_name}:latest"
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${docker_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${docker_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)
docker build -t $docker_name -f Dockerfile .
docker tag ${docker_name} ${fullname}
docker push ${fullname}

485822383573
Login Succeeded

Step 1/23 : FROM ubuntu:16.04
 ---> b6f507652425
Step 2/23 : LABEL maintainer="Amazon AI"
 ---> Using cache
 ---> 7d3810176a2e
Step 3/23 : ARG PYTHON=python3
 ---> Using cache
 ---> 683c419be179
Step 4/23 : ARG PYTHON_PIP=python3-pip
 ---> Using cache
 ---> b8624329a0e4
Step 5/23 : ARG PIP=pip3
 ---> Using cache
 ---> a42942582dae
Step 6/23 : ARG PYTHON_VERSION=3.6.6
 ---> Using cache
 ---> 1a0cabeefc40
Step 7/23 : RUN apt-get update && apt-get install -y --no-install-recommends software-properties-common &&     add-apt-repository ppa:deadsnakes/ppa -y &&     apt-get update && apt-get install -y --no-install-recommends         build-essential         ca-certificates         curl         wget         git         libopencv-dev         openssh-client         openssh-server         vim         zlib1g-dev &&     rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 0ea7cc8cd8de
Step 8/23 : RUN wget https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



Now that we have the container, we will use it to train. Let's first import some required libraries and designate the default S3 bucket.

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.local import LocalSession
import boto3

# Setup session
sess = sagemaker.Session()
bucket = sess.default_bucket()
role = get_execution_role()
sagemaker_session = LocalSession()

We will now create an estimator using our custom container. We define the local training script as the hyperparameter for this estimator.

In [None]:
docker_name = "sagemaker-deeppurpose"


account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/{}:latest".format(account, region, docker_name)
print(image)
task_tags = [{"Key": "ML Task", "Value": "deeppurpose"}]
estimator = sagemaker.estimator.Estimator(
    image,
    role,
    instance_count=1,
    instance_type="local",
    tags=task_tags,
    sagemaker_session=sagemaker_session,
    hyperparameters={"sagemaker_program": "train_local.py"}
)

485822383573.dkr.ecr.us-east-1.amazonaws.com/sagemaker-deeppurpose:latest


Next, let's train our model.

In [None]:
estimator.fit()

Creating g0rsc63n6p-algo-1-e6soy ... 
Creating g0rsc63n6p-algo-1-e6soy ... done
Attaching to g0rsc63n6p-algo-1-e6soy
[36mg0rsc63n6p-algo-1-e6soy |[0m   from cryptography.hazmat.backends import default_backend
[36mg0rsc63n6p-algo-1-e6soy |[0m 2022-07-29 01:02:49,120 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mg0rsc63n6p-algo-1-e6soy |[0m 2022-07-29 01:02:49,121 sagemaker-training-toolkit INFO     Failed to parse hyperparameter sagemaker_program value train_local.py to Json.
[36mg0rsc63n6p-algo-1-e6soy |[0m Returning the value itself
[36mg0rsc63n6p-algo-1-e6soy |[0m 2022-07-29 01:02:49,134 sagemaker-training-toolkit INFO     instance_groups entry not present in resource_config
[36mg0rsc63n6p-algo-1-e6soy |[0m 2022-07-29 01:02:49,141 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[36mg0rsc63n6p-algo-1-e6soy |[0m 2022-07-29 01:02:49,141 sagemaker-training-toolkit INFO     Failed to parse hyperparam

Now that we have verified that the training works, we will use this container to train multiple ADME models. This time, we will define the script file to be the Sagemaker script `train_sm.py` and we also provide an output bucket where we will upload the trained models.

In [None]:
from sagemaker.local import LocalSession
sagemaker_session = LocalSession()
estimator = sagemaker.estimator.Estimator(
    image,
    role,
    instance_count=1,
    instance_type="ml.p2.xlarge",
    tags=task_tags,
    sagemaker_session=sess,
    hyperparameters={"sagemaker_program": "train_sm.py", "models_output_bucket": bucket }
)

Let's train!

In [None]:
estimator.fit()

2022-07-29 01:03:25 Starting - Starting the training job...
2022-07-29 01:03:50 Starting - Preparing the instances for trainingProfilerReport-1659056605: InProgress
.........
2022-07-29 01:05:21 Downloading - Downloading input data......
  from cryptography.hazmat.backends import default_backend[0m
[34m2022-07-29 01:07:37,608 sagemaker-training-toolkit INFO     Failed to parse hyperparameter models_output_bucket value sagemaker-us-east-1-485822383573 to Json.[0m
[34mReturning the value itself[0m
[34m2022-07-29 01:07:37,609 sagemaker-training-toolkit INFO     Failed to parse hyperparameter sagemaker_program value train_sm.py to Json.[0m
[34mReturning the value itself[0m
[34m2022-07-29 01:07:37,648 sagemaker-training-toolkit INFO     Failed to parse hyperparameter models_output_bucket value sagemaker-us-east-1-485822383573 to Json.[0m
[34mReturning the value itself[0m
[34m2022-07-29 01:07:37,648 sagemaker-training-toolkit INFO     Failed to parse hyperparameter sagemaker_pr

We now have trained all 21 models for ADME. Let's download the models locally and examine them.

In [None]:
s3 = boto3.client('s3')
s3.download_file(bucket, 'ADME/models/models.zip', 'models.zip')
!unzip models.zip -d models/

Archive:  models.zip
   creating: models/bbb_martins_model/
   creating: models/bioavailability_ma_model/
   creating: models/caco2_wang_model/
   creating: models/clearance_hepatocyte_az_model/
   creating: models/clearance_microsome_az_model/
   creating: models/cyp1a2_veith_model/
   creating: models/cyp2c19_veith_model/
   creating: models/cyp2c9_substrate_carbonmangels_model/
   creating: models/cyp2c9_veith_model/
   creating: models/cyp2d6_substrate_carbonmangels_model/
   creating: models/cyp2d6_veith_model/
   creating: models/cyp3a4_substrate_carbonmangels_model/
   creating: models/cyp3a4_veith_model/
   creating: models/half_life_obach_model/
   creating: models/hia_hou_model/
   creating: models/hydrationfreeenergy_freesolv_model/
   creating: models/lipophilicity_astrazeneca_model/
   creating: models/pgp_broccatelli_model/
   creating: models/ppbr_az_model/
   creating: models/solubility_aqsoldb_model/
   creating: models/vdss_lombardo_model/
  inflating: models/half_lif

In [None]:
! ls models/

bbb_martins_model		      cyp3a4_substrate_carbonmangels_model
bioavailability_ma_model	      cyp3a4_veith_model
caco2_wang_model		      half_life_obach_model
clearance_hepatocyte_az_model	      hia_hou_model
clearance_microsome_az_model	      hydrationfreeenergy_freesolv_model
cyp1a2_veith_model		      lipophilicity_astrazeneca_model
cyp2c19_veith_model		      pgp_broccatelli_model
cyp2c9_substrate_carbonmangels_model  ppbr_az_model
cyp2c9_veith_model		      solubility_aqsoldb_model
cyp2d6_substrate_carbonmangels_model  vdss_lombardo_model
cyp2d6_veith_model


As you can see, we have downloaded the 21 models locally. You can now deploy these models entirely locally or on Sagemaker. 

This concludes our exercise. 