# Training MMAction3 Mask-RCNN Model on Sagemaker Distributed Cluster

## Motivation
[MMDetection](https://github.com/open-mmlab/mmdetection) is a popular open-source Deep Learning framework focused on Computer Vision models and use cases. MMDetection provides to higher level APIs for model training and inference. It demonstrates [state-of-the-art benchmarks](https://github.com/open-mmlab/mmdetection#benchmark-and-model-zoo) for variety of model architecture and extensive Model Zoo.

In this notebook, we will build a custom training container with MMdetection library and then train Mask-RCNN model from scratch on [COCO2017 dataset](https://cocodataset.org/#home) using Sagemaker distributed [training feature](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) in order to reduce training time.

### Preconditions
- To execute this notebook, you will need to have COCO 2017 training and validation datasets uploaded to S3 bucket available for Amazon Sagemaker service.


## Building Training Container

Amazon Sagemaker allows to BYO containers for training, data processing, and inference. In our case, we need to build custom training container which will be pushed to your AWS account [ECR service](https://aws.amazon.com/ecr/). 

For this, we need to login to public ECR with Sagemaker base images and private ECR reposity.

In [1]:
import sagemaker, boto3

session = sagemaker.Session()
region = session.boto_region_name
account = boto3.client('sts').get_caller_identity().get('Account')
bucket = session.default_bucket()

In [2]:
# login to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
# login to your private ECR
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account}.dkr.ecr.{region}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [2]:
! pygmentize -l docker Dockerfile.training

[34mARG[39;49;00m REGISTRY_URI
[34mFROM[39;49;00m [33m${REGISTRY_URI}/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04[39;49;00m

[34mRUN[39;49;00m mkdir -p /opt/ml/model

[34mENV[39;49;00m [31mPYTHONUNBUFFERED[39;49;00m=TRUE
[34mENV[39;49;00m [31mPYTHONDONTWRITEBYTECODE[39;49;00m=TRUE
[34mENV[39;49;00m [31mPATH[39;49;00m=[33m"[39;49;00m[33m/opt/program:[39;49;00m[33m${[39;49;00m[31mPATH[39;49;00m[33m}[39;49;00m[33m"[39;49;00m
[34mENV[39;49;00m G_ABSA /opt/ml/code/Generative-ABSA-SageMaker

[37m##########################################################################################[39;49;00m
[37m# SageMaker requirements[39;49;00m
[37m##########################################################################################[39;49;00m
[37m## install flask[39;49;00m
[34mRUN[39;49;00m pip install [31mnetworkx[39;49;00m==[34m2[39;49;00m.3 flask gevent gunicorn boto3 [31mtransformers[39;49;00m==[34m4[39;49;00m.6.0 [31mdatasets[39;49;00

In [14]:
! ./build_and_push.sh gabsa-training latest Dockerfile.training

set -e
# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
image=$1

if [ "$image" == "" ]
then
    echo "Use image name absa"
    image="absa"
fi

# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)

if [ $? -ne 0 ]
then
    exit 255
fi

# Get the region defined in the current configuration
region=$(aws configure get region)
#regions=$(aws ec2 describe-regions --all-regions --query "Regions[].{Name:RegionName}" --output text)

#for region in $regions; do

#aws s3 cp s3://aws-solutions-${region}/spot-bot-models/cars/model.tar.gz ./
#tar zxvf model.tar.gz
# TODO: update regional location based on https://amazonaws-china.com/releasenotes/available-deep-learning-containers-

In [4]:
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

In [5]:
from time import gmtime, strftime

prefix_input = 'gabsa-input'
prefix_output = 'gabsa-ouput'

In [6]:
container = "gabsa-training" # your container name
tag = "latest"
image = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, container, tag)

In [17]:
hyperparameters = {
    "task" : "tasd", 
    "data_root" : "/opt/ml/input/data/training",
    "dataset" : "rest16",
    "n_gpu":"4",
    "model_name_or_path" : "t5-base", 
    "paradigm": "extraction",
    "gradient_accumulation_steps": "2",
    "eval_batch_size" :"16",
    "train_batch_size" :"2",
    "learning_rate" :"3e-4",
    "num_train_epochs":"2",
    "out_dir":"/opt/ml/output",
    "nodes":"2",
}

In [13]:
!aws s3 cp --recursive ../data s3://$bucket/data

upload: ../data/aope/raw-data/16res/test.tsv to s3://sagemaker-us-west-2-847380964353/data/aope/raw-data/16res/test.tsv
upload: ../data/aope/laptop14/dev.txt to s3://sagemaker-us-west-2-847380964353/data/aope/laptop14/dev.txt
upload: ../data/aope/raw-data/15res/train.tsv to s3://sagemaker-us-west-2-847380964353/data/aope/raw-data/15res/train.tsv
upload: ../data/aope/laptop14/test.txt to s3://sagemaker-us-west-2-847380964353/data/aope/laptop14/test.txt
upload: ../data/aope/laptop14/train.txt to s3://sagemaker-us-west-2-847380964353/data/aope/laptop14/train.txt
upload: ../data/aope/rest14/dev.txt to s3://sagemaker-us-west-2-847380964353/data/aope/rest14/dev.txt
upload: ../data/aope/raw-data/14res/test.tsv to s3://sagemaker-us-west-2-847380964353/data/aope/raw-data/14res/test.tsv
upload: ../data/aope/rest15/dev.txt to s3://sagemaker-us-west-2-847380964353/data/aope/rest15/dev.txt
upload: ../data/aope/raw-data/15res/test.tsv to s3://sagemaker-us-west-2-847380964353/data/aope/raw-data/15res

In [18]:
est = sagemaker.estimator.Estimator(image,
                                          role=role,
                                          train_instance_count=2,
                                          train_instance_type='ml.p3.8xlarge',
                                          train_volume_size=100,
                                          output_path="s3://{}/{}".format(bucket, prefix_output),
                                          hyperparameters = hyperparameters, 
                                          sagemaker_session=session
)

est.fit({"training" : "s3://"+bucket+"/data/"})

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2022-01-10 07:30:20 Starting - Starting the training job...
2022-01-10 07:30:46 Starting - Launching requested ML instancesProfilerReport-1641799820: InProgress
.........
2022-01-10 07:32:15 Starting - Preparing the instances for training......
2022-01-10 07:33:17 Downloading - Downloading input data
2022-01-10 07:33:17 Training - Downloading the training image...................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-01-10 07:36:23,405 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-01-10 07:36:23,436 sagemaker-training-toolkit INFO     Failed to parse hyperparameter data_root value /opt/ml/input/data/training to Json.[0m
[34mReturning the value itself[0m
[34m2022-01-10 07:36:23,436 sagemaker-training-toolkit INFO     Failed to parse hyperparameter paradigm value extraction to Json.[0m
[34mReturning the value itself[0m
[3

UnexpectedStatusException: Error for Training job gabsa-training-2022-01-10-07-30-20-217: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 G_ABSA_train.py --data_root /opt/ml/input/data/training --dataset rest16 --eval_batch_size 16 --gradient_accumulation_steps 2 --learning_rate 0.0003 --model_name_or_path t5-base --n_gpu 4 --nodes 2 --num_train_epochs 2 --out_dir /opt/ml/output --paradigm extraction --task tasd --train_batch_size 2"
Traceback (most recent call last):
  File "G_ABSA_train.py", line 279, in save_model
    checkpoint_path = os.path.join(work_dir, last_model)
  File "/opt/conda/lib/python3.6/posixpath.py", line 94, in join
    genericpath._check_arg_types('join', a, *p)
  File "/opt/conda/lib/python3.6/genericpath.py", line 149, in _check_arg_types
    (funcname, s.__class__.__name__)) from None
TypeError: join() argument must be str or bytes, not 'NoneType'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G_ABSA_train.py", line 361, in <module>
    save_model(args.output_dir, os.environ['SM_MOD

In [5]:
!nvidia-smi


Fri Jan  7 10:02:58 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   44C    P0    37W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   38C    P0    37W / 300W |      0MiB / 16160MiB |      0%      Default |
|       