# Extending SageMaker Training Container

## Overview
In this notebook we will learn how to extend SageMaker container as a base image for your custom container image. Modifying pre-build containers can be beneficial in following scenarios:
- you need to add additional dependencies (for instance, ones which needs to be compiled from sources) or significantly re-configure runtime environment.
- you want to minimize development and testing efforts of your container and rely for most part on tested by AWS functionality of base container.

## Problem Statement
We will re-use code assets from the our previous notebook in this chapter, where we trained and deploy NLP model to classify articles based on their content. However, unlike previous container we will modify our runtime environment and *install latest stable HuggingFace Transformer from Github master branch*. This modification will be implemented in our custom container image.

## Developing Training Container

First of, we need to identify which base image we will use. AWS published all available Deep Learning containers here: https://github.com/aws/deep-learning-containers/blob/master/available_images.md

Since we plan to use re-install from scratch HugggingFace Transformer library anyway, we may choose PyTorch base image. As of time of writing, the latest PyTorch SageMaker container was `763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.0-gpu-py38-cu111-ubuntu20.04`.

*Note, this container URI is for AWS East-1 region and will be different for other AWS regions. Please consult with refenced above AWS article on correct URI for your region.*

To build a new containers we will need to:
- create Dockerfile with runtime instructions.
- build container image locally.
- push new container image to `container registry`. As a container registry in this example we will use Elastic Container Registry - a managed service from AWS well integrated with SageMaker ecosystem.


### Reviewing Dockerfile
Let's take a look on key components of our Dockerfile (please execute cell below to render Dockerfile content):
- we choose to use SageMaker PyTorch image as a base.
- install latest PyTorch and HuggingFace Transformers.
- copy our training script for previous lab into container.
- define `SAGEMAKER_SUBMIT_DIRECTORY` and `SAGEMAKER_PROGRAM` environmental variables, so SageMaker knows which training script to execute at container start.


In [25]:
!pygmentize -l docker 2_sources/Dockerfile.training

[34mFROM[39;49;00m [33m763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.0-gpu-py38-cu111-ubuntu20.04[39;49;00m

[34mRUN[39;49;00m pip3 install git+https://github.com/huggingface/transformers

[34mENV[39;49;00m SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
[34mENV[39;49;00m SAGEMAKER_PROGRAM train.py

[34mCOPY[39;49;00m 1_sources/train.py  [31m$SAGEMAKER_SUBMIT_DIRECTORY[39;49;00m/[31m$SAGEMAKER_PROGRAM[39;49;00m


### Building and Pushing Container Image

Once we have our Dockerfile ready, we need to build and push container image to registry. We start by authentificating with ECR. 

In [31]:
import sagemaker, boto3
from sagemaker import get_execution_role

session = sagemaker.Session()
role = get_execution_role()
account = boto3.client('sts').get_caller_identity().get('Account')

In [None]:
# loging to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
# loging to your private ECR
!aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {account}.dkr.ecr.us-east-1.amazonaws.com

Now, we are ready to build and push container to ECR. For this, we provide as part of this repo a utility script `build_and_push.sh` to automate this process.

In [None]:
image_name = "extended-pytorch-training"
tag = "latest"

!./build_and_push.sh {image_name} {tag} 2_sources/Dockerfile.training

### Scheduling Training Job

We have our extended PyTorch container in ECR, and we are ready to execute SageMaker training job. Training job configuration will be similar to Script Mode example with one noteable different: instead of `HuggingFaceEstimator` object we will use a generic `Sagemaker Estimator` which allows to work with custom images. Note, that you need to update parameter `iamge_uri` with reference to image URI in your ECR. You can find it by navigating to "ECR" service in your AWS Console and finding extended container there.

In [14]:
hyperparameters = {
    "epochs":1,
    # 2 params below may need to updated if non-GPU instances is used for training
    "per-device-train-batch-size":16, 
    "per-device-eval-batch-size":64,
    "warmup-steps":100,
    "logging-steps":100,
    "weight-decay":0.01    
}

In [19]:
# Please provide S3 URIs of test dataset from "Script Mode" example
train_dataset_uri="s3://<YOUR S3 BUCKET>/newsgroups/train_dataset.csv"
test_dataset_uri="s3://<YOUR S3 BUCKET>/newsgroups/test_dataset.csv"

In [None]:
from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri="<UPDATE WITH YOUR IMAGE URI FROM ECR>",
    hyperparameters=hyperparameters,
    instance_type="ml.p2.xlarge",
    instance_count=1,
    role=role
)

estimator.fit({
    "train":train_dataset_uri,
    "test":test_dataset_uri
})

## Summary
In this notebook, you learned how to extend SageMaker PyTorch training container to address some specific runtime requirements with now code changes in training scripts and minimal development efforts.

In next example we will learn how to build SageMaker-compatible container using official TensorFlow image. This approach allows for maximum flexibility while requires more development efforts.