# Extending SageMaker Training Container

## Overview
In this notebook we will learn how to extend SageMaker container as a base image for your custom container image. Modifying pre-build containers can be beneficial in following scenarios:
- you need to add additional dependencies (for instance, ones which needs to be compiled from sources) or significantly re-configure runtime environment.
- you want to minimize development and testing efforts of your container and rely for most part on tested by AWS functionality of base container.

## Problem Statement
We will re-use code assets from the our previous notebook in this chapter, where we trained and deploy NLP model to classify articles based on their content. But this time we will modify our runtime environment and install latest stable HuggingFace Transformer from Github master branch. This modification will be implemented in our custom container image.

## Notice On Support
This notebook assumes that you build container in nvidia-docker runtime environment. In other words, runtime environment with NVIDIA GPU available. If you don't have nvidia-docker runtime envrionment, you may switch to CPU-based containers. See below instance_type parameter which defines wether to use GPU or CPU versions of container.

## Developing Training Container

First of, we need to identify which base image we will use. AWS publishes all available Deep Learning containers here: https://github.com/aws/deep-learning-containers/blob/master/available_images.md

Since we plan to use re-install from scratch HugggingFace Transformer library anyway, we may choose PyTorch base image. We start by retrieving URI of SageMaker PyTorch training container. For this, we first define framework versions. Then use `image_uris.retrieve()` utility to get container URI.

In [23]:
PYTHON_VERSION = "py38"
PYTORCH_VERSION = "1.10.2"
#INSTANCE_TYPE = "ml.p2.xlarge" # if you have runtime with nvidia-docker
INSTANCE_TYPE = "ml.m5.xlarge" # uncomment this to use CPU-based instances

In [22]:
import sagemaker

session = sagemaker.Session()
container_uri = sagemaker.image_uris.retrieve("pytorch", session.boto_region_name, version=PYTORCH_VERSION, py_version=PYTHON_VERSION, image_scope="training", instance_type=INSTANCE_TYPE)
print(container_uri)

763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.2-cpu-py38


To build a new containers we will need to:
- create Dockerfile with runtime instructions.
- build container image locally.
- push new container image to `container registry`. As a container registry in this example we will use Elastic Container Registry - a managed service from AWS well integrated with SageMaker ecosystem.


### Reviewing Dockerfile
Let's take a look on key components of our Dockerfile (please execute cell below to render Dockerfile content):
- we choose to use SageMaker PyTorch image as a base.
- install latest PyTorch and HuggingFace Transformers.
- copy our training script for previous lab into container.
- define `SAGEMAKER_SUBMIT_DIRECTORY` and `SAGEMAKER_PROGRAM` environmental variables, so SageMaker knows which training script to execute at container start.

In [1]:
!pygmentize -l docker 2_sources/Dockerfile.training

[34mFROM[39;49;00m[37m [39;49;00m[33m<REPLACE_WITH_YOUR_CONTAINER_URI>[39;49;00m

[34mRUN[39;49;00m[37m [39;49;00mpip3 install git+https://github.com/huggingface/transformers

[34mENV[39;49;00m[37m [39;49;00mSAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
[34mENV[39;49;00m[37m [39;49;00mSAGEMAKER_PROGRAM train.py

[34mCOPY[39;49;00m[37m [39;49;00m1_sources/train.py  [31m$SAGEMAKER_SUBMIT_DIRECTORY[39;49;00m/[31m$SAGEMAKER_PROGRAM[39;49;00m


### Building and Pushing Container Image

Once we have our Dockerfile ready, we need to build and push container image to registry. We start by authentificating with ECR. 

In [25]:
import sagemaker, boto3
from sagemaker import get_execution_role

session = sagemaker.Session()
role = get_execution_role()
account = boto3.client('sts').get_caller_identity().get('Account')
region = session.boto_region_name

In [26]:
# loging to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
# loging to your private ECR
!aws ecr get-login-password --region $region | docker login --username AWS --password-stdin {account}.dkr.ecr.us-east-1.amazonaws.com

Login Succeeded

Logging in with your password grants your terminal complete access to your account. 
For better security, log in with a limited-privilege personal access token. Learn more at https://docs.docker.com/go/access-tokens/
Login Succeeded

Logging in with your password grants your terminal complete access to your account. 
For better security, log in with a limited-privilege personal access token. Learn more at https://docs.docker.com/go/access-tokens/


Now, we are ready to build and push container to ECR. For this, we provide as part of this repo a utility script `build_and_push.sh` to automate this process.

In [27]:
image_name = "extended-pytorch-training"
tag = "latest"

!./build_and_push.sh {image_name} {tag} 2_sources/Dockerfile.training

Working in region us-east-1

usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:

  aws help
  aws <command> help
  aws <command> <subcommand> help

aws: error: argument operation: Invalid choice, valid choices are:

batch-check-layer-availability           | batch-delete-image                      
batch-get-image                          | complete-layer-upload                   
create-repository                        | delete-lifecycle-policy                 
delete-registry-policy                   | delete-repository                       
delete-repository-policy                 | describe-image-scan-findings            
describe-images                          | describe-registry                       
describe-repositories                    | get-authorization-token                 
get-download-url-for-layer               | get-lifecycle-policy                    
get-lifecycle-policy-preview             | get-registry-

### Scheduling Training Job

We have our extended PyTorch container in ECR, and we are ready to execute SageMaker training job. Training job configuration will be similar to Script Mode example with one noteable different: instead of `HuggingFaceEstimator` object we will use a generic `Sagemaker Estimator` which allows to work with custom images. Note, that you need to update parameter `iamge_uri` with reference to image URI in your ECR. You can find it by navigating to "ECR" service in your AWS Console and finding extended container there.

In [14]:
hyperparameters = {
    "epochs":1,
    # 2 params below may need to updated if non-GPU instances is used for training
    "per-device-train-batch-size":16, 
    "per-device-eval-batch-size":64,
    "warmup-steps":100,
    "logging-steps":100,
    "weight-decay":0.01    
}

In [19]:
# Please provide S3 URIs of test dataset from "Script Mode" example
train_dataset_uri="s3://<YOUR S3 BUCKET>/newsgroups/train_dataset.csv"
test_dataset_uri="s3://<YOUR S3 BUCKET>/newsgroups/test_dataset.csv"

In [None]:
from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri="<UPDATE WITH YOUR IMAGE URI FROM ECR>",
    hyperparameters=hyperparameters,
    instance_type="ml.p2.xlarge",
    instance_count=1,
    role=role
)

estimator.fit({
    "train":train_dataset_uri,
    "test":test_dataset_uri
})

## Summary
In this notebook, you learned how to extend SageMaker PyTorch training container to address some specific runtime requirements with now code changes in training scripts and minimal development efforts.

In next example we will learn how to build SageMaker-compatible container using official TensorFlow image. This approach allows for maximum flexibility while requires more development efforts.