# Extending SageMaker Training Container

## Overview
In this example we will learn how to extend pre-built SageMaker containers. This can be beneficial in certain scenarios, such as:
- you need to add additional dependencies (for instance, ones which needs to be compiled from sources) or significantly re-configure runtime environment (e.g., update CUDA version or configuration).
- you want to minimize development and testing efforts of your container and rely for most part on tested by AWS functionality of base container.

In this notebook we will learn how to extend SageMaker container as a base image for your custom container image. We will modify our runtime environment and install latest HuggingFace Transformer framework from GitHub `main` branch.

### Prerequisites

1. This sample assumes that you have AWS CLI v2 installed. Refer to this article for installation details: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
2. To push containers to private Elastic Container Registry service ("ECR"), make sure that your current IAM role has enough permissions for this operation. Refer to this article for details: https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-push.html


## Developing Training Container

First of, we need to identify which base image we will use. AWS publishes all available Deep Learning containers here: https://github.com/aws/deep-learning-containers/blob/master/available_images.md

Since we plan to use re-install from scratch HugggingFace Transformer library anyway, we may choose PyTorch base image. We start by retrieving URI of SageMaker PyTorch training container. For this, we first define framework versions. Then use `image_uris.retrieve()` utility to get container URI according to specificed Python and PyTorch versions and target training instance type.

In [None]:
PYTHON_VERSION = "py38"
PYTORCH_VERSION = "1.10.2"
INSTANCE_TYPE = "ml.p2.xlarge"

In [None]:
import sagemaker

session = sagemaker.Session()
container_uri = sagemaker.image_uris.retrieve("pytorch", session.boto_region_name, version=PYTORCH_VERSION, py_version=PYTHON_VERSION, image_scope="training", instance_type=INSTANCE_TYPE)
print(f"Pre-built SageMaker DL container: {container_uri}")

## Reviewing Dockerfile

To build a new containers we will need to:
- create Dockerfile with runtime instructions.
- build container image locally.
- push new container image to Docker container registry. As a container registry in this example we will use Elastic Container Registry - a managed service from AWS well integrated with SageMaker ecosystem.

Let's take a look on key components of our Dockerfile (please execute cell below to render Dockerfile content):
- we choose to use SageMaker PyTorch image as a base. Please update base images with URI from `container_uri` directly in Dockerfile.
- install latest HuggingFace Transformers framework form Github `main` branch.
- copy our training script for previous lab into container.
- define `SAGEMAKER_SUBMIT_DIRECTORY` and `SAGEMAKER_PROGRAM` environmental variables, so SageMaker knows which training script to execute at container start.

In [None]:
!pygmentize -l docker 2_sources/Dockerfile.training

### Building and Pushing Container Image

Once we have our Dockerfile ready, we need to build and push container image to registry. We start by authentificating in AWS public ECR (which hosts DL containers) and your private ECR (which will host our extended image). For this, we first retrieve `account` and `region` parameters from SageMaker session instance.

In [None]:
import sagemaker, boto3
from sagemaker import get_execution_role

session = sagemaker.Session()
role = get_execution_role()
account = boto3.client('sts').get_caller_identity().get('Account')
region = session.boto_region_name

Next, we perform docker login operations for public and private ECRs. 


In [None]:
# loging to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
# loging to your private ECR
!aws ecr get-login-password --region $region | docker login --username AWS --password-stdin {account}.dkr.ecr.{region}.amazonaws.com

Now, we are ready to build and push container to private ECR. For this, we provide as part of this repo a utility script `build_and_push.sh` to automate this process.


In [None]:
image_name = "extended-pytorch-training"
image_uri = f"{account}.dkr.ecr.{region}.amazonaws.com/{image_name}"

!./build_and_push.sh {image_name} 2_sources/Dockerfile.training

### Scheduling Training Job

Now, we have our extended PyTorch container in ECR, and we are ready to execute SageMaker training job. Training job configuration will be similar to Script Mode example with one noteable different: instead of `HuggingFaceEstimator` object we will use a generic `Sagemaker Estimator` which allows to work with custom images.

In [None]:
hyperparameters = {
    "epochs":1,
    # 2 params below may need to updated if non-GPU instances is used for training
    "per-device-train-batch-size":16, 
    "per-device-eval-batch-size":64,
    "warmup-steps":100,
    "logging-steps":100,
    "weight-decay":0.01    
}

In [None]:
# Please provide S3 URIs of test dataset from "Script Mode" example (`1_Using_SageMaker_Script_Mode.ipynb` notebook)
train_dataset_uri="s3://<YOUR S3 BUCKET>/newsgroups/train_dataset.csv"
test_dataset_uri="s3://<YOUR S3 BUCKET>/newsgroups/test_dataset.csv"

In [None]:
from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri=image_uri,
    hyperparameters=hyperparameters,
    instance_type="ml.p2.xlarge",
    instance_count=1,
    role=role
)

estimator.fit({
    "train":train_dataset_uri,
    "test":test_dataset_uri
})

### Resource Cleanup

Execute the cell below to delete cloud resources.

In [None]:
import boto3

# Delete container image
ecr = boto3.client("ecr")
ecr.delete_repository(repositoryName=image_name, force=True)

## Summary
In this notebook, you learned how to extend SageMaker PyTorch training container to address some specific runtime requirements with now code changes in training scripts and minimal development efforts.

In next example we will learn how to build SageMaker-compatible container using official TensorFlow image. This approach allows for maximum flexibility while requires more development efforts.