In [None]:
!./build_and_push.sh yolov5

## Testing your algorithm on your local machine

When you're packaging your first algorithm to use with Amazon SageMaker, you probably want to test it yourself to make sure it's working correctly. We use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to test both locally and on SageMaker. For more examples with the SageMaker Python SDK, see [Amazon SageMaker Examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk). In order to test our algorithm, we need our dataset.

# Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker

### An overview of Docker

If you're familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new technology. But they are not difficult and can significantly simply the deployment of your software packages. 

Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way your program is set up is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like `conda` or `virtualenv` because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, and environment variable.

A Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run simultaneously on the same physical or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. An example is provided below. You can build your Docker images based on Docker images built by yourself or by others, which can simplify things quite a bit.

Docker has become very popular in programming and `devops` communities due to its flexibility and its well-defined specification of how code can be run in its containers. It is the underpinning of many services built in the past few years, such as [Amazon ECS].

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a one way for training and another, slightly different, way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [`Dockerfile` reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

### How Amazon SageMaker runs your Docker container

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container.

* In this example, we don't define a `ENTRYPOINT` in the `Dockerfile`, so Docker runs the command [`train` at training time](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html) and [`serve` at serving time](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html). In this example, we define these as executable Python scripts, but they could be any program that we want to start in that environment.
* If you specify a program as a `ENTRYPOINT` in the `Dockerfile`, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.
* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as a `ENTRYPOINT` in the `Dockerfile` and ignore (or verify) the first argument passed in.

#### Running your container during training

When Amazon SageMaker runs training, your `train` script is run, as in a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:

```
    /opt/ml
    |-- input
    |   |-- config
    |   |   |-- hyperparameters.json
    |   |    -- resourceConfig.json
    |    -- data
    |        -- <channel_name>
    |            -- <input data>
    |-- model
    |   -- <model files>
     -- output
        -- failure
```

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values are always strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to `CreateTrainingJob`, but it's generally important that channels match algorithm expectations. The files for each channel are copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure.
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker packages any files in this directory into a compressed tar archive file. This file is made available at the S3 location returned to the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file are returned to the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it is ignored.

#### Running your container during hosting

Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use [TensorFlow Serving](https://www.tensorflow.org/serving/), however the hosting solution can be customized. One example is the [Python serving stack within the `scikit learn` example](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb).

Amazon SageMaker uses two URLs in the container:

* `/ping` receives `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these are passed in as well. 

The container has the model files in the same place that they were written to during training:

    /opt/ml
    `-- model
        `-- <model files>



## SageMaker Python SDK Local Training
To represent our training, we use the Estimator class, which needs to be configured in five steps. 
1. IAM role - our AWS execution role
2. train_instance_count - number of instances to use for training.
3. train_instance_type - type of instance to use for training. For training locally, we specify `local` or `local_gpu`.
4. image_name - our custom PyTorch Docker image we created.
5. hyperparameters - hyperparameters we want to pass.

Let's start with setting up our IAM role. We make use of a helper function within the Python SDK. This function throw an exception if run outside of a SageMaker notebook instance, as it gets metadata from the notebook instance. If running outside, you must provide an IAM role with proper access stated above in [Permissions](#Permissions).

In [28]:
from sagemaker import get_execution_role

role = get_execution_role()

## Fit, Deploy, Predict

Now that the rest of our estimator is configured, we can call `fit()` with the path to our local CIFAR10 dataset prefixed with `file://`. This invokes our PyTorch container with 'train' and passes in our hyperparameters and other metadata as json files in /opt/ml/input/config within the container to our program entry point defined in the Dockerfile.

After our training has succeeded, our training algorithm outputs our trained model within the /opt/ml/model directory, which is used to handle predictions.

We can then call `deploy()` with an instance_count and instance_type, which is 1 and `local`. This invokes our PyTorch container with 'serve', which setups our container to handle prediction requests as defined [here](https://github.com/aws/sagemaker-pytorch-container/blob/master/src/sagemaker_pytorch_container/serving.py#L103). What is returned is a predictor, which is used to make inferences against our trained model.

After our prediction, we can delete our endpoint.

We recommend testing and training your training algorithm locally first, as it provides quicker iterations and better debuggability.

# Part 2: Training and Hosting your Algorithm in Amazon SageMaker
Once you have your container packaged, you can use it to train and serve models. Let's do that with the algorithm we made above.

## Set up the environment
Here we specify the bucket to use and the role that is used for working with SageMaker.

In [29]:
# S3 prefix
prefix = 'DEMOyolov5'

## Data and model configuration preparation

In [10]:
!aws s3 cp s3://lhr/best.pt ./data/weights/
!aws s3 cp s3://lhr/yolov5x6.pt ./data/weights/
!aws s3 cp s3://lhr/dp_hj2.zip ./data/ 
!aws s3 cp s3://lhr/haitian.yaml ./data/cfg/
!aws s3 cp s3://lhr/train-args-haitian.json ./data/cfg/
%cd data
!unzip dp_hj2.zip > /dev/null 
!rm -rf images
!rm -rf labels
!cp -r ./dp_hj2/images ./images
!cp -r ./dp_hj2/labels ./labels
!rm -rf ./dp_hj2
%cd ..

/home/ec2-user/SageMaker/yolov5_sagemaker/data
/home/ec2-user/SageMaker/yolov5_sagemaker


## Create the session

The session remembers our connection parameters to SageMaker. We use it to perform all of our SageMaker operations.

In [30]:
import sagemaker as sage

sess = sage.Session()

## Upload the data for training

We will use the tools provided by the SageMaker Python SDK to upload the data to a default bucket.

In [31]:
WORK_DIRECTORY = '/home/ec2-user/SageMaker/yolov5_sagemaker/data/'

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

inputs = {'cfg': data_location+'/cfg', 'weights': data_location+'/weights', 'images': data_location+'/images', 'labels': data_location+'/labels'}

print(inputs)

{'cfg': 's3://sagemaker-us-east-2-847380964353/DEMOyolov5/cfg', 'weights': 's3://sagemaker-us-east-2-847380964353/DEMOyolov5/weights', 'images': 's3://sagemaker-us-east-2-847380964353/DEMOyolov5/images', 'labels': 's3://sagemaker-us-east-2-847380964353/DEMOyolov5/labels'}


## Training on SageMaker
Training a model on SageMaker with the Python SDK is done in a way that is similar to the way we trained it locally. This is done by changing our train_instance_type from `local` to one of our [supported EC2 instance types](https://aws.amazon.com/sagemaker/pricing/instance-types/).

In addition, we must now specify the ECR image URL, which we just pushed above.

Finally, our local training dataset has to be in Amazon S3 and the S3 URL to our dataset is passed into the `fit()` call.

Let's first fetch our ECR image url that corresponds to the image we just built and pushed.

In [45]:
import boto3

client = boto3.client('sts')
account = client.get_caller_identity()['Account']

my_session = boto3.session.Session()
region = my_session.region_name

algorithm_name = 'yolov5'

if region.startswith('cn'):
    ecr_image = '{}.dkr.ecr.{}.amazonaws.com.cn/{}:latest'.format(account, region, algorithm_name)
else:
    ecr_image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, algorithm_name)

print(ecr_image)

847380964353.dkr.ecr.us-east-2.amazonaws.com/yolov5:latest


In [46]:
from sagemaker.estimator import Estimator

hyperparameters = {}

instance_type = 'ml.p3.2xlarge' #  'ml.p3.2xlarge'

estimator = Estimator(role=role,
                      instance_count=1,
                      instance_type=instance_type,
                      image_uri=ecr_image,
                      hyperparameters=hyperparameters)

estimator.fit(inputs)

2021-12-20 16:23:47 Starting - Starting the training job...
2021-12-20 16:24:10 Starting - Launching requested ML instancesProfilerReport-1640017427: InProgress
......
2021-12-20 16:25:11 Starting - Preparing the instances for training......
2021-12-20 16:26:15 Downloading - Downloading input data...
2021-12-20 16:26:31 Training - Downloading the training image..............................
2021-12-20 16:31:48 Training - Training image download completed. Training in progress..[34m/opt/ml/input/data/cfg/haitian.yaml[0m
[34m/opt/ml/input/data/cfg/yolov5x6.yaml[0m
[34m/opt/ml/input/data/weights/yolov5x6.pt[0m
[34mTensorFlow installation not found - running with reduced feature set.[0m
[34mNOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784[0m
[34mTensorBoard 2.7.0 at http://ip-10-0-119-218.us-east-2.compute.internal:6006/ (Press CTRL+C to 

In [None]:
instance_type = "ml.g4dn.xlarge"
predictor = estimator.deploy(1, instance_type)

---

In [40]:
import boto3
import cv2
import time
import cv2
import json
import ast
import numpy as np
body = b""
with open("./test.jpg", "rb") as fp:
    body = fp.read()
# body = cv2.imencode(".jpg", frame)[1].tobytes()
runtime = boto3.client("sagemaker-runtime",region_name="us-east-2")
response = runtime.invoke_endpoint(
    EndpointName='yolov5loc2',
    Body=body,
    ContentType='application/x-image',
)
body = json.loads(response["Body"].read().decode())
print(len(body))
print(body)

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
". See https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logEventViewer:group=/aws/sagemaker/Endpoints/yolov5loc2 in account 847380964353 for more information.

## Optional cleanup
When you're done with the endpoint, you should clean it up.

All of the training jobs, models and endpoints we created can be viewed through the SageMaker console of your AWS account.

In [None]:
predictor.delete_endpoint()

# Reference
- [How Amazon SageMaker interacts with your Docker container for training](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)
- [How Amazon SageMaker interacts with your Docker container for inference](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html)
- [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk)
- [Dockerfile](https://docs.docker.com/engine/reference/builder/)
- [scikit-bring-your-own](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
- [SageMaker PyTorch container](https://github.com/aws/sagemaker-pytorch-container)