# SageMaker Processing for Video Frame Extraction (Container)

In this follow-on notebook, we'll tackle the same problem with a custom container image to improve job execution speed and cost.


## Step 0: Pre-requisites

This notebook will create an ECR repository and push an image, so you'll need to **grant the notebook instance permissions permissions to use ECR** if running in SageMaker.

In simple steps to get started (with a permissive configuration that production users may want to limit down further):

- Go to the "Notebook instances" tab of the SageMaker console
- Find this notebook instance in the list, and click on the hyperlinked notebook name to go to the details page
- Scroll down to the "Permissions and encryption" section, and click on the hyperlinked "IAM role ARN" - which will open the IAM role details screen in a new tab.
- Click the blue "Attach Policies" button and search for the `AmazonEC2ContainerRegistryPowerUser` policy: Attach this policy to the role.


## Step 1: Dependencies


In [None]:
%load_ext autoreload
%autoreload 1

# Built-Ins:
import os
from string import Template

# External Dependencies:
import boto3
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor


## Step 2: Re-use previous notebook setup

We downloaded the sample data and set configurations like the `BUCKET_NAME` in the previous notebook, so we won't repeat ourselves here! Just reload the config and init the libraries as before.


In [None]:
%store -r BUCKET_NAME
%store -r INPUT_PREFIX
%store -r OUTPUT_PREFIX

if not os.path.isdir(f"{os.path.abspath('')}/{INPUT_PREFIX}"):
    raise RuntimeError("You need to run the non-ECR notebook's setup and data download first!")


In [None]:
role = sagemaker.get_execution_role()
session = boto3.session.Session()
region = session.region_name


## Step 3: Connect to container registries

We will inherit from the SageMaker SKLearn base container (so we need to know the URI where it lives, and log in to the repository); and create a new image (so we need to log in to our own registry to store it).

Since the base [Processor](https://sagemaker.readthedocs.io/en/stable/processing.html) interface of the Python SageMaker SDK takes an `image_uri` parameter, and the standard `SKLearnProcessor` used in the previous notebook only needs the SciKit Learn framework version, we can infer the [(open source) implementation of SKLearnProcessor](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/sklearn/processing.py) in the SDK will show us how to derive the container URI.

...so here we pretty much copy the approach taken by the SDK to derive the base image URI:

(Which we need to do, because it varies with AWS Region and other factors!)


In [None]:
framework_version = "0.20.0"
image_tag = "{}-{}-{}".format(framework_version, "cpu", "py3")
image_uri = sagemaker.fw_registry.default_framework_uri("scikit-learn", region, image_tag)
# (Note, some other frameworks use the `sagemaker.fw_utils.create_image_uri()` function instead)

base_host_account_id = image_uri.partition(".")[0]
print(image_uri)
print(f"Base image host account: {base_host_account_id}")


This host account is the first ECR registry we'll need to log in to (to pull the base image); and our own account is the other (to push our custom image). Here we open an ECR client and resolve our own account ID:


In [None]:
# ECR is a separate service, so we'll need another service client:
crclient = session.client("ecr")
# We also want our account ID for the purposes of logging in to our own ECR:
account_id = session.client("sts").get_caller_identity().get("Account")


Finally, we log in to the two registries.

The `aws ecr get-login` command returns (along with some other text) an executable `docker login` command with temporary credentials generated by IAM for our current AWS session.

The below just executes the AWS CLI command and then the Docker CLI command in the result:


In [None]:
login_cmd = f"$(aws ecr get-login --registry-ids {base_host_account_id} {account_id} --no-include-email | sed 's|https://||')"
!eval "$login_cmd"


## Step 4: Build and upload the custom container image

This source repository uses a template Dockerfile because the base image URI is dynamically calculated as above, so our first step is to resolve the variable to create a concrete Dockerfile:


In [None]:
with open("container/Dockerfile.tpl", "r") as tplfile:
    with open("container/Dockerfile", "w") as dockerfile:
        template = Template(tplfile.read())
        dockerfile.write(
            template.substitute({
                "BASE_IMAGE": image_uri
            }),
        )


Now we simply run a standard Docker build:


In [None]:
container_name = "smskcv"
!docker rmi --force $container_name
!docker build -t $container_name:latest ./container


The parent image and the tagged child image should now both be available here in our notebook instance:


In [None]:
!docker images


Now our image is built!

We'll create an ECR repository with the AWS CLI, and then associate the repository to the image and push it using standard Docker CLI:


In [None]:
!aws ecr create-repository --repository-name $container_name


In [None]:
target_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/smskcv:latest"
!docker tag smskcv:latest $target_uri
!docker push $target_uri


## Step 5: Use our new container for a processing job!

Now we have our custom container, we can use it in place of the Scikit-Learn built-in by substituting `ScriptProcessor` for `SKLearnProcessor`, and adding a couple of extra arguments.

Otherwise, the process to create and apply the processor is essentially the same as before:

**Note: You might want to delete the old `OUTPUT_PREFIX` folder from S3 to convince yourself that the below job re-populates it**


In [None]:
processor = ScriptProcessor(
    role=role,
    image_uri=target_uri,  # Need to tell SageMaker where to find the custom image
    command=["python3", "-v"],  # Because it's a custom image, need to specify the start command
    instance_type="ml.t3.medium",
    volume_size_in_gb=5, # We don't need the whole default allocation for this small data set!
    instance_count=2,  # We can parallelize the processing to boost overall speed
)


In [None]:
# This command will block while the job runs and output the logs:
processor.run(
    code="getframes.py",
    inputs=[
        ProcessingInput(
            source=f"s3://{BUCKET_NAME}/{INPUT_PREFIX}",
            destination="/opt/ml/processing/input/videos",
            # By default, each input will be "FullyReplicated": copied in full to every instance.
            # This is great for any common reference data, but to parallelize processing of the
            # main dataset we can use "ShardedByS3Key" to split the data between instances instead:
            s3_data_distribution_type="ShardedByS3Key",
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="frames",
            source="/opt/ml/processing/frames",
            destination=f"s3://{BUCKET_NAME}/{OUTPUT_PREFIX}",
        ),
    ],
    arguments=["--frames-per-second", "0"],
)


## Clean-Up:

As before, be aware of:

* This notebook instance
* The S3 input and output locations where we've stored data, and
* Any new processing metadata saved to the SageMaker default bucket for this region

If you use this notebook instance for other things, you might also want to clean up locally cached Docker images to free disk space - as below:


In [None]:
!docker image prune -a -f
