# SageMaker Processing for Video Frame Extraction (Inline)

This example shows how to use SageMaker Processing to extract frame images from video files in batch.

Our extractor implementation requires open source library [OpenCV](https://opencv.org/), which is **not installed in the built-in SageMaker Scikit-Learn processing container**. We show two ways to solve this:

1. (This notebook) Simply use **inline commands** at the top of our Python script to install OS-level dependencies and the Python OpenCV library each time a job starts
2. (Next notebook) Create a **custom container image** using the SageMaker built-in as a base, pre-installing the dependencies

The second option (custom container) reduces the run-time and therefore the cost of each Processing job; while the first (inline install) is simpler to get working and avoids introducing the Elastic Container Registry (ECR) service.


## Step 1: Dependencies and Configuration


In [None]:
# Note that although OpenCV isn't included in the optimized container images, it actually is 
# present here in the standard notebook conda_python3 kernel!
#
# In general though, we should prefer doing heavy pre-processing work in jobs over notebooks
# to best utilize resources (since the resources for jobs are active only while the job is running)
import cv2


In [None]:
%load_ext autoreload
%autoreload 1

# Built-Ins:
import io
import os
import zipfile

# External Dependencies:
import boto3
import requests
import sagemaker
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor


In [None]:
BUCKET_NAME = sagemaker.Session().default_bucket()  # (Or an existing bucket's name, if you prefer)
%store BUCKET_NAME
INPUT_PREFIX = "videos" # The folder in the bucket (and locally) where raw videos will live
%store INPUT_PREFIX
OUTPUT_PREFIX = "frames" # The base folder in the bucket where output frames will be written
%store OUTPUT_PREFIX

os.makedirs(INPUT_PREFIX, exist_ok=True)


In [None]:
role = sagemaker.get_execution_role()
session = boto3.session.Session()
region = session.region_name
bucket = session.resource("s3").Bucket(BUCKET_NAME)

bucket_region = \
    session.client("s3").head_bucket(Bucket=BUCKET_NAME)["ResponseMetadata"]["HTTPHeaders"]["x-amz-bucket-region"]
assert (
    bucket_region == region
), f"Your S3 bucket {BUCKET_NAME} and this notebook need to be in the same region."


## Step 2: Push our source data into S3

We'll be using a small collection of CC-0/public domain videos as an example: But you can replace this with whatever you'd like to process.

The end result must be that the `INPUT_PREFIX` folder of your `BUCKET_NAME` contains one or more video files of format supported by OpenCV VideoCapture. Nested folders are not supported by this sample code.


In [None]:
request = requests.get("https://archive.org/compress/pigeons_sp/formats=512KB%20MPEG4&file=/pigeons_sp.zip")
vidzip = zipfile.ZipFile(io.BytesIO(request.content))
for fname in vidzip.namelist():
    with open(f"{INPUT_PREFIX}/{fname}", "wb") as f:
        f.write(vidzip.read(fname))


In [None]:
!aws s3 sync $INPUT_PREFIX s3://$BUCKET_NAME/$INPUT_PREFIX


## Step 3: Run a SageMaker Processing Job

Here we use the SageMaker Processing built-in Scikit-Learn container to run our job. This saves the complexity and cost of setting up a custom container image in ECR, but means that our job needs to install OpenCV and its dependencies every time it runs - which will add to our job time and therefore the SageMaker compute costs.


In [None]:
processor = SKLearnProcessor(
    framework_version="0.20.0",
    role=role,
    instance_type="ml.t3.medium",
    volume_size_in_gb=5,  # We don't need the whole default allocation for this small data set!
    instance_count=2,  # We can parallelize the processing to boost overall speed
)


In [None]:
# This command will block while the job runs and output the logs:
processor.run(
    code="getframes.py",
    inputs=[
        ProcessingInput(
            source=f"s3://{BUCKET_NAME}/{INPUT_PREFIX}",
            destination="/opt/ml/processing/input/videos",
            # By default, each input will be "FullyReplicated": copied in full to every instance.
            # This is great for any common reference data, but to parallelize processing of the
            # main dataset we can use "ShardedByS3Key" to split the data between instances instead:
            s3_data_distribution_type="ShardedByS3Key",
        ),
    ],
    outputs=[
        ProcessingOutput(
            output_name="frames",
            source="/opt/ml/processing/frames",
            destination=f"s3://{BUCKET_NAME}/{OUTPUT_PREFIX}",
        ),
    ],
    arguments=["--frames-per-second", "0"],
)


## Step 4: Querying Job Status

In case we run a job in non-blocking mode, or just want to review after the job is complete, we can fetch the status of jobs as below:


In [None]:
preprocessing_job_description = processor.jobs[-1].describe()
preprocessing_job_description


## Clean-Up

The Processing container is shut down by SageMaker as soon as the job completes, so you only need to be aware of the ongoing S3 data storage and this running notebook instance.

Be aware though that our default configuration here will also push some job metadata to the SageMaker default bucket in this region (`sagemaker-{regionname}-{accountid}`): Why not go and check out what's saved?
