# Image Dataset Augmentation using SageMaker Processing

In this notebook we will learn how to preprocess data in distributed fashion using SageMaker Processing capability.

We will download CV dataset `450 Bird Species` which contains multiple images for each bird species. We then augmented original dataset with modified versions of images (rotated, cropped, resized) to increase dataset size and image variability. For image transformation we will use `Keras` module (a part of TensorFlow library). We will then run our processing job on multiple SageMaker compute nodes.


### Prerequisites
In this example we will build processing container from scratch. Make sure that you have `docker` installed.

## Getting Data
Download dataset from Kaggle (requires free account): https://www.kaggle.com/gpiosenka/100-bird-species/ and unzip it to local directory next to this notebook. To keep costs and timing of execution manageable, we will use only `test` split to produce augmented images. First, we start by uploading test split dataset to S3. It's convenient to use SageMaker `S3 uploader` class for it. 

In [2]:

import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

role = get_execution_role()
sess = sagemaker.Session()
account_id = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_region_name

Below we upload test split of original dataset and associated class dictionary to Amazon S3.

In [7]:
from sagemaker.s3 import S3Uploader


original_data_dir = "450_birds"
split = "test"

dataset_uri = S3Uploader.upload(f"./{original_data_dir}/{split}", f"s3://{sess.default_bucket()}/{original_data_dir}/{split}")

class_dict_file = "class_dict.csv"

class_dict_uri = S3Uploader.upload(f"./{original_data_dir}/{class_dict_file}", f"s3://{sess.default_bucket()}/{original_data_dir}")

In [None]:
print(f"{split} split data has been  uploaded to {dataset_uri}")
print(f"class dictionary has been  uploaded to {class_dict_uri}")

## Building Processing Container

SageMaker Processing provides two pre-built containers:
- PySpark container with dependencies to run Spark computations 
- Scikit-learn container

You can also provide BYO processing container with virtually any runtime configuration to run SageMaker Processing. In our example we will use TensorFlow image augmentation functionality, specifically `Keras` module. So we will build our processing container from scratch using `slim-buster` Python image as a base. We then install all required  Python dependencies and copy code processing code inside our container. Note, that SageMaker starts processing containers using `docker container name` command, hence, we need to specify entrypoint in our Dockerfile.

Run the cell below to familiarize with processing container.

In [None]:
! pygmentize -l docker 2_dockerfile.processor

Next, we build our container and push it to Amazon ECR (a managed container registry from AWS). Execute cell below to authenticate your docker client in ECR.

In [None]:
# login to your private ECR
!aws ecr get-login-password --region $region | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com

In [None]:
image_name = "keras-processing"
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{image_name}"

! ./build_and_push.sh {image_name} 2_dockerfile.processor

## Developing Processing Script

As part of processing container we also need to provide script with processing logic. Execute the cell below to review processing code.

Here are key highlights of processing script:
- We use Keras `Dataset` class to load dataset from directory.
- We use `ImageDataGenerator` class to generate modified versions of original images.
- We then iterate over batches of data and save generated on the fly batches to disk according to expected dataset directory hiearachy.

In [None]:
! pygmentize -O linenos=1  2_sources/processing.py

## Running Processing Jobs
Once we have BYO container and processing code, we are ready to schedule processing job. First, we need to instantiate `Processor` object with basic job configuration, such as number and type of instances and container image. In our example, we want to run distirbute our processing task across several compute nodes. We set number of instances to `2`.

In [None]:
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput


sklearn_processor = Processor(image_uri=image_uri,
                      role=role,
                      instance_count=2,
                      base_job_name="image-augmentation",
                      sagemaker_session=sess, 
                      instance_type="ml.m5.xlarge")



We then configure locations for input and output data using `ProcessingInput` class. Note, that in this example we have two types of input data which require slightly different configuration:
- Image dataset. Since we want to use multiple nodes in our processing tasks, we need to split evently input dataset images between our processing nodes. To achieve this, we set `s3_data_distribution_type="ShardedByS3Key"`. SageMaker will attempt to evenly split S3 objects (in case of our dataset - images) between processing nodes.
- Class lookup set. We need to have this file on each node. For this, we set `s3_data_distribution_type="FullyReplicated"`, so SageMaker automatically downloads full copy to each compute node.

Execute cell below to start SageMaker Processing Job. It will take several minutes to complete.

In [None]:
lookup_location = f"/opt/ml/processing/lookup"
data_location = "/opt/ml/processing/input"
output_location = '/opt/ml/processing/output'


sklearn_processor.run(
                      inputs=[
                        ProcessingInput(
                          source=dataset_uri,
                          destination=data_location,
                          s3_data_distribution_type="ShardedByS3Key"),
                        ProcessingInput(
                          source=class_dict_uri,
                          destination=lookup_location,
                          s3_data_distribution_type="FullyReplicated"
                          ),

                        ],
                      outputs=[ProcessingOutput(source=output_location)],
                      arguments = [
                        "--data_location", data_location, 
                        "--lookup_location", lookup_location,
                        "--output_location", output_location,
                        "--batch_size", "32",
                        "--max_samples", "10",
                        "--max_augmentations", "5"
                        ]
                     )

## Summary

This example should give you an intuition how SageMaker Processing can be used for your data processing needs. At the same time SageMaker Processing is flexible enough to run any arbitrary tasks, such as batch inference, data aggregation and analytics, and others.  