# Distributed data processing

In this notebook, we will learn how to preprocess data in distributed fashion using SageMaker Processing.

Download dataset from Kaggle (requires free account): https://www.kaggle.com/gpiosenka/100-bird-species/ and unzip it to local directory next to this notebook.

To keep costs and timing of execution manageable, we will use only "test" split to produce augmented images. First, we start by uploading test split dataset to S3. It's convenient to use SageMaker Session upload functionality for it.

In [2]:

import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

# role = get_execution_role() # TODO: uncomment it for final version
role = "arn:aws:iam::941656036254:role/service-role/AmazonSageMaker-ExecutionRole-20210904T193230"
sess = sagemaker.Session()
account_id = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_region_name

In [7]:
from sagemaker.s3 import S3Uploader


original_data_dir = "315_birds"
split = "test"

dataset_uri = S3Uploader.upload(f"./{original_data_dir}/{split}", f"s3://{sess.default_bucket()}/{original_data_dir}/{split}")

In [5]:
class_dict_file = "class_dict.csv"

class_dict_uri = S3Uploader.upload(f"./{original_data_dir}/{class_dict_file}", f"s3://{sess.default_bucket()}/{original_data_dir}")

In [8]:
print(f"{split} split data has been  uploaded to {dataset_uri}")
print(f"class dictionary has been  uploaded to {class_dict_uri}")

test split data has been  uploaded to s3://sagemaker-us-east-1-941656036254/315_birds/test/
class dictionary has been  uploaded to s3://sagemaker-us-east-1-941656036254/315_birds/class_dict.csv


# Scheduling a processing job

In [91]:
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com

Login Succeeded


In [106]:
! chmod +x build_and_push.sh
! ./build_and_push.sh "keras-processing" "latest" "2_dockerfile.processor"

Working in region us-east-1
Image URI 941656036254.dkr.ecr.us-east-1.amazonaws.com/keras-processing:latest


In [11]:
container_name = "keras-processing"
container_tag = "latest"
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{container_name}:{container_tag}"

In [13]:
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

lookup_location = f"/opt/ml/processing/lookup"
data_location = "/opt/ml/processing/input"
output_location = '/opt/ml/processing/output'

sklearn_processor = Processor(image_uri=image_uri,
                      role=role,
                      instance_count=2,
                      base_job_name="image-augmentation",
                      sagemaker_session=sess, 
                      instance_type="ml.m5.xlarge")

sklearn_processor.run(
                      inputs=[
                        ProcessingInput(
                          source=dataset_uri,
                          destination=data_location,
                          s3_data_distribution_type="ShardedByS3Key"),
                        ProcessingInput(
                          source=class_dict_uri,
                          destination=lookup_location),

                        ],
                      outputs=[ProcessingOutput(source=output_location)],
                      arguments = [
                        "--data_location", data_location, 
                        "--lookup_location", lookup_location,
                        "--output_location", output_location,
                        "--batch_size", "32",
                        "--max_samples", "10",
                        "--max_augmentations", "5"
                        ]
                     )


Job Name:  image-augmentation-2021-11-27-23-44-06-455
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-941656036254/315_birds/test/', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-941656036254/315_birds/class_dict.csv', 'LocalPath': '/opt/ml/processing/lookup', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-941656036254/image-augmentation-2021-11-27-23-44-06-455/output/output-1', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
............................[34m2021-11-27 23:48:32.160111: W tensorflow/stream_executor/pla