# Distributed Data Parallel Training on Amazon SageMaker
In this notebook we will use a Visual transformer to do image classification `horse or human` data from https://laurencemoroney.com/datasets.html. We will download both training and validation dataset provided on the site. 

Note: 
- Kernel: `PyTorch 1.8 Python 3.6 CPU Optimized)`
- Instance Type: `ml.m5.xlarge`

In [2]:
## Download data
!curl -o train.zip https://storage.googleapis.com/laurencemoroney-blog.appspot.com/horse-or-human.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  142M  100  142M    0     0  96.2M      0  0:00:01  0:00:01 --:--:-- 96.2M


In [3]:
!curl -o validation.zip https://storage.googleapis.com/laurencemoroney-blog.appspot.com/validation-horse-or-human.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.9M  100 10.9M    0     0  20.3M      0 --:--:-- --:--:-- --:--:-- 20.3M


In [4]:
## Unzip file
import zipfile
with zipfile.ZipFile("train.zip","r") as train_zip_ref:
    train_zip_ref.extractall("data/train")
    
with zipfile.ZipFile("validation.zip","r") as val_zip_ref:
    val_zip_ref.extractall("data/validation")

## Convert images to High Resolution
We will start with converting our images to High Resolution using HuggingFace model [EdsrModel](#https://huggingface.co/eugenesiow/edsr-base) from `super-image` library. 
Please note that this step is optional, and reason for doing this is to mimick the real world image datasets for High Performance Computing, where image size might be in mega bytes.

In [2]:
!pip install datasets super-image
!python3 -m pip install --upgrade sagemaker

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
     |████████████████████████████████| 325 kB 8.0 MB/s            
[?25hCollecting super-image
  Downloading super_image-0.1.6-py3-none-any.whl (85 kB)
     |████████████████████████████████| 85 kB 109.2 MB/s            
Collecting xxhash
  Downloading xxhash-3.0.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (211 kB)
     |████████████████████████████████| 211 kB 110.1 MB/s            
Collecting tqdm>=4.62.1
  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 109.6 MB/s            
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
     |████████████████████████████████| 67 kB 104.1 MB/s            
Collecting responses<0.19
  Downloading responses-0.17.0-py2.py3-none-any.whl (38 kB)
Collecting torch==1.9.0
  Downloadin

We are using `horse-or-human` dataset from here, which has image size of about `178KB`, we will convert these images to Higher resolution and `resulting size would be close to 2MB`. 

In [3]:
from super_image import EdsrModel, ImageLoader
from PIL import Image
import requests

model = EdsrModel.from_pretrained('eugenesiow/edsr-base', scale=4)

https://huggingface.co/eugenesiow/edsr-base/resolve/main/pytorch_model_4x.pt


In [11]:
import os
from os import listdir
folder_dir = "data/validation/"
for folder in os.listdir(folder_dir):
    folder_path = f'{folder_dir}{folder}'
    for image_file in os.listdir(folder_path):
        path = f'{folder_path}/{image_file}'
        image = Image.open(path)
        inputs = ImageLoader.load_image(image)
        preds = model(inputs)
        ImageLoader.save_image(preds, path)

In [12]:
# quick check on image size for the last image converted by the model.
import os

file_size = os.path.getsize(path)
print("File Size is :", file_size/1000000, "MB")

File Size is : 0.481116 MB


## Optional: Duplicate files to increase number of images for testing purpose in the later sections

In [9]:
# import os
# import shutil

# # repeat the same code for other folders as well.
# source_folder = r"./data/validation/horses/"
# destination_folder = r"./data/validation/horses/"
# i=0
# # fetch all files
# for file_name in os.listdir(source_folder):
#     # construct full file path
#     i=i+1
#     source = source_folder + file_name
#     destination = destination_folder + str(i) + '_' + file_name
#     # copy only files
#     if os.path.isfile(source):
#         shutil.copy(source, destination)
#         print('copied', destination)

copied ./data/validation/horses/1_horse5-235.png
copied ./data/validation/horses/2_horse5-550.png
copied ./data/validation/horses/3_horse1-554.png
copied ./data/validation/horses/4_horse6-153.png
copied ./data/validation/horses/5_horse5-514.png
copied ./data/validation/horses/6_horse3-255.png
copied ./data/validation/horses/7_horse1-455.png
copied ./data/validation/horses/8_horse5-181.png
copied ./data/validation/horses/9_horse2-544.png
copied ./data/validation/horses/10_horse5-002.png
copied ./data/validation/horses/11_horse1-204.png
copied ./data/validation/horses/12_horse1-105.png
copied ./data/validation/horses/13_horse4-102.png
copied ./data/validation/horses/14_horse1-411.png
copied ./data/validation/horses/15_horse6-089.png
copied ./data/validation/horses/16_horse1-510.png
copied ./data/validation/horses/17_horse5-275.png
copied ./data/validation/horses/18_horse4-159.png
copied ./data/validation/horses/19_horse3-484.png
copied ./data/validation/horses/20_horse2-368.png
copied ./

## Upload data to s3

In [3]:
%%time

import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
import boto3

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()

prefix = 'horse-or-human'

role = get_execution_role() 
client = boto3.client('sts')
account = client.get_caller_identity()['Account']
print(f'AWS account:{account}')

session = boto3.session.Session()
region = session.region_name
print(f'AWS region:{region}')

AWS account:706553727873
AWS region:us-west-2
CPU times: user 336 ms, sys: 17 ms, total: 353 ms
Wall time: 1.18 s


In [7]:
from sagemaker.s3 import S3Uploader
# s3_input_data = f's3://{bucket}/{prefix}/data'
s3_train_data = S3Uploader.upload('data/train',f's3://{bucket}/{prefix}/data/train')
s3_val_data = S3Uploader.upload('data/validation',f's3://{bucket}/{prefix}/data/validation')
print('s3 train data path: ', s3_train_data)
print('s3 validation data path: ', s3_val_data)

s3 train data path:  s3://sagemaker-us-west-2-706553727873/horse-or-human/data/train
s3 validation data path:  s3://sagemaker-us-west-2-706553727873/horse-or-human/data/validation


In [8]:
## Define PyTorch Estimator
metric_definitions=[
                   {'Name': 'train:error', 'Regex': 'loss : ([0-9\.]+)'},
                   {'Name': 'validation:error', 'Regex': 'val_loss : ([0-9\.]+)'}
                ]
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point='train.py',
                    source_dir='src',
                    role=role,
                    instance_count=1,
                    instance_type='ml.p3.16xlarge',
                    framework_version='1.8.0',
                    py_version='py3',
                    sagemaker_session=sagemaker_session,
                    hyperparameters={'epochs':10,
                                     'batch_size':32, 
                                     'lr':3e-5,
                                     'gamma': 0.7},
                    distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
                    debugger_hook_config=False,
                    metric_definitions = metric_definitions,
                   )

In [9]:
import torch
device = torch.device("cuda")
device

device(type='cuda')

In [10]:
%%time
from sagemaker.inputs import TrainingInput

train = TrainingInput(s3_train_data, content_type='image/png',input_mode='File')
val = TrainingInput(s3_val_data, content_type='image/png',input_mode='File')
estimator.fit({'train':train, 'val': val})

2022-04-04 23:39:50 Starting - Starting the training job...
2022-04-04 23:40:14 Starting - Launching requested ML instancesProfilerReport-1649115589: InProgress
.........
2022-04-04 23:41:35 Starting - Preparing the instances for training......
2022-04-04 23:42:51 Downloading - Downloading input data......
2022-04-04 23:43:35 Training - Downloading the training image....................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-04-04 23:47:05,081 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-04-04 23:47:05,158 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-04-04 23:47:05,164 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel[0m
[34m2022-04-04 23:47:05,165 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-04-04 23:47: