# MNIST Training using PyTorch

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)

---

## Background

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using PyTorch.

For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.

---

## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by creating a SageMaker session and specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).


In [10]:
install_needed = True  # should only be True once
install_needed = False

In [11]:
import sys
import IPython

if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install --upgrade pip 
    !{sys.executable} -m pip install -U sagemaker smdebug
    IPython.Application.instance().kernel.do_shutdown(True)

In [12]:
import sagemaker
from sagemaker.debugger import Rule, ProfilerRule, rule_configs
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
import time

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-pytorch-mnist'

role = sagemaker.get_execution_role()

In [13]:
sagemaker.__version__

'2.24.2'

## Data
### Getting the data



In [14]:
from torchvision import datasets, transforms

datasets.MNIST('data', download=True, transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
]))

Dataset MNIST
    Number of datapoints: 60000
    Root location: data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.1307,), std=(0.3081,))
           )

### Uploading the data to S3
We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.


In [15]:
inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))

input spec (in this case, just an S3 path): s3://sagemaker-us-east-1-322537213286/sagemaker/DEMO-pytorch-mnist


## Train
### Training script
The `mnist.py` script provides all the code we need for training and hosting a SageMaker model (`model_fn` function to load a model).
The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.
  These artifacts are uploaded to S3 for model hosting.
* `SM_NUM_GPUS`: The number of gpus available in the current container.
* `SM_CURRENT_HOST`: The name of the current container on the container network.
* `SM_HOSTS`: JSON encoded list containing all the hosts .

Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method, the following will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.

For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.

Because the SageMaker imports the training script, you should put your training code in a main guard (``if __name__=='__main__':``) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.

For example, the script run by this notebook:

In [16]:
!pygmentize mnist.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[37m# import sagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m [34mas[39;49;00m [04m[36mdist[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctional[39;49;00m [34mas[39;49;00m [04m[36mF[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mt

In [37]:
!python mnist.py --epochs 20

The model starts training on the local host without SageMaker TrainingJob.
Get train data loader
Get test data loader
Processes 60000/60000 (100%) of train data
Processes 10000/10000 (100%) of test data
Test set: Average loss: 0.1925, Accuracy: 94% (9423/10000)

Test set: Average loss: 0.1232, Accuracy: 96% (9611/10000)

Test set: Average loss: 0.0980, Accuracy: 97% (9703/10000)

Test set: Average loss: 0.0827, Accuracy: 97% (9740/10000)

Test set: Average loss: 0.0726, Accuracy: 98% (9777/10000)

Test set: Average loss: 0.0699, Accuracy: 98% (9779/10000)

Test set: Average loss: 0.0607, Accuracy: 98% (9819/10000)

Test set: Average loss: 0.0544, Accuracy: 98% (9831/10000)

Test set: Average loss: 0.0529, Accuracy: 98% (9835/10000)

Test set: Average loss: 0.0512, Accuracy: 98% (9836/10000)

Test set: Average loss: 0.0528, Accuracy: 98% (9832/10000)

Test set: Average loss: 0.0487, Accuracy: 98% (9845/10000)

Test set: Average loss: 0.0445, Accuracy: 99% (9858/10000)

Test set: Average

### Run training in SageMaker Data Parallel


AWS에서 Multigpu distributed training은 data_parallel와 model_parallel 를 모두 사용할 수 있으며, 아래 예제는 data_parallel 중심으로 학습을 하게 됩니다.

SageMaker Distributed Data Parallel : AWS의 네트워크 인프라와 Balanced Fusion Buffers 를 이용하여 AWS SageMaker에 최적화된 data parallel 분산학습 알고리즘을 제공합니다.



In [38]:
metric_definitions=[
     {'Name': 'train:Loss', 'Regex': 'Loss: (.*?),'},
     {'Name': 'test:Accuracy', 'Regex': 'Accuracy: (.*?)%'},
]

In [39]:
rules=[ 
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

In [40]:
profiler_config=ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(
        start_step=5,num_steps=10,
        detailed_profiling_config=DetailedProfilingConfig(start_step=2, num_steps=1),
        dataloader_profiling_config=DataloaderProfilingConfig(start_step=3, num_steps=1),
        python_profiling_config=PythonProfilingConfig(start_step=4, num_steps=1), # cprofile / Pyinstrument
    )
)

In [41]:
distribution = {"smdistributed": {
                    "dataparallel": {
                            "enabled": True
                    }
               }
             }

In [42]:
instance_type = 'ml.p3.16xlarge'
instance_count = 1
entry_point = 'mnist_smdp.py'

In [47]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point=entry_point,
                    role=role,
                    framework_version='1.6.0',
                    py_version='py36',
                    instance_count=instance_count,
                    instance_type=instance_type,
                    distribution=distribution,
                    metric_definitions=metric_definitions,
                    profiler_config=profiler_config,
                    rules=rules,
                    use_spot_instances=True,
                    max_wait=3*60*60,
                    max_run=3*60*60,
                    hyperparameters={
                        'epochs': 50
                    }
                   )

After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.


In [48]:
job_name = "training-job-{}".format(int(time.time()))

estimator.fit({'training': inputs},
                     job_name=job_name
                   )

2021-02-04 14:35:12 Starting - Starting the training job...
2021-02-04 14:35:37 Starting - Launching requested ML instancesLossNotDecreasing: InProgress
ProfilerReport: InProgress
.........
2021-02-04 14:37:09 Starting - Preparing the instances for training.........
2021-02-04 14:38:42 Downloading - Downloading input data
2021-02-04 14:38:42 Training - Downloading the training image.................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-02-04 14:41:27,217 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-02-04 14:41:27,296 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m

2021-02-04 14:41:44 Training - Training image download completed. Training in progress.[34m2021-02-04 14:41:33,542 sagemaker_pytorch_container.training INFO     Invoking SMDataParallel[0m
[34m2021-02-04 14:41:33,543 sa

## Debugger Profiling Report 다운로드 받기
Profiling report rule은 html report `profiler-report.html` 생성합니다. 이 Report에는 built-in rules 과 다음 단계에 대한 recommenadation에 대한 요약을 포함하고 있습니다. Report는 S3 bucket에 있으며 아래 cell을 실행하여 노트북으로 다운로드를 받습니다. 자세한 사항은 [SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html) 에서 확인이 가능합니다.

In [58]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

You will find the profiler report in s3://sagemaker-us-east-1-322537213286/training-job-1612449311/rule-output


In [59]:
!aws s3 ls {rule_output_path}/ProfilerReport/profiler-output/

                           PRE profiler-reports/
2021-02-04 14:45:33     428098 profiler-report.html
2021-02-04 14:45:33     292281 profiler-report.ipynb


In [64]:
import os

output_dir = './output'
!rm -rf $output_dir

profile_output = output_dir+'/ProfilerReport'

if not os.path.exists(profile_output):
    os.makedirs(profile_output)

In [65]:
!aws s3 cp {rule_output_path}/ProfilerReport/profiler-output/ {output_dir}/ProfilerReport/ --recursive

download: s3://sagemaker-us-east-1-322537213286/training-job-1612449311/rule-output/ProfilerReport/profiler-output/profiler-report.html to output/ProfilerReport/profiler-report.html
download: s3://sagemaker-us-east-1-322537213286/training-job-1612449311/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json to output/ProfilerReport/profiler-reports/Dataloader.json
download: s3://sagemaker-us-east-1-322537213286/training-job-1612449311/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json to output/ProfilerReport/profiler-reports/CPUBottleneck.json
download: s3://sagemaker-us-east-1-322537213286/training-job-1612449311/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json to output/ProfilerReport/profiler-reports/BatchSize.json
download: s3://sagemaker-us-east-1-322537213286/training-job-1612449311/rule-output/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json to output/ProfilerReport/profiler-reports

In [66]:
from IPython.core.display import display, HTML

display(HTML('<b>ProfilerReport : <a href="{}profiler-report.html">Profiler Report</a></b>'.format(output_dir+"/ProfilerReport/")))


## Host
### Create endpoint
After training, we use the `PyTorch` estimator object to build and deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.

As mentioned above we have implementation of `model_fn` in the `mnist.py` script that is required. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `transform_fm` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in `mnist.py`. Here we will deploy the model to a single ```ml.m4.xlarge``` instance.

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

### Evaluate
We can now use this predictor to classify hand-written digits. Drawing into the image box loads the pixel data into a `data` variable in this notebook, which we can then pass to the `predictor`.

In [None]:
from IPython.display import HTML
HTML(open("input.html").read())

In [None]:
import numpy as np

image = np.array([data], dtype=np.float32)
response = predictor.predict(image)
prediction = response.argmax(axis=1)[0]
print(prediction)

### Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it

In [None]:
predictor.delete_endpoint()