# Grant Perkins CS 4513 Final Project

In this project, I developed a distributed machine learning solution with AWS SageMaker.

Features:
 - Distributed computing:
   + One virtual computer runs this Jupyter Notebook
   + One virtual computer stores dataset
   + N virtual computers run the training job
   + One virtual computer runs the inference job (small load)
 - Machine Learning:
   + Trains a custom neural network to recognize digits in the MNIST dataset
   + Customizable hyperparameters (epochs, batch size, learning rate)
   + Inference on whatever image (classification)

I started this project on 4/30/2021, for CS 4513.

## Section 1: Data

In this section, I download the dataset, and upload it to a computer controlled by AWS S3, a long-term data service.

In [1]:
N_workers = 16  # must be factor of 8
epochs = 10000
batch_size = 64  # one batch per epoch
learning_rate = 0.000125

hyperparameters = {
    "epochs": epochs,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

In [2]:
import tensorflow as tf
import sagemaker as sage
from sagemaker.tensorflow import TensorFlow

In [3]:
# folder within S3 bucket
bucket = "CS4513MNIST"
# temp download location of MNIST dataset
mnist_directory = "/home/ec2-user/.keras/datasets"
sess = sage.Session()
role = sage.get_execution_role()

In [4]:
for i in range(N_workers):
    tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % i)
data_location = sess.upload_data(mnist_directory, key_prefix=bucket)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-dataset

## Section 2: Training

Now that the data is in the correct place, I can train a neural network for the dataset. I do this on N separate computers. N can be any natural number.

In [9]:
estimator = TensorFlow(
    base_job_name='mnist-cs4513',
    source_dir='code',
    entry_point='test.py',
    role=role,
    py_version='py37',
    framework_version='2.4.1',
    hyperparameters=hyperparameters,
    instance_count=N_workers // 8,
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sess,
    distribution={
        'smdistributed': {
            'dataparallel': {
                'enabled': True
            }
        }
    }
)

estimator.fit(data_location)

2021-05-01 16:47:32 Starting - Starting the training job...
2021-05-01 16:47:34 Starting - Launching requested ML instancesProfilerReport-1619887652: InProgress
.........
2021-05-01 16:49:29 Starting - Preparing the instances for training.........
2021-05-01 16:50:50 Downloading - Downloading input data...
2021-05-01 16:51:30 Training - Downloading the training image...........[35m2021-05-01 16:53:15.410871: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[35m2021-05-01 16:53:15.417079: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[35m2021-05-01 16:53:15.535354: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0[0m
[35m2021-05-01 16:53:15.650645: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initia