# Grant Perkins CS 4513 Final Project

In this project, I developed a distributed machine learning solution with AWS SageMaker.

Features:
 - Distributed computing:
   + One virtual computer runs this Jupyter Notebook
   + One virtual computer stores dataset
   + N virtual computers run the training job
 - Machine Learning:
   + Trains a custom neural network to recognize digits in the MNIST dataset
   + Customizable hyperparameters (epochs, batch size, learning rate)

I started this project on 4/30/2021, for CS 4513.

## Section 1: Data

In this section, I download the dataset, and upload it to a computer controlled by AWS S3, a long-term data service.

In [None]:
 # create hyperaparameters
N_workers = 16  # must be factor of 8
epochs = 10000
batch_size = 64  # one batch per epoch
learning_rate = 0.000125

hyperparameters = {
    "epochs": epochs,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

In [None]:
# import dependencies
import tensorflow as tf
import sagemaker as sage
from sagemaker.tensorflow import TensorFlow

In [None]:
# set up SageMaker connections

# folder within S3 bucket
bucket = "CS4513MNIST"
# temp download location of MNIST dataset
mnist_directory = "/home/ec2-user/.keras/datasets"
sess = sage.Session()
role = sage.get_execution_role()

In [None]:
for i in range(N_workers):
    tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % i)
data_location = sess.upload_data(mnist_directory, key_prefix=bucket)

## Section 2: Training

Now that the data is in the correct place, I can train a neural network for the dataset. I do this on N separate computers. N can be any natural number.

In [None]:
"""
Make estimator. Creates a training job, distributing the work across `N_workers` workers, and `N_workers//8` computers.
Trains on GPU. Runs `mnist.py` training set is distributed among workers. See `mnist.py` for more details.
"""
estimator = TensorFlow(
    base_job_name='mnist-cs4513',
    source_dir='code',
    entry_point='mnist.py',
    role=role,
    py_version='py37',
    framework_version='2.4.1',
    hyperparameters=hyperparameters,
    instance_count=N_workers // 8,
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sess,
    distribution={
        'smdistributed': {
            'dataparallel': {
                'enabled': True
            }
        }
    }
)

estimator.fit(data_location)