# Project Overview and Goals

_Team Members: Vamsi Banda, Rhett D’souza , Lukas Justen and Keith Pallo_

In this project our team set out to train a deep neural network by leveraging a PyTorch stack on top of AWS. Our goals were to gain experience in wrangling a large dataset and using some of the newer technologies that have gained traction in industry.


In this particular project our overarching goal was to perform image classification on the Cifar 10 dataset. This is a classic research dataset consisting over 30,000 images with 10 discrete output classes. Some example images from the dataset can be seen here.

<center> Example of Cifar 10 Data</center>

![Cifar_sample](documentation_images/cifar_examples.jpg)

Our project consists of several different Jupyter Notebooks and Python scripts which are each run in different AWS services in the cloud. This notebook serves as the primary documentation for an overview of the completed work - but all referenced files can be found in the associated directories as described below

Link to Cifar 10 Website: https://www.cs.toronto.edu/~kriz/cifar.html

# Overview of Architecture

AWS has many different options for training, validating, and deploying deep learning models - so at the start of the project it was a somewhat daughting task to choose a pacticular architecture. 

However, after looking through the available options our team decided to utilize the newer Sagemaker service - which is what AWS calls a "managed service". Sagemaker allows developers to quickly bundle together different services that AWS offers (like classic compute which is called EC2) and elastic storage (called S3) very easily, using custom built commands. All of the code can be run from a "SageMaker" Notebook instance, which can contain multiple different file types, including Jupyter Notebooks that can import the custom SageMaker code. Additionally, because this is a managed service, a significant amount of setup, like making sure the correct packages have been installed has already been handled. This is a huge advantage of the system, as we do not have to be concerned with configuration issues on the remote systems. 

For our particular application we have chosen to train Cifar10 by uploading the raw data to our own S3 bucket, using EC2 to train the model, and then storing the model hyperparamters in the same S3 bucket. Then, we also configured deployment of our model using AWS services Lambda and API Gateway so we can test our trained models. We also created a simple android application using the new Google language Flutter - which is included in our submission for reference as well. Below is a general overview of the architecture, but we also go further in depth concerning deployment in the following sections.

<img src="documentation_images/sagemaker-architecture.png" height="600" width="600">

To complete our task we created several different aws source files which have included for reference. Below is a list of their name and purposes. 

### Notable Files
     
     ├── documentation_images     (directory)      # Holds reference images
     
     ├── aws_full_reference_files (directory)
          ├── DeepLearningModel.ipynb              # Contains code to setup connections 
          ├── pytorch_cifar.py                     # Core model code
          
     ├── android_app (directory)
          ├── app-release.apk                      # Android app (created from Flutter)
     

# Setting up our environment - Sagemaker Connections and Setup

The code below is executed in a SageMaker Jupyter Notebook Instance to setup our S3 buckets, prepare the data for training and testing. This is a great example of the available SageMaker commands available - where we can directly interface with other AWS services.

```python

# Import sagemaker packages and define assocaited roles for this particular instance 

from sagemaker import get_execution_role
import sagemaker

sess = sagemaker.Session()

role = get_execution_role()
bucket= 'dlbucket435'
data_key = 'cifar-10-python.tar.gz'
model_out = 'model_1'
data_location = 's3://{}/{}'.format(bucket, data_key)
download_in = './cifar'

# Data Preperation for data viewing, testing and usage in the main script (pytorch_cifar.py).

# Transformation functions
transform = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# Setup output classes
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# Loading example for a CIFAR10 local Dataset and/or fresh dataset download with transformation
trainset = torchvision.datasets.CIFAR10(root=download_in, train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100, shuffle=False, num_workers=2)

# Unzip local version of CIFAR-10 dataset(we can use the AWS version of cifar10)
!tar -zcvf cifar_10.tar.gz ./cifar

# Raw Data Uploaded to S3 to be downloaded by the estimator's fit function later on during training with inputs URI
inputs = sess.upload_data(path='cifar', bucket=bucket)
print('input spec (in this case, just an S3 path): {}'.format(inputs)) 


              
```

# Training our Model - Modified Lenet and VGG

To train our PyTorch model, in the SageMaker environment, we must create a SageMaker Pytorch estimator that calls an entry point script. The script must implement key methods required by the SageMaker estimator interface noted in the [SageMaker documentation](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/pytorch).
```python

from sagemaker.pytorch import PyTorch
# Define the estimator
estimator = PyTorch(entry_point='pytorch_cifar.py',
                            role=role,
                            framework_version='1.0.0',
                            train_instance_count=1,
                            train_instance_type='ml.c4.8xlarge',
                            output_path='s3://dlbucket435/model/'
                            )

# Fit the model with the training data
estimator.fit({'training':inputs})
```
The fit function initiates the training job, spins up the AWS Compute Instance ( which in our case is a `ml.c4.8xlarge` instance), feeds the data from the S3 bucket to the training job, and starts training.
The model generated and any logs/checkpoints that we wished to write to the local instance and hence save (to local directory `model_dir`, in the entry point script) get written to the path specified by `output_path` in the PyTorch estimator object instantiation. In this case, we send it out to the S3 bucket __dlbucket435__ .

In the entry point script, we define our PyTorch models to be trained, the cost function, optimizers, data I/O, the training procedure, model saving, model loading and prediction.

We trained 2 networks, a modified version of LeNet and VGG11. We used the Adam optimizer with a Cross Entropy Loss function. For reference, the model that is currently deployed in our endpoint (and hence used by our application) is VGG11.

Class for the wider version (32 channels and 64 channels) of LeNet:
```python
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(32, 64, 5)
        self.fc1 = nn.Linear(64 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
```
Class for VGG architectures, with VGG11 being selected and trained:
```python
cfg = {
    'VGG11': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'VGG13': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
    'VGG19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
    }

class VGG(nn.Module):
    def __init__(self, vgg_name):
        super(VGG, self).__init__()
        self.features = self._make_layers(cfg[vgg_name])
        self.classifier = nn.Linear(512, 10)

    def forward(self, x):
        out = self.features(x)
        out = out.view(out.size(0), -1)
        out = self.classifier(out)
        return out

    def _make_layers(self, cfg):
        layers = []
        in_channels = 3
        for x in cfg:
            if x == 'M':
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                           nn.BatchNorm2d(x),
                           nn.ReLU(inplace=True)]
                in_channels = x
        layers += [nn.AvgPool2d(kernel_size=1, stride=1)]
        return nn.Sequential(*layers)
```

We then load the training data (similar to `trainloader` as mentioned earlier) from the temporary local directory (after it is downloaded from S3), normalize the data and then begin training.
Refer to the entry point script (pytorch_cifar.py) to view line-by-line definition of the training process.

After the model has been successfully trained, we load the SageMaker PyTorchModel, and deploy the model onto an endpoint to be used for inference.

```python
pytorch_model = PyTorchModel(
model_data='s3://dlbucket435/model/sagemaker-pytorch-2019-03-18-09-59-04-255/output/model.tar.gz', role=role,
                             entry_point='pytorch_cifar.py')

predictor = pytorch_model.deploy(instance_type='ml.c4.8xlarge', initial_instance_count=1)
```

The `deploy` function spins up the Compute Instance `ml.c4.8xlarge` and deploys the model using the function `model_fn` implemented in the entry point script to load the model from the S3 bucket (stored in the `output_path`)

This endpoint is then used in the next section, to serve the PyTorch model for live inference.

# Deployment - Lambda and API Gateway

![alt text](AppGatewayLambdaSagemaker.png "Architecture for the deployment part of our project.")

### Mobile Application

We used the mobile app development framework Flutter in order to build a small mobile app that can take a picture, sample it down to a 32 x 32 image which will then be classified by our model. To invoke the infer function of the model we needed to build something that allows our app to connect to AWS. Luckily, AWS provides its users with an API Gateway. After the app invokes that API and receives a classfication of the image the app displays the classification. The android version of our application has been provided if testing is desired. 

### API Gateway

As already mentioned, the API Gateway allows our app to connect to the model which we have created using Amazon Sagemaker. The API Gateway is a small web interface that allows you to build a simple REST API. We can then use that Gateway to AWS to connect to a Lambda function which will handle the user's request to classify the image. We are using a simple POST method to upload an base64 encoded image.

### Lambda Function

AWS provides users with Lambda functions that can run your code in response to events. In our case this event will be a request that has been received by the API Gateway. The Lambda function atomatically manages the compute resources for the users which makes it very convenient and scaleable to use these Lambda functions for inference. The Lambda function consists of one python script with a single function that is called in reponse to the received event. In our case, the python script will then convert the image into the proper representation for the trained network. After Sagemaker returns a classification the Lambda function return the response to the API Gateway and finally to the mobile application.

In [None]:
import json
import base64
import os
import boto3
import numpy as np

ENDPOINT_NAME = os.environ['ENDPOINT_NAME']
runtime= boto3.client('runtime.sagemaker')

def lambda_handler(event, context):
    data = json.loads(json.dumps(event))
    payload = data['data']
    bytes = str.encode(payload)
    png = base64.decodebytes(bytes)
    q = np.frombuffer(png, dtype=np.uint8)
    q = q.reshape((1, 3, 32, 32))
    data = q.tolist().__str__()
    
    response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                      ContentType='application/json',
                                      Body=data)
                                      
    result = json.loads(response['Body'].read().decode())
    
    return {
        'statusCode': 200,
        'body': result
    }


### Sagemaker Endpoint

The model that has been created by the Sagemaker notebook and the Pytorch Python script can be stored in an S3 bucket on our AWS machine. Instead of saving one model we could also store different versions of our model. In the case of a bug in our model, we could easily switch the model that is used by the Lambda function for inference. All in all, a group source of capability for our system and AWS in general is that the mentioned components can be pluged into each other like Legos. This makes it very convenient to build a deep learning pipeline using Amazon Sagemaker and AWS!

### Sample Application Output

<img src="documentation_images/SampleOutput.png" height="700" width="700">

# Lessons Learned

Throughout this project, we had several key learnings around the implementation of large deep learning infrastructure. 

Firstly, we experienced a steep learning curve for getting things going on the AWS cloud platform. Although AWS has done a good job of documenting specific use cases, it can be hard to figure out where to start. For example, there are several different ways to collaborate across multiple users (AWS Organizations, IAM Users, etc.) but in order to determine the best method, we had to reach out to an experienced AWS developer in our network. Additionally, debugging a system utilizing a managed service can be quite difficult, because a significant amount of configuation has been abstracted away. This difficulty extends to potentially buggy deployments - where there can be a 5 minute lead time between sending a new deployment and testing if it is operational.

However, despite these difficulties, our group very much enjoyed learning about AWS and the benefits are massive. For example, when deploying our model to an application it was extremely easy to do so - a task that would have been daughting if we had not used the platform. Additionally, the ability to "plug and play" with our models was astounding - and after solving some initial system OS issues, it became clear how anyone (ranging from an individual, a startup, or a Fortune 50 company) could very quickly get going with at scale deep learning. 

Furthermore, we would like to thank Professor Aggelos Katsaggelos for an amazing Deep Learning course experience! 