# Compiling Models with Neuron SDK

In this tutorial we will compile using Neuron SDK and run inference on AWS Inferentia instance. For this, we will fist compile `ResNet50` model and infer with a batch size of 1. After that we will tune model performance using NeuronCores `torch.neuron.DataParallel` and dynamic batching capabilities. 

### Prerequisites 

1. **Selecting instance.** To run this exmple, you need to run this notebook on `inf1.6xlarge` instance. At the time of writing SageMaker Notebook Instances and SageMaker Studio Notebooks don't support Inferentia-based instances. Hence, you will need to use AWS EC2 instance instead. It's recommended to use latest Deep Learning AMI GPU PyTorch image for it which comes with Jupyter environment pre-installed.

2. **Setting up Neuron SDK.** You need to follow setup NeuronSDK guide to install it and other dependencies. Refer to latest documentation here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuron/setup/pytorch-install.html

3. **Using correct Jupyter Kernel.** When using this notebook, make sure that you selected `Python (Neuron PyTorch)`.

## Compile Model for Neuron SDK
Run following steps to compile ResNet50 with Neuron SDK:

1. We start by importing required libraries, including `torch_neuron`, and download the ResNet50 model locally

In [None]:
import torch
from torchvision import models, transforms, datasets
import torch_neuron

model = models.resnet50(pretrained=True)
# set model into eval mode
model.eval()

2. Next, we must analyze the model operators to identify if any model operators are not supported by Inferentia/Neuron SDK. For this, we  use random input image. Since the ResNet50 model is supported, the output of this command should confirm that all the model operators are supported:

In [None]:
image = torch.zeros([1, 3, 224, 224], dtype=torch.float32)
torch.neuron.analyze_model(model, example_inputs=[image])

3. Now we are ready to compile by running the following command. You will see the compilation statistics (such as number of supported operators) and overal compilation status in the output:

In [None]:
model_neuron = torch.neuron.trace(model, example_inputs=[image])

4. Since Neuron SDK compiles into a TorchScript program, saving and loading the model is similar to what you would do in regular PyTorch:

In [None]:
model_neuron.save("resnet50_neuron.pt")

# model_neuron = torch.jit.load('resnet50_neuron.pt') # loading compiled model

## Running Inference 

Let's test our compiledd model. In the example below, we run inference using the CPU model and compiled Neuron model. We then will compare the predicted labels from the CPU model and Neuron model to verify that they are the same.

    Important: Do not perform inference with a Neuron traced model on a non-Neuron supported instance, as the results will not be calculated properly.

### Define Helper Functions
Before we begin, we need to define functions to preprocess images and benchmark inference. 

1. We define a basic image preprocessing function that loads a sample image and labels, normalizes and batches the image, and transforms the image into a tensor for inference using the compiled Neuron model:

In [None]:
import numpy as np

def preprocess(batch_size=1, num_neuron_cores=1):
    # Define a normalization function using the ImageNet mean and standard deviation
    normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225])

    # Resize the sample image to [1, 3, 224, 224], normalize it, and turn it into a tensor
    eval_dataset = datasets.ImageFolder(
        os.path.dirname("./torch_neuron_test/"),
        transforms.Compose([
        transforms.Resize([224, 224]),
        transforms.ToTensor(),
        normalize,
        ])
    )
    image, _ = eval_dataset[0]
    image = torch.tensor(image.numpy()[np.newaxis, ...])

    # Create a "batched" image with enough images to go on each of the available NeuronCores
    # batch_size is the per-core batch size
    # num_neuron_cores is the number of NeuronCores being used
    batch_image = image
    for i in range(batch_size * num_neuron_cores - 1):
        batch_image = torch.cat([batch_image, image], 0)
     
    return batch_image

2. We also need to define benchmarking function to compare inference performance:

In [None]:
from time import time

def benchmark(model, image):
    print('Input image shape is {}'.format(list(image.shape)))
    
    # The first inference loads the model so exclude it from timing 
    results = model(image)
    
    # Collect throughput and latency metrics
    latency = []
    throughput = []

    # Run inference for 100 iterations and calculate metrics
    num_infers = 100
    for _ in range(num_infers):
        delta_start = time()
        results = model(image)
        delta = time() - delta_start
        latency.append(delta)
        throughput.append(image.size(0)/delta)
    
    # Calculate and print the model throughput and latency
    print("Avg. Throughput: {:.0f}, Max Throughput: {:.0f}".format(np.mean(throughput), np.max(throughput)))
    print("Latency P50: {:.0f}".format(np.percentile(latency, 50)*1000.0))
    print("Latency P90: {:.0f}".format(np.percentile(latency, 90)*1000.0))
    print("Latency P95: {:.0f}".format(np.percentile(latency, 95)*1000.0))
    print("Latency P99: {:.0f}\n".format(np.percentile(latency, 99)*1000.0))

3. Below we downloadn several image samples which we'll use for benchmarking and accuracy verification:

In [None]:
import json
import os
from urllib import request

os.makedirs("./torch_neuron_test/images", exist_ok=True)
request.urlretrieve("https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg",
                    "./torch_neuron_test/images/kitten_small.jpg")

request.urlretrieve("https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json","imagenet_class_index.json")
idx2label = []

with open("imagenet_class_index.json", "r") as read_file:
    class_idx = json.load(read_file)
    idx2label = [class_idx[str(k)][1] for k in range(len(class_idx))]

### Test Compiled Model Accuracy

Let's compare if compiled model has similar accuracy to non-compiled model. Execute cell below to test model performance on sample images and compare it to accuracy for CPU ResNet model. We expect that predictions of both models will match.

In [None]:
import torch
from torchvision import models, transforms, datasets
import torch_neuron

# Get a sample image
image = preprocess()

# Run inference using the CPU model
output_cpu = model(image)

# Load the compiled Neuron model
model_neuron = torch.jit.load('resnet50_neuron.pt')

# Run inference using the Neuron model
output_neuron = model_neuron(image)

# Verify that the CPU and Neuron predictions are the same by comparing
# the top-5 results
top5_cpu = output_cpu[0].sort()[1][-5:]
top5_neuron = output_neuron[0].sort()[1][-5:]

# Lookup and print the top-5 labels
top5_labels_cpu = [idx2label[idx] for idx in top5_cpu]
top5_labels_neuron = [idx2label[idx] for idx in top5_neuron]
print("CPU top-5 labels: {}".format(top5_labels_cpu))
print("Neuron top-5 labels: {}".format(top5_labels_neuron))

## Benchmark Different Model Configuration

Neuron SDK provides several performance optimization capabilities. Below we run a set of experiments for different model configuration. 

### Running Model On Multiple Neuron Cores

To fully leverage the Inferentia hardware we want to use all avaialable NeuronCores. An inf1.xlarge and inf1.2xlarge have four NeuronCores, an inf1.6xlarge has 16 NeuronCores, and an inf1.24xlarge has 64 NeuronCores. For maximum performance on Inferentia hardware, we can use `torch.neuron.DataParallel` to utilize all available NeuronCores. Neuron DataParallel implements data parallelism at the module level by duplicating the Neuron model on all available NeuronCores and distributing data across the different cores for parallelized inference.

In the example below we set number of neuron cores to 4. Feel free to change it depending on how many Neuron Cores are available on your Inferentia instance.

In [None]:
# For an inf1.xlarge or inf1.2xlarge set num_neuron_cores=4; 
# For inf1.6xlarge set it to 16; for inf1.24xlarge - 24.
num_neuron_cores = 4

model_neuron_parallel = torch.neuron.DataParallel(model_neuron)

# Get sample image with batch size=1 per NeuronCore
batch_size = 1

image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)

# Benchmark the model
benchmark(model_neuron_parallel, image)

The benchmark results should be similar to the following:
```python
Input image shape is [4, 3, 224, 224]
Avg. Throughput: 551, Max Throughput: 562
Latency P50: 7
Latency P90: 7
Latency P95: 7
Latency P99: 7
```

### Running Inference With Different Batch Size
In this experiment, we will compile our model to run inference on batched samples to improve throughput. 

Note, that `dynamic batching` using small batch sizes can result in sub-optimal throughput because it involves slicing tensors into chunks and iteratively sending data to the hardware. Using a larger batch size at compilation time can use the Inferentia hardware more efficiently in order to maximize throughput. You can test the tradeoff between individual request latency and total throughput by fine-tuning the input batch size.

In the following example, we recompile our model using a batch size of 5 and run the model using `torch.neuron.DataParallel` to fully saturate our Inferentia hardware for optimal performance.

1. We start by recompiling model for batch size equal to 5:

In [None]:
# Create an input with batch size 5 for compilation
batch_size = 5
image = torch.zeros([batch_size, 3, 224, 224], dtype=torch.float32)

# Recompile the ResNet50 model for inference with batch size 5
model_neuron = torch.neuron.trace(model, example_inputs=[image])

# Export to saved model
model_neuron.save("resnet50_neuron_b{}.pt".format(batch_size))

2. Let's benchmark inference for newly compiled model

In [None]:
# Load compiled Neuron model
model_neuron = torch.jit.load("resnet50_neuron_b{}.pt".format(batch_size))

# Create DataParallel model
model_neuron_parallel = torch.neuron.DataParallel(model_neuron)

# Get sample image with batch size=5
image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)

# Benchmark the model
benchmark(model_neuron_parallel, image)

The benchmark’s output should look as follows:

```python
Input image shape is [20, 3, 224, 224]
Avg. Throughput: 979, Max Throughput: 998
Latency P50: 20
Latency P90: 21
Latency P95: 21
Latency P99: 24
```

Note, that while latency has increased, the overall throughput is increased as expected.

### Running Inference with NeuronCore Pipeline
In our last experiment, we will use Pipeline features.

1. We first recompile model 

In [None]:
# Number of Cores in the Pipeline Mode
neuroncore_pipeline_cores = 4

image = preprocess(batch_size=batch_size, num_neuron_cores=num_neuron_cores)
benchmark(neuron_pipeline_model, image)

# Compiling for neuroncore-pipeline-cores='16'
neuron_pipeline_model = torch.neuron.trace(model,
                                           example_inputs=[image],
                                           verbose=1,
                                           compiler_args = ['--neuroncore-pipeline-cores', str(neuroncore_pipeline_cores)]
                                          )

In [None]:
outputs = neuron_pipeline_model(*image)