# Exercise 2: Benchmarking

In this exercise, you will implement benchmarking code for testing performance.

The workload will be the same vehicle detection code as exercise 1. 
But in this exercise, you will be repeating the process and timing it.

## Step 1: Converting the Model

You will be using the `mobilenet-ssd` model again for the benchmark code.

Use the moel downloader and model optimizer to generate a model for **FP16**.

In [None]:
%%bash
# Download and convert mobilenet-ssd model for FP16
/opt/intel/openvino/deployment_tools/tools/model_downloader/downloader.py --print

## Step 2: Inference Benchmark Scripts

In this step, you will be writing the benchmarking code for testing various hardware available to you on the DevCloud.
The hardware that you wil test include those that perform best when there aremore than one request.
So as discussed in the video, we need to take advantage of the asynchronous inference mode to spawn the optimial number of inference requests.

For this exercise, you will be jumping straight to creating scripts for running the workload in the queue.
The scripts will be in two parts: `utils.py` where the helper functions like the function for peprocessing image are located, and `main.py` where the benchmarking occurs.

### utils.py

Follow the instructions to complete `utils.py`.

*(2.1)* Complete the `prepImage()` function, which is used to prepare the image for inference. The code here should be the exact same as in exercise 1.

*(2.2)* In `createExecNetwork()` function, create an instance of IECore object and load the CPU plugin specified by the variable `extension`. The solution is identical to the implementationin exercise 1. Optionally, add a if check to see if "CPU" appears in the device list. While it is safe toload the extension even if you are not using CPU, it is a good practice to add a check to not load unnecessary extensions.

*(2.3)* In `createExecNetwork()` function, create an instance of ExecutableNetwork from `ie_net` with the optimal number of requests and return it. Remember that you must create ExecutableNetwork object (with default `num_request`) first to get the optimal number of requests. See the slides for video 2 of course 2 for more details.

In [None]:
%%writefile utils.py
import cv2
from openvino.inference_engine import IECore, IENetwork

def prepImage(original_image, ie_net):

    ##! (2.1) Find n, c, h, w from net !##
    
    # Reshaping data
    input_image = cv2.resize(original_image, (w, h))
    input_image = input_image.transpose((2, 0, 1))
    input_image.reshape((n, c, h, w))

    return input_image

def getCount(detected_objects, prob_threshold=0.5):
    detected_count = 0
    for obj in detected_objects[0][0]:
        # Draw only objects when probability more than specified threshold
        if obj[2] > prob_threshold:
            detected_count+=1
    return detected_count

def createExecNetwork(ie_net, device):
    ##! (2.2) Create IECore !##
    
    ##! (2.2) Load the CPU plugin (optional: check if it is needed)!##
    extension = '/opt/intel/openvino/deployment_tools/inference_engine/lib/intel64/libcpu_extension_avx2.so'

    ##! (2.3) Create ExecutableNetwork object and find the optimal number of requests !##

    ##! (2.3) Recreate IECore and with num_requests set to optimal number of requests !##
    
    ##! (2.3) return the ExecutableNetwork !##

### main.py 
Next is the `main.py`.
For this implementation, follow the approach where preprocessing and postprocessing is also repeated as many time as there are requests.
While this is not strictly necessary to repeat the preprocesing and postprocessing steps, it will give you timing that you can directly compare.

Follow the instructions to complete `main.py`.

*(2.4)* Create the IENetwork object with FP16 version of the `vehicle-detection-adas-0002` model that we have downloaded earlier. Do not change the variable name, `ie_net` for ths file. Then find the name of the input layer and output layer.

*(2.5)* Start asynchronous processing on all request slots for images from prepped_images. 

*(2.6)* Wait for all request slots to complete.

*(2.7)* Get the number of vehicles from each inference request with `getCount()` function, and save the result in an array. You will likely need to access the result through the `outputs` attribute of the requests. See slides from course 1 video 7 for more. This array is used for a sanity check to make sure all inference requests return the same number of detected vehicles.

In [None]:
%%writefile main.py
import time
from openvino.inference_engine import IENetwork, IECore
from utils import *
import cv2
import sys
import os
import statistics

# Getting the device as commandline argument
device = sys.argv[1]

##! (2.4) create IENetwork object for vehicle-detection-adas-0002 and set it to ie_net!##
ie_net = None

##! (2.4) get the input and output layer names !##

image_path = "cars_1900_first_frame.jpg"
original_image = cv2.imread(image_path)

iter_ = 500
prep_time = []
infer_time = []
postp_time = []
for i in range(iter_):
    # Preprocessing image. 
    prep_start = time.time()
    prepped_images = []
    for slot_id in range(num_requests):
        prepped_images.append(prepImage(original_image, ie_net))
    prep_time.append((time.time()-prep_start)/num_requests*1000)
    
    infer_start = time.time()
    for req_slot in range(num_requests):
        ##! (2.5) Run asynchronous inference. !##

    for req_slot in range(num_requests):
        ##! (2.6) Wait for asynchronous inference to complete. !##
    infer_time.append((time.time()-infer_start)/num_requests*1000)
    
    postp_start = time.time()
    result_list = [0]*num_requests  # Python way of creating a 0 array of length 'num_requests'
    for req_slot in range(num_requests):
        ##! (2.7) Run getCount to get the vehicle count and store it in result_list !##
        result_list[req_slot] = None
    postp_time.append((time.time()-postp_start)/num_requests*1000)

    # Sanity check to make sure all results are identical. Abort if it does not match
    assert all([x == result_list[0] for x in result_list]), "Results for the inference requests did not match"
    
# writing the results to a file
if not os.path.exists("results"):
    os.makedirs("results")
with open("results/{}_val.txt".format(device), "w") as f:
    prep_avg    = statistics.mean(prep_time)
    prep_stdev  = statistics.stdev(prep_time)
    infer_avg   = statistics.mean(infer_time)
    infer_stdev = statistics.stdev(infer_time)
    postp_avg   = statistics.mean(postp_time)
    postp_stdev = statistics.stdev(postp_time)
    f.write("Inference running on: {} \n".format(device))
    f.write("Number of requests: {} \n".format(num_requests))
    f.write("Inference time per image (ms): {:.3g} +- {:.3g}\n".format(infer_avg, infer_stdev))
    f.write("Preprocessing time per image (ms): {:.3g} +- {:.3g}\n".format(prep_avg, prep_stdev))
    f.write("Postprocessing time per image (ms): {:.3g} +- {:.3g}\n".format(postp_avg, postp_stdev))

## Step 3: Running Inference Benchmarks

With the benchmark scripts in hand you are ready to begin running benchmarks on the DevCloud.
The commands for running the job will be provided to you, just like in exercise 1.


With that said, there are some differences to note for the job submission in this exercise.
In exercise 1, the command to run the job was pushed to the `qsub` through the `echo` command. 
For this exercise, you will be passing the commands to run for job through a bash script.
The reason for this shift is that you will be using an FPGA machine for the benchmarks, and they require an additional step beyond executing `main.py`.
As discusses in the videos, FPGAs require "programs" in the form of bit-streams to be loaded.
For the `mobilenet-ssd` model, OpenVINO has a pre-built bit-stream for it.
So the commands have to be added to the bash script, and ran if FPGA is used.

### job file

Run the following cell to create th bash script `job.sh` to be used for benchmarking.

In [None]:
%%writefile job.sh

# The default path for the job is your home directory, so we change directory to where the files are.
cd $PBS_O_WORKDIR
DEVICE=$1

# Check if FPGA is used 
if grep -q FPGA <<<"$DEVICE"; then
    # Environment variables and compilation for edge compute nodes with FPGAs
    source /opt/intel/init_openvino.sh
    aocl program acl0 /opt/intel/openvino/bitstreams/a10_vision_design_sg1_bitstreams/2019R3_PV_PL1_FP16_MobileNet_Clamp.aocx
fi
    
# Running the object detection code
python3 main.py $DEVICE

This bash script takes one argument, which specifies the device to use. 
The bit-stream is only loaded if "FPGA" appears in the device argument.

### Job queue submission

As in exercise 1, the command for submitting the job has been provided for you. 
The two main differences for this command is that it is getting the command from `job.sh` and that the argument for `job.sh` is set in by the `-F` flag of the `qsub` command. 
Once again, see the DevCloud documentation page for more information on using the `qsub` command.

Additionally, the `waitForJob` function has been provided for you like in exercise 1. 
This version however, has an argument `show_stdio` to choose whether you want the stdout and stderr to be printed.
The results of the benchmark will be printed regardless, so the stdout and stderr is primarily for your convenience in debugging if there is an issue.

Run the following cell will submit the job for processing with CPU. 

In [None]:
from notebook_utils import waitForJob
job_name_cpu = !qsub job.sh -l nodes=1:skylake -F "CPU" -N obj_det
print("Waiting on job to complete. This may take some time")
waitForJob(job_name_cpu, "CPU", show_stdio=True)

If the run on CPU was successful, it is time to try out the other devices.
One of the big advantage of the job qeueue system is that you can have multiple jobs running at once, so you can benchmark the remaining systems in one go.

Run the following cell to run the benchmark on GPU, FPGA, and VPU.

**Note:** FPGA is set to `HETERO` mode with CPU, as there are some layers that are not supported by FPGA. Also, HDDL will take longer to complete than others because it has a large optimal number of inference requests, and consequently has more images to process than others.

In [None]:
job_name_gpu = !qsub job.sh -l nodes=1:intel-hd-530 -F "GPU" -N obj_det
job_name_fpga = !qsub job.sh -l nodes=1:iei-mustang-f100-a10 -F "HETERO:FPGA,CPU" -N obj_det
job_name_hddl = !qsub job.sh -l nodes=1:iei-mustang-v100-mx8 -F "HDDL" -N obj_det
print("Waiting on jobs to complete. This may take some time")
waitForJob(job_name_gpu, "GPU", show_stdio=False)
waitForJob(job_name_fpga, "HETERO:FPGA,CPU", show_stdio=False)
waitForJob(job_name_hddl, "HDDL", show_stdio=False)

Now run the following cell will to get a side by side comparison of the inference time per image. 
**The quiz will ask you which device had the best (lowest) inference time per image value.**

In [None]:
from notebook_utils import summaryPlot
summaryPlot('results', 'Target device', 'Inference time per image (ms)', "Inference performance (lower the better)")

Congratulations! You now have the performance benchmark on 4 types of devices. 
Of course, these numbers are not the full story; you need consider other factors like power consumption and cost if these are important for your particuar deployment.
But these benchmarks will be a key component in that decision making process,