# Model Serving

Centrailized model serving can be a huge design win for your buisness products and/or applications. Hosting models in a central location reduces memory usage and can be designed to reduce interdevice communication.

This example will use [Nvidia's Triton Inference Server](https://github.com/triton-inference-server/server) to serve the model show in the previous section.

First let's export the model to a serving format:

In [None]:
!git clone https://github.com/cvlab-stonybrook/DM-Count

In [None]:
import torch
import onnxruntime
import numpy as np
import onnx
import gdown
import sys
import os

sys.path.insert(0, os.path.join(os.getcwd(), 'DM-Count/'))

import models

model_path = "model.pth"
url = "https://drive.google.com/uc?id=1nnIHPaV9RGqK8JHL645zmRvkNrahD9ru"
gdown.download(url, model_path, quiet=False)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# load model
model = models.vgg19() # DM-Count is VGG19 based
model.load_state_dict(torch.load(model_path, device))
model.eval()

# send model to compute device
model = model.to(device)

We will now export our trained model for deployment. I am choosing the ONNX format for deployment. ONNX is an open format built to represent machine learning models. Read more about it [here](https://onnx.ai/). Triton can handle many different model formats and even can be used to serve custom Python scripts.

Model formats for serving will be covered in a later section in more detail!

In [None]:
target_input_width = 1280
target_input_height = 800

dummy_input = torch.rand(1, 3, target_input_height, target_input_width).to(device)

torch.onnx.export(model,  # model being run
                  dummy_input,  # model test input
                  "model.onnx",  # where to save the model (can be a file or file-like object)
                  opset_version=16,  # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names=['input'],  # the model's input names
                  output_names=['output_0', 'output_1'],  # the model's output names
                  dynamic_axes={'input': {0: 'batch_size'},  # variable length axes
                                'output_0': {0: 'batch_size'}, 
                                'output_1': {0: 'batch_size'}
                               }
                  )

One nice feature of Triton is the [ability to have it "poll" a model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md#model-control-mode-poll) to see if a change has occured. So all that needs to be done is copy the model into the `model_repository` directory. You can read more on the specifics [here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md#repository-layout).

Triton has been setup and running already within the `docker-compose` network created when starting this all up. It's host name is `triton-inference-server`.

Okay let's get this model into Triton!

In [None]:
! mkdir model_repository/dmcount_onnx/ # create the folder with the model name
! mkdir model_repository/dmcount_onnx/1/ # create the folder for the model version
!cp model.onnx model_repository/dmcount_onnx/1/ # move the file to the directory

Now that the model is in Triton, it has automatically created a model config and loaded it. Let's query it to see more about it!

In [None]:
import tritonclient.grpc as grpcclient # can use http or grpc

# create the client
inference_server_url = "triton-inference-server:8001"
triton_client = grpcclient.InferenceServerClient(url=inference_server_url)

# find out info about model
model_name = "dmcount_onnx"
triton_client.get_model_config(model_name)

You can also create a custom config to control other paramenters like batch size or maximum number of requests. See [here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md) for more.

However now we are going to do our inference with the model!

You can see int he config above we have the input and output names of the model. Let's use this information to build a function for inference.

In [None]:
def infer(image):
    # create input
    input_name = "input"
    inputs = [grpcclient.InferInput(input_name, image.shape, "FP32")]
    inputs[0].set_data_from_numpy(image)

    output_names = ["output_0", "output_1"]
    outputs = [grpcclient.InferRequestedOutput(n) for n in output_names]

    results = triton_client.infer(model_name, inputs, outputs=outputs) # send the query

    output_0, output_1 = [results.as_numpy(o) for o in output_names]
    
    return output_0

In [None]:
import cv2
import numpy as np
from tritonclient.utils import triton_to_np_dtype

def detect_crowd(original_image):
    '''
    Function for counting crowds in a single image
    '''
        
    resized_image = cv2.resize(original_image, (target_input_width, target_input_height))
    
    # preprocessing    
    image_rgb = resized_image[...,::-1] # BGR to RGB
    image = image_rgb.astype(np.float32)

    image = image/255
    image = np.transpose(image, (2, 0, 1))  # HWC to CHW

    image = np.expand_dims(image, axis=0) # add batch dimension

    # inference
    output = infer(image)
    
       # post processing
    crowd_count = int(np.sum(output).item())
    
    heatmap = output[0, 0]
    heatmap = (heatmap - heatmap.min()) / (heatmap.max() - heatmap.min() + 1e-5)
    heatmap = (heatmap * 255).astype(np.uint8)
    heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)
    heatmap = cv2.cvtColor(heatmap, cv2.COLOR_BGR2RGB)
    heatmap = cv2.resize(heatmap, (target_input_width, target_input_height))
        
    overlayed = cv2.addWeighted(resized_image, 0.7, heatmap, 0.5, 0)
    
    return overlayed, crowd_count



We will now use the same Gradio interface as from the first section with the new way of model serving!

In [None]:
import gradio as gr

gr.close_all() # cleanup any stray samples

iface = gr.Interface(fn=detect_crowd,
                     inputs=[
                         gr.Image(label="Image of Crowd"),
                     ],
                     outputs=[
                         gr.Image(label="Predicted Density Map"),
                         gr.Label(label="Predicted Count"),
                     ],
                     examples=[
                         ["sample_images/busy-road.jpg"],
                         ["sample_images/concert-crowd.jpg"],
                         ["sample_images/group-photo.jpg"],
                         ["sample_images/mountains.jpg"],
                     ],
                     title="Crowd Detection App",
                     description="A simple app.",
                     )
iface.launch(server_name="0.0.0.0", server_port=7860)

In [None]:
# cleanup

import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)