# TensorRT Runtime

This example walks through the basic usecase of:
  1. initialization the infer-runtime
  2. loading a model
  3. allocating resources
  4. inspecting the input/output bindings of the model
  5. evaluating the model using async futures
  6. testing for correctness

In [None]:
import os
import time
import numpy as np
import wurlitzer

import trtlab
import infer_test_utils as utils

# this allows us to capture stdout and stderr from the backend c++ infer-runtime
display_output = wurlitzer.sys_pipes

In [None]:
!trtexec --onnx=/work/models/onnx/mnist-v1.3/model.onnx --saveEngine=/work/models/onnx/mnist-v1.3/mnist-v1.3.engine

## 1. Initialize infer-runtime

The most important option when initializing the infer-runtime is to set the maximum number of conncurrent executions that can be executed at any given time.  This value is tunable for your application.  Lower setting reduce latency; higher-settings increase throughput.  Evaluate how your model performs using ...TODO-this-notebook...

In [None]:
with display_output():
    models = infer.InferenceManager(max_executions=2)

## 2. Register a Model

To register a model, simply associate a `model_name` with a path to a TensorRT engine file. The returned object is an `InferRunner` object.  Use an `InferRunner` to submit work to the backend inference queue.

In [None]:
with display_output():
    mnist = models.register_tensorrt_engine("mnist", "/work/models/onnx/mnist-v1.3/mnist-v1.3.engine")

## 3. Allocate Resources

Before you can submit inference requests, you need to allocate some internal resources.  This should be done anytime new models are registered.  There maybe a runtime performance interruption if you update the resources while the queue is full.

In [None]:
with display_output():
    models.update_resources()

## 4. Inspect Model

Query the `InferenceRunner` to see what it expects for inputs and what it will return for outputs.

In [None]:
mnist.input_bindings()

In [None]:
mnist.output_bindings()

## 5. Submit Infer Requests

`InferenceRunner.infer` accecpts a dict of numpy arrays that match the input description, submits this inference request to the backend compute engine and returns a future to a dict of numpy arrays.  

That means, this method should returns almost immediately; however, that does not mean the inference is complete.  Use `get()` to wait for the result.  This is a blocking call.

In [None]:
result = mnist.infer(Input3=np.random.random_sample([1,28,28]))
result # result is a future

In [None]:
result = result.get()
result # result is the value of the future - dict of np arrays

In [None]:
with display_output():
    start = time.process_time()
    result = mnist.infer(**{k: np.random.random_sample(v['shape']) for k,v in mnist.input_bindings().items()})
    print("Queue Time: {}".format(time.process_time() - start))
    result = result.get()
    print("Compute Time: {}".format(time.process_time() - start))

## 6. Test for Correctness

Load test image and results.  [Thanks to the ONNX Model Zoo](https://github.com/onnx/models/tree/master/mnist) for this example.

In [None]:
inputs = utils.load_inputs("/work/models/onnx/mnist-v1.3/test_data_set_0")
expected = utils.load_outputs("/work/models/onnx/mnist-v1.3/test_data_set_0")

In [None]:
utils.mnist_image(inputs[0]).show()
expected[0]


Submit the images to the inference queue, then wait for each result to be returned.

In [None]:
results = [mnist.infer(Input3=input) for input in inputs]
results = [r.get() for r in results]

Check results.
TODO - update the utils to return dictionaries instead of arrays

In [None]:
for r, e in zip(results, expected):
    for key, val in r.items():
        r = val.reshape((1,10))
        np.testing.assert_almost_equal(r, e, decimal=3)
        print("Test Passed")
        print("Output Binding Name: {}; shape: {}".format(key, val.shape))
        print("Result: {}".format(np.argmax(utils.softmax(r))))
        # r # show the raw tensor