# 4.0 Using the model

Monitor real-time inference in action with a question-answering NLP task.

**[4.1 Inference API Overview](#4.2-Inference-API-Overview)**<br>
**[4.2 Preparing the Request](#4.3-Preparing-the-Request)**<br>
**[4.3 Querying the Server](#4.4-Querying-the-Server)**<br>
**[4.4 Post-Processing the Response](#4.5-Post-Processing-the-Response)**<br>

In [1]:
import os
import json
import argparse
import numpy as np
import tritonclient.http

The first step is to initialize the client by pointing it towards our server:

In [2]:
try:
    triton_client = tritonclient.http.InferenceServerClient(url="triton:8000", verbose=True)
except Exception as e:
    print("channel creation failed: " + str(e))

Next, inspect the status of our server, and availability and status of our model:

In [3]:
modelName = "bertQA-torchscript"
print(triton_client.is_server_live())
print(triton_client.is_server_ready())
print(triton_client.is_model_ready(modelName,"1"))

GET /v2/health/live, headers {}
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>
True
GET /v2/health/ready, headers {}
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>
True
GET /v2/models/bertQA-torchscript/versions/1/ready, headers {}
<HTTPSocketPoolResponse status=200 headers={'content-length': '0', 'content-type': 'text/plain'}>
True


Finally, inspect the metadata returned by the server:

In [4]:
triton_client.get_server_metadata()

GET /v2, headers {}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '285'}>
bytearray(b'{"name":"triton","version":"2.34.0","extensions":["classification","sequence","model_repository","model_repository(unload_dependents)","schedule_policy","model_configuration","system_shared_memory","cuda_shared_memory","binary_tensor_data","parameters","statistics","trace","logging"]}')


{'name': 'triton',
 'version': '2.34.0',
 'extensions': ['classification',
  'sequence',
  'model_repository',
  'model_repository(unload_dependents)',
  'schedule_policy',
  'model_configuration',
  'system_shared_memory',
  'cuda_shared_memory',
  'binary_tensor_data',
  'parameters',
  'statistics',
  'trace',
  'logging']}

# 4.1 Inference API Overview

Since we have been working with a neural network built to do question answering, we'll run an example query against our server. To start, let's investigate the shape of the input and output data that the server will use:

In [5]:
triton_client.get_model_metadata(modelName)

GET /v2/models/bertQA-torchscript, headers {}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '379'}>
bytearray(b'{"name":"bertQA-torchscript","versions":["1"],"platform":"pytorch_libtorch","inputs":[{"name":"input__0","datatype":"INT64","shape":[-1,384]},{"name":"input__1","datatype":"INT64","shape":[-1,384]},{"name":"input__2","datatype":"INT64","shape":[-1,384]}],"outputs":[{"name":"output__0","datatype":"FP32","shape":[-1,384]},{"name":"output__1","datatype":"FP32","shape":[-1,384]}]}')


{'name': 'bertQA-torchscript',
 'versions': ['1'],
 'platform': 'pytorch_libtorch',
 'inputs': [{'name': 'input__0', 'datatype': 'INT64', 'shape': [-1, 384]},
  {'name': 'input__1', 'datatype': 'INT64', 'shape': [-1, 384]},
  {'name': 'input__2', 'datatype': 'INT64', 'shape': [-1, 384]}],
 'outputs': [{'name': 'output__0', 'datatype': 'FP32', 'shape': [-1, 384]},
  {'name': 'output__1', 'datatype': 'FP32', 'shape': [-1, 384]}]}

You should have recieved a response similar to the below: <br/>
<img width=1000 src="images/DataFormat.png"/>

The server indicated that it expects three input tensors:
- input__0 being the input_ids
- input__1 being the sequence_ids
- input__2 being the mask_ids

The server will respond with:
- output__0 being the start logits
- output__1 being the end logits

We now need to pre process our question and context into the format required by the server.

# 4.2 Preparing the Request

Start by creating the question and an answer:

In [6]:
question = "Most antibiotics target bacteria and don't affect what class of organisms? "
context = "Within the genitourinary and gastrointestinal tracts, commensal flora serve as biological barriers by " +\
        "competing with pathogenic bacteria for food and space and, in some cases, by changing the conditions in " +\
        "their environment, such as pH or available iron. This reduces the probability that pathogens will " +\
        "reach sufficient numbers to cause illness. However, since most antibiotics non-specifically target bacteria" +\
        "and do not affect fungi, oral antibiotics can lead to an overgrowth of fungi and cause conditions such as a" +\
        "vaginal candidiasis (a yeast infection). There is good evidence that re-introduction of probiotic flora, such " +\
        "as pure cultures of the lactobacilli normally found in unpasteurized yogurt, helps restore a healthy balance of" +\
        "microbial populations in intestinal infections in children and encouraging preliminary data in studies on bacterial " +\
        "gastroenteritis, inflammatory bowel diseases, urinary tract infection and post-surgical infections. " 

Secondly, by importing some additional utilities that will hide the boilerplate logic necessary for data transformation:

In [7]:
import sys
sys.path.insert(0,'/dli/task/client')
from tokenization import BertTokenizer
from inference import preprocess_tokenized_text,parse_answer

This section of code transforms the data into the required format:

In [8]:
tokenizer = BertTokenizer("/dli/task/vocab", do_lower_case=True, max_len=512) 
doc_tokens = context.split()
query_tokens = tokenizer.tokenize(question)

tensors_for_inference, tokens_for_postprocessing = preprocess_tokenized_text(doc_tokens, 
                                    query_tokens, 
                                    tokenizer, 
                                    max_seq_length=384, 
                                    max_query_length=64)

dtype = np.int64
input_ids = np.array(tensors_for_inference.input_ids, dtype=dtype)[None,...] # make bs=1
segment_ids = np.array(tensors_for_inference.segment_ids, dtype=dtype)[None,...] # make bs=1
input_mask = np.array(tensors_for_inference.input_mask, dtype=dtype)[None,...] # make bs=1

Finally we copy the data into the structures required by Triton. Do notice that we use tensor names, data types and tensor dimensions as specified by the Triton server response earlier:

In [9]:
inputs = []
inputs.append(tritonclient.http.InferInput('input__0', [1, len(input_ids[0])], "INT64"))
inputs.append(tritonclient.http.InferInput('input__1', [1, len(segment_ids[0])], "INT64"))
inputs.append(tritonclient.http.InferInput('input__2', [1, len(input_mask[0])], "INT64"))


inputs[0].set_data_from_numpy(input_ids, binary_data=False)
inputs[1].set_data_from_numpy(segment_ids, binary_data=False)
inputs[2].set_data_from_numpy(input_mask, binary_data=False)

Inspecting one of the inputs reveals the new data representation, which was tokenized and converted to the numerical format as required by the network:

In [10]:
inputs[0]._get_tensor()

{'name': 'input__0',
 'shape': [1, 384],
 'datatype': 'INT64',
 'data': [101,
  2087,
  24479,
  4539,
  10327,
  1998,
  2123,
  1005,
  1056,
  7461,
  2054,
  2465,
  1997,
  11767,
  1029,
  102,
  2306,
  1996,
  8991,
  9956,
  9496,
  24041,
  1998,
  3806,
  13181,
  18447,
  19126,
  22069,
  1010,
  4012,
  3549,
  12002,
  10088,
  3710,
  2004,
  6897,
  13500,
  2011,
  6637,
  2007,
  26835,
  2594,
  10327,
  2005,
  2833,
  1998,
  2686,
  1998,
  1010,
  1999,
  2070,
  3572,
  1010,
  2011,
  5278,
  1996,
  3785,
  1999,
  2037,
  4044,
  1010,
  2107,
  2004,
  6887,
  2030,
  2800,
  3707,
  1012,
  2023,
  13416,
  1996,
  9723,
  2008,
  26835,
  2015,
  2097,
  3362,
  7182,
  3616,
  2000,
  3426,
  7355,
  1012,
  2174,
  1010,
  2144,
  2087,
  24479,
  2512,
  1011,
  4919,
  4539,
  10327,
  5685,
  2079,
  2025,
  7461,
  15289,
  1010,
  8700,
  24479,
  2064,
  2599,
  2000,
  2019,
  2058,
  26982,
  1997,
  15289,
  1998,
  3426,
  3785,
  2107,
  2004

Even though it is possible to just fetch all of the output tensors associated with the request it is a good practice to fetch only the bare minimum to minimize the bandwidth. We do that by specifying the request output:

In [11]:
outputs = []
outputs.append(
        tritonclient.http.InferRequestedOutput('output__0', binary_data=False))
outputs.append(
        tritonclient.http.InferRequestedOutput('output__1', binary_data=False))

# 4.3 Querying the Server

Let us now issue a request to the server. The <code>outputs</code> parameter is optional. If not specified all tensors will be returned.

In [12]:
results = triton_client.infer(modelName,
                                  inputs,
                                  outputs=outputs)

POST /v2/models/bertQA-torchscript/infer, headers {}
{"inputs":[{"name":"input__0","shape":[1,384],"datatype":"INT64","data":[101,2087,24479,4539,10327,1998,2123,1005,1056,7461,2054,2465,1997,11767,1029,102,2306,1996,8991,9956,9496,24041,1998,3806,13181,18447,19126,22069,1010,4012,3549,12002,10088,3710,2004,6897,13500,2011,6637,2007,26835,2594,10327,2005,2833,1998,2686,1998,1010,1999,2070,3572,1010,2011,5278,1996,3785,1999,2037,4044,1010,2107,2004,6887,2030,2800,3707,1012,2023,13416,1996,9723,2008,26835,2015,2097,3362,7182,3616,2000,3426,7355,1012,2174,1010,2144,2087,24479,2512,1011,4919,4539,10327,5685,2079,2025,7461,15289,1010,8700,24479,2064,2599,2000,2019,2058,26982,1997,15289,1998,3426,3785,2107,2004,10927,24965,27467,9032,6190,1006,1037,21957,8985,1007,1012,2045,2003,2204,3350,2008,2128,1011,4955,1997,4013,26591,10088,1010,2107,2004,5760,8578,1997,1996,18749,3406,3676,6895,6894,5373,2179,1999,4895,19707,2618,28405,10930,27390,2102,1010,7126,9239,1037,7965,5703,1997,7712,3217,2110

As you can see, the <code>results</code> and <code>outputs</code> are of the same data type.  

In [13]:
results
outputs

[<tritonclient.http._requested_output.InferRequestedOutput at 0x7f33145b50f0>,
 <tritonclient.http._requested_output.InferRequestedOutput at 0x7f33145b49d0>]

# 4.4 Post-Processing the Response

The results in our case are just logits of start and end positions. Let's process those further to obtain a human readable result. We start by copying the vectors to NumPy to make further processing easier:

In [14]:
# Validate the results by comparing with precomputed values.
output0_data = results.as_numpy('output__0')
output1_data = results.as_numpy('output__1')

Let's inspect the output...

In [15]:
output0_data

array([[-6.2304993, -5.988884 , -6.157663 , -6.06822  , -4.5110598,
        -6.1771555, -5.9670124, -6.233505 , -5.894256 , -5.6070166,
        -6.033555 , -6.302067 , -6.143382 , -6.211335 , -6.3532295,
        -6.214389 , -6.0064845, -6.23081  , -5.773906 , -6.385979 ,
        -6.442009 , -6.5102725, -6.268529 , -5.816762 , -6.2957683,
        -6.293145 , -6.4921436, -6.3164907, -6.4416842, -5.2130537,
        -6.26847  , -6.562448 , -6.120388 , -6.3689384, -6.228873 ,
        -6.0389414, -5.9520264, -6.278862 , -5.933717 , -6.097395 ,
        -0.4255836, -4.250477 , -1.1577065, -6.1508856, -5.9867973,
        -6.2784433, -6.035305 , -6.371573 , -6.259742 , -6.109337 ,
        -6.069693 , -6.3170094, -6.376259 , -6.140277 , -5.981711 ,
        -6.312574 , -6.2149525, -6.25721  , -6.2140055, -6.0792603,
        -6.480047 , -6.1744103, -6.160393 , -5.91033  , -6.3136454,
        -6.0063744, -5.0427103, -4.8972197, -5.936635 , -5.89367  ,
        -6.2945185, -6.1085763, -6.223854 , -4.4

...and convert it into a human readable format.

In [16]:
start_logits = output0_data[0].tolist()
end_logits = output1_data[0].tolist()

answer, answers = parse_answer(doc_tokens, tokens_for_postprocessing, 
                                 start_logits, end_logits)

# print result
print()
print(answer)
print()
print(json.dumps(answers, indent=4))


fungi

[
    {
        "text": "fungi",
        "probability": 1.0,
        "start_logit": 6.444554328918457,
        "end_logit": 6.382648468017578
    }
]
