In [66]:
# Copyright 2019 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="https://upload.wikimedia.org/wikipedia/en/6/6d/Nvidia_image_logo.svg" style="width: 90px; float: right;">

# QA Inference on BERT using TensorRT Inference Server

## 1. Overview

Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

The original paper can be found here: https://arxiv.org/abs/1810.04805.

### 1.a Learning objectives

This notebook demonstrates:

 *  Inference on Question Answering (QA) task with BERT Base/Large model
 *  The use of fine-tuned NVIDIA BERT models
 *  Use of BERT model with TensorRT Inference Server

## 2. Requirements

Please refer to the ReadMe file

## 3. BERT Inference: Question Answering

We can run inference on a fine-tuned BERT model for tasks like Question Answering.

Here we use a BERT model fine-tuned on a SQuaD 2.0 Dataset which contains 100,000+ question-answer pairs on 500+ articles combined with over 50,000 new, unanswerable questions.

### 3.a Paragraph and Queries

The paragraph and the questions can be customized by changing the text below. Note that when using models with small sequence lengths, you should use a shorter paragraph:
Paragraph:


In this example we ask our BERT model questions related to the following paragraph:

**The Apollo Program** "The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975."

**Short Version**: "The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975."

The questions and relative answers expected are shown below:

*  Q1: "What project put the first Americans into space?"
  *  A1: "Project Mercury"
*  Q2: "What program was created to carry out these projects and missions?"
  *  A2: "The Apollo program"
*  Q3: "What year did the first manned Apollo flight occur?"
  *  A3: "1968"
*  Q4: "What President is credited with the original notion of putting Americans in space?"
  *  A4: "John F. Kennedy"
*  Q5: "Who did the U.S. collaborate with on an Earth orbit mission in 1975?"
  *  A5: "Soviet Union"
*  Q6: "How long did Project Apollo run?"
  *  A6: "1961 to 1972"
*  Q7: "What program helped develop space travel techniques that Project Apollo used?"
  *  A7: "Gemini Mission"
*  Q8: "What space station supported three manned missions in 1973-1974?"
  *  A8: "Skylab"

In [67]:
paragraph_text = "The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975."

# Short paragraph version for BERT models with max sequence length of 128
short_paragraph_text = "The Apollo program was the third United States human spaceflight program. First conceived as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was dedicated to President John F. Kennedy's national goal of landing a man on the Moon. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972 followed by the Apollo-Soyuz Test Project a joint Earth orbit mission with the Soviet Union in 1975."

#### Question:

In [68]:
#question_text = "What project put the first Americans into space?"
#question_text =  "What year did the first manned Apollo flight occur?"
question_text =  "What President is credited with the original notion of putting Americans in space?"
#question_text =  "Who did the U.S. collaborate with on an Earth orbit mission in 1975?"

## 4. Data Preprocessing

Let's convert the paragraph and the question to BERT input with the help of the tokenizer:

In [69]:
import os
import time
import numpy as np
import data_processing as dp
import tokenization

tokenizer = tokenization.FullTokenizer(vocab_file="vocab.txt", do_lower_case=True)

# The maximum number of tokens for the question. Questions longer than this will be truncated to this length.
max_query_length = 64

# When splitting up a long document into chunks, how much stride to take between chunks.
doc_stride = 128

# The maximum total input sequence length after WordPiece tokenization. 
# Sequences longer than this will be truncated, and sequences shorter than this will be padded.
max_seq_length = 128

# Extract tokens from the paragraph
doc_tokens = dp.convert_doc_tokens(short_paragraph_text)

# Extract features from the paragraph and question
features = dp.convert_examples_to_features(doc_tokens, question_text, tokenizer, max_seq_length, doc_stride, max_query_length)

## 5. Inference

We are going to use the TensorRT Inference Server Python API: that will make the communication with the inference server very easy.
In this example we will use the HTTP protocol, but GRPC is also possible and very easy to enable.

In [70]:
from builtins import range
from tensorrtserver.api import *

model_name = "bert_tf_v2_large_fp16_128_v2"
model_version = -1
batch_size = 1
url = '34.66.157.22:8000'

protocol = ProtocolType.from_str('http')

In [71]:
# Create a health context, get the ready and live state of server.
health_ctx = ServerHealthContext(url, protocol=protocol, http_headers='', verbose=True)

In [72]:
print("Health for model {}".format(model_name))
print("Live: {}".format(health_ctx.is_live()))
print("Ready: {}".format(health_ctx.is_ready()))

Health for model bert_tf_v2_large_fp16_128_v2
Live: True
Ready: True


In [73]:
# Create a status context and get server status
status_ctx = ServerStatusContext(url, protocol, model_name, http_headers='', verbose=True)

print("Status for model {}".format(model_name))
print(status_ctx.get_server_status())

Status for model bert_tf_v2_large_fp16_128_v2
id: "inference:0"
version: "1.6.0"
uptime_ns: 15688767642199
model_status {
  key: "bert_tf_v2_large_fp16_128_v2"
  value {
    config {
      name: "bert_tf_v2_large_fp16_128_v2"
      platform: "tensorflow_savedmodel"
      version_policy {
        latest {
          num_versions: 1
        }
      }
      max_batch_size: 1
      input {
        name: "unique_ids"
        data_type: TYPE_INT32
        dims: 1
        reshape {
        }
      }
      input {
        name: "segment_ids"
        data_type: TYPE_INT32
        dims: 128
      }
      input {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: 128
      }
      input {
        name: "input_mask"
        data_type: TYPE_INT32
        dims: 128
      }
      output {
        name: "end_logits"
        data_type: TYPE_FP32
        dims: 128
      }
      output {
        name: "start_logits"
        data_type: TYPE_FP32
        dims: 128
      }
      instance_g

In [74]:
# Create the inference context for the model.
infer_ctx = InferContext(url, protocol, model_name, model_version, http_headers='', verbose=True)

In [75]:
# Create the data for the four input tensors. 
# Initialize the first to unique integers and the others to the values obtained from pre-processing.
unique_ids = np.int32([1])
segment_ids = features["segment_ids"]
input_ids = features["input_ids"]
input_mask = features["input_mask"]

In [76]:
print("\nRunning Inference...")
eval_start_time = time.time()

# Send inference request to the inference server. Get results for
# both output tensors.
result = infer_ctx.run({ 'unique_ids' : (unique_ids,),
                         'segment_ids' : (segment_ids,),
                         'input_ids' : (input_ids,),
                         'input_mask' : (input_mask,) },
                       { 'end_logits' : InferContext.ResultFormat.RAW,
                         'start_logits' : InferContext.ResultFormat.RAW }, 
                       batch_size)

eval_time_elapsed = time.time() - eval_start_time

print("-----------------------------")
print("Running Inference in {:.3f} Sentences/Sec".format(1.0/eval_time_elapsed))
print("-----------------------------")


Running Inference...
-----------------------------
Running Inference in 17.378 Sentences/Sec
-----------------------------


## 5. Post-Processing

In [77]:
# We expect there to be 2 results (each with batch-size 1). 
end_logits = result['end_logits'][0]
start_logits = result['start_logits'][0]

In [78]:
# The total number of n-best predictions to generate in the nbest_predictions.json output file
n_best_size = 20

# The maximum length of an answer that can be generated. This is needed 
#  because the start and end predictions are not conditioned on one another
max_answer_length = 30

(prediction, nbest_json, scores_diff_json) = \
    dp.get_predictions(doc_tokens, features, start_logits, end_logits, n_best_size, max_answer_length)

print("Answer: '{}'".format(prediction))
print("with prob: {:.3f}%".format(nbest_json[0]['probability'] * 100.0))

Answer: 'John F. Kennedy'
with prob: 77.919%
