In [None]:
# Copyright 2019 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="https://upload.wikimedia.org/wikipedia/en/6/6d/Nvidia_image_logo.svg" style="width: 90px; float: right;">

# QA Inference on BERT using TensorRT

## 1. Overview

Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. 

The original paper can be found here: https://arxiv.org/abs/1810.04805.


### 1.a Learning objectives

This notebook demonstrates:
- Inference on Question Answering (QA) task with BERT Base/Large model
- The use fine-tuned NVIDIA BERT models
- Use of BERT model with TRT

## 2. Requirements

Please refer to the ReadMe file

## 3. BERT Inference: Question Answering

We can run inference on a fine-tuned BERT model for tasks like Question Answering.

Here we use a BERT model fine-tuned on a [SQuaD 2.0 Dataset](https://rajpurkar.github.io/SQuAD-explorer/) which contains 100,000+ question-answer pairs on 500+ articles combined with over 50,000 new, unanswerable questions.

### 3.a Paragraph and Queries

The paragraph and the questions can be customized by changing the text below:

#### Paragraph:

In [1]:
paragraph_text = "The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975."

#### Question:

In [2]:
question_text = "What project put the first Americans into space?"
#question_text =  "What year did the first manned Apollo flight occur?"
#question_text =  "What President is credited with the original notion of putting Americans in space?"
#question_text =  "Who did the U.S. collaborate with on an Earth orbit mission in 1975?"

In this example we ask our BERT model questions related to the following paragraph:

**The Apollo Program**
_"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975."_

The questions and relative answers expected are shown below:

 - **Q1:** "What project put the first Americans into space?" 
  - **A1:** "Project Mercury"
 - **Q2:** "What program was created to carry out these projects and missions?"
  - **A2:** "The Apollo program"
 - **Q3:** "What year did the first manned Apollo flight occur?"
  - **A3:** "1968"
 - **Q4:** "What President is credited with the original notion of putting Americans in space?"
  - **A4:** "John F. Kennedy"
 - **Q5:** "Who did the U.S. collaborate with on an Earth orbit mission in 1975?"
  - **A5:** "Soviet Union"
 - **Q6:** "How long did Project Apollo run?"
  - **A6:** "1961 to 1972"
 - **Q7:** "What program helped develop space travel techniques that Project Apollo used?"
  - **A7:** "Gemini Mission"
 - **Q8:** "What space station supported three manned missions in 1973-1974?"
  - **A8:** "Skylab"

## Data Preprocessing
Let's convert the paragraph and the question to BERT input with the help of the tokenizer:

In [4]:
%pip install tensorflow_text

Collecting tensorflow_text
  Downloading tensorflow_text-2.3.0-cp37-cp37m-manylinux1_x86_64.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 9.2 MB/s eta 0:00:01
[?25hCollecting tensorflow<2.4,>=2.3.0
  Downloading tensorflow-2.3.0-cp37-cp37m-manylinux2010_x86_64.whl (320.4 MB)
[K     |███████████████████████████▌    | 275.5 MB 111.4 MB/s eta 0:00:01

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 320.4 MB 51 kB/s 
Installing collected packages: tensorflow, tensorflow-text
Successfully installed tensorflow-2.3.0 tensorflow-text-2.3.0
Note: you may need to restart the kernel to use updated packages.


In [5]:
%pip install bert 

Collecting bert
  Downloading bert-2.2.0.tar.gz (3.5 kB)
Collecting erlastic
  Downloading erlastic-2.0.0.tar.gz (6.8 kB)
Building wheels for collected packages: bert, erlastic
  Building wheel for bert (setup.py) ... [?25ldone
[?25h  Created wheel for bert: filename=bert-2.2.0-py3-none-any.whl size=3754 sha256=4ab9e0198851fa11d3e2b70ab0da55e86c780437f531fafb6dfba746a512a2fe
  Stored in directory: /home/fabriziomilo/.cache/pip/wheels/bb/31/1b/c05f362e347429b7436954d1a2280fe464731e8f569123a848
  Building wheel for erlastic (setup.py) ... [?25ldone
[?25h  Created wheel for erlastic: filename=erlastic-2.0.0-py3-none-any.whl size=6787 sha256=5ba0a79c77a1c6a619c4b8c242f670a2dc91f31682df2b1cd9294642e7d77bd4
  Stored in directory: /home/fabriziomilo/.cache/pip/wheels/94/f1/b4/0b98b1e94775da6a0b1130e342d22af05cd269e1172c19f40f
Successfully built bert erlastic
Installing collected packages: erlastic, bert
Successfully installed bert-2.2.0 erlastic-2.0.0
Note: you may need to restart the ker

In [17]:
!wget -O tokenizer.py https://raw.githubusercontent.com/google-research/bert/master/tokenization.py

--2020-08-04 14:02:48--  https://raw.githubusercontent.com/google-research/bert/master/tokenization.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12257 (12K) [text/plain]
Saving to: ‘tokenizer.py’


2020-08-04 14:02:48 (87.5 MB/s) - ‘tokenizer.py’ saved [12257/12257]



In [18]:
import tokenizer

In [20]:
import tensorflow as tf
tf.gfile =tf.io.gfile

In [21]:
import data_processing as dp
#import tokenization

#Large
#tokenizer = tokenization.FullTokenizer(vocab_file="./data/uncased_L-24_H-1024_A-16/vocab.txt", do_lower_case=True)
#Base
tokenizer = tokenizer.FullTokenizer(vocab_file="./data/uncased_L-12_H-768_A-12/vocab.txt", do_lower_case=True)

# The maximum number of tokens for the question. Questions longer than this will be truncated to this length.
max_query_length = 64

# When splitting up a long document into chunks, how much stride to take between chunks.
doc_stride = 128

# The maximum total input sequence length after WordPiece tokenization. 
# Sequences longer than this will be truncated, and sequences shorter 
max_seq_length = 384

# Extract tokecs from the paragraph
doc_tokens = dp.convert_doc_tokens(paragraph_text)

# Extract features from the paragraph and question
features = dp.convert_examples_to_features(doc_tokens, question_text, tokenizer, max_seq_length, doc_stride, max_query_length)


NotFoundError: ./data/uncased_L-12_H-768_A-12/vocab.txt; No such file or directory

## TensorRT Inference

In [None]:
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

In [None]:
import ctypes
nvinfer =  ctypes.CDLL("libnvinfer_plugin.so", mode = ctypes.RTLD_GLOBAL)
cm = ctypes.CDLL("./build/libcommon.so", mode = ctypes.RTLD_GLOBAL) 
pg = ctypes.CDLL("./build/libbert_plugins.so", mode = ctypes.RTLD_GLOBAL) 

In [None]:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time

# For this example we are going to use batch size 1
max_batch_size = 1

# Load the Large BERT Engine
# with open("./bert_python.engine", "rb") as f, \
#    trt.Runtime(TRT_LOGGER) as runtime, \
#    runtime.deserialize_cuda_engine(f.read()) as engine, \
#    engine.create_execution_context() as context:

# Load the Base BERT Engine
with open("./bert_python_base.engine", "rb") as f, \
    trt.Runtime(TRT_LOGGER) as runtime, \
    runtime.deserialize_cuda_engine(f.read()) as engine, \
    engine.create_execution_context() as context:

    print("List engine binding:")
    for binding in engine:
        print(" - {}: {}, Shape {}, {}".format(
            "Input" if engine.binding_is_input(binding) else "Output",
            binding,
            engine.get_binding_shape(binding),
            engine.get_binding_dtype(binding)))

    
    def binding_nbytes(binding):
        return trt.volume(engine.get_binding_shape(binding)) * engine.get_binding_dtype(binding).itemsize
    
    # Allocate device memory for inputs and outputs.
    d_inputs = [cuda.mem_alloc(binding_nbytes(binding)) for binding in engine if engine.binding_is_input(binding)]
    h_output = cuda.pagelocked_empty(tuple(engine.get_binding_shape(3)), dtype=np.float32)
    d_output = cuda.mem_alloc(h_output.nbytes)

    # Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()

    print("\nRunning Inference...")
    eval_start_time = time.time()

    # Copy inputs
    cuda.memcpy_htod_async(d_inputs[0], input_features["input_ids"], stream)
    cuda.memcpy_htod_async(d_inputs[1], input_features["segment_ids"], stream)
    cuda.memcpy_htod_async(d_inputs[2], input_features["input_mask"], stream)

    # Run inference
    context.execute_async(bindings=[int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)
    # Transfer predictions back from GPU
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
    # Synchronize the stream
    stream.synchronize()

    eval_time_elapsed = time.time() - eval_start_time

## Data Post-Processing

Now that we have the inference results let's extract the actual answer to our question

In [None]:
start_logits = h_output[:, 0]
end_logits = h_output[:, 1]

# The total number of n-best predictions to generate in the nbest_predictions.json output file
n_best_size = 20

# The maximum length of an answer that can be generated. This is needed 
#  because the start and end predictions are not conditioned on one another
max_answer_length = 30


(prediction, nbest_json, scores_diff_json) = \
        dp.get_predictions(doc_tokens, features, \
                       start_logits, end_logits, n_best_size, max_answer_length)


print("-----------------------------")
print("Running Inference in {:.3f} Sentences/Sec".format(1.0/eval_time_elapsed))
print("-----------------------------")
    
print("Answer: '{}'".format(prediction))
print("with prob: {:.3f}%".format(nbest_json[0]['probability']*100.0))

