In [None]:
# Copyright 2021 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<a href="https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference_colab.ipynb#scrollTo=5hRb96NKE3X0" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# BERT Question Answering Inference with Mixed Precision


## 1. Overview

Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. 

The original paper can be found here: https://arxiv.org/abs/1810.04805.

NVIDIA's BERT 19.10 is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and tensor cores on V100 GPUS for faster training times while maintaining target accuracy.

### 1.a Learning objectives

This notebook demonstrates:
- Inference on QA task with BERT Large model
- The use/download of fine-tuned NVIDIA BERT models
- Use of Mixed Precision for Inference

## 2. Requirements

### 2.a GPU

Before running this notebook, please set the Colab runtime environment to GPU via the menu *Runtime => Change runtime type => GPU*.

This demo will work on any NVIDIA GPU with CUDA cores, though for improved FP16 inference, a Volta, Turing or newer generation GPU with Tensor cores is desired.  On Google Colab, this normally means a T4 GPU. If you are assigned an older K80 GPU, another trial at another time might give you a T4 GPU.

In [None]:
#Select lower version of tensroflow on Google Colab
%tensorflow_version 1.x
import tensorflow
print(tensorflow.__version__)

In [None]:
!nvidia-smi

### 2.b Download the required files from NVIDIA-Github:

In [None]:
!wget -nc -q --show-progress -O ./master.zip \
https://github.com/NVIDIA/DeepLearningExamples/archive/master.zip
!unzip -q -n -d . ./master.zip 

In [None]:
import os

WORKSPACE_DIR='./DeepLearningExamples-master/TensorFlow/LanguageModeling/BERT/'
os.chdir(WORKSPACE_DIR)
print (os.getcwd())

## 3. BERT Inference: Question Answering

We can run inference on a fine-tuned BERT model for tasks like Question Answering.

Here we use a BERT model fine-tuned on a [SQuaD 2.0 Dataset](https://rajpurkar.github.io/SQuAD-explorer/) which contains 100,000+ question-answer pairs on 500+ articles combined with over 50,000 new, unanswerable questions.

### 3.a Paragraph and Queries

In this example we will ask our BERT model questions related to the following paragraph:

**The Apollo Program**
_"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975."_

The questions and relative answers expected are shown below:

 - **Q1:** "What project put the first Americans into space?" 
  - **A1:** "Project Mercury"
 - **Q2:** "What program was created to carry out these projects and missions?"
  - **A2:** "The Apollo program"
 - **Q3:** "What year did the first manned Apollo flight occur?"
  - **A3:** "1968"
 - **Q4:** "What President is credited with the original notion of putting Americans in space?"
  - **A4:** "John F. Kennedy"
 - **Q5:** "Who did the U.S. collaborate with on an Earth orbit mission in 1975?"
  - **A5:** "Soviet Union"
 - **Q6:** "How long did Project Apollo run?"
  - **A6:** "1961 to 1972"
 - **Q7:** "What program helped develop space travel techniques that Project Apollo used?"
  - **A7:** "Gemini Mission"
 - **Q8:** "What space station supported three manned missions in 1973-1974?"
  - **A8:** "Skylab"
  
---

The paragraph and the questions can be easily customized by changing the code below:

---

In [None]:
%%writefile input.json
{"data": 
 [
     {"title": "Project Apollo",
      "paragraphs": [
          {"context":"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975.", 
           "qas": [
               { "question": "What project put the first Americans into space?", 
                 "id": "Q1"
               },
               { "question": "What program was created to carry out these projects and missions?",
                 "id": "Q2"
               },
               { "question": "What year did the first manned Apollo flight occur?",
                 "id": "Q3"
               },                
               { "question": "What President is credited with the original notion of putting Americans in space?",
                 "id": "Q4"
               },
               { "question": "Who did the U.S. collaborate with on an Earth orbit mission in 1975?",
                 "id": "Q5"
               },
               { "question": "How long did Project Apollo run?",
                 "id": "Q6"
               },               
               { "question": "What program helped develop space travel techniques that Project Apollo used?",
                 "id": "Q7"
               },                
               {"question": "What space station supported three manned missions in 1973-1974?",
                 "id": "Q8"
               }                
]}]}]}

In [None]:
import sys

working_dir = os.getcwd();
data_dir = os.path.join(working_dir, 'data/download');
if working_dir not in sys.path:
    sys.path.append(working_dir)

In [None]:
input_file = os.path.join(working_dir, 'input.json')

### 3.b Mixed Precision

Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.

For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.

In this notebook we control mixed precision execution with the environmental variable:

In [None]:
import os
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "1" 

## 4. Fine-Tuned NVIDIA BERT TF Models

Based on the model size, we have the following two default configurations of BERT.

| **Model** | **Hidden layers** | **Hidden unit size** | **Attention heads** | **Feedforward filter size** | **Max sequence length** | **Parameters** |
|:---------:|:----------:|:----:|:---:|:--------:|:---:|:----:|
|BERTBASE |12 encoder| 768| 12|4 x  768|512|110M|
|BERTLARGE|24 encoder|1024| 16|4 x 1024|512|330M|

We will take advantage of the fine-tuned models available on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).
Among the many configurations available we will download these two:

 - **bert_tf_ckpt_large_qa_squad2_amp_384**

Which are trained on the SQuaD 2.0 Dataset.

In [None]:
# bert_tf_ckpt_large_qa_squad2_amp_384
DATA_DIR_FT = os.path.join(data_dir, 'finetuned_large_model')
!mkdir -p $DATA_DIR_FT    
!wget --content-disposition -O $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip  \
https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_ckpt_large_qa_squad2_amp_384/versions/19.03.1/zip \
&& unzip -n -d $DATA_DIR_FT/ $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip \
&& rm $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip

In the code that follows we will refer to these models.

Download the Google pretrained weights and vocab file:

In [None]:
os.chdir("./data");
from GooglePretrainedWeightDownloader import GooglePretrainedWeightDownloader
gd = GooglePretrainedWeightDownloader(data_dir)
gd.download()
os.chdir("..");

We need the horovod package:

In [None]:
try:
    __import__("horovod")
except ImportError:
    os.system("pip install --no-cache-dir horovod")

## 5. Running QA task inference

In order to run QA inference we will follow step-by-step the flow implemented in run_squad.py.

Configuration:

In [None]:
import run_squad
import json
import tensorflow as tf
import modeling
import tokenization
import time
import random

tf.logging.set_verbosity(tf.logging.INFO)

# Create the output directory where all the results are saved.
output_dir = os.path.join(working_dir, 'results')
tf.gfile.MakeDirs(output_dir)

# The config json file corresponding to the pre-trained BERT model.
# This specifies the model architecture.
bert_config_file = os.path.join(data_dir, 'finetuned_large_model_SQUAD2.0/bert_config.json')

# The vocabulary file that the BERT model was trained on.
vocab_file = os.path.join(data_dir, 'finetuned_large_model_SQUAD2.0/vocab.txt')

# Initiate checkpoint to the fine-tuned BERT Large model
init_checkpoint = os.path.join(data_dir, 'finetuned_large_model/model.ckpt')


# Whether to lower case the input text. 
# Should be True for uncased models and False for cased models.
do_lower_case = True
  
# Total batch size for predictions
predict_batch_size = 1
params = dict([('batch_size', predict_batch_size)])

# The maximum total input sequence length after WordPiece tokenization. 
# Sequences longer than this will be truncated, and sequences shorter than this will be padded.
max_seq_length = 384

# When splitting up a long document into chunks, how much stride to take between chunks.
doc_stride = 128

# The maximum number of tokens for the question. 
# Questions longer than this will be truncated to this length.
max_query_length = 64

# This is a WA to use flags from here:
flags = tf.flags

if 'f' not in tf.flags.FLAGS: 
    tf.app.flags.DEFINE_string('f', '', 'kernel')
FLAGS = flags.FLAGS

verbose_logging = True
# Set to True if the dataset has samples with no answers. For SQuAD 1.1, this is set to False
version_2_with_negative = False

# The total number of n-best predictions to generate in the nbest_predictions.json output file.
n_best_size = 20

# The maximum length of an answer that can be generated. 
# This is needed  because the start and end predictions are not conditioned on one another.
max_answer_length = 30

Let's define the tokenizer and create the model for the estimator:

In [None]:
# Validate the casing config consistency with the checkpoint name.
tokenization.validate_case_matches_checkpoint(do_lower_case, init_checkpoint)

# Create the tokenizer.
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)

# Load the configuration from file
bert_config = modeling.BertConfig.from_json_file(bert_config_file)

def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
    unique_ids = features["unique_ids"]
    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    segment_ids = features["segment_ids"]

    (start_logits, end_logits) = run_squad.create_model(
        bert_config=bert_config,
        is_training=False,
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        use_one_hot_embeddings=False)

    tvars = tf.trainable_variables()

    initialized_variable_names = {}
    (assignment_map, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
    output_spec = None
    predictions = {"unique_ids": unique_ids,
                   "start_logits": start_logits,
                   "end_logits": end_logits}
    output_spec = tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
    return output_spec

config = tf.ConfigProto(log_device_placement=True) 

run_config = tf.estimator.RunConfig(
      model_dir=None,
      session_config=config,
      save_checkpoints_steps=1000,
      keep_checkpoint_max=1)

estimator = tf.estimator.Estimator(
  model_fn=model_fn,
  config=run_config,
  params=params)

### 5.a Inference

In [None]:
eval_examples = run_squad.read_squad_examples(
        input_file=input_file, is_training=False)

eval_writer = run_squad.FeatureWriter(
    filename=os.path.join(output_dir, "eval.tf_record"),
    is_training=False)

eval_features = []
def append_feature(feature):
    eval_features.append(feature)
    eval_writer.process_feature(feature)


# Loads a data file into a list of InputBatch's
run_squad.convert_examples_to_features(
    examples=eval_examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    doc_stride=doc_stride,
    max_query_length=max_query_length,
    is_training=False,
    output_fn=append_feature)

eval_writer.close()

tf.logging.info("***** Running predictions *****")
tf.logging.info("  Num orig examples = %d", len(eval_examples))
tf.logging.info("  Num split examples = %d", len(eval_features))
tf.logging.info("  Batch size = %d", predict_batch_size)

predict_input_fn = run_squad.input_fn_builder(
    input_file=eval_writer.filename,
    batch_size=predict_batch_size,
    seq_length=max_seq_length,
    is_training=False,
    drop_remainder=False)

all_results = []
eval_hooks = [run_squad.LogEvalRunHook(predict_batch_size)]
eval_start_time = time.time()
for result in estimator.predict(
        predict_input_fn, yield_single_examples=True, hooks=eval_hooks, checkpoint_path=init_checkpoint):
    unique_id = int(result["unique_ids"])
    start_logits = [float(x) for x in result["start_logits"].flat]
    end_logits = [float(x) for x in result["end_logits"].flat]
    all_results.append(
      run_squad.RawResult(
          unique_id=unique_id,
          start_logits=start_logits,
          end_logits=end_logits))

eval_time_elapsed = time.time() - eval_start_time

time_list = eval_hooks[-1].time_list
time_list.sort()
eval_time_wo_startup = sum(time_list[:int(len(time_list) * 0.99)])
num_sentences = eval_hooks[-1].count * predict_batch_size
avg_sentences_per_second = num_sentences * 1.0 / eval_time_wo_startup

tf.logging.info("-----------------------------")
tf.logging.info("Total Inference Time = %0.2f Inference Time W/O start up overhead = %0.2f "
                "Sentences processed = %d", eval_time_elapsed, eval_time_wo_startup,
                num_sentences)
tf.logging.info("Inference Performance = %0.4f sentences/sec", avg_sentences_per_second)
tf.logging.info("-----------------------------")

output_prediction_file = os.path.join(output_dir, "predictions.json")
output_nbest_file = os.path.join(output_dir, "nbest_predictions.json")
output_null_log_odds_file = os.path.join(output_dir, "null_odds.json")

run_squad.write_predictions(eval_examples, eval_features, all_results,
                  n_best_size, max_answer_length,
                  do_lower_case, output_prediction_file,
                  output_nbest_file, output_null_log_odds_file,
                  version_2_with_negative, verbose_logging)

tf.logging.info("Inference Results:")

# Here we show only the prediction results, nbest prediction is also available in the output directory
results = ""
with open(output_prediction_file, 'r') as json_file:
    data = json.load(json_file)
    for question in eval_examples:
        results += "<tr><td>{}</td><td>{}</td><td>{}</td></tr>".format(question.qas_id, question.question_text, data[question.qas_id])


from IPython.display import display, HTML
display(HTML("<table><tr><th>Id</th><th>Question</th><th>Answer</th></tr>{}</table>".format(results)))        

## 6. What's next

Now that you are familiar with running QA Inference on BERT, using mixed precision, you may want to try
your own paragraphs and queries. 

You may also want to take a look to the notebook __bert_squad_tf_finetuning.ipynb__ on how to run fine-tuning on BERT, available in the same directory.