In [1]:
# Copyright 2022 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<img src="https://developer.download.nvidia.com/tesla/notebook_assets/nv_logo_torch_trt_resnet_notebook.png" style="width: 90px; float: right;">

# Masked Language Modeling (MLM) with Hugging Face BERT Transformer

## Learning objectives

This notebook demonstrates the steps for compiling a TorchScript module with Torch-TensorRT on a pretrained BERT transformer from Hugging Face, and running it to test the speedup obtained.

## Contents
1. [Requirements](#1)
2. [BERT Overview](#2)
3. [Creating TorchScript modules](#3)
4. [Compiling with Torch-TensorRT](#4)
5. [Benchmarking](#5)
6. [Conclusion](#6)

<a id="1"></a>
## 1. Requirements

NVIDIA's NGC provides a PyTorch Docker Container which contains PyTorch and Torch-TensorRT. Starting with version `22.05-py3`, we can make use of [latest pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) container to run this notebook.

Otherwise, you can follow the steps in `notebooks/README` to prepare a Docker container yourself, within which you can run this demo notebook.

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [3]:
from transformers import BertTokenizer, BertForMaskedLM
import torch
import timeit
import numpy as np
import torch_tensorrt
import torch.backends.cudnn as cudnn

<a id="2"></a>
## 2. BERT Overview

Transformers comprise a class of deep learning algorithms employing self-attention; broadly speaking, the models learn large matrices of numbers, each element of which denotes how important one component of input data is to another. Since their introduction in 2017, transformers have enjoyed widespread adoption, particularly in natural language processing, but also in computer vision problems. This is largely because they are easier to parallelize than the sequence models which attention mechanisms were originally designed to augment. 

Hugging Face is a company that maintains a huge respository of pre-trained transformer models. The company also provides tools for integrating those models into PyTorch code and running inference with them. 

One of the most popular transformer models is BERT (Bidirectional Encoder Representations from Transformers). First developed at Google and released in 2018, it has become the backbone of Google's search engine and a standard benchmark for NLP experiments. BERT was originally trained for next sentence prediction and masked language modeling (MLM), which aims to predict hidden words in sentences. In this notebook, we will use Hugging Face's `bert-base-uncased` model (BERT's smallest and simplest form, which does not employ text capitalization) for MLM.

<a id="3"></a>
## 3. Creating TorchScript modules  

First, create a pretrained BERT tokenizer from the `bert-base-uncased` model

In [4]:
enc = BertTokenizer.from_pretrained('bert-base-uncased')

Create dummy inputs to generate a traced TorchScript model later

In [5]:
batch_size = 4

batched_indexed_tokens = [[101, 64]*64]*batch_size
batched_segment_ids = [[0, 1]*64]*batch_size
batched_attention_masks = [[1, 1]*64]*batch_size

tokens_tensor = torch.tensor(batched_indexed_tokens)
segments_tensor = torch.tensor(batched_segment_ids)
attention_masks_tensor = torch.tensor(batched_attention_masks)

Obtain a BERT masked language model from Hugging Face in the (scripted) TorchScript, then use the dummy inputs to trace it

In [6]:
mlm_model_ts = BertForMaskedLM.from_pretrained('bert-base-uncased', torchscript=True)
traced_mlm_model = torch.jit.trace(mlm_model_ts, [tokens_tensor, segments_tensor, attention_masks_tensor])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Define 4 masked sentences, with 1 word in each sentence hidden from the model. Fluent English speakers will probably be able to guess the masked words, but just in case, they are `'capital'`, `'language'`, `'innings'`, and `'mathematics'`.

Also create a list containing the position of the masked word within each sentence. Given Python's 0-based indexing convention, the numbers are each higher by 1 than might be expected. This is because the token at index 0 in each sentence is a beginning-of-sentence token, denoted `[CLS]` when entered explicitly. 

In [7]:
masked_sentences = ['Paris is the [MASK] of France.', 
                    'The primary [MASK] of the United States is English.', 
                    'A baseball game consists of at least nine [MASK].', 
                    'Topology is a branch of [MASK] concerned with the properties of geometric objects that remain unchanged under continuous transformations.']
pos_masks = [4, 3, 9, 6]

Pass the masked sentences into the (scripted) TorchScript MLM model and verify that the unmasked sentences yield the expected results.  

Because the sentences are of different lengths, we must specify the `padding` argument in calling our encoder/tokenizer. There are several possible padding strategies, but we'll use `'max_length'` padding with `max_length=128`. Later, when we compile an optimized version of the model with Torch-TensorRT, the optimized model will expect inputs of length 128, hence our choice of padding strategy and length here. 

In [8]:
encoded_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
outputs = mlm_model_ts(**encoded_inputs)
most_likely_token_ids = [torch.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


Pass the masked sentences into the traced MLM model and verify that the unmasked sentences yield the expected results. 

Note the difference in how the `encoded_inputs` are passed into the model in the following cell compared to the previous one. If you examine `encoded_inputs`, you'll find that it's a dictionary with 3 keys, `'input_ids'`, `'token_type_ids'`, and `'attention_mask'`, each with a PyTorch tensor as an associated value. The traced model will accept `**encoded_inputs` as an input, but the Torch-TensorRT-optimized model (to be defined later) will not. 

In [9]:
encoded_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
outputs = traced_mlm_model(encoded_inputs['input_ids'], encoded_inputs['token_type_ids'], encoded_inputs['attention_mask'])
most_likely_token_ids = [torch.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


<a id="4"></a>
## 4. Compiling with Torch-TensorRT

In [10]:
trt_model = torch_tensorrt.compile(traced_mlm_model, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32)],
    enabled_precisions= {torch.float32}, # Run with 32-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)

The compiler is going to use the user setting Int32
This conflict may cause an error at runtime due to partial compilation being enabled and therefore
compatibility with PyTorch's data type convention is required.
If you do indeed see errors at runtime either:
- Remove the dtype spec for input_ids
- Disable partial compilation by setting require_full_compilation to True
The compiler is going to use the user setting Int32
This conflict may cause an error at runtime due to partial compilation being enabled and therefore
compatibility with PyTorch's data type convention is required.
If you do indeed see errors at runtime either:
- Remove the dtype spec for attention_mask.1
- Disable partial compilation by setting require_full_compilation to True
The compiler is going to use the user setting Int32
This conflict may cause an error at runtime due to partial compilation being enabled and therefore
compatibility with PyTorch's data type convention is required.
If you do indeed see errors at ru

Pass the masked sentences into the compiled model and verify that the unmasked sentences yield the expected results.

In [11]:
enc_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
enc_inputs = {k: v.type(torch.int32).cuda() for k, v in enc_inputs.items()}
output_trt = trt_model(enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])
most_likely_token_ids_trt = [torch.argmax(output_trt[i, pos, :]) for i, pos in enumerate(pos_masks)] 
unmasked_tokens_trt = enc.decode(most_likely_token_ids_trt).split(' ')
unmasked_sentences_trt = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens_trt)]
for sentence in unmasked_sentences_trt:
    print(sentence)

y be undefined behavior using dynamic shape and aten::size


Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


In [12]:
trt_model_fp16 = torch_tensorrt.compile(traced_mlm_model, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32)],
    enabled_precisions= {torch.half}, # Run with 16-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)

The compiler is going to use the user setting Int32
This conflict may cause an error at runtime due to partial compilation being enabled and therefore
compatibility with PyTorch's data type convention is required.
If you do indeed see errors at runtime either:
- Remove the dtype spec for input_ids
- Disable partial compilation by setting require_full_compilation to True
The compiler is going to use the user setting Int32
This conflict may cause an error at runtime due to partial compilation being enabled and therefore
compatibility with PyTorch's data type convention is required.
If you do indeed see errors at runtime either:
- Remove the dtype spec for attention_mask.1
- Disable partial compilation by setting require_full_compilation to True
The compiler is going to use the user setting Int32
This conflict may cause an error at runtime due to partial compilation being enabled and therefore
compatibility with PyTorch's data type convention is required.
If you do indeed see errors at ru

<a id="5"></a>
## 5. Benchmarking

This function passes the inputs into the model and runs inference `num_loops` times, then returns a list of length containing the amount of time in seconds that each instance of inference took.

In [13]:
def timeGraph(model, input_tensor1, input_tensor2, input_tensor3, num_loops=50):
    print("Warm up ...")
    with torch.no_grad():
        for _ in range(20):
            features = model(input_tensor1, input_tensor2, input_tensor3)

    torch.cuda.synchronize()

    print("Start timing ...")
    timings = []
    with torch.no_grad():
        for i in range(num_loops):
            start_time = timeit.default_timer()
            features = model(input_tensor1, input_tensor2, input_tensor3)
            torch.cuda.synchronize()
            end_time = timeit.default_timer()
            timings.append(end_time - start_time)
            # print("Iteration {}: {:.6f} s".format(i, end_time - start_time))

    return timings

G: [Torch-TensorRT] - There may be undefined behavior using dynamic shape and aten::size


This function prints the number of input batches the model is able to process each second and summary statistics of the model's latency.

In [14]:
def printStats(graphName, timings, batch_size):
    times = np.array(timings)
    steps = len(times)
    speeds = batch_size / times
    time_mean = np.mean(times)
    time_med = np.median(times)
    time_99th = np.percentile(times, 99)
    time_std = np.std(times, ddof=0)
    speed_mean = np.mean(speeds)
    speed_med = np.median(speeds)

    msg = ("\n%s =================================\n"
            "batch size=%d, num iterations=%d\n"
            "  Median text batches/second: %.1f, mean: %.1f\n"
            "  Median latency: %.6f, mean: %.6f, 99th_p: %.6f, std_dev: %.6f\n"
            ) % (graphName,
                batch_size, steps,
                speed_med, speed_mean,
                time_med, time_mean, time_99th, time_std)
    print(msg)

In [15]:
cudnn.benchmark = True

Benchmark the (scripted) TorchScript model on GPU

In [16]:
timings = timeGraph(mlm_model_ts.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 574.3, mean: 572.6
  Median latency: 0.006966, mean: 0.006986, 99th_p: 0.007236, std_dev: 0.000073



Benchmark the traced model on GPU

In [17]:
timings = timeGraph(traced_mlm_model.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 930.7, mean: 929.4
  Median latency: 0.004298, mean: 0.004304, 99th_p: 0.004388, std_dev: 0.000023



Benchmark the compiled FP32 model on GPU

In [18]:
timings = timeGraph(trt_model, enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 1249.2, mean: 1240.6
  Median latency: 0.003202, mean: 0.003513, 99th_p: 0.011851, std_dev: 0.002356



Benchmark the compiled FP16 model on GPU

In [19]:
timings = timeGraph(trt_model_fp16, enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 1773.5, mean: 1769.5
  Median latency: 0.002255, mean: 0.002261, 99th_p: 0.002302, std_dev: 0.000024



<a id="6"></a>
## 6. Conclusion

In this notebook, we have walked through the complete process of compiling TorchScript models with Torch-TensorRT for Masked Language Modeling with Hugging Face's `bert-base-uncased` transformer and testing the performance impact of the optimization. With Torch-TensorRT on an NVIDIA A100 GPU, we observe the speedups indicated below. These acceleration numbers will vary from GPU to GPU (as well as implementation to implementation based on the ops used) and we encorage you to try out latest generation of Data center compute cards for maximum acceleration.

Scripted (GPU): 1.0x
Traced (GPU): 1.62x
Torch-TensorRT (FP32): 2.14x
Torch-TensorRT (FP16): 3.15x

### What's next
Now it's time to try Torch-TensorRT on your own model. If you run into any issues, you can fill them at https://github.com/NVIDIA/Torch-TensorRT. Your involvement will help future development of Torch-TensorRT.