## NOTEBOOK: Profile a BERT-Tiny Model

This tutorial validates, compiles, and profiles a Bert-Tiny model for inference on Envise using the Idiom Software Stack.

The model we're going to be working with is, mrm8488/bert-tiny-finetuned-squadv2. You can find more about this model on Hugging Face.

Run this Jupyter notebook on an environment that has a GPU instance. 

The model will traverse through the following stages in the developer flow:

**Export model to ONNX**
    
    The original model is exported to ONNX before validating operator coverage. 

**Validate the model**
    
    The operator coverage tool is invoked at this stage and checks for supported and unsupported ONXX operators in the model. 

**Compile the model** 
    
    The compile() Idiom API is invoked at this stage and the ONNX model is compiled for Envise.

**Profile the model** 
    
    The profile() Idiom API is invoked at this stage and the model is executed at runtime in an Envise-simulated environment for performance metrics.

**SYSTEM COMPONENT MINIMUM REQUIREMENTS**

* CPU: Any X86-64 architecture with 4 cores
* RAM: 64 GB memory
* GPU: One Nvidia 2080

#### Install Dependencies
This step takes under one minute

In [1]:
!pip install -r requirements.txt



#### Set up Imports 

In [2]:
# Standard imports
import os
import sys
import argparse

import numpy as np
from pathlib import Path
from typing import Mapping
from collections import OrderedDict

# HuggingFace imports
import datasets
from transformers.onnx.convert import export
from transformers.onnx.config import OnnxConfig
from transformers import BertForQuestionAnswering, BertTokenizer

# Lightmatter imports
import idiom

  from .autonotebook import tqdm as notebook_tqdm


#### Define Inputs and Outputs

In [3]:
# OnnxConfig is an abstract class, so we need a concrete base class that
# provides a name for each tensor & their dimensions. These names are
# emitted into the ONNX file.
class BertOnnxConfig(OnnxConfig):
    def __init__(self, config, task):
        super().__init__(config,task)

    @property
    def inputs(self) -> Mapping[str, Mapping[int, str]]:
        return OrderedDict(
            {
                "input_ids":      {0: "batch", 1: "sequence"},
                "attention_mask": {0: "batch", 1: "sequence"},
                "token_type_ids": {0: "batch", 1: "sequence"}
            }
        )

    @property
    def outputs(self) -> Mapping[str, Mapping[int, str]]:
        return OrderedDict(
            {
                "start_logits": {0: "batch", 1: "sequence"},
                "end_logits":   {0: "batch", 1: "sequence"}
            }
        )

#### Define a Function to Encode Inputs

In [4]:
def encode_batch(tokenizer, questions, contexts, seq_len):
    '''
    Calls tokenizer.encode_plus() for each given question+context pair. All
    samples are encoded to length <seq_len>; shorter inputs are zero-padded
    and longer inputs are truncated.

    Params:
        tokenizer:             Tokenizer to use for encoding
        questions (list[str]): Set of questions
        contexts  (list[str]): Set of contexts (length must match questions)
        seq_len (int):         Fixed length of encoded samples

    Returns: Dictionary of [batch_size x seq_len] tensors (np.array) for
        token IDs, segment IDs, and attention masks.
    '''

    input_ids = []
    tkn_types = []
    attn_mask = []

    for q,c in zip(questions,contexts):
        inputs = tokenizer.encode_plus(q,c,return_tensors='np',truncation=True,padding='max_length',max_length=seq_len)
        input_ids.append(inputs['input_ids'])
        tkn_types.append(inputs['token_type_ids'])
        attn_mask.append(inputs['attention_mask'])

    input_ids = np.vstack(input_ids)
    tkn_types = np.vstack(tkn_types)
    attn_mask = np.vstack(attn_mask)

    return {
        'input_ids' : input_ids,
        'token_type_ids' : tkn_types,
        'attention_mask' : attn_mask
    }


#### Initialize the Profiling Parameters

In [5]:
num_batches = 1
batch_size = 2
sequence_length = 384

#### Download Model Parameters

In [6]:
compile_dir = f'compiled_tiny_bert'
onnx_file = compile_dir + '/model.onnx'

os.makedirs(compile_dir,exist_ok=True)


hf_model_name = 'mrm8488/bert-tiny-finetuned-squadv2'

print('Downloading model parameters...')
model     = BertForQuestionAnswering.from_pretrained(hf_model_name).eval()
tokenizer = BertTokenizer.from_pretrained(hf_model_name)


Downloading model parameters...


#### Get Dataset

In [7]:
# Download dataset from HuggingFace hub
print('Downloading SQUADv2 dataset...')
squad = datasets.load_dataset('squad_v2', split='validation')

Reusing dataset squad_v2 (/home/auro/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d)


Downloading SQUADv2 dataset...


#### Encode Inputs

In [8]:
# Encode plain-text paragraphs & questions into token IDs, segment IDs, and attention masks
batches = []
print('Encoding inputs...')

for i in range(num_batches):
    batch = list(squad)[i*batch_size:(i+1)*batch_size]
    questions = [q['question'] for q in batch]
    contexts  = [q['context']  for q in batch]
    encoded_inputs = encode_batch(tokenizer,questions,contexts,sequence_length)
    batches.append(encoded_inputs)

Encoding inputs...


#### Export the Model to ONNX

In [9]:
print('Exporting model to ONNX...')
config = BertOnnxConfig(model.config, task='question-answering')
export(tokenizer,model,config,opset=12,output=Path(onnx_file))

Exporting model to ONNX...




(['input_ids', 'attention_mask', 'token_type_ids'],
 ['start_logits', 'end_logits'])

#### Validate Model

The model needs to get validated for Operator Coverage. Here the ONNX model is scanned and you get an output that shows a list of supported and unsupported operations by the compiler.

The ONNX file path that is being validated is: `compiled_tiny_bert/model.onnx`

The ``idiom.cc.onnx.check_op_cov`` API command invokes the Operator Coverage functionality. Here, it accepts two arguments: an ONNX model, and a .json file that defines the ONNX inputs.


In [10]:
from idiom.cc.onnx import check_op_cov
check_op_cov('compiled_tiny_bert/model.onnx', onnx_define_inputs="bert-inputs.json")

2022-08-18 19:05:17,623 - check-op-cov - INFO - check-op-cov v0.5.0
Date and time: August 18, 2022 19:05:17
Source model path: /idiom-eap/examples/00-getting-started/tutorials/bert-tiny/perf/compiled_tiny_bert/model.onnx
2022-08-18 19:05:17,624 - check-op-cov - INFO - Output files will be saved in /idiom-eap/examples/00-getting-started/tutorials/bert-tiny/perf/compiled_tiny_bert/model_opcov
2022-08-18 19:05:17,625 - check-op-cov - INFO - Running operator coverage tool...
2022-08-18 19:05:17,781 - check-op-cov - INFO - Finished running operator coverage tool.
2022-08-18 19:05:17,790 - check-op-cov - INFO - General messages from the compiler:
ONNX opset version 12
Setting parameter 'batch' to 1 from input declaration.
Setting parameter 'sequence' to 384 from input declaration.
ONNX IR version 7
ONNX producer "pytorch" version 1.10
ONNX model version 0

2022-08-18 19:05:17,791 - check-op-cov - INFO - 224/224 operators from 21 op types passed. All ops are supported!
2022-08-18 19:05:17,792

#### Compile

The ``idiom.cc.onnx.compile`` API command invokes the Idiom Compiler, where an ONNX model is compiled for Envise. It accepts mainly two arguments:

* **output_directory** 

    Directory where the output files will get stored after compilation.

* **onnx_file_path**
    
    Path to the ONNX model.

In [11]:
compile_flags = [
    f'--onnx-declare-input=input_ids[{batch_size},{sequence_length}]'
]

from idiom.cc.onnx import compile
print('Starting compiling...')
idiom.cc.onnx.compile(compile_dir, onnx_file, batch_size, compile_flags)
print('Done compiling')


Starting compiling...
Done compiling


#### Profile Model

The ``idiom.runtime.profile`` API command invokes the profiler. It measures the model’s performance metrics such as Inferences Per Second (IPS) and latency of your model for Envise by profiling the execution of the model at runtime. 

It accepts three arguments:

* **Compiled Model Directory**

    The Compiled Model Directory where the compilation output resides.

* **Input data**

    A sequence of dictionaries containing model inputs. 

* **Batch size**

    The number of samples within a batch. This value is used to compute performance metrics.
    
    Note that we are setting the batch size to **1** in this tutorial.

In [12]:
import idiom.runtime
print('Profiling inferencing...')
idiom.runtime.profile(compile_dir, batches, detailed_report=True)

Profiling inferencing...
Profiling compiled_tiny_bert
    Loading model
    Profiling model execution
        Running batch 1 of 1

Performance Report

Source model path: compiled_tiny_bert
Batch size: 2
Number of Envises: 0.5

+------------------------------------+-------------------------+----------------------+
|         Measurement Scope          |   Inferences per Second |   Batch Latency (ms) |
|  System Performance (CPU Compute,  |                      32 |                63.42 |
| Envise Compute, and Data Transfer) |                         |                      |
+------------------------------------+-------------------------+----------------------+
|  Envise Compute and Data Transfer  |                    5136 |                 0.39 |
+------------------------------------+-------------------------+----------------------+
|        Only Envise Compute         |                    7543 |                 0.27 |
+------------------------------------+-------------------------+----

#### Conclusion

This tutorial shows how to validate, compile, and profile a ``Bert-Tiny`` model, and measure its performance metrics for Envise-behavior.