## Exporting a BERT/SQuAD model to ONNX
You can export any Fireball model to ONNX using the [exportToOnnx](https://interdigitalinc.github.io/Fireball/html/source/model.html#fireball.model.Model.exportToOnnx) function. This notebook shows how to use this function to create an ONNX model. It assumes that a trained BERT/SQuAD model already exists in the ```Models``` directory. Please refer to the notebook [Question Answering (BERT/SQuAD)](BertSquad.ipynb) for more info about training and using a BERT/SQuAD model.

Fireball can also export models with reduced number of parameters, pruned models, and quatized models. Please refer to the following notebooks for more information:

- [Reducing number of parameters of BERT/SQuAD Model](BertSquad-Reduce.ipynb)
- [Pruning BERT/SQuAD Model](BertSquad-Prune.ipynb)
- [Quantizing BERT/SQuAD Model](BertSquad-Quantize.ipynb)


Note: Fireball uses the [onnx](https://github.com/onnx/onnx) python package to export models to ONNX. We also use the [onnxruntime](https://onnxruntime.ai) here to run and evaluate the onnx models.

## Load a pretrained model

In [1]:
from fireball import Model

orgFileName = "Models/BertSquadRRPRQR.fbm"  # Reduced - Retrained - Pruned - Retrained - Quantized - Retrained

model = Model.makeFromFile(orgFileName, gpus='0')
model.printLayersInfo()
model.initSession()


Reading from "Models/BertSquadRRPRQR.fbm" ... Done.
Creating the fireball model "Bert-SQuAD" ... Done.

Scope            InShape       Comments                 OutShape      Activ.   Post Act.        # of Params
---------------  ------------  -----------------------  ------------  -------  ---------------  -----------
IN_EMB           ≤512 2                                 ≤512 768      None                      15,974,954 
S1_L1_LN         ≤512 768                               ≤512 768      None     DO:0.1           1,536      
S2_L1_BERT       ≤512 768      768/3072, 12 heads       ≤512 768      GELU                      2,838,763  
S2_L2_BERT       ≤512 768      768/3072, 12 heads       ≤512 768      GELU                      2,899,141  
S2_L3_BERT       ≤512 768      768/3072, 12 heads       ≤512 768      GELU                      2,910,253  
S2_L4_BERT       ≤512 768      768/3072, 12 heads       ≤512 768      GELU                      3,037,189  
S2_L5_BERT       ≤512 768      

## Export the model and check the exported ONNX model

In [2]:
onnxFileName = orgFileName.replace(".fbm",".onnx")

doc = ("This is the question answering model based on BERTbase and fine-tuned on SQuAD dataset. " + 
       "The inputs are a list of token IDs and a list of token types based on word-piece vocabulary embedding " +
       "scheme. The token IDs list must start with a [CLS] and end with an [SEP] code. The question tokens and " +
       "context tokens must also be separated by another [SEP] code. If the inputs are fed to the model in " +
       "batches of more than one, they must be padded with the [PAD] code so that they all have the same " +
       "length. The token types input must have 0's for question tokens and 1's for context tokens.")
model.exportToOnnx(onnxFileName, runQuantized=True, modelDocStr=doc)

# Check the exported model. This throws exceptions if something is wrong with the exported model.
import onnx
from onnx import shape_inference

onnxModel = onnx.load(onnxFileName)
onnx.checker.check_model(onnxModel)


Exporting to ONNX model "Models/BertSquadRRPRQR.onnx" ... 
    Processed all 16 layers.                                     
    Saving to "Models/BertSquadRRPRQR.onnx" ... Done.
Done (63.53 Sec.)


## Using netron to visualize the exported model
We can now visualize the model's network structure using the [netron](https://github.com/lutzroeder/netron) package.

In [3]:
import netron
import platform

if platform.system() == 'Darwin':      # Running on MAC
    netron.start(onnxFileName)   
else:
    import socket
    hostIp = socket.gethostbyname(socket.gethostname())
    netron.start(onnxFileName, address=(hostIp,8084))

Serving 'Models/BertSquadRRPRQR.onnx' at http://10.1.16.58:8084


## Running inference on the exported model
To verify the exported model, we can now run inference on it. Here we have a "context" which is a paragraph about InterDigital copied from Wikipedia and 3 different questions related to the context. We use our exported ONNX model to answer the questions.

**Note:** We could use the "Tokenizer" included in Fireball. But to show the independence of the following code from Fireball, we are using Google's original tokenizer from [here](https://github.com/google-research/bert/blob/master/tokenization.py).

In [4]:
context = r"""
InterDigital is a technology research and development company that provides wireless and video technologies for 
mobile devices, networks, and services worldwide. Founded in 1972, InterDigital is listed on NASDAQ and is 
included in the S&P SmallCap 600. InterDigital had 2020 revenue of $359 million and a portfolio of about 
32,000 U.S. and foreign issued patents and patent applications.
"""

print(context)
questions = [
    "When was InterDigital established?",
    "How much was InterDigital's revenue in 2020?",
    "What does InterDigital provide?",
]

import tokenization
import os
tokenizer = tokenization.FullTokenizer(os.path.expanduser("~")+"/data/SQuAD/vocab.txt")

import onnx
import numpy as np
import onnxruntime as ort
options = ort.SessionOptions()
options.intra_op_num_threads = 4
session = ort.InferenceSession(onnxModel.SerializeToString(), options, providers=['CPUExecutionProvider'])

contextTokens = tokenizer.tokenize(context)
for i, question in enumerate(questions):
    questionTokens = tokenizer.tokenize(question)
    allTokens = ["[CLS]"] + questionTokens + ["[SEP]"] + contextTokens + ["[SEP]"]
    tokIds = tokenizer.convert_tokens_to_ids(allTokens)
    tokTypes = [0]*(len(questionTokens)+2) + [1]*(len(contextTokens)+1)
    
    startLogits, endLogits = session.run(['StartLogits','EndLogits'],{'TokIds':[tokIds], 'TokTypes':[tokTypes]})
    startTok, endTok = np.argmax(startLogits), np.argmax(endLogits)
    startTok -= len(questionTokens) + 2
    endTok -= len(questionTokens) + 2
    answer = ' '.join(contextTokens[int(startTok):int(endTok+1)]).replace(" ##","")
    print("\nQ%d: %s\n    %s"%(i+1, question, answer))


InterDigital is a technology research and development company that provides wireless and video technologies for 
mobile devices, networks, and services worldwide. Founded in 1972, InterDigital is listed on NASDAQ and is 
included in the S&P SmallCap 600. InterDigital had 2020 revenue of $359 million and a portfolio of about 
32,000 U.S. and foreign issued patents and patent applications.


Q1: When was InterDigital established?
    1972

Q2: How much was InterDigital's revenue in 2020?
    $ 359 million

Q3: What does InterDigital provide?
    wireless and video technologies for mobile devices , networks , and services worldwide


## Also look at

[Exporting BERT/SQuAD Model to CoreML](BertSquad-CoreML.ipynb)

[Exporting BERT/SQuAD Model to TensorFlow](BertSquad-TF.ipynb)

---

[Fireball Playgrounds](../Contents.ipynb)

[Question Answering (BERT/SQuAD)](BertSquad.ipynb)

[Reducing number of parameters of BERT/SQuAD Model](BertSquad-Reduce.ipynb)

[Pruning BERT/SQuAD Model](BertSquad-Prune.ipynb)

[Quantizing BERT/SQuAD Model](BertSquad-Quantize.ipynb)
