# Quantizing the BERT/SQuAD Model
This notebook shows how to quantize a pre-trained Fireball model using Codebook Quantization. It assumes 
that a trained ```BERT/SQuAD``` model already exists in the ```Models``` directory. Please refer to the notebook
[Question Answering (BERT/SQuAD)](BertSquad.ipynb) for more info about training and using a BERT/SQuAD model.

If you want to quantize a Low-Rank model, you can use [this](BertSquad-Reduce.ipynb) notebook
to reduce the number of parameters in ```BERT/SQuAD```.

Model quantization reduces the size of the model by using less number of bits for each floating 
point parameter. Fireball uses a codebook quantization method based on K-Means clustering algorithm.

[quantizeModel](https://interdigitalinc.github.io/Fireball/html/source/model.html#fireball.model.Model.quantizeModel) is a class method that receives the file names of input and output to the 
quantization process. It also receives the quantization parameters such as ```minBits```, ```maxBits```, 
and ```mseUb```.

Fireball can create models with 2-bit to 12-bit quantization (Codebook sizes 4 to 4096). For the quantized
model to be compatible with [CoreML](https://developer.apple.com/documentation/coreml), we need to make sure the codebook size is a power of 2, less than or equal to 256, and only "weight" parameters are quantized (not biases)

## Quantizing a pretrained model
The code in the following cell quantizes the model specified by ```orgFileName``` and creates a new quantized model.

For each parameter tensor of the model, we try quantization bits 2 to 8 and find the best quantization that satisfies the specified MSE value.

To get better quantization (smaller model) increase ```mse```; to get better performance (larger model)
use a smaller ```mse```.



In [1]:
from fireball import Model

orgFileName = "Models/BertSquadRRPR.fbm"    # Reduced - Retrained - Pruned - Retrained
quantizedFileName = orgFileName.replace('.fbm', 'Q.fbm')  # Append 'Q' to the filename for "Quantized"
qResults = Model.quantizeModel(orgFileName, quantizedFileName,
                               minBits=2, maxBits=8, mseUb=.00001, reuseEmptyClusters=True, 
                               weightsOnly=True, verbose=True)


Reading model parameters from "Models/BertSquadRRPR.fbm" ... Done.
Quantizing 271 tensors using 36 workers ... 
   Quantization Parameters:
        mseUb .............. 1e-05
        pdfFactor .......... 0.1
        reuseEmptyClusters . True
        weightsOnly ........ True
        minBits ............ 2
        maxBits ............ 8
Quantization complete (61.05 Sec.).
Now saving to "Models/BertSquadRRPRQ.fbm" ... Done.

Size of Data: 223,350,332 -> 64,623,504 bytes
Model File Size: 223,370,708 -> 64,648,420 bytes


Compare the data size before and after quantization. 

## Evaluate the quantized model
Let's see the impact on model performance.

In [2]:
from fireball.datasets.squad import SquadDSet
gpus = "upto4"

trainDs,testDs = SquadDSet.makeDatasets("Train,Test", batchSize=128, version=1 )
model = Model.makeFromFile(quantizedFileName, testDs=testDs, gpus=gpus)   
model.initSession()
results = model.evaluate()


Initializing tokenizer from "/data/SQuAD/vocab.txt" ... Done. (Vocab Size: 30522)

Reading from "Models/BertSquadRRPRQ.fbm" ... Done.
Creating the fireball model "Bert-SQuAD" ... Done.
  Processed 10833 Samples. (Time: 77.99 Sec.)                              

    Exact Match: 77.909
    f1:          86.268



## Re-train and evaluate
Fireball can retrain the quantized models by modifying the quantization codebooks. The following cell uses the training dataset to train the quantized model. It then evaluates the re-trained model and saves it to a new file.


In [3]:
model = Model.makeFromFile(quantizedFileName, trainDs=trainDs, testDs=testDs,
                           batchSize=32, numEpochs=2,
                           learningRate=5e-9, optimizer='Adam',
                           saveBest=False,
                           gpus=gpus)
model.printNetConfig()
model.initSession()
model.train()
results = model.evaluate()

retrainedFileName = quantizedFileName.replace('.fbm', 'R.fbm')  # Append 'R' to the filename for "Re-trained"
model.save(retrainedFileName)


Reading from "Models/BertSquadRRPRQ.fbm" ... Done.
Creating the fireball model "Bert-SQuAD" ... Done.

Network configuration:
  Input:                     A tuple of TokenIds and TokenTypes.
  Output:                    2 logit vectors (with length ≤ 512) for start and end indexes of the answer.
  Network Layers:            16
  Tower Devices:             GPU0, GPU1, GPU2, GPU3
  Total Network Parameters:  53,047,718
  Total Parameter Tensors:   271
  Trainable Tensors:         271
  Training Samples:          87,844
  Test Samples:              10,833
  Num Epochs:                2
  Batch Size:                32
  L2 Reg. Factor:            0.0001
  Global Drop Rate:          0   
  Learning Rate:             0.000000005  
  Optimizer:                 Adam

+--------+---------+---------------+-----------+-------------------+
| Epoch  | Batch   | Learning Rate | Loss      | Valid/Test Acc.   |
+--------+---------+---------------+-----------+-------------------+
| 1      | 2745    | 0

## Compress the quantized model
To reduce the model file size even more, you can use the [compressModel](https://interdigitalinc.github.io/Fireball/html/source/model.html#fireball.model.Model.compressModel) class method to compress the network parameters using arithmethic coding. This process is lossless and does not affect the model performance.

Please note that while compressing a model makes it smaller, it takes longer to load a compressed model because each model parameter needs to go through the additional step of entropy decoding.

In [5]:
compressedFileName = retrainedFileName.replace('.fbm', '.fbmc')
qResults = Model.compressModel(retrainedFileName, compressedFileName)


Reading model parameters from "Models/BertSquadRRPRQR.fbm" ... Done.
Compressing 271 tensors using 36 workers ... 
Finished compressing model parameters (821.25 Sec.)
Now saving to "Models/BertSquadRRPRQR.fbmc" ... Done.
Model File Size: 64,648,394 -> 38,917,033 bytes


## Also look at

[Exporting BERT/SQuAD Model to ONNX](BertSquad-ONNX.ipynb)

[Exporting BERT/SQuAD Model to TensorFlow](BertSquad-TF.ipynb)

[Exporting BERT/SQuAD Model to CoreML](BertSquad-CoreML.ipynb)

---

[Fireball Playgrounds](../Contents.ipynb)

[Question Answering (BERT/SQuAD)](BertSquad.ipynb)

[Reducing number of parameters of BERT/SQuAD Model](BertSquad-Reduce.ipynb)

[Pruning BERT/SQuAD Model](BertSquad-Prune.ipynb)
