# Quantizing the BERT/SQuAD Model
This notebook shows how to quantize a pre-trained Fireball model using Codebook Quantization. It assumes 
that a trained ```BERT/SQuAD``` model already exists in the ```Models``` directory. Please refer to the notebook
[Question Answering (BERT/SQuAD)](BertSquad.ipynb) for more info about training and using a BERT/SQuAD model.

If you want to quantize a Low-Rank model, you can use [this](BertSquad-Reduce.ipynb) notebook
to reduce the number of parameters in ```BERT/SQuAD```.

Model quantization reduces the size of the model by using less number of bits for each floating 
point parameter. Fireball uses a codebook quantization method based on K-Means clustering algorithm.

```quantizeModel``` is a class method that receives the file names of input and output to the 
quantization process. It also receives the quantization parameters such as ```minBits```, ```maxBits```, 
```mse```, and ```pdfFactor```.

Fireball can create models with 2-bit to 12-bit quantization (Codebook sizes 4 to 4096). For the quantized
model to be compatible with ```CoreML```, we need to make sure the codebook size is a power of 2, less than or equal to 256, and only "weight" parameters are quantized (not biases)

## Quantizing a pretrained model
The code in the following cell quantizes the model specified by ```orgFileName``` and creates a new quantized model.

For each parameter tensor of the model, we try quantization bits 2 to 8 and find the best quantization that satisfies the specified MSE value.

To get better quantization (smaller model) increase ```mse```; to get better performance (larger model)
use a smaller ```mse```.



In [2]:
from fireball import Model

# orgFileName = "Models/SSD512.fbm"        # Original model
# orgFileName = "Models/BertSquadP.fbm"       # Pruned
# orgFileName = "Models/BertSquadPR.fbm"      # Pruned - Retrained
# orgFileName = "Models/BertSquadR.fbm"       # Reduced
# orgFileName = "Models/BertSquadRP.fbm"      # Reduced - Pruned
# orgFileName = "Models/BertSquadRP.fbm"      # Reduced - Pruned - Retrained
# orgFileName = "Models/BertSquadRR.fbm"      # Reduced - Retrained
# orgFileName = "Models/BertSquadRRP.fbm"     # Reduced - Retrained - Pruned
orgFileName = "Models/BertSquadRRPR.fbm"    # Reduced - Retrained - Pruned - Retrained

quantizedFileName = orgFileName.replace('.fbm', 'Q.fbm')  # Append 'Q' to the filename for "Quantized"
qResults = Model.quantizeModel(orgFileName, quantizedFileName,
                               minBits=2, maxBits=8, mseUb=.00001, reuseEmptyClusters=True)


Reading model parameters from "Models/BertSquadRRPR.fbm" ... Done.
Quantizing 272 tensors using 76 workers ... 
   Quantization Parameters:
        mseUb .............. 1e-05
        pdfFactor .......... 0.1
        reuseEmptyClusters . True
        weightsOnly ........ True
        minBits ............ 2
        maxBits ............ 8
Quantization complete (23.41 Sec.).
Now saving to "Models/BertSquadRRPRQ.fbm" ... Done.

Size of Data: 177,506,054 -> 53,384,271 bytes
Model File Size: 177,526,445 -> 53,409,238 bytes


Compare the data size before and after quantization. 

## Evaluate the quantized model
Let's see the impact on model performance.

In [3]:
from fireball.datasets.squad import SquadDSet
gpus = "0,1,2,3"

trainDs,testDs = SquadDSet.makeDatasets("Train,Test", batchSize=128, version=1 )
model = Model.makeFromFile(quantizedFileName, testDs=testDs, gpus=gpus)   
model.initSession()
results = model.evaluate()


Initializing tokenizer from "/data/SQuAD/vocab.txt" ... Done. (Vocab Size: 30522)

Reading from "Models/BertSquadRRPRQ.fbm" ... Done.
Creating the fireball model "Bert-SQuAD" ... Done.
  Processed 10833 Samples. (Time: 64.02 Sec.)                              

    Exact Match: 77.692
    f1:          85.868



## Re-train and evaluate
Fireball can retrain the quantized models by modifying the quantization codebooks. The following cell uses the training dataset to train the quantized model.

If the trained model specified by ```quantizedFileName``` is already available in the ```Models``` directory, this cell shows the results of last training. If you want to force it to do the training again, you can un-remark the line at the beginning of the cell to delete the existing file. Note that the re-training can take up to 2 hour on a 4-GPU machine.

In [4]:
model = Model.makeFromFile(quantizedFileName, trainDs=trainDs, testDs=testDs,
                           batchSize=32, numEpochs=2,
                           learningRate=5e-9, optimizer='Adam',
                           saveBest=False,
                           gpus=gpus)
model.printNetConfig()
model.initSession()
model.train()
results = model.evaluate()

retrainedFileName = quantizedFileName.replace('.fbm', 'R.fbm')  # Append 'R' to the filename for "Re-trained"
model.save(retrainedFileName)


Reading from "Models/BertSquadRRPRQ.fbm" ... Done.
Creating the fireball model "Bert-SQuAD" ... Done.

Network configuration:
  Input:                     A tuple of TokenIds and TokenTypes.
  Output:                    2 logit vectors (with length ≤ 512) for start and end indexes of the answer.
  Network Layers:            16
  Tower Devices:             GPU0, GPU1, GPU2, GPU3
  Total Network Parameters:  41,818,535
  Total Parameter Tensors:   272
  Trainable Tensors:         272
  Training Samples:          87,844
  Test Samples:              10,833
  Num Epochs:                2
  Batch Size:                32
  L2 Reg. Factor:            0.0001
  Global Drop Rate:          0   
  Learning Rate:             0.000000005  
  Optimizer:                 Adam

+--------+---------+---------------+-----------+-------------------+
| Epoch  | Batch   | Learning Rate | Loss      | Valid/Test Acc.   |
+--------+---------+---------------+-----------+-------------------+
| 1      | 2746    | 0

## Compress the quantized model
To reduce the model file size even more, you can use the ``compressModel`` class method to compress the network parameters using arithmethic coding. This process is lossless and does not affect the model performance.

Please note that while compressing a model makes it smaller, it takes longer to load a compressed model because each model parameter needs to go through the additional step of entropy decoding.

In [6]:
compressedFileName = retrainedFileName.replace('.fbm', '.fbmc')
qResults = Model.compressModel(retrainedFileName, compressedFileName)


Reading model parameters from "Models/BertSquadRRPRQR.fbm" ... Done.
Compressing 272 tensors using 76 workers ... 
Finished compressing model parameters (402.73 Sec.)
Now saving to "Models/BertSquadRRPRQR.fbmc" ... Done.
Model File Size: 53,409,212 -> 32,925,854 bytes


## Where do I go from here?

[Exporting BERT/SQuAD Model to ONNX](BertSquad-ONNX.ipynb)

[Exporting BERT/SQuAD Model to TensorFlow](BertSquad-TF.ipynb)

[Exporting BERT/SQuAD Model to CoreML](BertSquad-CoreML.ipynb)

---

[Fireball Playgrounds](../Contents.ipynb)

[Question Answering (BERT/SQuAD)](BertSquad.ipynb)

[Reducing number of parameters of BERT/SQuAD Model](BertSquad-Reduce.ipynb)

[Pruning BERT/SQuAD Model](BertSquad-Prune.ipynb)
