# Quantizing RNN Models

In this example, we show how to quantize recurrent models.  
Using a pretrained model `model.RNNModel`, we convert the built-in pytorch implementation of LSTM to our own, modular implementation.  
The pretrained model was generated with:  
```time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6```  
The reason we replace the LSTM that is because the inner operations in the pytorch implementation are not accessible to us, but we still want to quantize these operations. <br />
Afterwards we can try different techniques to quantize the whole model.  

_NOTE_: We use `tqdm` to plot progress bars, since it's not in `requirements.txt` you should install it using 
`pip install tqdm`.

In [1]:
from model import DistillerRNNModel, RNNModel
from data import Corpus
import torch
from torch import nn
import distiller
from distiller.modules import DistillerLSTM as LSTM
from tqdm import tqdm # for pretty progress bar
import numpy as np

### Preprocess the data:

In [2]:
corpus = Corpus('./data/wikitext-2/')

In [3]:
def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)
device = 'cuda:0'
batch_size = 20
eval_batch_size = 10
train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

### Loading the model and converting to our own implementation.

In [4]:
rnn_model = torch.load('./checkpoint.pth.tar.best')
rnn_model = rnn_model.to(device)
rnn_model

RNNModel(
  (drop): Dropout(p=0.65)
  (encoder): Embedding(33278, 1500)
  (rnn): LSTM(1500, 1500, num_layers=2, dropout=0.65)
  (decoder): Linear(in_features=1500, out_features=33278, bias=True)
)

Here we convert the pytorch LSTM implementation to our own, by calling `LSTM.from_pytorch_impl`:

In [5]:
def manual_model(pytorch_model_: RNNModel):
    nlayers, ninp, nhid, ntoken, tie_weights = \
        pytorch_model_.nlayers, \
        pytorch_model_.ninp, \
        pytorch_model_.nhid, \
        pytorch_model_.ntoken, \
        pytorch_model_.tie_weights

    model = DistillerRNNModel(nlayers=nlayers, ninp=ninp, nhid=nhid, ntoken=ntoken, tie_weights=tie_weights).to(device)
    model.eval()
    model.encoder.weight = nn.Parameter(pytorch_model_.encoder.weight.clone().detach())
    model.decoder.weight = nn.Parameter(pytorch_model_.decoder.weight.clone().detach())
    model.decoder.bias = nn.Parameter(pytorch_model_.decoder.bias.clone().detach())
    model.rnn = LSTM.from_pytorch_impl(pytorch_model_.rnn)

    return model

man_model = manual_model(rnn_model)
torch.save(man_model, 'manual.checkpoint.pth.tar')
man_model

DistillerRNNModel(
  (encoder): Embedding(33278, 1500)
  (rnn): DistillerLSTM(1500, 1500, num_layers=2, dropout=0.65, bidirectional=False)
  (decoder): Linear(in_features=1500, out_features=33278, bias=True)
)

### Batching the data for evaluation:

In [6]:
sequence_len = 35
def get_batch(source, i):
    seq_len = min(sequence_len, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

hidden = rnn_model.init_hidden(eval_batch_size)
data, targets = get_batch(test_data, 0)

### Check that the convertion has succeeded:

In [7]:
rnn_model.eval()
man_model.eval()
y_t, h_t = rnn_model(data, hidden)
y_p, h_p = man_model(data, hidden)

print("Max error in y: %f" % (y_t-y_p).abs().max().item())

Max error in y: 0.000011


### Defining the evaluation:

In [8]:
criterion = nn.CrossEntropyLoss()
def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)
    

def evaluate(model, data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(eval_batch_size)
    with torch.no_grad():
        # The line below was fixed as per: https://github.com/pytorch/examples/issues/214
        for i in tqdm(range(0, data_source.size(0), sequence_len)):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
            hidden = repackage_hidden(hidden)
    return total_loss / len(data_source)

# Quantizing the model:

## Collect activation statistics:

The model uses activation statistics to determine how big the quantization range is. The bigger the range - the larger the round off error after quantization which leads to accuracy drop.  
Our goal is to minimize the range s.t. it contains the absolute most of our data.  
After that, we divide the range into chunks of equal size, according to the number of bits, and transform the data according to this scale factor.  
Read more on scale factor calculation [in our docs](https://nervanasystems.github.io/distiller/algo_quantization.html).

The class `QuantCalibrationStatsCollector` collects the statistics for defining the range $r = max - min$.  

Each forward pass, the collector records the values of inputs and outputs, for each layer:
- absolute over all batches min, max (stored in `min`, `max`)
- average over batches, per batch min, max (stored in `avg_min`, `avg_max`)
- mean
- std
- shape of output tensor  

All these values can be used to define the range of quantization, e.g. we can use the absolute `min`, `max` to define the range.

In [9]:
import os
from distiller.data_loggers import QuantCalibrationStatsCollector, collector_context

man_model = torch.load('./manual.checkpoint.pth.tar')
distiller.utils.assign_layer_fq_names(man_model)
collector = QuantCalibrationStatsCollector(man_model)

if not os.path.isfile('manual_lstm_pretrained_stats.yaml'):
    with collector_context(collector) as collector:
        val_loss = evaluate(man_model, val_data)
        collector.save('manual_lstm_pretrained_stats.yaml')

## Quantize Model:
  
We quantize the model after the training has completed.  
Here we check the baseline model perplexity, to have an idea how good the quantization is.

In [10]:
from distiller.quantization import PostTrainLinearQuantizer, LinearQuantMode
from copy import deepcopy

# Load and evaluate the baseline model.
man_model = torch.load('./manual.checkpoint.pth.tar')
val_loss = evaluate(man_model, val_data)
print('val_loss:%8.2f\t|\t ppl:%8.2f' % (val_loss, np.exp(val_loss)))

100%|██████████| 622/622 [00:23<00:00, 26.72it/s]

val_loss:    4.46	|	 ppl:   86.78





Now we do our magic - __Quantizing the model__.  
The quantizer replaces the layers in out model with their quantized versions.  
We can see that our model has changed:

In [11]:
# Define the quantizer
quantizer = PostTrainLinearQuantizer(
    deepcopy(man_model),
    model_activation_stats='./manual_lstm_pretrained_stats.yaml')

# Quantizer magic:
quantizer.prepare_model()

In [12]:
quantizer.model

DistillerRNNModel(
  (encoder): RangeLinearEmbeddingWrapper(
    (wrapped_module): Embedding(33278, 1500)
  )
  (rnn): DistillerLSTM(1500, 1500, num_layers=2, dropout=0.65, bidirectional=False)
  (decoder): RangeLinearQuantParamLayerWrapper(
    mode=SYMMETRIC, num_bits_acts=8, num_bits_params=8, num_bits_accum=32, clip_acts=NONE, per_channel_wts=False
    preset_activation_stats=True
    w_scale=126.2964, w_zero_point=0.0000
    in_scale=127.0004, in_zero_point=0.0000
    out_scale=3.6561, out_zero_point=0.0000
    (wrapped_module): Linear(in_features=1500, out_features=33278, bias=True)
  )
)

In [13]:
val_loss = evaluate(quantizer.model.to(device), val_data)

100%|██████████| 622/622 [02:06<00:00,  5.71it/s]


In [14]:
print('val_loss:%8.2f\t|\t ppl:%8.2f' % (val_loss, np.exp(val_loss)))

val_loss:    4.64	|	 ppl:  103.26


As we can see here, the perplexity has increased much - meaning our quantization has damaged the accuracy of our model.  
Let's try quantizing each channel separately, and making the range of the quantization asymmetric.  
Also - we replaced the `min`, `max` boundaries manually in the file.  
The idea is - the quantizer takes the absolute `min`, `max` boundaries by default, and in the original file many of the activations had a very large range that makes our quants very big - while we want to minimize their size since each quant corresponds to a roundoff error.  
The activations in every LSTM are either `sigmoid` or `tanh`, and since these are bounded respectively by
$[0,1]$, $[-1,1]$ and they saturate very quickly - we can clip the inputs to be between in the range of $[-6,6]$.

In [15]:
quantizer = PostTrainLinearQuantizer(
    deepcopy(man_model),
    model_activation_stats='./manual_lstm_pretrained_stats_new.yaml',
    mode=LinearQuantMode.ASYMMETRIC_SIGNED,
    per_channel_wts=True
)
quantizer.prepare_model()
quantizer.model

DistillerRNNModel(
  (encoder): RangeLinearEmbeddingWrapper(
    (wrapped_module): Embedding(33278, 1500)
  )
  (rnn): DistillerLSTM(1500, 1500, num_layers=2, dropout=0.65, bidirectional=False)
  (decoder): RangeLinearQuantParamLayerWrapper(
    mode=ASYMMETRIC_SIGNED, num_bits_acts=8, num_bits_params=8, num_bits_accum=32, clip_acts=NONE, per_channel_wts=True
    preset_activation_stats=True
    w_scale=PerCh, w_zero_point=PerCh
    in_scale=127.5069, in_zero_point=1.0000
    out_scale=5.0241, out_zero_point=48.0000
    (wrapped_module): Linear(in_features=1500, out_features=33278, bias=True)
  )
)

In [16]:
val_loss = evaluate(quantizer.model.to(device), val_data)
print('val_loss:%8.2f\t|\t ppl:%8.2f' % (val_loss, np.exp(val_loss)))

100%|██████████| 622/622 [02:09<00:00,  5.13it/s]

val_loss:    4.61	|	 ppl:  100.92





A tiny bit better, but still no good. Let us try the half precision version of the model:

In [17]:
model_fp16 = deepcopy(man_model).half()
val_loss = evaluate(model_fp16, val_data)
print('val_loss: %8.6f\t|\t ppl:%8.2f' % (val_loss, np.exp(val_loss)))

100%|██████████| 622/622 [00:29<00:00, 21.17it/s]

val_loss: 4.463242	|	 ppl:   86.77





The result is very close to our original model! That means that the roundoff when quantizing lineary is what hurts our accuracy. Let's try then quantizing everything except elemtentwise operations, as stated in 
[`Effective Quantization Methods for Recurrent Neural Networks`](https://arxiv.org/abs/1611.10176) :

In [18]:
overrides_yaml = """
.*eltwise.*:
    fp16: true
encoder:
    fp16: true
decoder:
    fp16: true
"""
overrides = distiller.utils.yaml_ordered_load(overrides_yaml)
quantizer = PostTrainLinearQuantizer(
    deepcopy(man_model),
    model_activation_stats='./manual_lstm_pretrained_stats_new.yaml',
    mode=LinearQuantMode.ASYMMETRIC_SIGNED,
    overrides=overrides,
    per_channel_wts=True
)
quantizer.prepare_model()
val_loss = evaluate(quantizer.model.to(device), val_data)
print('val_loss:%8.6f\t|\t ppl:%8.2f' % (val_loss, np.exp(val_loss)))

100%|██████████| 622/622 [01:20<00:00,  8.19it/s]

val_loss:4.463708	|	 ppl:   86.81





In [19]:
quantizer.model

DistillerRNNModel(
  (encoder): FP16Wrapper(
    (wrapped_module): Embedding(33278, 1500)
  )
  (rnn): DistillerLSTM(1500, 1500, num_layers=2, dropout=0.65, bidirectional=False)
  (decoder): FP16Wrapper(
    (wrapped_module): Linear(in_features=1500, out_features=33278, bias=True)
  )
)

The accuracy is still holding up very well, even though we quantized the inner linear layers!  
Now, lets try to choose different boundaries for `min`, `max` -  
Instead of using absolute ones, we take the average of all batches (`avg_min`, `avg_max`), which is an indication of where usually most of the boundaries lie. This is done by specifying the `clip_acts` parameter to `ClipMode.AVG` or `"AVG"` in the quantizer ctor:

In [20]:
overrides_yaml = """
encoder:
    fp16: true
decoder:
    fp16: true
"""
overrides = distiller.utils.yaml_ordered_load(overrides_yaml)
quantizer = PostTrainLinearQuantizer(
    deepcopy(man_model),
    model_activation_stats='./manual_lstm_pretrained_stats.yaml',
    mode=LinearQuantMode.ASYMMETRIC_SIGNED,
    overrides=overrides,
    per_channel_wts=True,
    clip_acts="AVG"
)
quantizer.prepare_model()
val_loss = evaluate(quantizer.model.to(device), val_data)
print('val_loss:%8.6f\t|\t ppl:%8.2f' % (val_loss, np.exp(val_loss)))

100%|██████████| 622/622 [02:31<00:00,  3.80it/s]

val_loss:4.487813	|	 ppl:   88.93





Great! Even though we quantized all of the layers except the embedding and the decoder - we got almost no accuracy penalty. Lets try quantizing them as well:

In [21]:
quantizer = PostTrainLinearQuantizer(
    deepcopy(man_model),
    model_activation_stats='./manual_lstm_pretrained_stats_new.yaml',
    mode=LinearQuantMode.ASYMMETRIC_SIGNED,
    per_channel_wts=True,
    clip_acts="AVG"
)
quantizer.prepare_model()
val_loss = evaluate(quantizer.model.to(device), val_data)
print('val_loss:%8.6f\t|\t ppl:%8.2f' % (val_loss, np.exp(val_loss)))

100%|██████████| 622/622 [02:24<00:00,  4.84it/s]

val_loss:4.487492	|	 ppl:   88.90





In [22]:
quantizer.model

DistillerRNNModel(
  (encoder): RangeLinearEmbeddingWrapper(
    (wrapped_module): Embedding(33278, 1500)
  )
  (rnn): DistillerLSTM(1500, 1500, num_layers=2, dropout=0.65, bidirectional=False)
  (decoder): RangeLinearQuantParamLayerWrapper(
    mode=ASYMMETRIC_SIGNED, num_bits_acts=8, num_bits_params=8, num_bits_accum=32, clip_acts=AVG, per_channel_wts=True
    preset_activation_stats=True
    w_scale=PerCh, w_zero_point=PerCh
    in_scale=129.4670, in_zero_point=1.0000
    out_scale=9.9393, out_zero_point=56.0000
    (wrapped_module): Linear(in_features=1500, out_features=33278, bias=True)
  )
)

Here we see that sometimes quantizing with the right boundaries gives better results than actually using floating point operations (even though they are half precision). 

## Conclusion

Choosing the right boundaries for quantization  was crucial for achieving almost no degradation in accrucay of LSTM.  
  
Here we showed how to use the distiller quantization API to quantize an RNN model, by converting the pytorch implementation into a modular one and then quantizing each layer separately.