# Quantizing Neural Machine Translation Models

We continue our quest to quantize every Neural Network!  
On this chapter: __Google's Neural Machine Translation model__.  
A brief summary - using stacked LSTMs and attention mechanism, this model encodes a sentence into a list of vectors and then decodes it to the other language tokens until an end token is reached.  
To read more - refer to <a id="ref-1" href="#cite-wu2016google">Google's paper</a>.

# Table of Contents
* [Quantizing Neural Machine Translation Models](#Quantizing-Neural-Machine-Translation-Models)
	* [Getting the resources](#Getting-the-resources)
	* [Loading the model](#Loading-the-model)
	* [Evaulation of the model](#Evaulation-of-the-model)
	* [Quantizing the model](#Quantizing-the-model)
		* [Collecting the statistics](#Collecting-the-statistics)
		* [Defining the Quantizer](#Defining-the-Quantizer)
		* [Quantizing the model](#Quantizing-the-model)
		* [Evaluating the quantized model](#Evaluating-the-quantized-model)
		* [Finding the right quantization](#Finding-the-right-quantization)


## Getting the resources

In this project, we modified the [`mlperf/training/rnn_translator`](https://github.com/mlperf/training/tree/master/rnn_translator) project to enable quantization of the GNMT model.  
The instructions to download and setup the required environment for this task are in `README.md` (located in the current directory).  
Download the pretrained model using the command:

In [1]:
# Uncomment the line below to download the pretrained model:
#! wget https://zenodo.org/record/2581623/files/model_best.pth

At this point, you should have everything ready to start quantizing!

## Preparing The Model For Quantization

In order to be able to fully quantize the model, we modify it according to the instructions laid out in the Distiller [documentation](https://nervanasystems.github.io/distiller/prepare_model_quant.html). This mostly amounts to making sure every quantize-able operation is invoked via a dedicated PyTorch Module. You can compare the code under `seq2seq/models` in this example with the [original](https://github.com/mlperf/training/tree/master/rnn_translator/pytorch/seq2seq/models).

For example, in `seq2seq/models/attention.py`, we added the following code to the `__init__` function of the `BahdanauAttention` class:

```python
# Adding submodules for basic ops to allow quantization:
self.eltwiseadd_qk = EltwiseAdd()
self.eltwiseadd_norm_bias = EltwiseAdd()
self.eltwisemul_norm_scaler = EltwiseMult()
self.matmul_score = Matmul()
self.context_matmul = BatchMatmul()
```

We're creating modules for operations that were invoked directly in the `forward` function in the original code. This enables Distiller to detect these operations and replace them with quantized counterparts.


## Loading the model

In [2]:
import torch
import torch.nn as nn
import distiller
from distiller.modules import DistillerLSTM
from distiller.quantization import PostTrainLinearQuantizer
from ast import literal_eval
from itertools import zip_longest
from copy import deepcopy

from seq2seq import models
from seq2seq.inference.inference import Translator
from seq2seq.utils import AverageMeter
import subprocess
import os
import seq2seq.data.config as config
from seq2seq.data.dataset import ParallelDataset
import logging
from seq2seq.utils import AverageMeter
# Import utilities from the example:
from translate import grouper, write_output, checkpoint_from_distributed, unwrap_distributed
from itertools import takewhile
from tqdm import tqdm
import logging
logging.disable(logging.INFO)  # Disables mlperf output

import warnings
warnings.filterwarnings(action='default', module='distiller.quantization')
warnings.filterwarnings(action='default', module='distiller.quantization.range_linear')

In [3]:
# Define some constants
batch_first=True
batch_size=128
beam_size=10
cov_penalty_factor=0.1
dataset_dir='./data'
input='./data/newstest2014.tok.clean.bpe.32000.en'
len_norm_const=5.0
len_norm_factor=0.6
max_seq_len=80
model='model_best.pth'
output='output_file'
print_freq=1
reference='./data/newstest2014.de'

In [4]:
# Loading the model
checkpoint = torch.load('./model_best.pth', map_location={'cuda:0': 'cpu'})
vocab_size = checkpoint['tokenizer'].vocab_size
model_config = dict(vocab_size=vocab_size, math=checkpoint['config'].math,
                    **literal_eval(checkpoint['config'].model_config))
model_config['batch_first'] = batch_first
model = models.GNMT(**model_config)

In [5]:
state_dict = checkpoint['state_dict']
if checkpoint_from_distributed(state_dict):
    state_dict = unwrap_distributed(state_dict)

model.load_state_dict(state_dict)
torch.cuda.set_device(0)
model = model.cuda()
model.eval()

GNMT(
  (encoder): ResidualRecurrentEncoder(
    (rnn_layers): ModuleList(
      (0): LSTM(1024, 1024, batch_first=True, bidirectional=True)
      (1): LSTM(2048, 1024, batch_first=True)
      (2): LSTM(1024, 1024, batch_first=True)
      (3): LSTM(1024, 1024, batch_first=True)
    )
    (dropout): Dropout(p=0.2)
    (embedder): Embedding(32317, 1024, padding_idx=0)
    (eltwiseadd_residuals): ModuleList(
      (0): EltwiseAdd()
      (1): EltwiseAdd()
    )
  )
  (decoder): ResidualRecurrentDecoder(
    (att_rnn): RecurrentAttention(
      (rnn): LSTM(1024, 1024, batch_first=True)
      (attn): BahdanauAttention(
        (linear_q): Linear(in_features=1024, out_features=1024, bias=False)
        (linear_k): Linear(in_features=1024, out_features=1024, bias=False)
        (dropout): Dropout(p=0)
        (eltwiseadd_qk): EltwiseAdd()
        (eltwiseadd_norm_bias): EltwiseAdd()
        (eltwisemul_norm_scaler): EltwiseMult()
        (tanh): Tanh()
        (matmul_score): Matmul()
       

## Evaulation of the model

In [6]:
tokenizer = checkpoint['tokenizer']


test_data = ParallelDataset(
    src_fname=os.path.join(dataset_dir, config.SRC_TEST_FNAME),
    tgt_fname=os.path.join(dataset_dir, config.TGT_TEST_FNAME),
    tokenizer=tokenizer,
    min_len=0,
    max_len=150,
    sort=False)

def get_loader():
    return test_data.get_loader(batch_size=batch_size,
                                   batch_first=True,
                                   shuffle=False,
                                   num_workers=0,
                                   drop_last=False,
                                   distributed=False)
def get_translator(model):
    return Translator(model,
                       tokenizer,
                       beam_size=beam_size,
                       max_seq_len=max_seq_len,
                       len_norm_factor=len_norm_factor,
                       len_norm_const=len_norm_const,
                       cov_penalty_factor=cov_penalty_factor,
                       cuda=True)
torch.cuda.empty_cache()

In [7]:
def evaluate(model, test_path, num_batches=None):
    test_file = open(test_path, 'w', encoding='UTF-8')
    model.eval()
    translator = get_translator(model)
    stats = {}
    loader = get_loader()
    total_batches = len(loader)
    if num_batches is None:
        num_batches = total_batches
    num_batches = min(num_batches, total_batches)
    loader = iter(loader)
    for i in tqdm(range(num_batches)):
        src, tgt, indices = next(loader)
        src, src_length = src
        if translator.batch_first:
            batch_size = src.size(0)
        else:
            batch_size = src.size(1)
        bos = [translator.insert_target_start] * (batch_size * beam_size)
        bos = torch.LongTensor(bos)
        if translator.batch_first:
            bos = bos.view(-1, 1)
        else:
            bos = bos.view(1, -1)
        src_length = torch.LongTensor(src_length)
        stats['total_enc_len'] = int(src_length.sum())
        src = src.cuda()
        src_length = src_length.cuda()
        bos = bos.cuda()
        with torch.no_grad():
            context = translator.model.encode(src, src_length)
            context = [context, src_length, None]
            if beam_size == 1:
                generator = translator.generator.greedy_search
            else:
                generator = translator.generator.beam_search
            preds, lengths, counter = generator(batch_size, bos, context)
        stats['total_dec_len'] = lengths.sum().item()
        stats['iters'] = counter
        preds = preds.cpu()
        lengths = lengths.cpu()
        output = []
        for idx, pred in enumerate(preds):
            end = lengths[idx] - 1
            pred = pred[1: end]
            pred = pred.tolist()
            out = translator.tok.detokenize(pred)
            output.append(out)
        output = [output[indices.index(i)] for i in range(len(output))]
        for line in output:
            test_file.write(line)
            test_file.write('\n')
        total_tokens = stats['total_dec_len'] + stats['total_enc_len']
    test_file.close()
    if num_batches < total_batches:
        print("Can't calculate BLEU when evaluating partial dataset")
        return
    # run moses detokenizer
    print("Calculating BLEU score...")
    detok_path = os.path.join(dataset_dir, config.DETOKENIZER)
    detok_test_path = test_path + '.detok'

    with open(detok_test_path, 'w') as detok_test_file, \
            open(test_path, 'r') as test_file:
        subprocess.run(['perl', detok_path], stdin=test_file,
                       stdout=detok_test_file, stderr=subprocess.DEVNULL)
    # run sacrebleu
    reference_path = os.path.join(dataset_dir,
                                  config.TGT_TEST_TARGET_FNAME)
    sacrebleu = subprocess.run(['sacrebleu --input {} {} --score-only -lc --tokenize intl'.
                                format(detok_test_path, reference_path)],
                               stdout=subprocess.PIPE, shell=True)
    bleu = float(sacrebleu.stdout.strip())
    print('BLEU on test dataset: {}'.format(bleu))

In [8]:
evaluate(model, output)

100%|██████████| 24/24 [00:48<00:00,  1.77s/it]


Calculating BLEU score...
BLEU on test dataset: 22.16


## Quantizing the model

As we already noted, we modified the model from `mlperf` to a modular implementation so we can quantize each and every operation in the graph.  
However, the default `nn.LSTM` was implemented in C++/CUDA, and we don't have usual access to it's operations hence we can't quantize it properly. This is why we'll convert the `nn.LSTM` to a `DistillerLSTM`, which is an entirely modular implementation of the LSTM - identical in functionality to the original `nn.LSTM`.  
This is done by simply calling `DistillerLSTM.from_pytorch_impl` for a single `nn.LSTM` and  
`convert_model_to_distiller_lstm` for an entire model containing multiple different LSTMs.


In [9]:
from distiller.modules import convert_model_to_distiller_lstm
model = convert_model_to_distiller_lstm(model)
evaluate(model, output)

100%|██████████| 24/24 [01:54<00:00,  4.02s/it]


Calculating BLEU score...
BLEU on test dataset: 22.16


### Collecting the statistics

The quantizer uses statistics to define the range of the quantization. We collect these statistics using a `QuantCalibrationStatsCollector` instance like this:

In [10]:
import os
from distiller.data_loggers import collect_quant_stats

stats_file = './acts_quantization_stats.yaml'

if not os.path.isfile(stats_file): # Collect stats.
    model_copy = deepcopy(model)
    distiller.utils.assign_layer_fq_names(model_copy)
    
    def eval_for_stats(model):
        evaluate(model, output + '.temp', num_batches=None)
    collect_quant_stats(model_copy, eval_for_stats, save_dir='.')
    del model_copy
    torch.cuda.empty_cache()

100%|██████████| 24/24 [46:47<00:00, 102.09s/it]


Calculating BLEU score...


  0%|          | 0/24 [00:00<?, ?it/s]

BLEU on test dataset: 22.16


100%|██████████| 24/24 [12:23<00:00, 26.94s/it]


Calculating BLEU score...
BLEU on test dataset: 22.16


### Defining the Quantizer

A distiller `Quantizer` object replaces each submodule in a model with its quantized counterpart, using a 
`replacement_factory`.  
`Quantizer.replacement_factory` is a dictionary which maps from a module type (e.g. `nn.Linear` and `nn.Conv`) to a function. This function takes a module and quantization configuration, and returns a quantized version of the same module.

In [11]:
# Basic quantizer defintion
quantizer = PostTrainLinearQuantizer(deepcopy(model), 
                                    mode="SYMMETRIC",  # As was suggested in GNMT's paper
                                    model_activation_stats=stats_file)
# We take a look at the replacement factory:
for t, rf in quantizer.replacement_factory.items():
    if rf is not None:
        print("Replacing '{}' modules using '{}' function".format(t.__name__, rf.__name__))

Replacing 'Conv2d' modules using 'replace_param_layer' function
Replacing 'Conv3d' modules using 'replace_param_layer' function
Replacing 'Linear' modules using 'replace_param_layer' function
Replacing 'Concat' modules using 'replace_non_param_layer' function
Replacing 'EltwiseAdd' modules using 'replace_non_param_layer' function
Replacing 'EltwiseMult' modules using 'replace_non_param_layer' function
Replacing 'Matmul' modules using 'replace_non_param_layer' function
Replacing 'BatchMatmul' modules using 'replace_non_param_layer' function
Replacing 'Embedding' modules using 'replace_embedding' function


### Quantizing the model

This is done by simply calling `quantizer.prepare_model()`

In [12]:
dummy_input = (torch.ones(1, 2).to(dtype=torch.long),
               torch.ones(1).to(dtype=torch.long),
               torch.ones(1, 2).to(dtype=torch.long))
quantizer.prepare_model(dummy_input)
quantizer.model

Will perform specific optimization for the DistillerLSTM modules, but any other potential opportunities for optimization in the model will be ignored.




GNMT(
  (encoder): ResidualRecurrentEncoder(
    (rnn_layers): ModuleList(
      (0): DistillerLSTM(1024, 1024, num_layers=1, dropout=0.00, bidirectional=True)
      (1): DistillerLSTM(2048, 1024, num_layers=1, dropout=0.00, bidirectional=False)
      (2): DistillerLSTM(1024, 1024, num_layers=1, dropout=0.00, bidirectional=False)
      (3): DistillerLSTM(1024, 1024, num_layers=1, dropout=0.00, bidirectional=False)
    )
    (dropout): Dropout(p=0.2)
    (embedder): RangeLinearEmbeddingWrapper(
      (wrapped_module): Embedding(32317, 1024, padding_idx=0)
    )
    (eltwiseadd_residuals): ModuleList(
      (0): RangeLinearQuantEltwiseAddWrapper(
        output_quant_settings=(num_bits=8 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)
        accum_quant_settings=(num_bits=32 ; quant_mode=SYMMETRIC ; clip_mode=NONE ; clip_n_stds=None ; clip_half_range=False ; per_channel=False)
        requires_quantized_inputs=True
          inputs

If you'd like to know how these functions replace the modules - I recommend reading the source code for them in  
`{DISTILLER_ROOT}/distiller/quantization/range_linear.py:PostTrainLinearQuantizer`.

### Evaluating the quantized model

In [13]:
#torch.cuda.empty_cache()
evaluate(quantizer.model, output)

100%|██████████| 24/24 [15:15<00:00, 33.37s/it]


Calculating BLEU score...
BLEU on test dataset: 18.04


### Finding the right quantization

As we can see here, we quantized our model entirely and it lost some accuracy, so we want to apply more strategies to quantize better.  
Symmetric quantization means our range is the biggest we can hold our activations in:
$$
    M = \max \{ |\text{acts}|\},\, \text{range}_{symmetric} = [-M, M]
$$
This way we waste resolution. However, if we use assymetric quantization - we may get better results:

In [14]:
# Basic quantizer defintion
quantizer = PostTrainLinearQuantizer(deepcopy(model), 
                                    mode="ASYMMETRIC_SIGNED",  
                                    model_activation_stats=stats_file)
quantizer.prepare_model()
evaluate(quantizer.model, output)

100%|██████████| 24/24 [17:19<00:00, 33.57s/it]


Calculating BLEU score...
BLEU on test dataset: 18.4


Here - we quantized asymmetrically, meaning our range still holds all the activations, but it's smaller than in the symmetrical case.  
The formula is:
$$
    \text{range}_{asymmetric} = \left[\min\{ \text{acts}\}, \max \{ \text{acts}\}\right] 
    \subset \text{range}_{symmetric}
$$
And we indeed got a slightly better result.  
However - some part of the activations during the evaluations are outliers, meaning they are way outside the range of most of their buddies. We're going to intercept this in two ways -
1. Quantize each channel separately, that way we achieve more accuracy. We'll add the argument `per_channel_wts=True`.
2. Limit the quantization range to a smaller one, thus clamping these outliers.

We'll try using the same technique as in `quantize_lstm.ipynb` - clipping the activations according to the average range recorded for them:  


$$
    m = \underset{b\in\text{batches}}{\text{avg}}\left\{\min_{b}\{\text{acts}\}\right\},\,
    M = \underset{b\in\text{batches}}{\text{avg}}\left\{\max_{b}\{\text{acts}\}\right\}
$$


$$
    \text{range}_{clipped} = [m,M] \subset \text{range}_{asymmetric} \subset \text{range}_{symmetric}
$$

This is done by specifying `clip_acts="AVG"` in the quantizer. 

In [15]:
# Basic quantizer defintion
quantizer = PostTrainLinearQuantizer(deepcopy(model), 
                                    mode="ASYMMETRIC_SIGNED",  
                                    model_activation_stats=stats_file,
                                    per_channel_wts=True,
                                    clip_acts="AVG")
quantizer.prepare_model()
evaluate(quantizer.model, output)

100%|██████████| 24/24 [15:55<00:00, 34.44s/it]


Calculating BLEU score...
BLEU on test dataset: 9.44


Oh no! This is bad... turns out that by clamping the outliers we actually "removed" useful features from important layers like the attention layer. In the attention layer we have a softmax which relies on high values to pass a correct score of importance of features. Let's try clipping all the other values, except in the attention layer:

In [16]:
# No clipping in the attention layer
overrides_yaml = """
.*att_rnn.attn.*:
    clip_acts: NONE # Quantize without clipping
"""
overrides = distiller.utils.yaml_ordered_load(overrides_yaml)
# Basic quantizer defintion
quantizer = PostTrainLinearQuantizer(deepcopy(model), 
                                    mode="ASYMMETRIC_SIGNED",  
                                    model_activation_stats=stats_file,
                                    overrides=overrides,
                                    per_channel_wts=True,
                                    clip_acts="AVG")
quantizer.prepare_model()
evaluate(quantizer.model, output)

100%|██████████| 24/24 [15:22<00:00, 33.23s/it]


Calculating BLEU score...
BLEU on test dataset: 16.71


The accuracy is somewhat "restored", by still we would like to get a score as close to the original model as possible. How about leaving the `classifier` asymmetric, without clipping it?

In [17]:
# No clipping in the attention layer and in the final classifier
overrides_yaml = """
.*att_rnn.attn.*:
    clip_acts: NONE # Quantize without clipping
decoder.classifier.classifier:
    clip_acts: NONE # Quantize without clipping
"""
overrides = distiller.utils.yaml_ordered_load(overrides_yaml)
# Basic quantizer defintion
quantizer = PostTrainLinearQuantizer(deepcopy(model), 
                                    mode="ASYMMETRIC_SIGNED",  
                                    model_activation_stats=stats_file,
                                    overrides=overrides,
                                    per_channel_wts=True,
                                    clip_acts="AVG")
quantizer.prepare_model()
evaluate(quantizer.model, output)

100%|██████████| 24/24 [16:43<00:00, 39.45s/it]


Calculating BLEU score...
BLEU on test dataset: 21.48


Finally, some good results! So now we know better which layers are sensitive to clipping and which are complimented by it.  

# References

<a id="cite-wu2016google"/><sup><a href=#ref-1>[^]</a></sup>Wu, Yonghui and Schuster, Mike and Chen, Zhifeng and Le, Quoc V and Norouzi, Mohammad and Macherey, Wolfgang and Krikun, Maxim and Cao, Yuan and Gao, Qin and Macherey, Klaus and others. 2016. _Google's neural machine translation system: Bridging the gap between human and machine translation_.



<!--bibtex

@article{wu2016google,
  title={Google's neural machine translation system: Bridging the gap between human and machine translation},
  author={Wu, Yonghui and Schuster, Mike and Chen, Zhifeng and Le, Quoc V and Norouzi, Mohammad and Macherey, Wolfgang and Krikun, Maxim and Cao, Yuan and Gao, Qin and Macherey, Klaus and others},
  journal={arXiv preprint arXiv:1609.08144},
  year={2016}
}

-->