We explore ways to reduce computation and model size for neural machine translation. With the development of binary weight networks and XNOR networks in vision, we attempt to extend that work to machine translation. In particular, we evaluate how binary convolutions can be used in machine translation and their effects.
Although our analysis is done on Multi30k dataset, our code supports the following datasets:
- WMT 14 EN - FR
- IWSLT
- Multi30k
We implement 4 baseline models to compare our binarized models against.
An encoder decoder model, that encodes the source language with an LSTM, then presents the final hidden state to the decoder. The decoder uses the final hidden state to decode the output.
An encoder decoder model, similar to the last but at every decoder step applies an attention mechanism over all the encoder outputs conditioned on the current hidden state.
The same model as above, but using QRNN (Quasi Recurrent Neural Network developed by Salesforce Research) instead of LSTMs. QRNN should be much faster since the rely on lower level convolutions and can be parallelized further than Attention RNN.
This model (implemented by FAIR) rather than using RNNs, creates a series of convolutional layers that are used for the encoder, and decoder along with attention.
We implement two variants of binarized networks to compare performance.
This
model is the same as the one implemented above, with one key difference. All the weights are represented as a binary tensor β, and a normalization vector such that W ≈ β · α
. The benefit here is that a convolution can be estimated as (I · β) · α
This model extends upon the binarized weight network. The input is binarized as well so the convolutions can be estimated as (sign(I) · sign(β)) · α
.
Other stats can be found in this issue
We compare model size of two different sets of models. First the models we ran our Multi30k experiments on. Then the large models. Since our dataset is quite a bit smaller, we also ran experiments on the size of the models that are used for larger translation datasets such as WMT, and note the hyper parameters reported in their papers.
A short cut to do all the setup:
# creates a virutal environment and downloads the data
$ bash setup.sh
To set up the python code create a python3 environment with the following:
# create a virtual environment
$ python3 -m venv env
# activate environment
$ source env/bin/activate
# install all requirements
$ pip install -r requirements.txt
If you add a new package you will have to update the requirements.txt with the following command:
# add new packages
$ pip freeze > requirements.txt
And if you want to deactivate the virtual environment
# decativate the virtual env
$ deactivate
# if using python 3.7.x, no official tensorflow distro is available so use this for mac:
$ pip install https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.0-py3-none-any.whl
# use this for linux
$ pip install https://github.com/adrianodennanni/tensorflow-1.12.0-cp37-cp37m-linux_x86_64/blob/master/tensorflow-1.12.0-cp37-cp37m-linux_x86_64.whl?raw=true
- XNOR - Net: Paper
- Multi bit quantization networks: Paper
- Binarized LSTM Language Model: Paper
- Fair Seq Convolutinal Sequence Learning: Paper
- Quasi Recurrent Networks: Paper
- WMT 14 Translation Task Paper
- Attention is all you need Paper
- Imagination improves multimodal translation Paper
- Multi30k dataset Paper
- IWSLT paper