Skip to content


Repository files navigation

Attention is All you Need (Transformer)

This repository is deprecated. Please refer to the updated codebase here:

This repository implements the transformer model in pytorch framework which was introduced in the paper Attention is All you Need as described in their NIPS 2017 version:

The overall model architecture is as shown in the figure:

The code in this repository implements the following features:

  • Positional Encoding
  • Multi-Head Dot-Product Attention
  • Positional Attention from "Non-Autoregressive Neural Machine Translation"
  • Label Smoothing
  • Warm-up steps based training of Adam Optimizer
  • Shared weights of embedding and softmax layers
  • Beam Search with length normalisation

Software Requirements

One can install the above packages using the requirements file.

pip install -r requirements.txt


Step 1: Preprocessing:

python -i data/ja_en -s-train train-big.ja -t-train train-big.en -s-valid dev.ja -t-valid dev.en -s-test test.ja -t-test test.en --save_data demo

Step 2: Train and Evaluate the model:

python -i data/ja_en --data demo --wbatchsize 4096 --batchsize 60 --tied --beam_size 5 --epoch 40 --layers 6 --multi_heads 8 --gpu 0
python -i data/ja_en --data demo --batchsize 60 --beam_size 5 --model_file "results/model.ckpt" --src data/ja_en/test.ja --gpu 0


Dataset Statistics included in data directory are:

Dataset Train Set Dev Set Test Set
Japanse-English 148,850 500 500
IWSLT'15 English-Vietnamese 133,317 1,553 1,268
IWSLT'16 German-English 98,132 887 1,565


All the experiments were performed on a modern Titan-Xp GPU with 12GB RAM. BLEU Scores are computed using Beam Search.

Ja->En translation

Dataset URL

Method Layers BLEU (dev) BLEU (test) Parameters Words / Sec
Transformer (self) 1 33.16 36.52 32.5 M 60.1K
Transformer (self) 6 34.65 69.3 M 15.5K
BiLSTM encoder (OpenNMT-py) 1 29.55 41.3 M 31.5K
LSTM encoder (OpenNMT-py) 1 30.15 41.8 M 35.5K
Transformer (OpenNMT-py) 1 26.83 42.3 M 52.5K
BiLSTM encoder (XNMT) 1 29.58 31.39 9.1K* (Target Words)
Transformer (XNMT) 1 25.55 2.2K (Target Words)

*1 epoch get completed in around 180 seconds.

En->Vi translation

Dataset URL

Method Layers BLEU (dev) BLEU (test) Parameters Words / Sec
Transformer (self) 1 21.96 41.2 M 57.8K
Transformer (self) 2 22.96 48.5 M 40.2K
BiLSTM encoder (OpenNMT-py) 1 21.99 53.5 M 30.5K
LSTM encoder (OpenNMT-py) 1 21.04 53.9 M 29.5K
Transformer (OpenNMT-py) 1 19.26 55.3 M 48.5K
BiLSTM encoder (XNMT) 1 21.31 23.87 7.2K (Target Words)
Transformer (XNMT) 1

De->En translation (Dev Set BLEU Scores)

Dataset URL . This dataset exists in tokenised form (using NLTK and lowercase).

Method Layers BLEU (dev) BLEU (test) Parameters Words / Sec
Transformer (self) 1 21.91 54.5 M 44.5K
Transformer (self) 2
BiLSTM encoder (OpenNMT-py) 1 23.10 23.71 73.7 M
LSTM encoder (OpenNMT-py) 1
Transformer (OpenNMT-py) 1
BiLSTM encoder (XNMT) 1 22.87 23.43 8K
Transformer (XNMT) 1
