Skip to content


Repository files navigation



There have been a lot of known problems caused by using different software versions. Please check your versions before opening issues or emailing me.


  • Python >= 3.4
  • unzip, wget

Python Packages

  • tensorflow-gpu >= 1.5.0
  • spaCy >= 2.0.0
  • tqdm
  • ujson


To download and preprocess the data, run

# download SQuAD and Glove
# preprocess the data
python --mode prepro

Hyper parameters are stored in To debug/train/test the model, run

python --mode debug/train/test

To get the official score, run

python ~/data/squad/dev-v1.1.json log/answer/answer.json

The default directory for tensorboard log file is log/event

See release for trained model.

Detailed Implementaion

  • The original paper uses additive attention, which consumes lots of memory. This project adopts scaled multiplicative attention presented in Attention Is All You Need.
  • This project adopts variational dropout presented in A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.
  • To solve the degradation problem in stacked RNN, outputs of each layer are concatenated to produce the final output.
  • When the loss on dev set increases in a certain period, the learning rate is halved.
  • During prediction, the project adopts search method presented in Machine Comprehension Using Match-LSTM and Answer Pointer.
  • To address efficiency issue, this implementation uses bucketing method (contributed by xiongyifan) and CudnnGRU. The bucketing method can speedup training, but will lower the F1 score by 0.3%.



original paper 71.1 79.5
this project 71.07 79.51

Training Time (s/it)

Native Native + Bucket Cudnn Cudnn + Bucket
E5-2640 6.21 3.56 - -
TITAN X 2.56 1.31 0.41 0.28


These settings may increase the score but not used in the model by default. You can turn these settings on in