An open-source implementation of the paper ``A Structured Self-Attentive Sentence Embedding'' (Lin et al., 2017).
Switch branches/tags
Nothing to show
Clone or download
Latest commit 8e0ace5 Jul 28, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore add .gitignore Jul 24, 2017
LICENSE Initial commit Jul 24, 2017
README.md Update README.md Jul 27, 2017
models.py from __future__ import print_function for Python 3 Jul 27, 2017
tokenizer-yelp.py from __future__ import print_function for Python 3 Jul 27, 2017
train.py Update train.py Jul 28, 2017
util.py add model on Yelp dataset Jul 25, 2017

README.md

Structured-Self-Attentive-Sentence-Embedding

An open-source implementation of the paper ``A Structured Self-Attentive Sentence Embedding'' published by IBM and MILA. https://arxiv.org/abs/1703.03130

Requirments

PyTorch: http://pytorch.org/

spaCy: https://spacy.io/

Please refer to https://github.com/pytorch/examples/tree/master/snli for the information about obtaining GloVe model (in PyTorch model format .pt). Typically, the model should be a tuple (dict, torch.FloatTensor, int), where the first element (dict) is a mapping from word to its index, the third element (int) is the dimension of the word embeddings, and the second element (torch.FloatTensor) with the size of word_count * dim refers to the word embeddings.

Usage

Tokenization

python tokenizer-yelp.py --input [Yelp dataset] --output [output path, will be a json file] --dict [output dictionary path, will be a json file]

Training Model

python train.py \
--emsize [word embedding size default 300] \
--nhid [hidden layer size, default 300] \
--nlayers [hidden layer numbers in Bi-LSTM, default 2] \
--attention-unit [attention unit number, d_a in the paper, default 350] \
--attention-hops [hop number, r in the paper, default 1] \
--dropout [dropout ratio, default 0.5] \
--nfc [hidden layer size for MLP in the classifier, default 512] \
--lr [learning rate, default 0.001] \
--epochs [epoch number for training, default 40] \
--seed [initial seed for reproduction, default 1111] \
--log-interval [the interval for reporting training loss, default 200] \
--batch-size [size of a batch in training procedure, default 32] \
--optimizer [type of the optimizer, default Adam] \
--penalization-coeff [coefficient of the Frobenius Norm penalization term, default 1.0] \
--class-number [number of class for the last step of classification] \
--save [path to save model] \
--dictionary [location of the dictionary generated by the tokenizer] \
--word-vector [location of the initial word vector, e.g. GloVe, should be a torch .pt model] \
--train-data [location of training data, should be in the same format with tokenized productions] \
--val-data [development set] \
--test-data [location of testing dataset] \
--cuda [whether using GPU for training, remove this when using CPU] 

Differences between the paper and our implementation

  1. For faster Python based tokenization, we used spaCy instead of Stanford Tokenizer (https://nlp.stanford.edu/software/tokenizer.shtml)

  2. For faster performance, we manually crop the comments in Yelp to a max length of 500.

Example Experimental Command and Result

We followed Lin et al.(2017) to generate the dataset, and obtained the following result:

python train.py --train-data "data/train.json" --val-data "data/dev.json" --test-data "data/test.json" --cuda --emsize 300 --nhid 300 --nfc 300 --dropout 0.5 --attention-unit 350 --epochs 10 --lr 0.001 --clip 0.5 --dictionary "data/Yelp/data/dict.json" --word-vector "data/GloVe/glove.42B.300d.pt" --save "models/model-medium.pt" --batch-size 50 --class-number 5 --optimizer Adam --attention-hops 4 --penalization-coeff 1.0 --log-interval 100
# test loss (cross entropy loss, without the Frobenius norm penalization) 0.7544
# test accuracy: 0.6690