A PyTorch reimplementation from daandouwe of the neural dependency parser described as follow below:
- Deep Biaffine Attention for Neural Dependency Parsing
- Parsing Thai Social Data: A New Challenge for Thai NLP
- KITTHAI: Thai Elementary Discourse Unit Segmentation
- Thai Dependency Parsing with Character Embedding
You can train on the Penn Treebank, converted to Stanford Dependencies. We assume you have the PTB in standard train/dev/test splits in conll-format, stored somewhere in one directory, and that they are named train.conll
, dev.conll
, test.conll
.
First, extract a vocabulary:
mkdir vocab
./preprocess.py --data your/ptb/conll/dir --out vocab
Then, train a default model with the following arguments:
mkdir log checkpoints
./main.py train --data your/ptb/conll/dir
Training can be exited at any moment with Control-C and the current model will be evaluated on the development-set.
The following options are available:
usage: main.py {train,predict} [...]
Biaffine graph-based dependency parser
positional arguments:
{train,predict}
optional arguments:
-h, --help show this help message and exit
Data:
--data DATA location of the data corpus
--vocab VOCAB location of the preprocessed vocabulary
--disable-length-ordered
do not order sentences by length so batches have more
padding
Embedding options:
--use-glove use pretrained glove embeddings
--use-chars use character level word embeddings
--char-encoder {rnn,cnn,transformer}
type of character encoder used for word embeddings
--filter-factor FILTER_FACTOR
controls output size of cnn character embedding
--disable-words do not use words as input
--disable-tags do not use tags as input
--word-emb-dim WORD_EMB_DIM
size of word embeddings
--tag-emb-dim TAG_EMB_DIM
size of tag embeddings
--emb-dropout EMB_DROPOUT
dropout used on embeddings
Encoder options:
--encoder {rnn,cnn,transformer,none}
type of sentence encoder used
RNN options:
--rnn-type {RNN,GRU,LSTM}
type of rnn
--rnn-hidden RNN_HIDDEN
number of hidden units in rnn
--rnn-num-layers RNN_NUM_LAYERS
number of layers
--batch-first BATCH_FIRST
number of layers
--rnn-dropout RNN_DROPOUT
dropout used in rnn
CNN options:
--cnn-num-layers CNN_NUM_LAYERS
number convolutions
--kernel-size KERNEL_SIZE
size of convolution kernel
--cnn-dropout CNN_DROPOUT
dropout used in cnn
Transformer options:
--N N transformer options
--d-model D_MODEL transformer options
--d-ff D_FF transformer options
--h H transformer options
--trans-dropout TRANS_DROPOUT
dropout used in transformer
Biaffine classifier arguments:
--mlp-arc-hidden MLP_ARC_HIDDEN
number of hidden units in arc MLP
--mlp-lab-hidden MLP_LAB_HIDDEN
number of hidden units in label MLP
--mlp-dropout MLP_DROPOUT
dropout used in mlps
Training arguments:
--multi-gpu enable training on multiple GPUs
--lr LR initial learning rate
--epochs EPOCHS number of epochs of training
--batch-size BATCH_SIZE
batch size
--seed SEED random seed
--disable-cuda disable cuda
--print-every PRINT_EVERY
report interval
--plot-every PLOT_EVERY
plot interval
--logdir LOGDIR directory to log losses
--checkpoints CHECKPOINTS
path to save the final model
python>=3.6.0
torch>=0.3.0
numpy
- Add MST algorithm for decoding.
- Write predicted parses to conll file.
- A couple of full runs of the model for results.
- Enable multi-GPU training
- Work on character-level embedding of words (CNN or LSTM).
- Implement RNN options: RNN, GRU, (RAN?)
- Character level word embeddings: CNN
- Character level word embeddings: RNN
- Different encoder: Transformer.
- Different encoder: CNN (again see spaCy's parser).
@INPROCEEDINGS{9045639,
author={Singkul, Sattaya and Khampingyot, Borirat and Maharattamalai, Nattasit and Taerungruang, Supawat and Chalothorn, Tawunrat},
booktitle={2019 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)},
title={Parsing Thai Social Data: A New Challenge for Thai NLP},
year={2019},
volume={},
number={},
pages={1-7},
doi={10.1109/iSAI-NLP48611.2019.9045639}}
@INPROCEEDINGS{8930002,
author={Singkul, Sattaya and Woraratpanya, Kuntpong},
booktitle={2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE)},
title={Thai Dependency Parsing with Character Embedding},
year={2019},
volume={},
number={},
pages={1-5},
doi={10.1109/ICITEED.2019.8930002}}
@misc{dozat2017deep,
title={Deep Biaffine Attention for Neural Dependency Parsing},
author={Timothy Dozat and Christopher D. Manning},
year={2017},
eprint={1611.01734},
archivePrefix={arXiv},
primaryClass={cs.CL}
}