This is an edited version of SuPar
which provides the ability to train and evaluate depdendency parser models using ELMo embeddings. Both monolingual and cross-lingual capabilities are supported. Cross-lingual option supports mapping embeddings with vecmap
(https://github.com/artetxem/vecmap https://github.com/EMBEDDIA/vecmap-changes), MUSE
(https://github.com/facebookresearch/MUSE) and ELMoGAN
(https://github.com/EMBEDDIA/elmogan).
This version currently doesn't support the use of any other type of embeddings other than ELMo. To install, run:
$ git clone https://github.com/EMBEDDIA/parser && cd parser && git checkout elmo
$ python setup.py install
The main script to train and evaluate the parser is located at supar/parsers/biaffine_dependency.py
, run that file with --help
appended to see all the options.
The original SuPar
readme below.
SuPar
provides a collection of state-of-the-art syntactic parsing models with Biaffine Parser (Dozat and Manning, 2017) as the basic architecture:
- Biaffine Dependency Parser (Dozat and Manning, 2017)
- CRFNP Dependency Parser (Koo et al., 2007; Ma and Hovy, 2017)
- CRF Dependency Parser (Zhang et al., 2020a)
- CRF2o Dependency Parser (Zhang et al, 2020a)
- CRF Constituency Parser (Zhang et al, 2020b)
You can load released pretrained models for the above parsers and obtain dependency/constituency parsing trees very conveniently, as detailed in Usage.
The implementations of several popular and well-known algorithms, like MST (ChuLiu/Edmonds), Eisner, CKY, MatrixTree, TreeCRF, are also integrated in this package.
Besides POS Tag embeddings used by the vanilla Biaffine Parser as auxiliary inputs to the encoder, optionally, SuPar
also allows to utilize CharLSTM/BERT layers to produce character/subword-level features.
Among them, CharLSTM is taken as the default option, which avoids additional requirements for generating POS tags, as well as the inefficiency of BERT.
The BERT module in SuPar
extracts BERT representations from the pretrained model in transformers
.
It is also compatiable with other language models like XLNet, RoBERTa and ELECTRA, etc.
The CRF models for Dependency/Constituency parsing are our recent works published in ACL 2020 and IJCAI 2020 respectively. If you are interested in them, please cite:
@inproceedings{zhang-etal-2020-efficient,
title = {Efficient Second-Order {T}ree{CRF} for Neural Dependency Parsing},
author = {Zhang, Yu and Li, Zhenghua and Zhang Min},
booktitle = {Proceedings of ACL},
year = {2020},
url = {https://www.aclweb.org/anthology/2020.acl-main.302},
pages = {3295--3305}
}
@inproceedings{zhang-etal-2020-fast,
title = {Fast and Accurate Neural {CRF} Constituency Parsing},
author = {Zhang, Yu and Zhou, Houquan and Li, Zhenghua},
booktitle = {Proceedings of IJCAI},
year = {2020},
doi = {10.24963/ijcai.2020/560},
url = {https://doi.org/10.24963/ijcai.2020/560},
pages = {4046--4053}
}
SuPar
can be installed via pip:
$ pip install -U supar
Or installing from source is also permitted:
$ git clone https://github.com/yzhangcs/parser && cd parser
$ python setup.py install
As a prerequisite, the following requirements should be satisfied:
python
: 3.7pytorch
: >= 1.4transformers
: >= 3.1
Currently, SuPar
provides pretrained models for English and Chinese.
English models are trained on Penn Treebank (PTB) with 39,832 training sentences, while Chinese models are trained on Penn Chinese Treebank version 7 (CTB7) with 46,572 training sentences.
The performance and parsing speed of these models are listed in the following table. Notably, punctuation is ignored in all evaluation metrics for PTB, but reserved for CTB7.
Dataset | Type | Name | Metric | Performance | Speed (Sents/s) | |
---|---|---|---|---|---|---|
PTB | Dependency | biaffine-dep-en |
UAS/LAS | 96.03 | 94.37 | 1826.77 |
biaffine-dep-bert-en |
UAS/LAS | 96.69 | 95.15 | 646.66 | ||
crfnp-dep-en |
UAS/LAS | 96.01 | 94.42 | 2197.15 | ||
crf-dep-en |
UAS/LAS | 96.12 | 94.50 | 652.41 | ||
crf2o-dep-en |
UAS/LAS | 96.14 | 94.55 | 465.64 | ||
Constituency | crf-con-en |
F1 | 94.18 | 923.74 | ||
crf-con-bert-en |
F1 | 95.26 | 503.99 | |||
CTB7 | Dependency | biaffine-dep-zh |
UAS/LAS | 88.77 | 85.63 | 1155.50 |
biaffine-dep-bert-zh |
UAS/LAS | 91.81 | 88.94 | 395.28 | ||
crfnp-dep-zh |
UAS/LAS | 88.78 | 85.64 | 1323.75 | ||
crf-dep-zh |
UAS/LAS | 88.98 | 85.84 | 354.65 | ||
crf2o-dep-zh |
UAS/LAS | 89.35 | 86.25 | 217.09 | ||
Constituency | crf-con-zh |
F1 | 88.67 | 639.27 | ||
crf-con-bert-zh |
F1 | 91.40 | 300.15 |
All results are tested on the machine with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz and Nvidia GeForce GTX 1080 Ti GPU.
SuPar
is very easy to use. You can download the pretrained model and run syntactic parsing over sentences with a few lines of code:
>>> from supar import Parser
>>> parser = Parser.load('biaffine-dep-en')
>>> dataset = parser.predict([['She', 'enjoys', 'playing', 'tennis', '.']], prob=True, verbose=False)
100%|####################################| 1/1 00:00<00:00, 85.15it/s
The call to parser.predict
will return an instance of supar.utils.Dataset
containing the predicted syntactic trees.
For dependency parsing, you can either access each sentence held in dataset
or an individual field of all the trees.
>>> print(dataset.sentences[0])
1 She _ _ _ _ 2 nsubj _ _
2 enjoys _ _ _ _ 0 root _ _
3 playing _ _ _ _ 2 xcomp _ _
4 tennis _ _ _ _ 3 dobj _ _
5 . _ _ _ _ 2 punct _ _
>>> print(f"arcs: {dataset.arcs[0]}\n"
f"rels: {dataset.rels[0]}\n"
f"probs: {dataset.probs[0].gather(1,torch.tensor(dataset.arcs[0]).unsqueeze(1)).squeeze(-1)}")
arcs: [2, 0, 2, 3, 2]
rels: ['nsubj', 'root', 'xcomp', 'dobj', 'punct']
probs: tensor([1.0000, 0.9999, 0.9642, 0.9686, 0.9996])
Probabilities can be returned along with the results if prob=True
.
As for CRF parsers, marginals are available if mbr=True
, i.e., using MBR decoding.
Note that SuPar
requires pre-tokenized sentences as inputs.
If you'd like to parse un-tokenized raw texts, you can call nltk.word_tokenize
to do the tokenization first:
>>> import nltk
>>> text = nltk.word_tokenize('She enjoys playing tennis.')
>>> print(parser.predict([text], verbose=False).sentences[0])
100%|####################################| 1/1 00:00<00:00, 74.20it/s
1 She _ _ _ _ 2 nsubj _ _
2 enjoys _ _ _ _ 0 root _ _
3 playing _ _ _ _ 2 xcomp _ _
4 tennis _ _ _ _ 3 dobj _ _
5 . _ _ _ _ 2 punct _ _
If there are a plenty of sentences to parse, SuPar
also supports for loading them from file, and save to the pred
file if specified.
>>> dataset = parser.predict('data/ptb/test.conllx', pred='pred.conllx')
2020-07-25 18:13:50 INFO Loading the data
2020-07-25 18:13:52 INFO
Dataset(n_sentences=2416, n_batches=13, n_buckets=8)
2020-07-25 18:13:52 INFO Making predictions on the dataset
100%|####################################| 13/13 00:01<00:00, 10.58it/s
2020-07-25 18:13:53 INFO Saving predicted results to pred.conllx
2020-07-25 18:13:54 INFO 0:00:01.335261s elapsed, 1809.38 Sents/s
Please make sure the file is in CoNLL-X format. If some fields are missing, you can use underscores as placeholders. An interface is provided for the transformation from text to CoNLL-X format string.
>>> from supar.utils import CoNLL
>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.']))
1 She _ _ _ _ _ _ _ _
2 enjoys _ _ _ _ _ _ _ _
3 playing _ _ _ _ _ _ _ _
4 tennis _ _ _ _ _ _ _ _
5 . _ _ _ _ _ _ _ _
For Universial Dependencies (UD), the CoNLL-U file is also allowed, while comment lines in the file can be reserved before prediction and recovered during post-processing.
>>> import os
>>> import tempfile
>>> text = '''# text = But I found the location wonderful and the neighbors very kind.
1\tBut\t_\t_\t_\t_\t_\t_\t_\t_
2\tI\t_\t_\t_\t_\t_\t_\t_\t_
3\tfound\t_\t_\t_\t_\t_\t_\t_\t_
4\tthe\t_\t_\t_\t_\t_\t_\t_\t_
5\tlocation\t_\t_\t_\t_\t_\t_\t_\t_
6\twonderful\t_\t_\t_\t_\t_\t_\t_\t_
7\tand\t_\t_\t_\t_\t_\t_\t_\t_
7.1\tfound\t_\t_\t_\t_\t_\t_\t_\t_
8\tthe\t_\t_\t_\t_\t_\t_\t_\t_
9\tneighbors\t_\t_\t_\t_\t_\t_\t_\t_
10\tvery\t_\t_\t_\t_\t_\t_\t_\t_
11\tkind\t_\t_\t_\t_\t_\t_\t_\t_
12\t.\t_\t_\t_\t_\t_\t_\t_\t_
'''
>>> path = os.path.join(tempfile.mkdtemp(), 'data.conllx')
>>> with open(path, 'w') as f:
... f.write(text)
...
>>> print(parser.predict(path, verbose=False).sentences[0])
100%|####################################| 1/1 00:00<00:00, 68.60it/s
# text = But I found the location wonderful and the neighbors very kind.
1 But _ _ _ _ 3 cc _ _
2 I _ _ _ _ 3 nsubj _ _
3 found _ _ _ _ 0 root _ _
4 the _ _ _ _ 5 det _ _
5 location _ _ _ _ 6 nsubj _ _
6 wonderful _ _ _ _ 3 xcomp _ _
7 and _ _ _ _ 6 cc _ _
7.1 found _ _ _ _ _ _ _ _
8 the _ _ _ _ 9 det _ _
9 neighbors _ _ _ _ 11 dep _ _
10 very _ _ _ _ 11 advmod _ _
11 kind _ _ _ _ 6 conj _ _
12 . _ _ _ _ 3 punct _ _
Constituency trees can be parsed in a similar manner.
The returned dataset
holds all predicted trees represented using nltk.Tree
objects.
>>> parser = Parser.load('crf-con-en')
>>> dataset = parser.predict([['She', 'enjoys', 'playing', 'tennis', '.']], verbose=False)
100%|####################################| 1/1 00:00<00:00, 75.86it/s
>>> print(f"trees:\n{dataset.trees[0]}")
trees:
(TOP
(S
(NP (_ She))
(VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
(_ .)))
>>> dataset = parser.predict('data/ptb/test.pid', pred='pred.pid')
2020-07-25 18:21:28 INFO Loading the data
2020-07-25 18:21:33 INFO
Dataset(n_sentences=2416, n_batches=13, n_buckets=8)
2020-07-25 18:21:33 INFO Making predictions on the dataset
100%|####################################| 13/13 00:02<00:00, 5.30it/s
2020-07-25 18:21:36 INFO Saving predicted results to pred.pid
2020-07-25 18:21:36 INFO 0:00:02.455740s elapsed, 983.82 Sents/s
Analogous to dependency parsing, a sentence can be transformed to an empty nltk.Tree
conveniently:
>>> from supar.utils import Tree
>>> print(Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], root='TOP'))
(TOP (_ She) (_ enjoys) (_ playing) (_ tennis) (_ .))
To train a model from scratch, it is preferred to use the command-line option, which is more flexible and customizable. Here are some training examples:
# Biaffine Dependency Parser
# some common and default arguments are stored in config.ini
$ python -m supar.cmds.biaffine_dependency train -b -d 0 \
-c config.ini \
-p exp/ptb.biaffine.dependency.char/model \
-f char
# to use BERT, `-f` and `--bert` (default to bert-base-cased) should be specified
# if you'd like to use XLNet, you can type `--bert xlnet-base-cased`
$ python -m supar.cmds.biaffine_dependency train -b -d 0 \
-p exp/ptb.biaffine.dependency.bert/model \
-f bert \
--bert bert-base-cased
# CRF Dependency Parser
# for CRF dependency parsers, you should use `--proj` to discard all non-projective training instances
# optionally, you can use `--mbr` to perform MBR decoding
$ python -m supar.cmds.crf_dependency train -b -d 0 \
-p exp/ptb.crf.dependency.char/model \
-f char \
--mbr \
--proj
# CRF Constituency Parser
# the training of CRF constituency parser behaves like dependency parsers
$ python -m supar.cmds.crf_constituency train -b -d 0 \
-p exp/ptb.crf.constituency.char/model -f char \
--mbr
For more instructions on training, please type python -m supar.cmds.<parser> train -h
.
Alternatively, SuPar
provides some equivalent command entry points registered in setup.py
:
biaffine-dependency
, crfnp-dependency
, crf-dependency
, crf2o-dependency
and crf-constituency
.
$ biaffine-dependency train -b -d 0 -c config.ini -p exp/ptb.biaffine.dependency.char/model -f char
To accommodate large models, distributed training is also supported:
$ python -m torch.distributed.launch --nproc_per_node=4 --master_port=10000 \
-m supar.cmds.biaffine_dependency train -b -d 0,1,2,3 \
-p exp/ptb.biaffine.dependency.char/model \
-f char
You can consult the PyTorch documentation and tutorials for more details.
The evaluation process resembles prediction:
>>> parser = Parser.load('biaffine-dep-en')
>>> loss, metric = parser.evaluate('data/ptb/test.conllx')
2020-07-25 20:59:17 INFO Loading the data
2020-07-25 20:59:19 INFO
Dataset(n_sentences=2416, n_batches=11, n_buckets=8)
2020-07-25 20:59:19 INFO Evaluating the dataset
2020-07-25 20:59:20 INFO loss: 0.2326 - UCM: 61.34% LCM: 50.21% UAS: 96.03% LAS: 94.37%
2020-07-25 20:59:20 INFO 0:00:01.253601s elapsed, 1927.25 Sents/s
- Timothy Dozat and Christopher D. Manning. 2017. Deep Biaffine Attention for Neural Dependency Parsing.
- Tao Ji, Yuanbin Wu and Man Lan. 2019. Graph-based Dependency Parsing with Graph Neural Networks.
- Terry Koo, Amir Globerson, Xavier Carreras and Michael Collins. 2007. Structured Prediction Models via the Matrix-Tree Theorem.
- Xuezhe Ma and Eduard Hovy. 2017. Neural Probabilistic Model for Non-projective MST Parsing.
- Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig and Eduard Hovy. 2018. Stack-Pointer Networks for Dependency Parsing.
- Xinyu Wang, Jingxian Huang, and Kewei Tu. 2019. Second-Order Semantic Dependency Parsing with End-to-End Neural Networks.
- Yu Zhang, Houquan Zhou and Zhenghua Li. 2020. Fast and Accurate Neural CRF Constituency Parsing.
- Yu Zhang, Zhenghua Li and Min Zhang. 2020. Efficient Second-Order TreeCRF for Neural Dependency Parsing.