Skip to content

Dependency parser using ELMo embeddings, based on deep biaffine attention.

License

Notifications You must be signed in to change notification settings

EMBEDDIA/supar-elmo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SuPar ELMo

This is an edited version of SuPar which provides the ability to train and evaluate depdendency parser models using ELMo embeddings. Both monolingual and cross-lingual capabilities are supported. Cross-lingual option supports mapping embeddings with vecmap (https://github.com/artetxem/vecmap https://github.com/EMBEDDIA/vecmap-changes), MUSE (https://github.com/facebookresearch/MUSE) and ELMoGAN (https://github.com/EMBEDDIA/elmogan).

This version currently doesn't support the use of any other type of embeddings other than ELMo. To install, run:

$ git clone https://github.com/EMBEDDIA/parser && cd parser && git checkout elmo
$ python setup.py install

The main script to train and evaluate the parser is located at supar/parsers/biaffine_dependency.py, run that file with --help appended to see all the options. The original SuPar readme below.

SuPar

build docs release downloads LICENSE

SuPar provides a collection of state-of-the-art syntactic parsing models with Biaffine Parser (Dozat and Manning, 2017) as the basic architecture:

You can load released pretrained models for the above parsers and obtain dependency/constituency parsing trees very conveniently, as detailed in Usage.

The implementations of several popular and well-known algorithms, like MST (ChuLiu/Edmonds), Eisner, CKY, MatrixTree, TreeCRF, are also integrated in this package.

Besides POS Tag embeddings used by the vanilla Biaffine Parser as auxiliary inputs to the encoder, optionally, SuPar also allows to utilize CharLSTM/BERT layers to produce character/subword-level features. Among them, CharLSTM is taken as the default option, which avoids additional requirements for generating POS tags, as well as the inefficiency of BERT. The BERT module in SuPar extracts BERT representations from the pretrained model in transformers. It is also compatiable with other language models like XLNet, RoBERTa and ELECTRA, etc.

The CRF models for Dependency/Constituency parsing are our recent works published in ACL 2020 and IJCAI 2020 respectively. If you are interested in them, please cite:

@inproceedings{zhang-etal-2020-efficient,
  title     = {Efficient Second-Order {T}ree{CRF} for Neural Dependency Parsing},
  author    = {Zhang, Yu and Li, Zhenghua and Zhang Min},
  booktitle = {Proceedings of ACL},
  year      = {2020},
  url       = {https://www.aclweb.org/anthology/2020.acl-main.302},
  pages     = {3295--3305}
}

@inproceedings{zhang-etal-2020-fast,
  title     = {Fast and Accurate Neural {CRF} Constituency Parsing},
  author    = {Zhang, Yu and Zhou, Houquan and Li, Zhenghua},
  booktitle = {Proceedings of IJCAI},
  year      = {2020},
  doi       = {10.24963/ijcai.2020/560},
  url       = {https://doi.org/10.24963/ijcai.2020/560},
  pages     = {4046--4053}
}

Contents

Installation

SuPar can be installed via pip:

$ pip install -U supar

Or installing from source is also permitted:

$ git clone https://github.com/yzhangcs/parser && cd parser
$ python setup.py install

As a prerequisite, the following requirements should be satisfied:

Performance

Currently, SuPar provides pretrained models for English and Chinese. English models are trained on Penn Treebank (PTB) with 39,832 training sentences, while Chinese models are trained on Penn Chinese Treebank version 7 (CTB7) with 46,572 training sentences.

The performance and parsing speed of these models are listed in the following table. Notably, punctuation is ignored in all evaluation metrics for PTB, but reserved for CTB7.

Dataset Type Name Metric Performance Speed (Sents/s)
PTB Dependency biaffine-dep-en UAS/LAS 96.0394.37 1826.77
biaffine-dep-bert-en UAS/LAS 96.6995.15 646.66
crfnp-dep-en UAS/LAS 96.0194.42 2197.15
crf-dep-en UAS/LAS 96.1294.50 652.41
crf2o-dep-en UAS/LAS 96.1494.55 465.64
Constituency crf-con-en F1 94.18923.74
crf-con-bert-en F1 95.26503.99
CTB7 Dependency biaffine-dep-zh UAS/LAS 88.7785.631155.50
biaffine-dep-bert-zh UAS/LAS 91.8188.94395.28
crfnp-dep-zh UAS/LAS 88.7885.641323.75
crf-dep-zh UAS/LAS 88.9885.84354.65
crf2o-dep-zh UAS/LAS 89.3586.25217.09
Constituency crf-con-zh F1 88.67 639.27
crf-con-bert-zh F1 91.40 300.15

All results are tested on the machine with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz and Nvidia GeForce GTX 1080 Ti GPU.

Usage

SuPar is very easy to use. You can download the pretrained model and run syntactic parsing over sentences with a few lines of code:

>>> from supar import Parser
>>> parser = Parser.load('biaffine-dep-en')
>>> dataset = parser.predict([['She', 'enjoys', 'playing', 'tennis', '.']], prob=True, verbose=False)
100%|####################################| 1/1 00:00<00:00, 85.15it/s

The call to parser.predict will return an instance of supar.utils.Dataset containing the predicted syntactic trees. For dependency parsing, you can either access each sentence held in dataset or an individual field of all the trees.

>>> print(dataset.sentences[0])
1       She     _       _       _       _       2       nsubj   _       _
2       enjoys  _       _       _       _       0       root    _       _
3       playing _       _       _       _       2       xcomp   _       _
4       tennis  _       _       _       _       3       dobj    _       _
5       .       _       _       _       _       2       punct   _       _

>>> print(f"arcs:  {dataset.arcs[0]}\n"
          f"rels:  {dataset.rels[0]}\n"
          f"probs: {dataset.probs[0].gather(1,torch.tensor(dataset.arcs[0]).unsqueeze(1)).squeeze(-1)}")
arcs:  [2, 0, 2, 3, 2]
rels:  ['nsubj', 'root', 'xcomp', 'dobj', 'punct']
probs: tensor([1.0000, 0.9999, 0.9642, 0.9686, 0.9996])

Probabilities can be returned along with the results if prob=True. As for CRF parsers, marginals are available if mbr=True, i.e., using MBR decoding.

Note that SuPar requires pre-tokenized sentences as inputs. If you'd like to parse un-tokenized raw texts, you can call nltk.word_tokenize to do the tokenization first:

>>> import nltk
>>> text = nltk.word_tokenize('She enjoys playing tennis.')
>>> print(parser.predict([text], verbose=False).sentences[0])
100%|####################################| 1/1 00:00<00:00, 74.20it/s
1       She     _       _       _       _       2       nsubj   _       _
2       enjoys  _       _       _       _       0       root    _       _
3       playing _       _       _       _       2       xcomp   _       _
4       tennis  _       _       _       _       3       dobj    _       _
5       .       _       _       _       _       2       punct   _       _

If there are a plenty of sentences to parse, SuPar also supports for loading them from file, and save to the pred file if specified.

>>> dataset = parser.predict('data/ptb/test.conllx', pred='pred.conllx')
2020-07-25 18:13:50 INFO Loading the data
2020-07-25 18:13:52 INFO
Dataset(n_sentences=2416, n_batches=13, n_buckets=8)
2020-07-25 18:13:52 INFO Making predictions on the dataset
100%|####################################| 13/13 00:01<00:00, 10.58it/s
2020-07-25 18:13:53 INFO Saving predicted results to pred.conllx
2020-07-25 18:13:54 INFO 0:00:01.335261s elapsed, 1809.38 Sents/s

Please make sure the file is in CoNLL-X format. If some fields are missing, you can use underscores as placeholders. An interface is provided for the transformation from text to CoNLL-X format string.

>>> from supar.utils import CoNLL
>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.']))
1       She     _       _       _       _       _       _       _       _
2       enjoys  _       _       _       _       _       _       _       _
3       playing _       _       _       _       _       _       _       _
4       tennis  _       _       _       _       _       _       _       _
5       .       _       _       _       _       _       _       _       _

For Universial Dependencies (UD), the CoNLL-U file is also allowed, while comment lines in the file can be reserved before prediction and recovered during post-processing.

>>> import os
>>> import tempfile
>>> text = '''# text = But I found the location wonderful and the neighbors very kind.
1\tBut\t_\t_\t_\t_\t_\t_\t_\t_
2\tI\t_\t_\t_\t_\t_\t_\t_\t_
3\tfound\t_\t_\t_\t_\t_\t_\t_\t_
4\tthe\t_\t_\t_\t_\t_\t_\t_\t_
5\tlocation\t_\t_\t_\t_\t_\t_\t_\t_
6\twonderful\t_\t_\t_\t_\t_\t_\t_\t_
7\tand\t_\t_\t_\t_\t_\t_\t_\t_
7.1\tfound\t_\t_\t_\t_\t_\t_\t_\t_
8\tthe\t_\t_\t_\t_\t_\t_\t_\t_
9\tneighbors\t_\t_\t_\t_\t_\t_\t_\t_
10\tvery\t_\t_\t_\t_\t_\t_\t_\t_
11\tkind\t_\t_\t_\t_\t_\t_\t_\t_
12\t.\t_\t_\t_\t_\t_\t_\t_\t_

'''
>>> path = os.path.join(tempfile.mkdtemp(), 'data.conllx')
>>> with open(path, 'w') as f:
...     f.write(text)
...
>>> print(parser.predict(path, verbose=False).sentences[0])
100%|####################################| 1/1 00:00<00:00, 68.60it/s
# text = But I found the location wonderful and the neighbors very kind.
1       But     _       _       _       _       3       cc      _       _
2       I       _       _       _       _       3       nsubj   _       _
3       found   _       _       _       _       0       root    _       _
4       the     _       _       _       _       5       det     _       _
5       location        _       _       _       _       6       nsubj   _       _
6       wonderful       _       _       _       _       3       xcomp   _       _
7       and     _       _       _       _       6       cc      _       _
7.1     found   _       _       _       _       _       _       _       _
8       the     _       _       _       _       9       det     _       _
9       neighbors       _       _       _       _       11      dep     _       _
10      very    _       _       _       _       11      advmod  _       _
11      kind    _       _       _       _       6       conj    _       _
12      .       _       _       _       _       3       punct   _       _

Constituency trees can be parsed in a similar manner. The returned dataset holds all predicted trees represented using nltk.Tree objects.

>>> parser = Parser.load('crf-con-en')
>>> dataset = parser.predict([['She', 'enjoys', 'playing', 'tennis', '.']], verbose=False)
100%|####################################| 1/1 00:00<00:00, 75.86it/s
>>> print(f"trees:\n{dataset.trees[0]}")
trees:
(TOP
  (S
    (NP (_ She))
    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
    (_ .)))
>>> dataset = parser.predict('data/ptb/test.pid', pred='pred.pid')
2020-07-25 18:21:28 INFO Loading the data
2020-07-25 18:21:33 INFO
Dataset(n_sentences=2416, n_batches=13, n_buckets=8)
2020-07-25 18:21:33 INFO Making predictions on the dataset
100%|####################################| 13/13 00:02<00:00,  5.30it/s
2020-07-25 18:21:36 INFO Saving predicted results to pred.pid
2020-07-25 18:21:36 INFO 0:00:02.455740s elapsed, 983.82 Sents/s

Analogous to dependency parsing, a sentence can be transformed to an empty nltk.Tree conveniently:

>>> from supar.utils import Tree
>>> print(Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], root='TOP'))
(TOP (_ She) (_ enjoys) (_ playing) (_ tennis) (_ .))

Training

To train a model from scratch, it is preferred to use the command-line option, which is more flexible and customizable. Here are some training examples:

# Biaffine Dependency Parser
# some common and default arguments are stored in config.ini
$ python -m supar.cmds.biaffine_dependency train -b -d 0  \
    -c config.ini  \
    -p exp/ptb.biaffine.dependency.char/model  \
    -f char
# to use BERT, `-f` and `--bert` (default to bert-base-cased) should be specified
# if you'd like to use XLNet, you can type `--bert xlnet-base-cased`
$ python -m supar.cmds.biaffine_dependency train -b -d 0  \
    -p exp/ptb.biaffine.dependency.bert/model  \
    -f bert  \
    --bert bert-base-cased

# CRF Dependency Parser
# for CRF dependency parsers, you should use `--proj` to discard all non-projective training instances
# optionally, you can use `--mbr` to perform MBR decoding
$ python -m supar.cmds.crf_dependency train -b -d 0  \
    -p exp/ptb.crf.dependency.char/model  \
    -f char  \
    --mbr  \
    --proj

# CRF Constituency Parser
# the training of CRF constituency parser behaves like dependency parsers
$ python -m supar.cmds.crf_constituency train -b -d 0  \
    -p exp/ptb.crf.constituency.char/model -f char  \
    --mbr

For more instructions on training, please type python -m supar.cmds.<parser> train -h.

Alternatively, SuPar provides some equivalent command entry points registered in setup.py: biaffine-dependency, crfnp-dependency, crf-dependency, crf2o-dependency and crf-constituency.

$ biaffine-dependency train -b -d 0 -c config.ini -p exp/ptb.biaffine.dependency.char/model -f char

To accommodate large models, distributed training is also supported:

$ python -m torch.distributed.launch --nproc_per_node=4 --master_port=10000  \
    -m supar.cmds.biaffine_dependency train -b -d 0,1,2,3  \
    -p exp/ptb.biaffine.dependency.char/model  \
    -f char

You can consult the PyTorch documentation and tutorials for more details.

Evaluation

The evaluation process resembles prediction:

>>> parser = Parser.load('biaffine-dep-en')
>>> loss, metric = parser.evaluate('data/ptb/test.conllx')
2020-07-25 20:59:17 INFO Loading the data
2020-07-25 20:59:19 INFO
Dataset(n_sentences=2416, n_batches=11, n_buckets=8)
2020-07-25 20:59:19 INFO Evaluating the dataset
2020-07-25 20:59:20 INFO loss: 0.2326 - UCM: 61.34% LCM: 50.21% UAS: 96.03% LAS: 94.37%
2020-07-25 20:59:20 INFO 0:00:01.253601s elapsed, 1927.25 Sents/s

TODO

References

About

Dependency parser using ELMo embeddings, based on deep biaffine attention.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages