The repository contains code for Universal Dependencies according to BERT: both more specific and more general
Our modification of the Universal Dependencies annotation is applied with UDApi. To install UDApi, follow the instruction from UDApi. We have created our custom block that performs conllu modifications, to use it:
- Clone the UDApi repository
- Copy the file
attentionconvert.py
toudapi-python/udapi/block/ud
- Follow the steps in Install Udapi for developers for developers
- Run in a command line:
udapy read.Conllu files=<path-to-conllu> ud.AttentionConvert write.Conllu > <path-to-converted-conllu>
Note that this step is optional. However, it is necessary to reproduce our results.
The code and instruction for running BERT over text and extracting the resulting attention map were created by Kevin Clark and were adapted for this project. The original code is available at Attention Analysis Clark et al.
The input data should be a JSON file containing a list of dicts, each one corresponding to a single example to be passed into BERT. Each dict must contain exactly one of the following fields:
"text"
: A string."words"
: A list of strings. Needed if you want word-level rather than token-level attention."tokens"
: A list of strings corresponding to BERT wordpiece tokenization.
If the present field is "tokens," the script expects [CLS]/[SEP] tokens
to be already added; otherwise it adds these tokens to the
beginning/end of the text automatically.
Note that if an example is longer than max_sequence_length
tokens
after BERT wordpiece tokenization, attention maps will not be extracted for it.
Attention extraction is run with
python attention-analysis-clark-etal/extract_attention.py --preprocessed-data-file <path-to-your-data> --bert_dir <directory-containing-BERT-model> --max-sequence-length 256
The following optional arguments can also be added:
--max_sequence_length
: Maximum input sequence length after tokenization (default is 128).--batch_size
: Batch size when running BERT over examples (default is 16).--debug
: Use a tiny BERT model for fast debugging.--cased
: Do not lowercase the input text.--word_level
: Compute word-level instead of token-level attention (see Section 4.1 of the paper).
The list of attention matrices will be saved to <path-to-your-data>_attentions.npz
. The file will be referred to as <path-to-attentions>
in the next steps.
Wordpiece tokenized sentences will be saved to <path-to-your-data>_source.txt
. The file will be referred to as <path-to-wordpieces>
in the next steps.
Select syntactic head ensembles for each Universal Dependencies syntactic relation:
python3 head-ensembles/head_ensemble.py <attention-matrices> <bpe-tokenized-sentences> <path-to-conllu> -j <path-to-head-ensembles>
<attention-matrices>
and <bpe-tokenized-sentences>
were generated in the last step.
<conllu-file>
is a path to conll file used for evaluation, that optionally was converted with UDApi before.
A dictionary is produced with syntactic labels as keys and head ensembles as values. Each head ensemble contains fields:
- ensemble: list of pairs [layer_index, head_index] of heads selected to the ensemble
- max_metric: metric result for the head ensemble on evaluation conllu (Dependency accuracy by default)
- metric_history: metric result in each step of the selection process
- max_ensemble_size: the limit of the number of heads in an ensemble
- relation_label: the same as a dictionary key
If the argument --json
is provided the dictionary is saved in JSON
format.
Other arguments for the script:
--metric
: metric to optimize in head ensemble selection (currently only DepAcc is supported)--num-heads
: the maximal size of each head ensemble (by default: 4)--sentences
: indices of the sentences used for selection.
Construct dependency trees from head ensembles selected in the last step and evaluate their UAS and LAS on conllu file.
python head-ensembles/extract_trees.py <attention-matrices> <bpe-tokenized-sentences> <path-to-conllu> <path-to-head-ensmemble>
The results are printed to standard output. We use different conllu file for head ensemble selection (EuroParl with UD modification) and dependency tree (PUD w/o UD modifications)
Other arguments for the script:
--sentences
: indices of the sentences used for selection.
-
Install required packages by pip. Follow instruction in Universal Dependencies Modification to install UDApi with our custom block.
-
Download conllu files from Universal Dependencies web site. For instance, for Japanese GSD train treebank
ja_gsd-ud-train.conll
for head selection andja_pud-ud-test.conllu
for evaluation. Save the files toresources
directory. -
Download BERT model from BERT GitHub to
<directory-containing-BERT-model>
. Then extract the attention matrices by running bash script:source scripts/extract_attention.sh ja_gsd-ud-train ja_pud-ud-test <directory-containing-BERT-model>
-
Run head selection and tree extraction by running bash script. Results will be saved in
results
directory:source scripts/pipeline_eval.sh ja_gsd-ud-train ja_pud-ud-test
@misc{limisiewicz2020universal,
title={Universal Dependencies according to BERT: both more specific and more general},
author={Tomasz Limisiewicz and Rudolf Rosa and David Mare\v{c}ek},
year={2020},
eprint={2004.14620},
archivePrefix={arXiv},
primaryClass={cs.CL}
}