Meta-learning for Low-Resource Dependency Parsing

Data

All of the experiments were conducted on Universal Dependencies:

corpus main page
data download page
Universal Dependencies v2.2 (we use 46 languages in v2.2 for training; see data/train_langs.txt)
Universal Dependencies v2.5 (we use 8 treebanks in v2.5 for testing; see data/test_tbs-v2.5.txt)

Setting up the environment

Clone this repository:

git clone https://github.com/Chung-I/maml-parsing.git

Set up conda environment:

conda create -n maml-parsing python=3.6
conda activate maml-parsing

Install python package requirements:

pip install -r requirements.txt

Pre-training:

UD_GT: Root path of ground truth universal dependencies treebank files used for evaluation.
UD_ROOT: Root path of treebank files used for training. For scenarios that use ground truth universal dependencies treebank files for training, simply set it the same as UD_GT. For those who would like to use their own POS taggers as input features for training, put all pos-tagged conllu files in a singler folder and set UD_ROOT to it. We provide Universal Dependencies v2.2 preprocessed by stanfordnlp (stanfordnlp package) for those who would like to compare their result with paper), which use predicted tags of their POS taggers for training.
CONFIG_NAME: json file storing training configuration such as dataset paths, model hyperparameter settings, training schedule, etc. See delexicalized parsing models and lexicalized parsing models for examples of configuration files to choose from.
Normal usage: Simply extract Universal Dependencies v2.2 to some folder, then set UD_GT="folder/**/" and UD_ROOT="folder/**/".

UD_GT="path/to/your/ud-treebanks-v2.2/**/" UD_ROOT="path/to/your/pos-tagged/conllu-files/" python -W ignore run.py train $CONFIG_NAME -s <serialization_dir> --include-package src

delexicalized parsing models:

Multi-task baseline
- CONFIG_NAME=training_config/multi-pos.jsonnet
- pre-trained model: multi-pos.tar.gz
MAML
- CONFIG_NAME=training_config/maml-pos.jsonnet
- pre-trained model: maml-pos.tar.gz
FOMAML
- CONFIG_NAME=training_config/fomaml-pos.jsonnet
- pre-trained model: fomaml-pos.tar.gz
Reptile:
- CONFIG_NAME=training_config/reptile-pos.jsonnet
- pre-trained model: reptile-pos.tar.gz

lexicalized parsing models:

Multi-task baseline
- CONFIG_NAME=training_config/multi-lex.jsonnet
- pretrained model: multi-lex.tar.gz
Reptile
- CONFIG_NAME=training_config/reptile-lex.jsonnet
- pretrained model (inner step K=2): reptile-lex-K2.tar.gz
- pretrained model (inner step K=4): reptile-lex-K4.tar.gz

hyperparameters:

num_gradient_accumulation_steps: meta-learning inner steps

Zero-shot Transfer

UD_GT: Same as pre-training.
UD_ROOT: Root path of treebank files used for testing. For scenarios that use ground truth text segmentation and POS tags as inputs to the parser, simply set it the same as UD_GT. For users who would like to compare their results with CoNLL 2018 shared task submission, which scores not only parser accuracies but also the whole preprocessing pipeline (tokenization, lemmatization, POS/morphological features tagging, multi-word expansion) before dependency parsing, they can use their own preprocessing pipeline to process raw text and put all preprocessed conllu files in a singler folder and set UD_ROOT to it. The parser will read the test files in it to generate system output. For users who don't want to develop their own preprocessing pipeline but still want to compare their result with CoNLL 2018 submission, we provide preprocessed Universal Dependencies v2.2 by stanfordnlp preprocessing pipeline (stanfordnlp package). Preprocessed Universal Dependencies v2.5 by stanza preprocessing pipeline (stanza package) is also provided for users who'd like to parse treebanks in UD v2.5 and compare their results with stanza, stanford's multilingual NLP system trained on UD v2.5.
EPOCH_NUM: Which pre-training epoch checkpoint to perform zero-shot transfer from.
ZS_LANG: Language code of target transfer language (e.g. wo, te, cop, ..., etc.).
SUFFIX: Suffix of folder names storing results.
<serialization_dir>: Directory of model to perform zero-shot transfer from. For example, if one would like to perform zero-shot transfer from the pos-only multi-task baseline model, simply extract pre-trained model multi-pos.tar.gz and set <serialization_dir> to that folder.

UD_GT="path/to/your/ud-treebanks-v2.x/**/" UD_ROOT="path/to/your/preprocessed/conllu-files/" bash zs-eval.sh <serialization_dir> $EPOCH_NUM $ZS_LANG 0 $SUFFIX

Results will be stored in log dir: <serialization_dir>_${EPOCH_NUM}_${ZS_LANG}_${SUFFIX}.

Fine-tuning

UD_GT: Same as pre-training.
UD_ROOT: Same as zero-shot transfer.
EPOCH_NUM: Which pre-training epoch checkpoint to perform fine-tuning from.
ZS_LANG: Code of target transfer language (e.g. wo, te, cop, ..., etc.).
NUM_EPOCHS: Perform fine-tuning for this many number of epochs.
SUFFIX: Suffix of folder names storing results.
<serialization_dir>: Directory of model to perform fine-tuning from. For example, if one would like to perform fine-tuning from the pos-only multi-task baseline model, simply extract pre-trained model multi-pos.tar.gz and set <serialization_dir> to that folder.

UD_GT="path/to/your/ud-treebanks-v2.x/**/" UD_ROOT="path/to/your/preprocessed/testset/" bash fine-tune.sh <serialization_dir> $EPOCH_NUM $FT_LANG $NUM_EPOCHS $SUFFIX

Results will be stored in log dir: <serialization_dir>_${EPOCH_NUM}_${FT_LANG}_${SUFFIX}.

Files in log directory

train-result.conllu: System prediction of training set ($UD_GT/$ZS_LANG*-train.conllu).
dev-result.conllu: System prediction of development set ($UD_GT/$ZS_LANG*-dev.conllu).
result.conllu: System prediction of testing set ($UD_ROOT/$ZS_LANG*-test.conllu).
result-gt.conllu: System prediction of testing set ($UD_GT/$ZS_LANG*-test.conllu).
result.txt: Performance (LAS, UAS, etc.) of result.conllu computed by utils/conll18_ud_eval.py, which is provided by CoNLL 2018 Shared Task.
result-gt.txt: Performance (LAS, UAS, etc.) of result-gt.conllu computed by utils/error_analysis.py, which is modified from CoNLL 2018 Shared Task. Scores grouped by sentence length (LASlen[sentence length lower bound][sentence length upper bound]) and dependency length(LASdep[dependency length]) are added.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
data		data
src		src
tests		tests
tools		tools
training_config		training_config
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.pyi		__init__.pyi
dev.sh		dev.sh
ensemble.py		ensemble.py
fine-tune.sh		fine-tune.sh
interp_all.sh		interp_all.sh
interpolate_zs.sh		interpolate_zs.sh
predict.py		predict.py
requirements.txt		requirements.txt
run-zs-eval-all.sh		run-zs-eval-all.sh
run.py		run.py
run_fine-tune_all.sh		run_fine-tune_all.sh
test.sh		test.sh
test_all.sh		test_all.sh
visualize.py		visualize.py
viz_pos.py		viz_pos.py
zs-eval.sh		zs-eval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meta-learning for Low-Resource Dependency Parsing

Data

Setting up the environment

Pre-training:

delexicalized parsing models:

lexicalized parsing models:

hyperparameters:

Zero-shot Transfer

Fine-tuning

Files in log directory

About

Releases

Packages

Languages

Chung-I/maml-parsing

Folders and files

Latest commit

History

Repository files navigation

Meta-learning for Low-Resource Dependency Parsing

Data

Setting up the environment

Pre-training:

delexicalized parsing models:

lexicalized parsing models:

hyperparameters:

Zero-shot Transfer

Fine-tuning

Files in log directory

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages