COMBO is jointly trained tagger, lemmatizer and dependency parser.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README.md
encoders.py
main.py
models.py
mst.py
parser.py
requirements.txt
utils.py

README.md

COMBO

COMBO is jointly trained neural tagger, lemmatizer and dependency parser implemented in python 3 using Keras framework. It took part in 2018 CoNLL Universal Dependency shared task and ranked 3rd/4th in the official evaluation.

Paper

The COMBO description can be found here: Semi-Supervised Neural System for Tagging, Parsing and Lematization.

Usage

Training your own model:

python main.py --mode autotrain --train train_data.conllu --valid valid_data.conllu --embed external_embedding.txt --model model_name.pkl --force_trees

Making predictions:

python main.py --mode predict --test test_data.conllu --pred output_path.conllu --model model_name.pkl

Trained models

Models trained on UD dataset:

Language Treebank LAS MLAS BLEX Model
Afrikaans af_afribooms 84.72 72.91 74.98 377 MB
Ancient Greek grc_perseus 74.20 53.30 54.29 101 MB
Ancient Greek grc_proiel 76.45 59.95 67.47 101 MB
Arabic ar_padt 71.95 62.75 64.38 737 MB
Armenian hy_armtdp 28.15 5.02 11.25 738 MB
Basque eu_bdt 83.12 68.82 77.96 737 MB
Bulgarian bg_btb 89.36 81.10 79.98 738 MB
Buryat bxr_bdt 15.16 1.09 1.92 90 MB
Catalan ca_ancora 90.54 83.11 85.20 737 MB
Chinese zh_gsd 63.92 53.48 57.84 744 MB
Croatian hr_set 86.32 71.12 79.74 737 MB
Czech cs_cac 90.72 83.27 86.69 740 MB
Czech cs_fictree 91.83 84.23 87.81 740 MB
Czech cs_pdt 90.34 84.04 86.96 740 MB
Danish da_ddt 83.43 74.22 77.58 737 MB
Dutch nl_alpino 87.15 74.93 77.06 737 MB
Dutch nl_lassysmall 84.27 72.65 75.44 737 MB
English en_ewt 82.31 73.33 76.52 737 MB
English en_gum 82.82 73.24 73.57 737 MB
English en_lines 80.33 72.25 74.01 737 MB
Estonian et_edt 83.46 75.79 72.07 738 MB
Finnish fi_ftb 86.89 78.42 81.06 739 MB
Finnish fi_tdt 85.93 78.65 72.39 739 MB
French fr_gsd 85.42 77.08 79.72 738 MB
French fr_sequoia 88.99 81.48 84.67 738 MB
French fr_spoken 74.31 63.43 65.34 738 MB
Galician gl_ctg 81.17 68.15 73.60 736 MB
Galician gl_treegal 73.21 52.88 62.86 736 MB
German de_gsd 77.43 54.28 68.59 738 MB
Gothic got_proiel 65.87 50.81 59.30 48 MB
Greek el_gdt 88.49 76.15 78.57 738 MB
Hebrew he_htb 63.69 50.26 53.58 737 MB
Hindi hi_hdtb 91.43 76.23 86.29 593 MB
Hungarian hu_szeged 79.47 66.09 72.51 737 MB
Indonesian id_gsd 78.40 67.30 75.10 737 MB
Irish ga_idt 69.24 37.31 47.32 206 MB
Italian it_isdt 91.03 83.18 84.76 737 MB
Italian it_postwita 73.99 61.14 62.98 737 MB
Japanese ja_gsd 73.69 57.82 60.62 743 MB
Kazakh kk_ktb 22.38 4.40 7.86 738 MB
Korean ko_gsd 80.66 74.49 66.13 741 MB
Korean ko_kaist 84.88 76.92 72.40 743 MB
Kurmanji kmr_mg 21.95 2.26 05.01 45 MB
Latin la_ittb 85.54 79.84 83.51 526 MB
Latin la_perseus 68.07 49.77 52.75 526 MB
Latin la_proiel 70.08 56.82 64.94 526 MB
Latvian lv_lvtb 80.71 66.22 71.80 637 MB
North Sámi sme_giella 57.16 39.66 45.03 47 MB
Norwegian no_bokmaal 89.33 79.51 84.68 737 MB
Norwegian no_nynorsk 88.36 79.32 82.89 737 MB
Norwegian no_nynorsklia 68.26 57.51 60.98 737 MB
Old Church Slavonic cu_proiel 71.14 56.52 66.04 48 MB
Old French fro_srcmf 84.81 76.75 81.20 52 MB
Persian fa_seraji 86.14 80.30 76.29 737 MB
Polish pl_lfg 94.62 86.44 89.31 737 MB
Polish pl_sz 91.38 80.45 85.59 737 MB
Polish poleval2018 86.11 76.18 79.86 115 MB
Portuguese pt_bosque 87.57 74.31 80.31 737 MB
Romanian ro_rrt 85.31 76.84 79.54 737 MB
Russian ru_syntagrus 91.10 85.37 87.16 741 MB
Russian ru_taiga 74.24 61.59 64.36 741 MB
Serbian sr_set 87.27 73.79 79.92 738 MB
Slovak sk_snk 83.76 63.97 75.34 54 MB
Slovenian sl_ssj 85.72 75.07 81.11 737 MB
Slovenian sl_sst 58.12 45.93 50.94 737 MB
Spanish es_ancora 89.68 82.60 84.51 737 MB
Swedish sv_lines 81.97 66.26 77.01 737 MB
Swedish sv_talbanken 85.89 77.68 80.74 737 MB
Turkish tr_imst 63.54 52.51 58.89 737 MB
Ukrainian uk_iu 84.71 69.88 77.97 738 MB
Upper Sorbian hsb_ufal 21.30 1.45 4.53 139 MB
Urdu ur_udtb 81.53 55.70 72.49 485 MB
Uyghur ug_udt 63.10 40.71 52.76 165 MB
Vietnamese vi_vtb 42.53 35.11 38.47 736 MB

License

Citation

@InProceedings{rybak-wrblewska:2018:K18-2,
  author    = {Rybak, Piotr  and  Wr{\'{o}}blewska, Alina},
  title     = {Semi-Supervised Neural System for Tagging, Parsing and Lematization},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {45--54},
  url       = {http://www.aclweb.org/anthology/K18-2004}
}