Skip to content

Latest commit

 

History

History
49 lines (37 loc) · 2.11 KB

README.md

File metadata and controls

49 lines (37 loc) · 2.11 KB

emBERT

emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package.

emBERT defines the following tools:

Name(s) Task Training corpus F1 score
bert-ner NER Szeged NER corpus 97.08%
bert-basenp base NP chunking Szeged TreeBank 2.0 95.58%
bert-np (or bert-chunk) maximal NP chunking Szeged TreeBank 2.0 95.05%

(The results in bold are state-of-the-art for Hungarian.)

Due to their size (a little over 700M apiece), the models are stored in a separate repository. emBERT-models is a submodule of this repository, so if cloned recursively with git LFS, the models will be downloaded as well:

git clone --recursive https://github.com/DavidNemeskey/emBERT.git

Alternatively, the models can be obtained via emtsv's download_models.py script.

If you use emBERT in your work, please cite the following paper (see link for bib; Hungarian):

Nemeskey Dávid Márk 2020. Egy emBERT próbáló feladat. In Proceedings of the 16th Conference on Hungarian Computational Linguistics (MSZNY 2020). pp. 409-418.

Training

Should the need arise to train a better model, or to build one for a different domain, the train_embert.py script can be used to fine-tune a BERT model on a token classification task. An example run that reproduces the chunking results (given the same train-valid-test split and a GPU with 11G+ memory) is:

train_embert.py --data_dir ~/data/chunking/szeged_max_bioe1_100/ \
    --bert_model SZTAKI-HLT/hubert-base-cc --task_name szeged_bioes_chunk \
    --data_format tsv --output_dir bert_np --do_train --max_seq_length 384 \
    --num_train_epochs=4 --train_batch_size 10 --learning_rate "1e-5" \
    --do_eval --eval_batch_size 1 --use_viterbi --seed 42

Note that if the model is trained on a new tag set, it has to be added to embert/processors.py.