biBERT

Quickstart

Download the models here:

What's this?

Finnish-English bilingual BERT models (biBERTs) are Google's BERT model trained for both English and Finnish.

Two versions of biBERT have been trained, one with a custom 70k wordpiece vocabulary, the other 80k. The vocabulary size is roughly that of the sum of Google's English BERT (30k) and FinBERT (50k). In the vocabulary, there is no distinction between English and Finnish wordpieces, i.e., the models have a joint vocabulary for English and Finnish.

The model's performance have been evaluated on both English and Finnish benchmarks. The Finnish benchmarks have also been used to evaluate FinBERT.

Data

biBERTs are trained on both English and Finnish data (no parallel corpus is used). For English, Wikipedia and a reconstructed BookCorpus were used for pre-training. For Finnish, the same data used to train FinBERT were used, which include Finnish news, online discussion, as well as internet crawl.

Results

Document classification

[code][Yle data] [Ylilauta data]

Named Entity Recognition

Evaluation on FiNER corpus (Ruokolainen et al 2019)

Model	Accuracy
FinBERT	92.40%
biBERT 70k	92.34%
biBERT 80k	92.23%
Multilingual BERT	90.29%
FiNER-tagger (rule-based)	86.82%

(FiNER tagger results from Ruokolainen et al. 2019)

[code][data]

Part of speech tagging

Evaluation on three Finnish corpora annotated with Universal Dependencies part-of-speech tags: the Turku Dependency Treebank (TDT), FinnTreeBank (FTB), and Parallel UD treebank (PUD)

Model	TDT	FTB	PUD
FinBERT	98.23%	98.39%	98.08%
biBERT 70k	98.17%	98.30%	98.08%
biBERT 80k	98.14%	98.16%	98.07%
Multilingual BERT	96.97%	95.87%	97.58%

[code][data]

Dependency parsing

Evaluation on the same corpora used in POS-tagging: TDT, FTB, and PUD

Labeled attachment score (LAS) parsing results for predicted (p.seg.) and gold (g.seg.) segmentation.

Model	TDT p.seg. g.seg.	FTB p.seg. g.seg.	PUD p.seg. g.seg.
FinBERT	91.93% 93.56%	92.16% 93.95%	92.54% 93.10%
biBERT 70k	91.35% 92.93%	91.68% 93.49%	92.18% 92.74%
biBERT 80k	91.59% 93.16%	91.76% 93.50%	92.37% 93.02%
Multilingual BERT	86.32% 87.99%	85.52% 87.46%	89.18% 89.75%

[code][data]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

biBERT

Quickstart

What's this?

Data

Results

Document classification

Named Entity Recognition

Part of speech tagging

Dependency parsing

About

Releases

Packages

TurkuNLP/biBERT

Folders and files

Latest commit

History

Repository files navigation

biBERT

Quickstart

What's this?

Data

Results

Document classification

Named Entity Recognition

Part of speech tagging

Dependency parsing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages