Skip to content

TurkuNLP/biBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

biBERT

Quickstart

Download the models here:

What's this?

Finnish-English bilingual BERT models (biBERTs) are Google's BERT model trained for both English and Finnish.

Two versions of biBERT have been trained, one with a custom 70k wordpiece vocabulary, the other 80k. The vocabulary size is roughly that of the sum of Google's English BERT (30k) and FinBERT (50k). In the vocabulary, there is no distinction between English and Finnish wordpieces, i.e., the models have a joint vocabulary for English and Finnish.

The model's performance have been evaluated on both English and Finnish benchmarks. The Finnish benchmarks have also been used to evaluate FinBERT.

Data

biBERTs are trained on both English and Finnish data (no parallel corpus is used). For English, Wikipedia and a reconstructed BookCorpus were used for pre-training. For Finnish, the same data used to train FinBERT were used, which include Finnish news, online discussion, as well as internet crawl.

Results

Document classification

learning curves for Yle and Ylilauta document classification

[code][Yle data] [Ylilauta data]

Named Entity Recognition

Evaluation on FiNER corpus (Ruokolainen et al 2019)

Model Accuracy
FinBERT 92.40%
biBERT 70k 92.34%
biBERT 80k 92.23%
Multilingual BERT 90.29%
FiNER-tagger (rule-based) 86.82%

(FiNER tagger results from Ruokolainen et al. 2019)

[code][data]

Part of speech tagging

Evaluation on three Finnish corpora annotated with Universal Dependencies part-of-speech tags: the Turku Dependency Treebank (TDT), FinnTreeBank (FTB), and Parallel UD treebank (PUD)

Model TDT FTB PUD
FinBERT 98.23% 98.39% 98.08%
biBERT 70k 98.17% 98.30% 98.08%
biBERT 80k 98.14% 98.16% 98.07%
Multilingual BERT 96.97% 95.87% 97.58%

[code][data]

Dependency parsing

Evaluation on the same corpora used in POS-tagging: TDT, FTB, and PUD

Labeled attachment score (LAS) parsing results for predicted (p.seg.) and gold (g.seg.) segmentation.

Model

TDT
p.seg. g.seg.

FTB
p.seg. g.seg.

PUD
p.seg. g.seg.

FinBERT 91.93% 93.56% 92.16% 93.95% 92.54% 93.10%
biBERT 70k 91.35% 92.93% 91.68% 93.49% 92.18% 92.74%
biBERT 80k 91.59% 93.16% 91.76% 93.50% 92.37% 93.02%
Multilingual BERT 86.32% 87.99% 85.52% 87.46% 89.18% 89.75%

[code][data]

About

Finnish English bilingual BERT models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published