Welcome to project Julibert, Softcatalà playground of BERT models
BERT models were introduced by Google in 2018 and achieved state-of-the-art performance on a number of natural language understanding tasks:
- GLUE (General Language Understanding Evaluation) task set (consisting of 9 tasks)
- SQuAD (Stanford Question Answering Dataset) v1.1 and v2.0.
- SWAG (Situations With Adversarial Generations)
Google has published two sets of models:
- Single language models for English and Chineses
- Multilingual models where a single model covers 104 languages
It has been prove the multilingual models perform poorly comparared to single language models. Serveral linguistics communities like French, Finish or Spanish have been working on creating the language specific models that outperform Google multilingual models.
BERT represents serveral problems for minority languages:
- It's expensive to train: Training of BERT-base was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total), and training of BERT-large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.
- It's expensive to do predictions
Since BERT was publised several derviated versions have been published to solve these problems: Roberta, MiniBert, etc.
This project has two goals
Publish a Catalan model at https://huggingface.co/models and make available to the NLP community.
We have the hypotesis that we can leverage on BERT alike models to improve LanguageTool grammar correction capatibities. Basically using BERT to understand if a word in a sentence is possibe in a BERT model.
We train the model using the SpanBerta instrucctions
The scripts in Python are in the repository
- Corpus: Oscar Catalan Corpus (3,8G)
- Tokenizer: ByteLevelBPETokenizer
- Model type: Roberta
- Vocabulary size: 50265
- Steps: 500000
https://www.softcatala.org/pub/softcatala/julibert/julibert-2020-10-11.zip
A Catalan ALBERT (A Lite BERT). See: https://github.com/codegram/calbert
- Corpus: Oscar Catalan Corpus (3,8G)
- Tokenizer: SentencePiece
- Model type: Alebert
- Vocabulary size: 30000
From Linux command line:
wget https://www.softcatala.org/pub/softcatala/julibert/julibert-2020-11-10.zip
unzip julibert-2020-11-10.zip
pip install transformers torch
From Python 3:
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="julibert/",
tokenizer="julibert/"
)
predict = fill_mask("El tribunal considera provat que els acusats van <mask> gairebé 24 milions d'euros.")
print(predict)
Result:
[
{'sequence': "<s>El tribunal considera provat que els acusats van costar gairebé 24 milions d'euros.</s>", 'score': 0.33576342463493347, 'token': 14808 },
{'sequence': "<s>El tribunal considera provat que els acusats van invertir gairebé 24 milions d'euros.</s>", 'score': 0.06258589774370193, 'token': 14388 },
{'sequence': "<s>El tribunal considera provat que els acusats van pagar gairebé 24 milions d'euros.</s>", 'score': 0.05679689720273018, 'token': 4030},
{'sequence': "<s>El tribunal considera provat que els acusats van guanyar gairebé 24 milions d'euros.</s>", 'score': 0.03947337344288826, 'token': 3246},
{'sequence': "<s>El tribunal considera provat que els acusats van recaptar gairebé 24 milions d'euros.</s>", 'score': 0.035779498517513275, 'token': 14638}
Email address: Jordi Mas: jmas@softcatala.org