Skip to content

Softcatala/julibert

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 

Introduction

Welcome to project Julibert, Softcatalà playground of BERT models

BERT models were introduced by Google in 2018 and achieved state-of-the-art performance on a number of natural language understanding tasks:

  • GLUE (General Language Understanding Evaluation) task set (consisting of 9 tasks)
  • SQuAD (Stanford Question Answering Dataset) v1.1 and v2.0.
  • SWAG (Situations With Adversarial Generations)

Google has published two sets of models:

  • Single language models for English and Chineses
  • Multilingual models where a single model covers 104 languages

It has been prove the multilingual models perform poorly comparared to single language models. Serveral linguistics communities like French, Finish or Spanish have been working on creating the language specific models that outperform Google multilingual models.

Challenges for minority languages

BERT represents serveral problems for minority languages:

  • It's expensive to train: Training of BERT-base was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total), and training of BERT-large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.
  • It's expensive to do predictions

Since BERT was publised several derviated versions have been published to solve these problems: Roberta, MiniBert, etc.

Goals of the project

This project has two goals

Create a Catalan BERT alike model for Catalan language

Publish a Catalan model at https://huggingface.co/models and make available to the NLP community.

Evaluate its use as part of our grammar correction system

We have the hypotesis that we can leverage on BERT alike models to improve LanguageTool grammar correction capatibities. Basically using BERT to understand if a word in a sentence is possibe in a BERT model.

Models

Roberta model

We train the model using the SpanBerta instrucctions

The scripts in Python are in the repository

  • Corpus: Oscar Catalan Corpus (3,8G)
  • Tokenizer: ByteLevelBPETokenizer
  • Model type: Roberta
  • Vocabulary size: 50265
  • Steps: 500000

https://www.softcatala.org/pub/softcatala/julibert/julibert-2020-10-11.zip

Calbert model

A Catalan ALBERT (A Lite BERT). See: https://github.com/codegram/calbert

  • Corpus: Oscar Catalan Corpus (3,8G)
  • Tokenizer: SentencePiece
  • Model type: Alebert
  • Vocabulary size: 30000

Usage

From Linux command line:

wget https://www.softcatala.org/pub/softcatala/julibert/julibert-2020-11-10.zip
unzip julibert-2020-11-10.zip 


pip install transformers torch

From Python 3:

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="julibert/",
    tokenizer="julibert/"
)

predict = fill_mask("El tribunal considera provat que els acusats van <mask> gairebé 24 milions d'euros.")
print(predict)

Result:

[
{'sequence': "<s>El tribunal considera provat que els acusats van costar gairebé 24 milions d'euros.</s>", 'score': 0.33576342463493347, 'token': 14808 },
{'sequence': "<s>El tribunal considera provat que els acusats van invertir gairebé 24 milions d'euros.</s>", 'score': 0.06258589774370193, 'token': 14388 },
{'sequence': "<s>El tribunal considera provat que els acusats van pagar gairebé 24 milions d'euros.</s>", 'score': 0.05679689720273018, 'token': 4030}, 
{'sequence': "<s>El tribunal considera provat que els acusats van guanyar gairebé 24 milions d'euros.</s>", 'score': 0.03947337344288826, 'token': 3246}, 
{'sequence': "<s>El tribunal considera provat que els acusats van recaptar gairebé 24 milions d'euros.</s>", 'score': 0.035779498517513275, 'token': 14638}


Contact

Email address: Jordi Mas: jmas@softcatala.org

About

Catalan bert model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages