Skip to content
This repository was archived by the owner on Jun 16, 2023. It is now read-only.

Softcatala/julibert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Welcome to project Julibert, Softcatalà playground of BERT models

BERT models were introduced by Google in 2018 and achieved state-of-the-art performance on a number of natural language understanding tasks:

  • GLUE (General Language Understanding Evaluation) task set (consisting of 9 tasks)
  • SQuAD (Stanford Question Answering Dataset) v1.1 and v2.0.
  • SWAG (Situations With Adversarial Generations)

Google has published two sets of models:

  • Single language models for English and Chineses
  • Multilingual models where a single model covers 104 languages

It has been prove the multilingual models perform poorly comparared to single language models. Serveral linguistics communities like French, Finish or Spanish have been working on creating the language specific models that outperform Google multilingual models.

Challenges for minority languages

BERT represents serveral problems for minority languages:

  • It's expensive to train: Training of BERT-base was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total), and training of BERT-large was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.
  • It's expensive to do predictions

Since BERT was publised several derviated versions have been published to solve these problems: Roberta, MiniBert, etc.

Goals of the project

This project has two goals

Create a Catalan BERT alike model for Catalan language

Publish a Catalan model at https://huggingface.co/models and make available to the NLP community.

Evaluate its use as part of our grammar correction system

We have the hypotesis that we can leverage on BERT alike models to improve LanguageTool grammar correction capatibities. Basically using BERT to understand if a word in a sentence is possibe in a BERT model.

Models

Roberta model

We train the model using the SpanBerta instrucctions

The scripts in Python are in the repository

  • Corpus: Oscar Catalan Corpus (3,8G)
  • Tokenizer: ByteLevelBPETokenizer
  • Model type: Roberta
  • Vocabulary size: 50265
  • Steps: 500000

https://www.softcatala.org/pub/softcatala/julibert/julibert-2020-10-11.zip

Calbert model

A Catalan ALBERT (A Lite BERT). See: https://github.com/codegram/calbert

  • Corpus: Oscar Catalan Corpus (3,8G)
  • Tokenizer: SentencePiece
  • Model type: Alebert
  • Vocabulary size: 30000

Usage

From Linux command line:

wget https://www.softcatala.org/pub/softcatala/julibert/julibert-2020-11-10.zip
unzip julibert-2020-11-10.zip 


pip install transformers torch

From Python 3:

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="julibert/",
    tokenizer="julibert/"
)

predict = fill_mask("El tribunal considera provat que els acusats van <mask> gairebé 24 milions d'euros.")
print(predict)

Result:

[
{'sequence': "<s>El tribunal considera provat que els acusats van costar gairebé 24 milions d'euros.</s>", 'score': 0.33576342463493347, 'token': 14808 },
{'sequence': "<s>El tribunal considera provat que els acusats van invertir gairebé 24 milions d'euros.</s>", 'score': 0.06258589774370193, 'token': 14388 },
{'sequence': "<s>El tribunal considera provat que els acusats van pagar gairebé 24 milions d'euros.</s>", 'score': 0.05679689720273018, 'token': 4030}, 
{'sequence': "<s>El tribunal considera provat que els acusats van guanyar gairebé 24 milions d'euros.</s>", 'score': 0.03947337344288826, 'token': 3246}, 
{'sequence': "<s>El tribunal considera provat que els acusats van recaptar gairebé 24 milions d'euros.</s>", 'score': 0.035779498517513275, 'token': 14638}


Contact

Email address: Jordi Mas: jmas@softcatala.org

Releases

No releases published

Packages

No packages published

Languages