Skip to content


Repository files navigation


This repository contains the scripts to train neuronal translation models for OpenNMT and also the Softcatalà published models.

For more information about training see the TRAINING document.

The corpus used to train these models are available here:

And here the tools that at Softcatalà to serve these models in production:


Language pair SC model BLEU SC Flores200 BLEU Google BLEU Meta NLLB200 BLEU Opus-MT BLEU Sentences Download model
German-Catalan 34.8 28.9 35.5 30.7 18.5 3142257
Catalan-German 28.5 25.4 32.9 29.1 15.8 3142257
English-Catalan 46.9 43.8 46.0 41.7 29.8 7856208
Catalan-English 47.4 43.5 47.0 48.0 29.6 7856208
French-Catalan 41.3 31.6 37.3 33.3 27.2 2566302
Catalan-French 41.4 35.4 41.7 39.6 27.9 2566302
Galician-Catalan 74.1 31.4 36.5 33.2 N/A 2710149
Catalan-Galician 80.7 31.9 33.1 31.7 N/A 2710149
Italian-Catalan 39.7 26.5 30.6 27.8 22.0 2584598
Catalan-Italian 36.2 24.5 27.5 26.0 19.2 2584598
Japanese-Catalan 24.9 17.8 23.4 N/A N/A 1997740
Catalan-Japanese 21.3 19.8 32.5 N/A N/A 1997740
Dutch-Catalan 30.4 20.3 27.1 24.8 15.8 2208538
Catalan-Dutch 27.6 18.2 23.4 21.8 13.4 2208538
Occitan-Catalan 74.9 32.5 N/A 36.2 N/A 2711350
Catalan-Occitan 78.8 28.9 N/A 27.8 N/A 2711350
Portuguese-Catalan 41.6 33.9 38.7 34.5 28.1 2043019
Catalan-Portuguese 39.0 32.3 40.0 36.5 27.5 2043019
Spanish-Catalan 88.8 22.6 23.6 25.8 22.5 7596985
Catalan-Spanish 87.5 24.2 24.2 25.5 23.2 7596985


  • SC Model BLEU column indicates the Softcatalà models' BLEU score against the corpus test dataset (from train/dev/test)
  • SC Flores200 BLEU column indicates the Softcatalà models' BLEU score against Flores200 benchmark dataset. This provides an external evaluation
  • Google BLEU is the BLUE score of Google Translate using the Flores200 benchmark
  • Opus-MT BLEU is the BLUE score of the Opus-MT models using the Flores200 benchmark (our ambition is to outperform them)
  • Sentences is the number of sentences in the corpus used for training
  • Meta NLLB200 refers to nllb-200-3.3B model from Meta. This is a very slow model and it's distilled version performs significantly worse.


  • All models are based on TransformerRelative and SentencePiece has been used as tokenizer.
  • We use Sacrebleu to calculate BLUE scores with the 13a tokenizer.
  • These models are used in production with modest hardware (CPU). As result, these models are a balance between precision and latency. It is possible to further improve BLUE scores by ~+1 BLEU, but at a significant latency cost at inference.
  • BLEU is the most popular metric for evaluating machine translation but also broadly acknowledged that it is not perfect. It's estimated that has a ~80% correlation with human judgment
  • Flores200 has some limitations. It was produced translating from English to many of the other languages. When you use flores for example to benchmark Catalan - Spanish translations, consider that the Catalan -> Spanish corpus was produced by translating from English to Catalan and from English to Spanish. The resulting Spanish and Catalan translations are different from what a translator will do translating directly from Spanish to Catalan. As a summary, Flores200 is more reliable for benchmarks where English is the source or target language.
  • Occitan model is based on Languedocian variant

Structure of the models

Description of the directories on the contained in the models zip file:

  • tensorflow: model exported in Tensorflow format
  • ctranslate2: model exported in CTranslate2 format (used for inference)
  • metadata: description of the model
  • tokenizer: SentencePiece models for both languages

Using the models

You can use the models with which offers fast inference.

At Softcatalà we built also command line tools to translate TXT and PO files. See:

Download the model and unpack it:

Install dependencies:

pip3 install ctranslate2 pyonmttok

Simple translation using Python:

import ctranslate2
translator = ctranslate2.Translator("eng-cat/ctranslate2/")
translator.translate_batch([["▁Hello", "▁world", "!"]])
[[{'tokens': ['▁Hola', '▁món', '!']}]]

Simple tokenization & translation using Python:

import pyonmttok
tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = "eng-cat/tokenizer/sp_m.model")
tokenized=tokenizer.tokenize("Hello world!")

import ctranslate2
translator = ctranslate2.Translator("eng-cat/ctranslate2/")
translated = translator.translate_batch([tokenized[0]])
Hola món!

Training the models

In order to train models you should have a GPU.

Training in a machine

First you need to install the necessary packages:

make install

After this, you download be all the corpuses:

make get-corpus

To train the English - Catalan model type:

make train-eng-cat

Training using a Jupyter notebook

We recommend using Kaggle which provides Jupyter notebooks with GPU access.

We have a Jupyter notebook which allows to trains simple models to learn how to use this toolset.


No releases published


No packages published