Skip to content

Create an engine from scratch

Davide Caroselli edited this page Aug 7, 2018 · 14 revisions

How to prepare your data

The easy way to increase the quality is to add more in-domain data. MMT uses standard sentence aligned corpora, in the format of either TMX files or in couples of parallel files representing memories.

Note: by design, ModernMT requires corpora files to be UTF-8 encoded. If your files are not already encoded in this format, please convert them to UTF-8 before trying to use them

Example:

data/europarl.tmx
data/wmt10.en
data/wmt10.fr

In general:

memory-name.(2 letters iso lang code|5 letters RFC3066)

Note: memory-name must be [a-zA-Z0-9] only, without spaces.

Get more data

If you need more data there is a good collection here:

Creating a large translation model

If you want to try creating a 1B words engine, you can download the WMT 10 Corpus corpus from here:

$ wget http://www.statmt.org/wmt10/training-giga-fren.tar

Untar the archive and place the unzipped giga-fren.release2.XX corpus in a training directory (eg. wmt-train-dir) and run:

$ ./mmt create en fr wmt-train-dir

The corpus contains 575,799,111 source tokens and 1,247,735,635 total words.

Engine Creation

Once you have collected all the data you want to use for your engine, you need to create it.

To create an engine, just open a shell and run:

./mmt create <your_source_language> <your_target_language> /path/to/your/data

By default this process will run forever, until a Ctrl+C command is sent or the corresponding command kill -2 <process-pid> is run in a shell.

If you want your engine to handle more than one language pair, change your <your_mmt_home>/engines/<your_engine_name>/engine.xconf configuration file adding under XML element engine a child element languages with the needed pair nodes. You can find information and examples on this in our advanced configurations page.

Then, start your engine and import into it the data for your new language pairs using the corresponding REST API.