-
Notifications
You must be signed in to change notification settings - Fork 67
Create an engine from scratch
The easy way to increase the quality is to add more in-domain data. MMT uses standard sentence aligned corpora, in the format of either TMX files or in couples of parallel files representing memories.
Note: by design, ModernMT requires corpora files to be UTF-8 encoded. If your files are not already encoded in this format, please convert them to UTF-8 before trying to use them
Example:
data/europarl.tmx
data/wmt10.en
data/wmt10.fr
In general:
memory-name.(2 letters iso lang code|5 letters RFC3066)
Note: memory-name must be [a-zA-Z0-9] only, without spaces.
If you need more data there is a good collection here:
- Parallel data: Opus Website (moses format)
If you want to try creating a 1B words engine, you can download the WMT 10 Corpus corpus from here:
$ wget http://www.statmt.org/wmt10/training-giga-fren.tar
Untar the archive and place the unzipped giga-fren.release2.XX corpus in a training directory (eg. wmt-train-dir) and run:
$ ./mmt create en fr wmt-train-dir
The corpus contains 575,799,111 source tokens and 1,247,735,635 total words.
Once you have collected all the data you want to use for your engine, you need to create it.
To create an engine, just open a shell and run:
./mmt create <your_source_language> <your_target_language> /path/to/your/data
By default this process will run forever, until a Ctrl+C
command is sent or the corresponding command kill -2 <process-pid>
is run in a shell.
If you want your engine to handle more than one language pair, change your <your_mmt_home>/engines/<your_engine_name>/engine.xconf
configuration file adding under XML element engine
a child element languages
with the needed pair
nodes.
You can find information and examples on this in our advanced configurations page.
Then, start your engine and import into it the data for your new language pairs using the corresponding REST API.