Skip to content

DFKI-MLT/TransIns

Repository files navigation

TransIns - Document Translation with Markup Reinsertion

Build Status

V1.0.0

written by Jörg Steffen
DFKI MLT-Lab

Overview

TransIns uses the Okapi framework to parse documents of supported formats (MS Office, OpenOffice, HTML and plain text) into a representation that preserves the document markup and allows access to the document’s text content on a sentence level. The sentences, without the markup, are translated by the neural machine translation framework MarianNMT using translation models provided by OPUS-MT. Afterwards, the markup is reinserted into the translated sentences based on token alignments. Finally, a translated document is provided in the original format.

We implement the following strategies for reinserting markup into the translated sentence using the tokens' alignments:

  • mtrain: A strategy assigning markup to the token next to it, as described in this paper of Matthias Müller and implemented in the Zurich NLP mtrain Python package

  • mtrain++: An improved version of the MTRAIN strategy

  • Complete Mapping Strategy (CMS): A strategy assigning markup to all tokens in the markup’s scope

An online demonstrator of TransIns is available here.

Citation

If you use TransIns in your research, please cite the following paper:

@inproceedings{steffen-van-genabith-2021-transins,
    title = "{T}rans{I}ns: Document Translation with Markup Reinsertion",
    author = {Steffen, J{\"o}rg  and
      van Genabith, Josef},
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-demo.4",
    pages = "28--34"
}

Requirements

TransIns was tested on openSUSE Leap 15.1. For local deployment, the following software is required:

  • Java 11 JDK or higher

  • Maven build tool, version 3.5 or higher

  • Conda package and environment management system, version 4.9.2 or higher

  • Marian NMT neural machine translation framework, version 1.7

TransIns also provides Docker images for easy deployment (see Docker Deployment). This requires

For efficient performance, we recommend to run Marian NMT on a server with at least one GPU. Our setup uses two GeForce RTX 2080 Ti with 11GB RAM each.

Content

This distribution consists of the following files:

  • README.adoc: this readme file

  • pom.xm: Maven build tool descriptor file

  • environment.yml: Conda environment definition file

  • Dockerfile-*: Docker image descriptions for the three TransIns components

  • build-transins-docker-images.sh: a convenience script for building TransIns Docker images

  • download-opus-models.sh: script for downloading publicly available OPUS MT models for the translation directions de → en, en → de, de → fr and fr → de and deploying them in TransIns

  • docker-compose.yml: docker-compose file to run TransIns in Docker containers

  • evaluation: German evaluation documents with annotated translations to French and English, using both TransIns and 3rd party translation services

  • src/main/java/*: Java source code of the core TransIns component, including an embedded server for web page and REST API

  • src/main/python: Python source code of the pre- and postprocessing server

  • src/main/web: Source files of the TransIns web page

  • src/main/resources:

    • transIns.cfg: TransIns configuration file; includes configuration of embedded server for web page and REST API as well as Marian NMT translator configurations for each supported translation direction

    • okapi: configuration files of Okapi modules

    • apertium-translator.cfg, microsoft-translator.cfg: Okapi connector configuration files for alternative translation services

    • logback.xml: TransIns logger configuration

    • log4j2.xml, commons-logging.properties, simplelog.properties: logger configuration for third-party libraries

  • src/main/resources-misc:

    • prepostprocessing: configuration folder of pre-/postprocessing server; run ./download-opus-models.sh to put required resources for pre-/postprocessing here

    • marian-nmt: configuration folder of Marian NMT translation server; contains a subfolder for each supported translation direction; run ./download-opus-models.sh to put required MT models here

  • src/test/java: Java source code of unit tests and demos

  • src/test/resources: files used in demos

Installation

TransIns consists of the following components:

  • the TransIns core component, including web server and REST API

  • the server for handling pre- and postprocessing of sentences to translate; one instance handles all supported translation directions

  • the Marian NMT server for translating sentences; one instance for each supported translation direction

Quickstart

If you have a server with two GPUs, each with at least 8 GB RAM, run the following commands in the top level folder of the distribution:

  • ./build-transins-docker-images.sh

  • ./download-opus-models.sh

  • docker-compose up -d

TransIns is now available at http://localhost:7777 providing de → en, en → de, de → fr and fr→de translations.

TransIns Core Component

To compile the Java source code for the TransIns core component, run mvn clean install in the top level folder of the distribution (where pom.xml is located). Start the server providing the web page and REST API by running the following command:

java -cp src/main/resources:target/*:target/lib/* \
  de.dfki.mlt.transins.server.TransInsServer

Configuration files are loaded from Java classpath via src/main/resources/. Please note that the connection settings in transIns.cfg are set by default for deployment in Docker and have to be adapted for local deployment.

The TransIns web page can be accessed at http://localhost:7777, the REST API is available at http://localhost:7777/* (see REST API for more details).

TransIns Pre-/Postprocessing Server

To run the pre-/postprocessing server, several Python packages have to be installed first. We suggest to create a dedicated Conda environment by running conda create --name transins and then switching to it by running conda activate transins. Install the required packages by running the following Conda commands:

conda install -c anaconda flask=2.2.2
conda install -c anaconda waitress=1.4.3
conda install -c conda-forge sacremoses=0.0.43
conda install -c conda-forge sentencepiece=0.1.95
conda install -c anaconda pip=20.2.4
pip install subword-nmt==0.3.7

Alternatively, the creation of the Conda environment and the installation of the required packages can also be done by a single Conda command using the environment definition file: conda env create -f environment.yml

Start the pre-/postprocessing server by running the following command:

python src/main/python/PrePostProcessingServer.py \
  --config_folder src/main/resources/prepostprocessing \
  --port 5000

The configuration of the pre-/postprocessing server is loaded from the file config.ini in the provided configuration folder. Make sure you have run ./download-opus-models.sh to put the required resources for pre-/postprocessing into the configuration folder.

The pre-/postprocessing server is now accessible via a REST API running at http://localhost:5000.

TransIns Marian NMT Server

Install Marian NMT on your system following these instructions. Start the translation server for a specific translation direction by running the following command in the top level folder of the distribution (assuming that marian-server is on your $PATH)

marian-server --config src/main/resources-misc/marian-nmt/<trans-dir>/config.yml

Make sure you have run ./download-opus-models.sh to put the required models into subfolders of marian-nmt. Please note that the provided config.yml configurations assume a GPU with at least 4 GB of free memory.

The Marian NMT translation server is now accessible via a web socket running at ws://localhost:8080/translate.

Docker Deployment

Instead of installing the TransIns components as described above, we also provide Docker images for easy deployment. The Docker image for each component is defined in the corresponding Dockerfile. Build the TransIns Docker images by running ./build-transins-docker-images.sh in the top level folder of the distribution.

Start TransIns with all supported translation directions (de → en, en → de, de → fr, fr → de) by running docker-compose up -d. Each TransIns component runs in a separate container within a Docker network. Note that the configuration folders of both the pre-/postprocessing server as well as the Marian NMT servers are passed as bind mounts to the corresponding Docker containers.

By default, the MT models for de → en and en → de are deployed at GPU 0 and the MT models for de → fr and fr → de are deployed at GPU 1. Each deployed models requires ~ 4 GB of free GPU memory. Please adapt the device_ids parameter of the transins-marian containers in docker-compose.yml, if necessary for your local GPU setup.

REST API

TransIns provides a RESTful API that allows to query the translation service in an asynchronous way. This API is also used by the web page.

The REST endpoint for getting the supported translation directions is /getTranslationDirections. Sending a GET request returns a JSON array of strings where each string represents a translation direction in the format <sourceLang>-<targetLang>.

The REST endpoint for sending a document to translate is /translate. The query has to be sent as POST request encoded as multipart/form-data with the following fields:

  • file the file name of the document to translate

  • transDir the translation direction; use the same format as returned by the getTranslationDirections endpoint

  • enc the encoding of the document; the translated document will use the same encoding

  • strategy the markup reinsertion strategy to use; possible values are MTRAIN, MTRAIN_IMPROVED and COMPLETE_MAPPING (default if strategy is not provided)

If successful, the service returns a token which is required to retrieve the translated document with a second query. That query has to be sent as GET request to the /getTranslation REST endpoint. It requires the token as path parameter. Please note that it is not guaranteed that the translated document can be retrieved immediately, as the translation may take some time. If the translation is not yet available, the second call returns a 202 HTTP response code.

If required, the token can also be used to cancel a translation and/or force the deletion of all associated files on the server. A delete query has to be sent as DELETE request to the /deleteTranslation REST endpoint with the token as path parameter.

To test the REST service, use the curl and wget tools.

The following GET query retrieves the supported translation directions from a TransIns service running on port 7777 at localhost:

curl -i -X GET localhost:7777/getTranslationDirections

This would return a JSON array ["de-en", "en-de", "de-fr","fr-de"].

A POST query to translate an MS Office document MyDoc.docx from German to French would look like this:

curl -i -X POST -H "Content-Type: multipart/form-data" \
  -F "file=@MyDoc.docx" -F "transDir=de-fr" -F "enc=windows-1252" \
  -F "strategy=COMPLETE_MAPPING" \
  localhost:7777/translate

This returns a token cbVHK6U2oJIO8hCPvU4LR6dL3FSt2oU0nw9VBbFo that must be used in the second GET query to retrieve the translated document:

wget -S --content-disposition \
  localhost:7777/getTranslation/cbVHK6U2oJIO8hCPvU4LR6dL3FSt2oU0nw9VBbFo

In order to delete the files on the server, use this DELETE query:

curl -i -X DELETE \
  localhost:7777/deleteTranslation/cbVHK6U2oJIO8hCPvU4LR6dL3FSt2oU0nw9VBbFo