TransIns - Document Translation with Markup Reinsertion

V1.0.0

written by Jörg Steffen
DFKI MLT-Lab

Overview

TransIns uses the Okapi framework to parse documents of supported formats (MS Office, OpenOffice, HTML and plain text) into a representation that preserves the document markup and allows access to the document’s text content on a sentence level. The sentences, without the markup, are translated by the neural machine translation framework MarianNMT using translation models provided by OPUS-MT. Afterwards, the markup is reinserted into the translated sentences based on token alignments. Finally, a translated document is provided in the original format.

We implement the following strategies for reinserting markup into the translated sentence using the tokens' alignments:

mtrain: A strategy assigning markup to the token next to it, as described in this paper of Matthias Müller and implemented in the Zurich NLP mtrain Python package
mtrain++: An improved version of the MTRAIN strategy
Complete Mapping Strategy (CMS): A strategy assigning markup to all tokens in the markup’s scope

An online demonstrator of TransIns is available here.

Citation

If you use TransIns in your research, please cite the following paper:

@inproceedings{steffen-van-genabith-2021-transins,
    title = "{T}rans{I}ns: Document Translation with Markup Reinsertion",
    author = {Steffen, J{\"o}rg  and
      van Genabith, Josef},
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-demo.4",
    pages = "28--34"
}

Requirements

TransIns was tested on openSUSE Leap 15.1. For local deployment, the following software is required:

Java 11 JDK or higher
Maven build tool, version 3.5 or higher
Conda package and environment management system, version 4.9.2 or higher
Marian NMT neural machine translation framework, version 1.7

TransIns also provides Docker images for easy deployment (see Docker Deployment). This requires

Docker, version 19.03 or higher
Docker Compose, version 1.28.5 or higher

For efficient performance, we recommend to run Marian NMT on a server with at least one GPU. Our setup uses two GeForce RTX 2080 Ti with 11GB RAM each.

Content

This distribution consists of the following files:

README.adoc: this readme file
pom.xm: Maven build tool descriptor file
environment.yml: Conda environment definition file
Dockerfile-*: Docker image descriptions for the three TransIns components
build-transins-docker-images.sh: a convenience script for building TransIns Docker images
download-opus-models.sh: script for downloading publicly available OPUS MT models for the translation directions de → en, en → de, de → fr and fr → de and deploying them in TransIns
docker-compose.yml: docker-compose file to run TransIns in Docker containers
evaluation: German evaluation documents with annotated translations to French and English, using both TransIns and 3rd party translation services
src/main/java/*: Java source code of the core TransIns component, including an embedded server for web page and REST API
src/main/python: Python source code of the pre- and postprocessing server
src/main/web: Source files of the TransIns web page
src/main/resources:
- transIns.cfg: TransIns configuration file; includes configuration of embedded server for web page and REST API as well as Marian NMT translator configurations for each supported translation direction
- okapi: configuration files of Okapi modules
- apertium-translator.cfg, microsoft-translator.cfg: Okapi connector configuration files for alternative translation services
- logback.xml: TransIns logger configuration
- log4j2.xml, commons-logging.properties, simplelog.properties: logger configuration for third-party libraries
src/main/resources-misc:
- prepostprocessing: configuration folder of pre-/postprocessing server; run ./download-opus-models.sh to put required resources for pre-/postprocessing here
- marian-nmt: configuration folder of Marian NMT translation server; contains a subfolder for each supported translation direction; run ./download-opus-models.sh to put required MT models here
src/test/java: Java source code of unit tests and demos
src/test/resources: files used in demos

Installation

TransIns consists of the following components:

the TransIns core component, including web server and REST API
the server for handling pre- and postprocessing of sentences to translate; one instance handles all supported translation directions
the Marian NMT server for translating sentences; one instance for each supported translation direction

Quickstart

If you have a server with two GPUs, each with at least 8 GB RAM, run the following commands in the top level folder of the distribution:

./build-transins-docker-images.sh
./download-opus-models.sh
docker-compose up -d

TransIns is now available at http://localhost:7777 providing de → en, en → de, de → fr and fr→de translations.

TransIns Core Component

To compile the Java source code for the TransIns core component, run mvn clean install in the top level folder of the distribution (where pom.xml is located). Start the server providing the web page and REST API by running the following command:

java -cp src/main/resources:target/*:target/lib/* \
  de.dfki.mlt.transins.server.TransInsServer

Configuration files are loaded from Java classpath via src/main/resources/. Please note that the connection settings in transIns.cfg are set by default for deployment in Docker and have to be adapted for local deployment.

The TransIns web page can be accessed at http://localhost:7777, the REST API is available at http://localhost:7777/* (see REST API for more details).

TransIns Pre-/Postprocessing Server

To run the pre-/postprocessing server, several Python packages have to be installed first. We suggest to create a dedicated Conda environment by running conda create --name transins and then switching to it by running conda activate transins. Install the required packages by running the following Conda commands:

conda install -c anaconda flask=2.2.2
conda install -c anaconda waitress=1.4.3
conda install -c conda-forge sacremoses=0.0.43
conda install -c conda-forge sentencepiece=0.1.95
conda install -c anaconda pip=20.2.4
pip install subword-nmt==0.3.7

Alternatively, the creation of the Conda environment and the installation of the required packages can also be done by a single Conda command using the environment definition file: conda env create -f environment.yml

Start the pre-/postprocessing server by running the following command:

python src/main/python/PrePostProcessingServer.py \
  --config_folder src/main/resources/prepostprocessing \
  --port 5000

The configuration of the pre-/postprocessing server is loaded from the file config.ini in the provided configuration folder. Make sure you have run ./download-opus-models.sh to put the required resources for pre-/postprocessing into the configuration folder.

The pre-/postprocessing server is now accessible via a REST API running at http://localhost:5000.

TransIns Marian NMT Server

Install Marian NMT on your system following these instructions. Start the translation server for a specific translation direction by running the following command in the top level folder of the distribution (assuming that marian-server is on your $PATH)

marian-server --config src/main/resources-misc/marian-nmt/<trans-dir>/config.yml

Make sure you have run ./download-opus-models.sh to put the required models into subfolders of marian-nmt. Please note that the provided config.yml configurations assume a GPU with at least 4 GB of free memory.

The Marian NMT translation server is now accessible via a web socket running at ws://localhost:8080/translate.

Docker Deployment

Instead of installing the TransIns components as described above, we also provide Docker images for easy deployment. The Docker image for each component is defined in the corresponding Dockerfile. Build the TransIns Docker images by running ./build-transins-docker-images.sh in the top level folder of the distribution.

Start TransIns with all supported translation directions (de → en, en → de, de → fr, fr → de) by running docker-compose up -d. Each TransIns component runs in a separate container within a Docker network. Note that the configuration folders of both the pre-/postprocessing server as well as the Marian NMT servers are passed as bind mounts to the corresponding Docker containers.

By default, the MT models for de → en and en → de are deployed at GPU 0 and the MT models for de → fr and fr → de are deployed at GPU 1. Each deployed models requires ~ 4 GB of free GPU memory. Please adapt the device_ids parameter of the transins-marian containers in docker-compose.yml, if necessary for your local GPU setup.

REST API

TransIns provides a RESTful API that allows to query the translation service in an asynchronous way. This API is also used by the web page.

The REST endpoint for getting the supported translation directions is /getTranslationDirections. Sending a GET request returns a JSON array of strings where each string represents a translation direction in the format <sourceLang>-<targetLang>.

The REST endpoint for sending a document to translate is /translate. The query has to be sent as POST request encoded as multipart/form-data with the following fields:

file the file name of the document to translate
transDir the translation direction; use the same format as returned by the getTranslationDirections endpoint
enc the encoding of the document; the translated document will use the same encoding
strategy the markup reinsertion strategy to use; possible values are MTRAIN, MTRAIN_IMPROVED and COMPLETE_MAPPING (default if strategy is not provided)

If successful, the service returns a token which is required to retrieve the translated document with a second query. That query has to be sent as GET request to the /getTranslation REST endpoint. It requires the token as path parameter. Please note that it is not guaranteed that the translated document can be retrieved immediately, as the translation may take some time. If the translation is not yet available, the second call returns a 202 HTTP response code.

If required, the token can also be used to cancel a translation and/or force the deletion of all associated files on the server. A delete query has to be sent as DELETE request to the /deleteTranslation REST endpoint with the token as path parameter.

To test the REST service, use the curl and wget tools.

The following GET query retrieves the supported translation directions from a TransIns service running on port 7777 at localhost:

curl -i -X GET localhost:7777/getTranslationDirections

This would return a JSON array ["de-en", "en-de", "de-fr","fr-de"].

A POST query to translate an MS Office document MyDoc.docx from German to French would look like this:

curl -i -X POST -H "Content-Type: multipart/form-data" \
  -F "file=@MyDoc.docx" -F "transDir=de-fr" -F "enc=windows-1252" \
  -F "strategy=COMPLETE_MAPPING" \
  localhost:7777/translate

This returns a token cbVHK6U2oJIO8hCPvU4LR6dL3FSt2oU0nw9VBbFo that must be used in the second GET query to retrieve the translated document:

wget -S --content-disposition \
  localhost:7777/getTranslation/cbVHK6U2oJIO8hCPvU4LR6dL3FSt2oU0nw9VBbFo

In order to delete the files on the server, use this DELETE query:

curl -i -X DELETE \
  localhost:7777/deleteTranslation/cbVHK6U2oJIO8hCPvU4LR6dL3FSt2oU0nw9VBbFo

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
evaluation		evaluation
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile-MarianNMT		Dockerfile-MarianNMT
Dockerfile-MarianNMT-CPU		Dockerfile-MarianNMT-CPU
Dockerfile-MarianNMT-CPU.dockerignore		Dockerfile-MarianNMT-CPU.dockerignore
Dockerfile-MarianNMT.dockerignore		Dockerfile-MarianNMT.dockerignore
Dockerfile-PrePostprocessing		Dockerfile-PrePostprocessing
Dockerfile-PrePostprocessing.dockerignore		Dockerfile-PrePostprocessing.dockerignore
Dockerfile-Server		Dockerfile-Server
Dockerfile-Server.dockerignore		Dockerfile-Server.dockerignore
LICENSE.txt		LICENSE.txt
README.adoc		README.adoc
TransIns_Paper_EMNLP_2021.pdf		TransIns_Paper_EMNLP_2021.pdf
build-transins-docker-images.sh		build-transins-docker-images.sh
docker-compose.yml		docker-compose.yml
download-opus-models.sh		download-opus-models.sh
environment.yml		environment.yml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TransIns - Document Translation with Markup Reinsertion

Overview

Citation

Requirements

Content

Installation

Quickstart

TransIns Core Component

TransIns Pre-/Postprocessing Server

TransIns Marian NMT Server

Docker Deployment

REST API

About

Releases

Packages

Languages

License

DFKI-MLT/TransIns

Folders and files

Latest commit

History

Repository files navigation

TransIns - Document Translation with Markup Reinsertion

Overview

Citation

Requirements

Content

Installation

Quickstart

TransIns Core Component

TransIns Pre-/Postprocessing Server

TransIns Marian NMT Server

Docker Deployment

REST API

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages