Skip to content
Improving the Transformer translation model with document-level context
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs first commit Mar 13, 2018
thumt
LICENSE first commit Mar 13, 2018
README.md Fix typo in readme Jul 31, 2019
UserManual.pdf first commit Mar 13, 2018

README.md

Improving the Transformer Translation Model with Document-Level Context

Contents

Introduction

This is the implementation of our work, which extends Transformer to integrate document-level context [paper]. The implementation is on top of THUMT

Usage

Note: The usage is not user-friendly. May improve later.

  1. Train a standard Transformer model, please refer to the user manual of THUMT. Suppose that model_baseline/model.ckpt-30000 performs best on validation set.

  2. Generate a dummy improved Transformer model with the following command:

python THUMT/thumt/bin/trainer_ctx.py --inputs [source corpus] [target corpus] \
                                      --context [context corpus] \
                                      --vocabulary [source vocabulary] [target vocabulary] \
                                      --output model_dummy --model contextual_transformer \
                                      --parameters train_steps=1
  1. Generate the initial model by merging the standard Transformer model into the dummy model, then create a checkpoint file:
python THUMT/thumt/script/combine_add.py --model model_dummy/model.ckpt-0 \
                                         --part model_baseline/model.ckpt-30000 --output train
printf 'model_checkpoint_path: "new-0"\nall_model_checkpoint_paths: "new-0"' > train/checkpoint
  1. Train the improved Transformer model with the following command:
python THUMT/thumt/bin/trainer_ctx.py --inputs [source corpus] [target corpus] \
                                      --context [context corpus] \
                                      --vocabulary [source vocabulary] [target vocabulary] \
                                      --output train --model contextual_transformer \
                                      --parameters start_steps=30000,num_context_layers=1
  1. Translate with the improved Transformer model:
python THUMT/thumt/bin/translator_ctx.py --inputs [source corpus] --context [context corpus] \
                                         --output [translation result] \
                                         --vocabulary [source vocabulary] [target vocabulary] \
                                         --model contextual_transformer --checkpoints [model path] \
                                         --parameters num_context_layers=1

Citation

Please cite the following paper if you use the code:

@InProceedings{Zhang:18,
  author    = {Zhang, Jiacheng and Luan, Huanbo and Sun, Maosong and Zhai, Feifei and Xu, Jingfang and Zhang, Min and Liu, Yang},
  title     = {Improving the Transformer Translation Model with Document-Level Context},
  booktitle = {Proceedings of EMNLP},
  year      = {2018},
}

FAQ

  1. What is the context corpus?

The context corpus file contains one context sentence each line. Normally, context sentence is the several preceding source sentences within a document. For example, if the origin document-level corpus is:

==== source ====
<document id=XXX>
<seg id=1>source sentence #1</seg>
<seg id=2>source sentence #2</seg>
<seg id=3>source sentence #3</seg>
<seg id=4>source sentence #4</seg>
</document>

==== target ====
<document id=XXX>
<seg id=1>target sentence #1</seg>
<seg id=2>target sentence #2</seg>
<seg id=3>target sentence #3</seg>
<seg id=4>target sentence #4</seg>
</document>

The inputs to our system should be processed as (suppose that 2 preceding source sentences are used as context):

==== train.src ==== (source corpus)
source sentence #1
source sentence #2
source sentence #3
source sentence #4

==== train.ctx ==== (context corpus)
(the first line is empty)
source sentence #1
source sentence #1 source sentence #2 (there is only a space between the two sentence)
source sentence #2 source sentence #3

==== train.trg ==== (target corpus)
target sentence #1
target sentence #2
target sentence #3
target sentence #4
You can’t perform that action at this time.