Metric for Automatic Machine Translation Evaluation
We submited it to WMT2019 Metrics Shared Task.
Paper: http://www.statmt.org/wmt19/pdf/53/WMT60.pdf
- Python >= 3.6.0
- TensorFlow >= 1.11.0.
- Clone BERT repository (https://github.com/google-research/bert) and
export PYTHONPATH="path to bert dir:$PYTHONPATH"
- Download the BERT model fine-tuned with MRPC and
export TUNED_MODEL_DIR="path to fine-tuned BERT model"
-
Prepare test set to
data/orig
(File names are src, out, ref) -
Make pseudo-references
Translate the source of test set with off-the-shelf MT system and set the outputs todata/pseudo_references/
Note: Don't use a off-the-shelf MT system whose output is contained the test set.
-
Filtering with BERT
sh script/filter.sh
Pseudo-references with paraphrase score are in
data/sim_scores
.
Filtered pseudo-references are indata/filtered_paseudo_references/
. -
Evaluate
Please evaluate the score with a metric which allows use of multiple references.
If you evaluate with sentence bleu, please download moses binaries andsh scripts/evaluate.sh [language] [path to moses folder]
Generated output_score file contains each sentence-bleu score.
Note: If you evaluate the score with other metrics, use the metrics that take into account all references, not to get the maximum value for each single reference.