Skip to content

NLP2CT/TaU

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TaU

TaU: Test-time Adaptation for Machine Translation Evaluation by Uncertainty Minimization

Paper Slides Poster

Overview

Prerequisites

This work cannot be done without the amazing code from COMET and mt-metrics-eval.

cd comet
pip install poetry
poetry install
git clone https://github.com/google-research/mt-metrics-eval.git # Our version: bdda529ce4fae9
cd mt-metrics-eval
pip install .
alias mtme='python3 -m mt_metrics_eval.mtme'
mtme --download 

Test-time Adaptation

  • Run with configurations

    1. Create the experiment script:
    cd tau/
    python create_tau_exps.py --config_file configs/{comet-da/comet-mqm/comet-qe-mqm}.yaml
    1. Run the generated script:
    cd tau/
    sh run_{comet-da/comet-mqm/comet-qe-mqm}.sh
  • CLI Usage Example:

    MTME_DATA=${HOME}/.mt-metrics-eval/mt-metrics-eval-v2/
    SAVE=results
    SYSTEM=Online-W.txt
    python tau.py \
        -s ${MTME_DATA}/wmt21.tedtalks/sources/en-de.txt \
        -r ${TASK_DATA}/wmt21.tedtalks/references/en-de.refA.txt -t ${SYSTEM} \
        --to_json ${SAVE}/${SYSTEM}.json \
        --lr 1e-4 --mc_dropout 30 --component ln --adapt-epoch 1 --batch_size 16 --quiet

Meta-evaluation

  • Recommendation: Please use the "unit test" to ensure that the meta-evaluation script can produce the baseline result reported in the WMT official report. To do so, please check the following function in tau/meta_eval_results.py:":
    def verify():
      ...
      file = "" # Please replace the path with your MTME data path
      ...
  • If you use the generated script to run the experiments, you can evaluate the correlation performance with:
    cd tau/
    python meta_eval_results.py [comet-da-results/wmt21.tedtalks/en-de] # An example

Supplementary

Here is a valueable question after the publication: How do TaU perform in the low-resource languages and other benchmarks without tuning?

  • We recognize and value language diversity. However, the out-of-domain benchmark for MT evaluation is scarce. Thus, we conducted further experiments on the previous WMT20-News benchmark with the same learning rate (without tuning on developmental data), and the results are as follows:

    Pl-En Ru-En Ta-En Zh-En En-Pl En-Ru En-Ta En-Zh
    COMET-DA 34.5 83.6 76.4 93.1 80.0 92.5 79.8 0.7
    +TaU 34.6 84.0 77.4 93.4 79.0 91.6 75.3 1.2

Environment (For reference)

Citation

@inproceedings{zhan-etal-2023-test,
    title = "Test-time Adaptation for Machine Translation Evaluation by Uncertainty Minimization",
    author = "Zhan, Runzhe  and
      Liu, Xuebo  and
      Wong, Derek F.  and
      Zhang, Cuilian  and
      Chao, Lidia S.  and
      Zhang, Min",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.47",
    pages = "807--820",
}

About

TaU: Test-time Adaptation for Machine Translation Evaluation by Uncertainty Minimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%