Skip to content

BreakingMT/ATO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

142 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ATO: Augmenting Text to Increase Translation Difficulty

William Kalikman*, Šimon Sukup*, Michal Tešnar, Vilém Zouhar (*equal contribution)
ETH Zurich

License: MIT PyTorch Transformers Python

We propose Adversarial Translation Optimization (ATO), a gradient-based method for making source texts harder to translate without LLM prompting, human curation, or task-specific training. We use the Sentinel-src-25 difficulty estimator's gradient signal, combined with Beam Search and a differentiable fluency term, to iteratively replace tokens in a seed text so that it becomes harder to translate while remaining grammatically plausible.

📄 Paper 🗄️Dataset

@inproceedings{kalikman2026augmenting,
  title     = {Augmenting Text to Increase Translation Difficulty},
  author    = {Kalikman, William and Sukup, {\v{S}}imon and Te{\v{s}}nar, Michal and Zouhar, Vil{\'e}m},
  year      = {2026},
  note      = {ETH Zurich}
}

How It Works

Beam Search diagram showing one iteration of ATO: candidates are expanded via gradient-guided token substitutions, pruned by the objective, and a random subset is injected for diversity.

At each iteration, ATO computes token-level gradients of the objective (translation difficulty + fluency), samples a pool of single-token substitution candidates across all beam members, scores them, and retains the best B survivors plus R random injections for diversity. See the paper for full details.


Methods

Two variants of ATO are implemented:

Method Description Entry point
ATO-Direct Single-phase beam search over whole-word XLM-R vocabulary (~10k tokens). Objective: Sentinel difficulty + CoLA fluency + cosine cohesion. Output selected by lowest Qwen perplexity. euler_infra/beam_search/run_experiment_cola.py
ATO-TwoPhase Phase 1: beam search over English subword vocabulary (~25k tokens) with Sentinel+CoLA objective. Phase 2: beam search in Qwen token space (~87k tokens) optimising perplexity. euler_infra/beam_search/run_experiment_qwen_phased.py

Hyperparameters for both methods are documented in configs/ato_direct.yaml and configs/ato_twophase.yaml.


Repository Structure

.
├── configs/
│   ├── ato_direct.yaml               # ATO-Direct hyperparameters
│   └── ato_twophase.yaml             # ATO-TwoPhase hyperparameters
├── euler_infra/
│   └── beam_search/
│       ├── run_experiment_cola.py    # ATO-Direct entry point
│       └── run_experiment_qwen_phased.py  # ATO-TwoPhase entry point
├── scripts/
│   ├── finetune_cola.py              # fine-tune CoLA grammaticality classifier
│   └── sentence_compare.py          # visualise optimisation trajectory for one sentence
├── src/
│   ├── breaking_mt_sigils/           # differentiable objectives (Sentinel, CoLA, Qwen PPL)
│   ├── constraints/                  # vocabulary filter files
│   └── optimizers/
│       ├── beam_gcg.py               # BeamSearchGCGOptimizer (ATO-Direct)
│       └── beam_gcg_phased.py        # BeamSearchGCGOptimizerPhased (ATO-TwoPhase)
├── translation/
│   ├── translate/                    # Translation backends (NLLB, TranslateGemma, Gemini)
│   ├── evaluate/                     # Metrics (XCOMET-XL, MetricX-24) and visualisation
│   └── requirements.txt              # Separate dependency set for translation
└── requirements.txt                  # Full environment freeze (ATO optimisers)

Installation

git clone https://github.com/wskal/breaking-mt.git
cd breaking-mt
pip install -r requirements.txt

Critical version requirements (already pinned in requirements.txt):

  • transformers==4.35.2 — 4.36+ breaks compatibility
  • numpy<2 — required for PyTorch 2.2.2

Model Dependencies

Three models are required:

1. Sentinel-src-25 — translation difficulty estimator (auto-downloaded from HuggingFace):

Prosho/sentinel-src-25

2. CoLA-finetuned XLM-RoBERTa-large — grammaticality classifier (auto-downloaded):

wskal/cola-xlm-roberta-large

To fine-tune it yourself instead:

python scripts/finetune_cola.py

3. Qwen/Qwen2.5-72B — served locally via vLLM for perplexity scoring:

vllm serve Qwen/Qwen2.5-72B \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096 \
    --dtype bfloat16 > qwen.log 2> qwen.err &

Log in to HuggingFace before downloading models:

huggingface-cli login

Running ATO-Direct

Parameters are set at the top of the script (see configs/ato_direct.yaml for a documented reference). With the vLLM server running:

python euler_infra/beam_search/run_experiment_cola.py

Results are written to a timestamped subdirectory under results/.


Running ATO-TwoPhase

Parameters are set at the top of the script (see configs/ato_twophase.yaml for a documented reference). With the vLLM server running:

python euler_infra/beam_search/run_experiment_qwen_phased.py

Translation & Evaluation Pipeline

ATO-augmented texts are translated into multiple target languages and scored with two reference-free metrics. The pipeline lives in translation/. Each script appends new columns to the input CSV rather than producing new files.

Environment Setup

Main environment — handles translation (NLLB, TranslateGemma, Gemini) and XCOMET scoring:

python -m venv translation-venv
source translation-venv/bin/activate
pip install -r translation/requirements.txt
pip install unbabel-comet          # XCOMET-XL

MetricX environment — requires a pinned, isolated Conda environment due to conflicting dependencies:

conda create -n metricx-env python=3.10 -y
conda activate metricx-env
git clone https://github.com/google-research/metricx metricx/
pip install "transformers[torch]==4.30.2" sentencepiece==0.1.99 datasets==2.13.1
pip install "git+https://github.com/google-research/mt-metrics-eval"
pip install protobuf==3.20.3 fsspec==2023.6.0 "numpy<2.0"

Apply this one-line fix to the MetricX predict script (required to avoid a padding error):

# In metricx/metricx24/predict.py, inside the Trainer initialisation:
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    data_collator=transformers.DataCollatorWithPadding(tokenizer),  # ← add this line
)

For Gemini Flash, place your API key in a .env file at the repo root:

GEMINI_API_KEY=your_key_here

Step 1: Translate

Translate all text columns of a results CSV to a target language (translated columns are appended as <original_col>_<model>_<lang3>):

# NLLB-200-3.3B — runs locally; uses NLLB language codes
python translation/translate.py \
    --input  results/all_methods_25.csv \
    --output results/all_methods_25_nllb_deu.csv \
    --model  nllb \
    --src-lang eng_Latn \
    --tgt-lang deu_Latn

# google/translategemma-27b-it — uses ISO 639-1 codes
python translation/translate.py \
    --input  results/all_methods_25.csv \
    --output results/all_methods_25_translategemma_de.csv \
    --model  translategemma \
    --src-lang en \
    --tgt-lang de

# Gemini-3-Flash — uses language names; API key required
python translation/translate.py \
    --input  results/all_methods_25.csv \
    --output results/all_methods_25_gemini_de.csv \
    --model  gemini \
    --src-lang English \
    --tgt-lang German \
    --batch-size 10

To translate only specific columns, pass --columns:

python translation/translate.py \
    --input  results/all_methods_25.csv \
    --output results/all_methods_25_nllb_ces.csv \
    --model  nllb --src-lang eng_Latn --tgt-lang ces_Latn \
    --columns original_sentence ATO-Direct ATO-TwoPhase

Step 2: Evaluate

XCOMET-XL (Unbabel/XCOMET-XL, reference-free quality estimation; run in the main venv):

python translation/evaluate/xcomet.py \
    --input  results/all_methods_25_nllb_deu.csv \
    --output results/all_methods_25_nllb_deu_xcomet.csv \
    --source-col original_sentence \
    --mt-cols ATO-Direct_nllb_deu ATO-TwoPhase_nllb_deu \
    --batch-size 8

MetricX-24 (google/metricx-24-hybrid-xl-v2p6, reference-free QE; run in metricx-env):

conda activate metricx-env
python translation/evaluate/metricx.py \
    --input  results/all_methods_25_nllb_deu.csv \
    --output results/all_methods_25_nllb_deu_metricx.csv \
    --source-col original_sentence \
    --mt-cols ATO-Direct_nllb_deu ATO-TwoPhase_nllb_deu \
    --batch-size 8

Scores are written as new columns: xcomet_<mt_col> and metricx_<mt_col>. To specify independent source columns per method, use --pairs source_col:mt_col instead.

Step 3: Visualise

After scoring, generate bar-chart comparisons across methods and target languages:

from translation.evaluate.visualize import load_scores, plot_scores

scores = load_scores()          # reads all scored CSVs from translation/results/
fig    = plot_scores(scores)
fig.savefig("figs_to_upload/translation_quality.pdf", bbox_inches="tight")

Charts compare Original, DIPPER, Qwen-72B, ATO-Direct, and ATO-TwoPhase across Czech, German, Icelandic, Russian, and Spanish using both XCOMET and MetricX, with a normalised panel relative to the unmodified baseline.

SLURM Jobs (ETH Euler Cluster)

Batch scripts are in euler_infra/translation/:

Script Purpose
job_nllb_{ces,deu,isl,rus,spa}.sbatch NLLB translation for each of the five target languages
job_translategemma_deu.sbatch TranslateGemma translation to German
job_gemini_deu.sbatch Gemini Flash translation to German
job_xcomet.sbatch XCOMET-XL scoring (parametric $LANG)
job_metricx.sbatch MetricX-24 scoring (parametric $LANG)

Submit a parametric scoring job with:

sbatch --export=LANG=deu euler_infra/translation/job_xcomet.sbatch
sbatch --export=LANG=deu euler_infra/translation/job_metricx.sbatch

Cluster / SLURM

The euler_infra/ directory contains SLURM job scripts and setup guides for the ETH Euler cluster. These are not required for general use — see euler_infra/README.md for details.


License

MIT

About

Implementation of “Augmenting Text to Increase Translation Difficulty”: a gradient-based optimization framework for scoring large pretrained language models creating harder machine-translation benchmarks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors