William Kalikman*, Šimon Sukup*, Michal Tešnar, Vilém Zouhar (*equal contribution)
ETH Zurich
We propose Adversarial Translation Optimization (ATO), a gradient-based method for making source texts harder to translate without LLM prompting, human curation, or task-specific training. We use the Sentinel-src-25 difficulty estimator's gradient signal, combined with Beam Search and a differentiable fluency term, to iteratively replace tokens in a seed text so that it becomes harder to translate while remaining grammatically plausible.
@inproceedings{kalikman2026augmenting,
title = {Augmenting Text to Increase Translation Difficulty},
author = {Kalikman, William and Sukup, {\v{S}}imon and Te{\v{s}}nar, Michal and Zouhar, Vil{\'e}m},
year = {2026},
note = {ETH Zurich}
}At each iteration, ATO computes token-level gradients of the objective (translation difficulty + fluency), samples a pool of single-token substitution candidates across all beam members, scores them, and retains the best B survivors plus R random injections for diversity. See the paper for full details.
Two variants of ATO are implemented:
| Method | Description | Entry point |
|---|---|---|
| ATO-Direct | Single-phase beam search over whole-word XLM-R vocabulary (~10k tokens). Objective: Sentinel difficulty + CoLA fluency + cosine cohesion. Output selected by lowest Qwen perplexity. | euler_infra/beam_search/run_experiment_cola.py |
| ATO-TwoPhase | Phase 1: beam search over English subword vocabulary (~25k tokens) with Sentinel+CoLA objective. Phase 2: beam search in Qwen token space (~87k tokens) optimising perplexity. | euler_infra/beam_search/run_experiment_qwen_phased.py |
Hyperparameters for both methods are documented in configs/ato_direct.yaml and configs/ato_twophase.yaml.
.
├── configs/
│ ├── ato_direct.yaml # ATO-Direct hyperparameters
│ └── ato_twophase.yaml # ATO-TwoPhase hyperparameters
├── euler_infra/
│ └── beam_search/
│ ├── run_experiment_cola.py # ATO-Direct entry point
│ └── run_experiment_qwen_phased.py # ATO-TwoPhase entry point
├── scripts/
│ ├── finetune_cola.py # fine-tune CoLA grammaticality classifier
│ └── sentence_compare.py # visualise optimisation trajectory for one sentence
├── src/
│ ├── breaking_mt_sigils/ # differentiable objectives (Sentinel, CoLA, Qwen PPL)
│ ├── constraints/ # vocabulary filter files
│ └── optimizers/
│ ├── beam_gcg.py # BeamSearchGCGOptimizer (ATO-Direct)
│ └── beam_gcg_phased.py # BeamSearchGCGOptimizerPhased (ATO-TwoPhase)
├── translation/
│ ├── translate/ # Translation backends (NLLB, TranslateGemma, Gemini)
│ ├── evaluate/ # Metrics (XCOMET-XL, MetricX-24) and visualisation
│ └── requirements.txt # Separate dependency set for translation
└── requirements.txt # Full environment freeze (ATO optimisers)
git clone https://github.com/wskal/breaking-mt.git
cd breaking-mt
pip install -r requirements.txtCritical version requirements (already pinned in requirements.txt):
transformers==4.35.2— 4.36+ breaks compatibilitynumpy<2— required for PyTorch 2.2.2
Three models are required:
1. Sentinel-src-25 — translation difficulty estimator (auto-downloaded from HuggingFace):
Prosho/sentinel-src-25
2. CoLA-finetuned XLM-RoBERTa-large — grammaticality classifier (auto-downloaded):
wskal/cola-xlm-roberta-large
To fine-tune it yourself instead:
python scripts/finetune_cola.py3. Qwen/Qwen2.5-72B — served locally via vLLM for perplexity scoring:
vllm serve Qwen/Qwen2.5-72B \
--gpu-memory-utilization 0.85 \
--max-model-len 4096 \
--dtype bfloat16 > qwen.log 2> qwen.err &Log in to HuggingFace before downloading models:
huggingface-cli loginParameters are set at the top of the script (see configs/ato_direct.yaml for a documented reference). With the vLLM server running:
python euler_infra/beam_search/run_experiment_cola.pyResults are written to a timestamped subdirectory under results/.
Parameters are set at the top of the script (see configs/ato_twophase.yaml for a documented reference). With the vLLM server running:
python euler_infra/beam_search/run_experiment_qwen_phased.pyATO-augmented texts are translated into multiple target languages and scored with two reference-free metrics. The pipeline lives in translation/. Each script appends new columns to the input CSV rather than producing new files.
Main environment — handles translation (NLLB, TranslateGemma, Gemini) and XCOMET scoring:
python -m venv translation-venv
source translation-venv/bin/activate
pip install -r translation/requirements.txt
pip install unbabel-comet # XCOMET-XLMetricX environment — requires a pinned, isolated Conda environment due to conflicting dependencies:
conda create -n metricx-env python=3.10 -y
conda activate metricx-env
git clone https://github.com/google-research/metricx metricx/
pip install "transformers[torch]==4.30.2" sentencepiece==0.1.99 datasets==2.13.1
pip install "git+https://github.com/google-research/mt-metrics-eval"
pip install protobuf==3.20.3 fsspec==2023.6.0 "numpy<2.0"Apply this one-line fix to the MetricX predict script (required to avoid a padding error):
# In metricx/metricx24/predict.py, inside the Trainer initialisation:
trainer = transformers.Trainer(
model=model,
args=training_args,
data_collator=transformers.DataCollatorWithPadding(tokenizer), # ← add this line
)For Gemini Flash, place your API key in a .env file at the repo root:
GEMINI_API_KEY=your_key_here
Translate all text columns of a results CSV to a target language (translated columns are appended as <original_col>_<model>_<lang3>):
# NLLB-200-3.3B — runs locally; uses NLLB language codes
python translation/translate.py \
--input results/all_methods_25.csv \
--output results/all_methods_25_nllb_deu.csv \
--model nllb \
--src-lang eng_Latn \
--tgt-lang deu_Latn
# google/translategemma-27b-it — uses ISO 639-1 codes
python translation/translate.py \
--input results/all_methods_25.csv \
--output results/all_methods_25_translategemma_de.csv \
--model translategemma \
--src-lang en \
--tgt-lang de
# Gemini-3-Flash — uses language names; API key required
python translation/translate.py \
--input results/all_methods_25.csv \
--output results/all_methods_25_gemini_de.csv \
--model gemini \
--src-lang English \
--tgt-lang German \
--batch-size 10To translate only specific columns, pass --columns:
python translation/translate.py \
--input results/all_methods_25.csv \
--output results/all_methods_25_nllb_ces.csv \
--model nllb --src-lang eng_Latn --tgt-lang ces_Latn \
--columns original_sentence ATO-Direct ATO-TwoPhaseXCOMET-XL (Unbabel/XCOMET-XL, reference-free quality estimation; run in the main venv):
python translation/evaluate/xcomet.py \
--input results/all_methods_25_nllb_deu.csv \
--output results/all_methods_25_nllb_deu_xcomet.csv \
--source-col original_sentence \
--mt-cols ATO-Direct_nllb_deu ATO-TwoPhase_nllb_deu \
--batch-size 8MetricX-24 (google/metricx-24-hybrid-xl-v2p6, reference-free QE; run in metricx-env):
conda activate metricx-env
python translation/evaluate/metricx.py \
--input results/all_methods_25_nllb_deu.csv \
--output results/all_methods_25_nllb_deu_metricx.csv \
--source-col original_sentence \
--mt-cols ATO-Direct_nllb_deu ATO-TwoPhase_nllb_deu \
--batch-size 8Scores are written as new columns: xcomet_<mt_col> and metricx_<mt_col>. To specify independent source columns per method, use --pairs source_col:mt_col instead.
After scoring, generate bar-chart comparisons across methods and target languages:
from translation.evaluate.visualize import load_scores, plot_scores
scores = load_scores() # reads all scored CSVs from translation/results/
fig = plot_scores(scores)
fig.savefig("figs_to_upload/translation_quality.pdf", bbox_inches="tight")Charts compare Original, DIPPER, Qwen-72B, ATO-Direct, and ATO-TwoPhase across Czech, German, Icelandic, Russian, and Spanish using both XCOMET and MetricX, with a normalised panel relative to the unmodified baseline.
Batch scripts are in euler_infra/translation/:
| Script | Purpose |
|---|---|
job_nllb_{ces,deu,isl,rus,spa}.sbatch |
NLLB translation for each of the five target languages |
job_translategemma_deu.sbatch |
TranslateGemma translation to German |
job_gemini_deu.sbatch |
Gemini Flash translation to German |
job_xcomet.sbatch |
XCOMET-XL scoring (parametric $LANG) |
job_metricx.sbatch |
MetricX-24 scoring (parametric $LANG) |
Submit a parametric scoring job with:
sbatch --export=LANG=deu euler_infra/translation/job_xcomet.sbatch
sbatch --export=LANG=deu euler_infra/translation/job_metricx.sbatchThe euler_infra/ directory contains SLURM job scripts and setup guides for the ETH Euler cluster. These are not required for general use — see euler_infra/README.md for details.
MIT
