Learning to Route Languages for Multilingual Preference Optimization

This is the official implementation for the paper: "Learning to Route Languages for Multilingual Preference Optimization". LRPO is an online preference optimization method for multilingual LLMs, which treats the rollout language as a selectable training variable. LRPO has three main pieces:

Language-routed rollouts: for each training question, LRPO generates a group of responses in multiple target languages under a fixed rollout budget.
Calibrated multilingual rewards: generated responses are compared with high-quality references using cross-lingual semantic similarity, then calibrated so scores are more comparable across language pairs.
Trainable language router: a contextual multi-armed bandit learns topic- and region-conditioned language preferences and balances exploration with exploitation during training.

LRPO builds on verl, so most distributed training, rollout, checkpointing, and logging behavior follows the upstream verl interface.

Quick Start

Create a Python environment with CUDA-compatible PyTorch, then install the package in editable mode:

git clone https://github.com/Guochry/LRPO.git && cd LRPO
pip install -e .
pip install -r requirements.txt

# Optional vLLM rollout backend
pip install ".[vllm]"

Reward code may require additional assets, depending on the reward function you use. Specify these assets in verl/utils/reward_score/calibrated_rs.py:

a language identification model (e.g., fastText LID);
multilingual embedding models (e.g., mmBERT) or reward models.

Data

LRPO uses the verl parquet format. Each row should provide the fields needed for rollout, reward computation, and language routing:

prompt
reward_model.ground_truth: reference provided
ability: topic label used by the language router
extra_info.language: language in which the example is originally written
extra_info.region: region label for regional examples, if applicable

Preprocessing examples are available in examples/data_preprocess/. Treat them as templates and adapt them to your own dataset paths and schema.

Training

The main training entry point is verl.trainer.main_ppo with LRPO-specific routing options. See examples/grpo_trainer/run_lrpo.sh for a concrete launch script. Replace the data paths, checkpoint paths, reward-asset paths, logging paths, and PYTHONPATH before using it outside the original environment.

The training script introduces several LRPO-specific hyperparameters for the language router:

Option	Meaning
`+data.dynamic_lang_policy`	Enables the online language router.
`+data.lang_policy_alpha`	Exponential moving average update rate for router values.
`+data.lang_policy_update_every`	Number of reward steps to buffer before updating the router.
`+data.lang_policy_temperature_init`	Initial softmax temperature for language sampling.
`+data.lang_policy_temperature_min`	Minimum annealed sampling temperature.
`+data.lang_policy_temperature_decay`	Temperature decay rate.
`+data.lang_policy_epsilon_init`	Initial epsilon-greedy exploration rate.
`+data.lang_policy_epsilon_min`	Minimum exploration rate.
`+data.lang_policy_epsilon_decay`	Exploration decay rate.
`+data.lang_policy_orig_lang_min`	Minimum number of original-language rollouts kept for each prompt group.
`+data.lang_policy_group_norm`	Reward normalization for router updates, for example `center` or `zscore`.
`+data.lang_policy_log_path`	Optional JSONL path for router probability logs.

Evaluation

In addition to existing multilingual benchmarks, this project introduces CARE (Pro), a cross-lingual and cross-cultural evaluation benchmark for realistic multilingual information needs. CARE (Pro) targets two settings that are often underrepresented in standard benchmarks:

fine-grained insider regional knowledge, where questions require local, city-, town-, or community-level knowledge rather than broad country-level facts;
cross-cultural information seeking, where users ask about another region or culture from a foreign-language perspective.

The dataset is publicly available at geyang627/care_pro. To evaluate on CARE (Pro), generate model responses for each question and compare each response against the gold reference with the LLM-as-a-judge prompt in evaluation/prompt.txt. The judge assigns one of four labels:

CORRECT
CORRECT_BUT_WRONG_LANGUAGE
INCORRECT
NOT_ATTEMPTED

Only CORRECT is counted as correct. CORRECT_BUT_WRONG_LANGUAGE is separated from factual correctness but treated as incorrect for the final accuracy.

Citation

@misc{lrpo,
  title = {Learning to Route Languages for Multilingual Preference Optimization},
  author = {Geyang Guo and Hiromi Wakaki and Yuki Mitsufuji and Alan Ritter and Wei Xu},
  year = {2026},
  note = {ICML}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
evaluation		evaluation
examples		examples
figs		figs
scripts		scripts
tests		tests
verl		verl
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning to Route Languages for Multilingual Preference Optimization

Quick Start

Data

Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Learning to Route Languages for Multilingual Preference Optimization

Quick Start

Data

Training

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages