Skip to content

Guochry/LRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning to Route Languages for Multilingual Preference Optimization

LRPO overview

This is the official implementation for the paper: "Learning to Route Languages for Multilingual Preference Optimization". LRPO is an online preference optimization method for multilingual LLMs, which treats the rollout language as a selectable training variable. LRPO has three main pieces:

  • Language-routed rollouts: for each training question, LRPO generates a group of responses in multiple target languages under a fixed rollout budget.
  • Calibrated multilingual rewards: generated responses are compared with high-quality references using cross-lingual semantic similarity, then calibrated so scores are more comparable across language pairs.
  • Trainable language router: a contextual multi-armed bandit learns topic- and region-conditioned language preferences and balances exploration with exploitation during training.

LRPO builds on verl, so most distributed training, rollout, checkpointing, and logging behavior follows the upstream verl interface.

Quick Start

Create a Python environment with CUDA-compatible PyTorch, then install the package in editable mode:

git clone https://github.com/Guochry/LRPO.git && cd LRPO
pip install -e .
pip install -r requirements.txt

# Optional vLLM rollout backend
pip install ".[vllm]"

Reward code may require additional assets, depending on the reward function you use. Specify these assets in verl/utils/reward_score/calibrated_rs.py:

  • a language identification model (e.g., fastText LID);
  • multilingual embedding models (e.g., mmBERT) or reward models.

Data

LRPO uses the verl parquet format. Each row should provide the fields needed for rollout, reward computation, and language routing:

  • prompt
  • reward_model.ground_truth: reference provided
  • ability: topic label used by the language router
  • extra_info.language: language in which the example is originally written
  • extra_info.region: region label for regional examples, if applicable

Preprocessing examples are available in examples/data_preprocess/. Treat them as templates and adapt them to your own dataset paths and schema.

Training

The main training entry point is verl.trainer.main_ppo with LRPO-specific routing options. See examples/grpo_trainer/run_lrpo.sh for a concrete launch script. Replace the data paths, checkpoint paths, reward-asset paths, logging paths, and PYTHONPATH before using it outside the original environment.

The training script introduces several LRPO-specific hyperparameters for the language router:

Option Meaning
+data.dynamic_lang_policy Enables the online language router.
+data.lang_policy_alpha Exponential moving average update rate for router values.
+data.lang_policy_update_every Number of reward steps to buffer before updating the router.
+data.lang_policy_temperature_init Initial softmax temperature for language sampling.
+data.lang_policy_temperature_min Minimum annealed sampling temperature.
+data.lang_policy_temperature_decay Temperature decay rate.
+data.lang_policy_epsilon_init Initial epsilon-greedy exploration rate.
+data.lang_policy_epsilon_min Minimum exploration rate.
+data.lang_policy_epsilon_decay Exploration decay rate.
+data.lang_policy_orig_lang_min Minimum number of original-language rollouts kept for each prompt group.
+data.lang_policy_group_norm Reward normalization for router updates, for example center or zscore.
+data.lang_policy_log_path Optional JSONL path for router probability logs.

Evaluation

In addition to existing multilingual benchmarks, this project introduces CARE (Pro), a cross-lingual and cross-cultural evaluation benchmark for realistic multilingual information needs. CARE (Pro) targets two settings that are often underrepresented in standard benchmarks:

  • fine-grained insider regional knowledge, where questions require local, city-, town-, or community-level knowledge rather than broad country-level facts;
  • cross-cultural information seeking, where users ask about another region or culture from a foreign-language perspective.

The dataset is publicly available at geyang627/care_pro. To evaluate on CARE (Pro), generate model responses for each question and compare each response against the gold reference with the LLM-as-a-judge prompt in evaluation/prompt.txt. The judge assigns one of four labels:

  • CORRECT
  • CORRECT_BUT_WRONG_LANGUAGE
  • INCORRECT
  • NOT_ATTEMPTED

Only CORRECT is counted as correct. CORRECT_BUT_WRONG_LANGUAGE is separated from factual correctness but treated as incorrect for the final accuracy.

Citation

@misc{lrpo,
  title = {Learning to Route Languages for Multilingual Preference Optimization},
  author = {Geyang Guo and Hiromi Wakaki and Yuki Mitsufuji and Alan Ritter and Wei Xu},
  year = {2026},
  note = {ICML}
}

About

[ICML 2026] This is the official implementation for the paper: "Learning to Route Languages for Multilingual Preference Optimization"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors