This is the implementation of the paper "Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction" by Haoyu Wang, Yuxin Chen, Liang Luo, Buyun Zhang, Ellie Wen, and Pan Li.
We recommend installing the dependencies with conda.
conda create -n itpo python=3.10
conda activate itpoTo install the dependencies, run the following command:
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers==4.57.0
pip install uv
uv pip install vllm==0.10.2 --torch-backend=cu128The implementation of ITPO is based on VeRL with version 0.5.0.dev. To install the corresponding version of VeRL, run the following command:
mkdir ITPO
cd ITPO
git clone https://github.com/verl-project/verl.git
cd verl
git checkout ddd86f52 # The corresponding commit hash of VeRL 0.5.0.dev
cd ..
pip install -e verl/pip install --extra-index-url https://miropsota.github.io/torch_packages_builder flash_attn==2.8.3+pt2.8.0cu128
pip install ray==2.49.2
pip install litellm==1.79.0
pip install nltk
pip install click==8.2.1Download the codebase from the repository, and put the itpo folder into the verl/recipe folder of VeRL repo.
For the rest files, simply put them into the corresponding verl/verl/xxx folder.
[The Huggingface Hub DNS] When downloading a new model/tokenizer, the xnet version may be incompatible with the current version of transformers. In this case, you can refer to this issue to download the compatible xnet version for fix.
[Click incompatible with Ray] You may download the click version according to this issue to fix.issue.
[Prebuilt wheels of Flash Attention] If you encounter an error when installing Flash Attention, you can refer to this issue to download the prebuilt wheels for fix.
cd ITPO
mkdir data
cd data
mkdir collab
cd ..
cd ..
cd verl
python recipe/itpo/process_data/process_dataset.py --dataset collabllm/collabllm-multiturn-medium-large --local_dir [$HOME PATH of ITPO]/ITPO/data/collab/medium --dataset_type rlOne could also organize the MTMedDialogue following the form above and process the data.
After data preparation, the folder structure should look like the following:
ITPO
├── data
│ ├── collab
│ │ ├── math
│ │ │ ├── rl_train.parquet
│ │ │ ├── ...
│ │ ...
├── verl
│ ├── recipe/itpo
└── ...
This section should document the code execution process step by step, including the purpose of each step, the command used, and the expected outputs.
One should first set up the user simulator and LLM judge via VLLM.
CUDA_VISIBLE_DEVICES=0,1 nohup python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-14B-Instruct --port 8001 --tensor-parallel-size 2 >./vllm_8001.log 2>&1 </dev/null &bash recipe/itpo/scripts/$Name_of_Script.shbash recipe/itpo/scripts/eval/eval.shIf you find this work useful for your research, please consider citing the following paper:
@article{wang2026implicit,
title = {Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction},
author = {Wang, Haoyu and Chen, Yuxin and Luo, Liang and Zhang, Buyun and Wen, Ellie and Li, Pan},
journal = {arXiv preprint arXiv:2603.23550},
year = {2026}
}