Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

This is the implementation of the paper "Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction" by Haoyu Wang, Yuxin Chen, Liang Luo, Buyun Zhang, Ellie Wen, and Pan Li.

1. Environment Setup

We recommend installing the dependencies with conda.

conda create -n itpo python=3.10
conda activate itpo

1.1 Basic Dependencies

To install the dependencies, run the following command:

pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers==4.57.0
pip install uv
uv pip install vllm==0.10.2 --torch-backend=cu128

1.2 Install VeRL

The implementation of ITPO is based on VeRL with version 0.5.0.dev. To install the corresponding version of VeRL, run the following command:

mkdir ITPO
cd ITPO
git clone https://github.com/verl-project/verl.git
cd verl
git checkout ddd86f52 # The corresponding commit hash of VeRL 0.5.0.dev
cd ..
pip install -e verl/

pip install --extra-index-url https://miropsota.github.io/torch_packages_builder flash_attn==2.8.3+pt2.8.0cu128
pip install ray==2.49.2
pip install litellm==1.79.0
pip install nltk
pip install click==8.2.1

1.3 Download the ITPO CodeBase

Download the codebase from the repository, and put the itpo folder into the verl/recipe folder of VeRL repo.

For the rest files, simply put them into the corresponding verl/verl/xxx folder.

1.4 Issues that may occur

[The Huggingface Hub DNS] When downloading a new model/tokenizer, the xnet version may be incompatible with the current version of transformers. In this case, you can refer to this issue to download the compatible xnet version for fix.

[Click incompatible with Ray] You may download the click version according to this issue to fix.issue.

[Prebuilt wheels of Flash Attention] If you encounter an error when installing Flash Attention, you can refer to this issue to download the prebuilt wheels for fix.

2. Data Preparation

cd ITPO
mkdir data
cd data
mkdir collab
cd ..
cd ..
cd verl
python recipe/itpo/process_data/process_dataset.py --dataset collabllm/collabllm-multiturn-medium-large --local_dir [$HOME PATH of ITPO]/ITPO/data/collab/medium --dataset_type rl

One could also organize the MTMedDialogue following the form above and process the data.

2.1 Folder Structure

After data preparation, the folder structure should look like the following:

ITPO
├── data
│   ├── collab
│   │   ├── math
│   │   │   ├── rl_train.parquet
│   │   │   ├── ...
│   │   ...
├── verl
│   ├── recipe/itpo
└── ...

3. Code Workflow

This section should document the code execution process step by step, including the purpose of each step, the command used, and the expected outputs.

3.1 User Simulator / LLM Judge

One should first set up the user simulator and LLM judge via VLLM.

CUDA_VISIBLE_DEVICES=0,1 nohup python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-14B-Instruct --port 8001 --tensor-parallel-size 2 >./vllm_8001.log 2>&1 </dev/null &

3.2 Training

bash recipe/itpo/scripts/$Name_of_Script.sh

3.3. Evaluation

bash recipe/itpo/scripts/eval/eval.sh

4. Citation

If you find this work useful for your research, please consider citing the following paper:

@article{wang2026implicit,
  title   = {Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction},
  author  = {Wang, Haoyu and Chen, Yuxin and Luo, Liang and Zhang, Buyun and Wen, Ellie and Li, Pan},
  journal = {arXiv preprint arXiv:2603.23550},
  year    = {2026}
}

5. Acknowledgements

The code implementation is based on VeRL and CollabLLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

Table of Contents

1. Environment Setup

1.1 Basic Dependencies

1.2 Install VeRL

1.3 Download the ITPO CodeBase

1.4 Issues that may occur

2. Data Preparation

2.1 Folder Structure

3. Code Workflow

3.1 User Simulator / LLM Judge

3.2 Training

3.3. Evaluation

4. Citation

5. Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
recipe/itpo		recipe/itpo
verl		verl
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

Table of Contents

1. Environment Setup

1.1 Basic Dependencies

1.2 Install VeRL

1.3 Download the ITPO CodeBase

1.4 Issues that may occur

2. Data Preparation

2.1 Folder Structure

3. Code Workflow

3.1 User Simulator / LLM Judge

3.2 Training

3.3. Evaluation

4. Citation

5. Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages