Skip to content

Conversation

@gxlvera
Copy link
Contributor

@gxlvera gxlvera commented Dec 17, 2025

Goal

VLM Multi-turn (related to #1075)

TODO / Status

Rollout

  • Created a custom multi-turn rollout function in examples/vlm_multi_turn/rollout.py
    • Pluggable interactive env (env path specified via rollout argument --custom-config-path)
    • Early-stop logic
      • max_turns (specified via rollout argument --custom-config-path)
      • max_new_token cap
    • loss_mask / rollout_log_probs
      • loss_mask = 1 on assistant tokens
      • loss_mask = 0 on user/observation tokens
      • rollout_log_probs padded to match
      • initial sample.prompt stays unmasked

Interactive environment

  • Custom env split from rollout: examples/vlm_multi_turn/env_geo3k.py
    • build_env/ reset / step / format_observation functions for per-turn feedback

Data & dataset

Experiment Result

Trained Qwen3-VL-2B-Instruct with FSDP backend on the geo3k dataset with multi-turn reasoning, using GRPO.
vlm_multi_turn_geo3k_reward

type=int,
default=None,
help="Maximum turns for multi-turn custom rollout (e.g., Sokoban). Defaults to rollout implementation config.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to pass these 2 configs through --custom-config-path?

Copy link
Contributor Author

@gxlvera gxlvera Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it sounds neater. I have pushed the change, thanks!

@zhaochenyang20
Copy link
Collaborator

Nice done Xiaole!

@yogesh1801
Copy link

@gxlvera are you working on OpenCUA part? can i help with it?

@zhaochenyang20
Copy link
Collaborator

Great job so far!

@gxlvera
Copy link
Contributor Author

gxlvera commented Dec 22, 2025

@gxlvera are you working on OpenCUA part? can i help with it?

Hi, you could try to support OpenCUA's AgentNet dataset. Note that if you want to implement the online interaction, maybe you need an os sandbox for simulation. It's OK if you stick with offline mode (without interaction) although I personally don't think that would work well.

@yogesh1801
Copy link

Sure @gxlvera can try that

@gxlvera gxlvera changed the title [VLM] Multi-turn with Sokoban Dataset [VLM] Multi-turn Dec 27, 2025
@yogesh1801
Copy link

@gxlvera can you help with openCUA a bit? where can i connect u?

@gxlvera gxlvera marked this pull request as ready for review December 29, 2025 15:32
@gxlvera
Copy link
Contributor Author

gxlvera commented Dec 29, 2025

@gxlvera can you help with openCUA a bit? where can i connect u?

Hi, you could DM me at gxlvera@gmail.com~

@gxlvera gxlvera changed the title [VLM] Multi-turn [VLM] geo3k multi-turn Jan 2, 2026
@zhaochenyang20 zhaochenyang20 changed the title [VLM] geo3k multi-turn [VLM] end2end geo3k multi-turn RL of VLM Recipe Jan 2, 2026
@zhaochenyang20
Copy link
Collaborator

We shall also have a Megatron version,. But FSDP works cool!

@zhaochenyang20
Copy link
Collaborator

I evaluated the performance, which works well to me.

image

@zhaochenyang20 zhaochenyang20 merged commit 0878cd0 into THUDM:main Jan 2, 2026
19 of 34 checks passed
kafkayu pushed a commit to kafkayu/slime that referenced this pull request Jan 8, 2026
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants