[VLM] end2end geo3k multi-turn RL of VLM Recipe #1141

gxlvera · 2025-12-17T21:34:39Z

Goal

VLM Multi-turn (related to #1075)

TODO / Status

Rollout

Created a custom multi-turn rollout function in examples/vlm_multi_turn/rollout.py
- Pluggable interactive env (env path specified via rollout argument --custom-config-path)
- Early-stop logic
  - max_turns (specified via rollout argument --custom-config-path)
  - max_new_token cap
- loss_mask / rollout_log_probs
  - loss_mask = 1 on assistant tokens
  - loss_mask = 0 on user/observation tokens
  - rollout_log_probs padded to match
  - initial sample.prompt stays unmasked

Interactive environment

Custom env split from rollout: examples/vlm_multi_turn/env_geo3k.py
- build_env/ reset / step / format_observation functions for per-turn feedback

Data & dataset

Support geo3k multimodal dataset: https://huggingface.co/datasets/chenhegu/geo3k_imgurl
OpenCUA not supported yet (may require a more complex interactive env, e.g. https://github.com/xlang-ai/OSWorld)

Experiment Result

Trained Qwen3-VL-2B-Instruct with FSDP backend on the geo3k dataset with multi-turn reasoning, using GRPO.

…ds refine (THUDM#1075)

…HUDM#1075)

zhuzilin · 2025-12-18T02:41:53Z

slime/utils/arguments.py

+                type=int,
+                default=None,
+                help="Maximum turns for multi-turn custom rollout (e.g., Sokoban). Defaults to rollout implementation config.",
+            )


is it possible to pass these 2 configs through --custom-config-path?

Yes, it sounds neater. I have pushed the change, thanks!

zhaochenyang20 · 2025-12-19T04:09:17Z

Nice done Xiaole!

yogesh1801 · 2025-12-20T12:02:19Z

@gxlvera are you working on OpenCUA part? can i help with it?

zhaochenyang20 · 2025-12-21T20:46:22Z

Great job so far!

gxlvera · 2025-12-22T01:00:51Z

@gxlvera are you working on OpenCUA part? can i help with it?

Hi, you could try to support OpenCUA's AgentNet dataset. Note that if you want to implement the online interaction, maybe you need an os sandbox for simulation. It's OK if you stick with offline mode (without interaction) although I personally don't think that would work well.

yogesh1801 · 2025-12-23T10:40:55Z

Sure @gxlvera can try that

…al_token cap (THUDM#1075)

…rn cap (THUDM#1075)

yogesh1801 · 2025-12-28T15:54:10Z

@gxlvera can you help with openCUA a bit? where can i connect u?

…catenation

gxlvera · 2025-12-29T15:52:58Z

@gxlvera can you help with openCUA a bit? where can i connect u?

Hi, you could DM me at gxlvera@gmail.com~

examples/geo3k_vlm_multi_turn/rollout.py

…HUDM#1141)

)

zhaochenyang20 · 2026-01-02T19:14:34Z

We shall also have a Megatron version,. But FSDP works cool!

zhaochenyang20 · 2026-01-02T22:44:23Z

I evaluated the performance, which works well to me.

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

gxlvera added 2 commits December 17, 2025 21:06

add VLM multi turn sokoban, early-termination and reward function nee…

e559b6b

…ds refine (THUDM#1075)

vlm multiturn add apply-chat-template in rollout and training script (T…

6f32d2a

…HUDM#1075)

gxlvera mentioned this pull request Dec 17, 2025

[Roadmap][VLM] Support VLM Multi-Turn #1075

Open

3 tasks

zhuzilin reviewed Dec 18, 2025

View reviewed changes

gxlvera and others added 3 commits December 18, 2025 05:35

Merge branch 'THUDM:main' into main

db576c3

pass multi-turn arguments as custom config (THUDM#1075)

84f9973

update README (THUDM#1075)

a5a19cf

gxlvera and others added 2 commits December 20, 2025 12:10

Merge branch 'THUDM:main' into main

02f6b49

remove unused _fmt_tokens() in vlm multi-turn rollout.py (THUDM#1075)

00e4d1d

Merge branch 'THUDM:main' into main

21344db

gxlvera and others added 7 commits December 25, 2025 00:47

Merge branch 'THUDM:main' into main

02491cc

rollout switch to delta token from retokenization (#1705)

6c1e823

Merge branch 'THUDM:main' into main

1912a4b

cap rollout tokens to max_new_tokens and max_turn only, abort max_tot…

70d277d

…al_token cap (THUDM#1075)

Merge branch 'main' of https://github.com/gxlvera/slime

a144a39

change sample.status from TRUNCATED to COMPLETED when reaching max-tu…

d6704ca

…rn cap (THUDM#1075)

simplified set max_turns (THUDM#1075)

04ef398

gxlvera changed the title ~~[VLM] Multi-turn with Sokoban Dataset~~ [VLM] Multi-turn Dec 27, 2025

gxlvera and others added 3 commits December 29, 2025 14:25

Merge branch 'THUDM:main' into main

801a126

replace sokoban with geo3k (THUDM#1075)

8b8f8ee

update multiturn rollout tokenization process and multimodal data con…

add4743

…catenation

gxlvera marked this pull request as ready for review December 29, 2025 15:32

revert changes in types.py (THUDM#1141)

d4fee83

zhaochenyang20 requested changes Dec 31, 2025

View reviewed changes

examples/geo3k_vlm_multi_turn/rollout.py Outdated Show resolved Hide resolved

examples/geo3k_vlm_multi_turn/rollout.py Outdated Show resolved Hide resolved

gxlvera and others added 12 commits January 1, 2026 04:47

Merge branch 'THUDM:main' into main

010f363

unified max_turn between rollout default config and geo3k env config (T…

fefbe8f

…HUDM#1141)

Merge branch 'THUDM:main' into main

35255c3

rename vlm_multi_turn to geo3k_vlm_multi_turn (THUDM#1141)

884e273

change dataset download mode, delete process script (THUDM#1141)

9161ae3

delete rollout/env default config, forced user input for max_turns

d751867

fix rollout_log_prob handling (THUDM#1141)

b8c35e3

add preamble trimming for _encode_observation_for_generation (THUDM#1141

04b59ff

)

delete getattr() except get env observation formatter (THUDM#1141)

6677a95

optimized _ merge_multimodal_train_inputs

c002060

enforce strict regex extraction for tool call content (THUDM#1141)

ec83bae

support megatron in geo3k vlm multi turn training script (THUDM#1141)

3a1fcb1

gxlvera changed the title ~~[VLM] Multi-turn~~ [VLM] geo3k multi-turn Jan 2, 2026

polish for better code style (THUDM#1141)

322959e

zhaochenyang20 changed the title ~~[VLM] geo3k multi-turn~~ [VLM] end2end geo3k multi-turn RL of VLM Recipe Jan 2, 2026

zhaochen20 added 3 commits January 2, 2026 10:28

remove getattr with base class OOP

b557910

fix namespace

ffdca54

fix with OOP

a25030a

zhaochenyang20 approved these changes Jan 2, 2026

View reviewed changes

zhaochenyang20 added ci run-ci-short run-ci-long run-ci-precision run-ci-ckpt run-ci-fault-tolerance labels Jan 2, 2026

zhaochenyang20 merged commit 0878cd0 into THUDM:main Jan 2, 2026
19 of 34 checks passed

kafkayu pushed a commit to kafkayu/slime that referenced this pull request Jan 8, 2026

[VLM] end2end geo3k multi-turn RL of VLM Recipe (THUDM#1141)

963a75a

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

[VLM] end2end geo3k multi-turn RL of VLM Recipe #1141

[VLM] end2end geo3k multi-turn RL of VLM Recipe #1141

Uh oh!

Conversation

gxlvera commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

TODO / Status

Rollout

Interactive environment

Data & dataset

Experiment Result

Uh oh!

zhuzilin Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gxlvera Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Dec 19, 2025

Uh oh!

yogesh1801 commented Dec 20, 2025

Uh oh!

zhaochenyang20 commented Dec 21, 2025

Uh oh!

gxlvera commented Dec 22, 2025

Uh oh!

yogesh1801 commented Dec 23, 2025

Uh oh!

yogesh1801 commented Dec 28, 2025

Uh oh!

gxlvera commented Dec 29, 2025

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Jan 2, 2026

Uh oh!

zhaochenyang20 commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gxlvera commented Dec 17, 2025 •

edited

Loading

gxlvera Dec 18, 2025 •

edited

Loading