-
Notifications
You must be signed in to change notification settings - Fork 399
[VLM] end2end geo3k multi-turn RL of VLM Recipe #1141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
slime/utils/arguments.py
Outdated
| type=int, | ||
| default=None, | ||
| help="Maximum turns for multi-turn custom rollout (e.g., Sokoban). Defaults to rollout implementation config.", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to pass these 2 configs through --custom-config-path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it sounds neater. I have pushed the change, thanks!
|
Nice done Xiaole! |
|
@gxlvera are you working on OpenCUA part? can i help with it? |
|
Great job so far! |
Hi, you could try to support OpenCUA's AgentNet dataset. Note that if you want to implement the online interaction, maybe you need an os sandbox for simulation. It's OK if you stick with offline mode (without interaction) although I personally don't think that would work well. |
|
Sure @gxlvera can try that |
|
@gxlvera can you help with openCUA a bit? where can i connect u? |
Hi, you could DM me at gxlvera@gmail.com~ |
|
We shall also have a Megatron version,. But FSDP works cool! |
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

Goal
VLM Multi-turn (related to #1075)
TODO / Status
Rollout
examples/vlm_multi_turn/rollout.pymax_turns(specified via rollout argument --custom-config-path)loss_mask/rollout_log_probsloss_mask = 1on assistant tokensloss_mask = 0on user/observation tokensrollout_log_probspadded to matchsample.promptstays unmaskedInteractive environment
examples/vlm_multi_turn/env_geo3k.pybuild_env/reset/step/format_observationfunctions for per-turn feedbackData & dataset
Experiment Result
Trained Qwen3-VL-2B-Instruct with FSDP backend on the geo3k dataset with multi-turn reasoning, using GRPO.
