Skip to content

GLAD-RUC/R2VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[CVPR 26] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

📑 Paper    |    🤗 Model & Data

Model Framework

We propose R2VLM, Recurrent Reasoning Vision-Language Model for long-horizon embodied task progress estimation.

Training

We leverage LLaMA-Factory for supervised fine-tuning and verl for reinforcement learning. To support our data format, we modify the multi-turn rollout algorithm in verl by removing historical contexts from previous turns and introducing history chain-of-thought reasoning. Our modified verl code is available at dhcpack/verl-0.5.0-r2vlm.

You can obtain the training code by cloning the submodules:

git submodule update --init --recursive

These two training frameworks are tracked in this repository as Git submodules under code/framework/:

  • code/framework/LLaMA-Factory
  • code/framework/verl

Please refer to the requirements of the corresponding training frameworks for installation. After setting up both frameworks, you can train R2VLM using Qwen2.5-VL-7B-Instruct as the base model.

Pretrained weights

We release both SFT and RL models trained on the Alfred progress estimation dataset. Training was conducted on 8·A100.

Model Base Model Training Stage Link Training Time
R2VLM-Alfred-SFT Qwen2.5-VL-7B SFT Hugging Face 50 hours
R2VLM-Alfred-RL R2VLM-Alfred-SFT RL Hugging Face 75 hours

Datasets

We release the Alfred progress estimation dataset at zhangyuelin/alfred_progress_data. The SFT split includes cold-start chain-of-thought annotations generated with Qwen2.5-VL-72B.

The corresponding video data should be obtained from the official Alfred repository: askforalfred/alfred.

Benchmarks

We release the Alfred benchmark at zhangyuelin/alfred_progress_bench.

Citation

If you find our work helpful, please cite as:

@article{zhang2026recurrent,
  title={Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress},
  author={Zhang, Yuelin and Cheng, Sijie and Li, Chen and Li, Zongzhao and Huang, Yuxin and Liu, Yang and Huang, Wenbing},
  journal={arXiv preprint arXiv:2603.17312},
  year={2026}
}

About

[CVPR 2026] Official implementation of R2VLM: Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors