[CVPR 26] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
📑 Paper | 🤗 Model & Data
We propose R2VLM, Recurrent Reasoning Vision-Language Model for long-horizon embodied task progress estimation.
We leverage LLaMA-Factory for supervised fine-tuning and verl for reinforcement learning. To support our data format, we modify the multi-turn rollout algorithm in verl by removing historical contexts from previous turns and introducing history chain-of-thought reasoning. Our modified verl code is available at dhcpack/verl-0.5.0-r2vlm.
You can obtain the training code by cloning the submodules:
git submodule update --init --recursiveThese two training frameworks are tracked in this repository as Git submodules under code/framework/:
code/framework/LLaMA-Factorycode/framework/verl
Please refer to the requirements of the corresponding training frameworks for installation. After setting up both frameworks, you can train R2VLM using Qwen2.5-VL-7B-Instruct as the base model.
We release both SFT and RL models trained on the Alfred progress estimation dataset. Training was conducted on 8·A100.
| Model | Base Model | Training Stage | Link | Training Time |
|---|---|---|---|---|
| R2VLM-Alfred-SFT | Qwen2.5-VL-7B | SFT | Hugging Face | 50 hours |
| R2VLM-Alfred-RL | R2VLM-Alfred-SFT | RL | Hugging Face | 75 hours |
We release the Alfred progress estimation dataset at zhangyuelin/alfred_progress_data. The SFT split includes cold-start chain-of-thought annotations generated with Qwen2.5-VL-72B.
The corresponding video data should be obtained from the official Alfred repository: askforalfred/alfred.
We release the Alfred benchmark at zhangyuelin/alfred_progress_bench.
If you find our work helpful, please cite as:
@article{zhang2026recurrent,
title={Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress},
author={Zhang, Yuelin and Cheng, Sijie and Li, Chen and Li, Zongzhao and Huang, Yuxin and Liu, Yang and Huang, Wenbing},
journal={arXiv preprint arXiv:2603.17312},
year={2026}
}
