[CVPR 26] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

We propose R²VLM, Recurrent Reasoning Vision-Language Model for long-horizon embodied task progress estimation.

Training

We leverage LLaMA-Factory for supervised fine-tuning and verl for reinforcement learning. To support our data format, we modify the multi-turn rollout algorithm in verl by removing historical contexts from previous turns and introducing history chain-of-thought reasoning. Our modified verl code is available at dhcpack/verl-0.5.0-r2vlm.

You can obtain the training code by cloning the submodules:

git submodule update --init --recursive

These two training frameworks are tracked in this repository as Git submodules under code/framework/:

code/framework/LLaMA-Factory
code/framework/verl

Please refer to the requirements of the corresponding training frameworks for installation. After setting up both frameworks, you can train R²VLM using Qwen2.5-VL-7B-Instruct as the base model.

Pretrained weights

We release both SFT and RL models trained on the Alfred progress estimation dataset. Training was conducted on 8·A100.

Model	Base Model	Training Stage	Link	Training Time
R2VLM-Alfred-SFT	Qwen2.5-VL-7B	SFT	Hugging Face	50 hours
R2VLM-Alfred-RL	R2VLM-Alfred-SFT	RL	Hugging Face	75 hours

Datasets

We release the Alfred progress estimation dataset at zhangyuelin/alfred_progress_data. The SFT split includes cold-start chain-of-thought annotations generated with Qwen2.5-VL-72B.

The corresponding video data should be obtained from the official Alfred repository: askforalfred/alfred.

Benchmarks

We release the Alfred benchmark at zhangyuelin/alfred_progress_bench.

Citation

If you find our work helpful, please cite as:

@article{zhang2026recurrent,
  title={Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress},
  author={Zhang, Yuelin and Cheng, Sijie and Li, Chen and Li, Zongzhao and Huang, Yuxin and Liu, Yang and Huang, Wenbing},
  journal={arXiv preprint arXiv:2603.17312},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
code		code
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 26] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

Training

Pretrained weights

Datasets

Benchmarks

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[CVPR 26] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

Training

Pretrained weights

Datasets

Benchmarks

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages