Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning
This is the official code of the paper Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning by Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, Enhong Chen (Accepted as Poster of AAAI'2026)
TL;DR: In this paper, we introduce the Chain of Evidence (CoE) paradigm, which models stepwise inference by grounding each chain-of-thought (CoT) reasoning step. To realize CoE, we propose Look As You Think (LAT), a two-stage reinforcement learning (RL) framework that trains VLMs to unify CoT reasoning and visual grounding by generating progressive reasoning process paired with an aligned visual attribution for each reference element.
If you find this repository or paper useful, you can cite:
@misc{liu2025lookthinkunifyingreasoning,
title={Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning},
author={Shuochen Liu and Pengfei Luo and Chao Zhang and Yuhao Chen and Haotian Zhang and Qi Liu and Xin Kou and Tong Xu and Enhong Chen},
year={2025},
eprint={2511.12003},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2511.12003},
}
The required dependencies and their versions can be found in the requirements.txt.
To install all the required packages along with their dependencies, run
# python >= 3.10
pip install -r requirements.txt1. Download Data
Prepare VISA datasets. Place the downloaded datasets under the /data/visa/(paper/wiki/fine-web) directories. Modify the paths as necessary to match your local environment.
To obtain images for the multi-candidate setup, please run /src/image_address.py.
2. Cold start
bash scripts/sft_inference.shImportant
- After each training session, merge the LoRA parameters by executing the following code.
- For multi-image training scenarios, initialize the multi-image model using the single-image trained version. Subsequently, perform supervised fine-tuning (SFT) on the multi-image CoE data in
$\mathcal{D}_{\text{final}}$ , fine-tuning only the LoRA adapter of the language model while keeping the vision transformer (ViT) frozen to minimize GPU memory consumption.
from peft import PeftModel
model = PeftModel.from_pretrained(model, lora_name_or_path)
model = model.merge_and_unload()
model.save_pretrained("merged_model_path")
3. Reinforcement Learning
bash scripts/grpo.shAfter SFT training, the LoRA parameters need to be merged into the base model, and the model_name_or_path should be updated accordingly.
4. Evaluate Model
python src/evaluation.pyDuring evaluation, it is necessary to manually specify the model and the corresponding LoRA parameters to reproduce the results, while selecting the appropriate evaluation settings and dataset names.
More details and analyses about experimental results can be found in our paper.
Our code have been developed based on VLM-R1, VISA. We thank their valuable works.

