Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

This is the official code of the paper Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning by Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, Enhong Chen (Accepted as Poster of AAAI'2026)

Overview

TL;DR: In this paper, we introduce the Chain of Evidence (CoE) paradigm, which models stepwise inference by grounding each chain-of-thought (CoT) reasoning step. To realize CoE, we propose Look As You Think (LAT), a two-stage reinforcement learning (RL) framework that trains VLMs to unify CoT reasoning and visual grounding by generating progressive reasoning process paired with an aligned visual attribution for each reference element.

If you find this repository or paper useful, you can cite:

@misc{liu2025lookthinkunifyingreasoning,
      title={Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning}, 
      author={Shuochen Liu and Pengfei Luo and Chao Zhang and Yuhao Chen and Haotian Zhang and Qi Liu and Xin Kou and Tong Xu and Enhong Chen},
      year={2025},
      eprint={2511.12003},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.12003}, 
}

Dependencies

The required dependencies and their versions can be found in the requirements.txt. To install all the required packages along with their dependencies, run

# python >= 3.10

pip install -r requirements.txt

Run

1. Download Data

Prepare VISA datasets. Place the downloaded datasets under the /data/visa/(paper/wiki/fine-web) directories. Modify the paths as necessary to match your local environment.

To obtain images for the multi-candidate setup, please run /src/image_address.py.

2. Cold start

bash scripts/sft_inference.sh

Important

After each training session, merge the LoRA parameters by executing the following code.
For multi-image training scenarios, initialize the multi-image model using the single-image trained version. Subsequently, perform supervised fine-tuning (SFT) on the multi-image CoE data in $\mathcal{D}_{\text{final}}$, fine-tuning only the LoRA adapter of the language model while keeping the vision transformer (ViT) frozen to minimize GPU memory consumption.

from peft import PeftModel

model = PeftModel.from_pretrained(model, lora_name_or_path)
model = model.merge_and_unload()
model.save_pretrained("merged_model_path")

3. Reinforcement Learning

bash scripts/grpo.sh

After SFT training, the LoRA parameters need to be merged into the base model, and the model_name_or_path should be updated accordingly.

4. Evaluate Model

python src/evaluation.py

During evaluation, it is necessary to manually specify the model and the corresponding LoRA parameters to reproduce the results, while selecting the appropriate evaluation settings and dataset names.

Case

More details and analyses about experimental results can be found in our paper.

Acknowledgement

Our code have been developed based on VLM-R1, VISA. We thank their valuable works.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
fig		fig
scripts		scripts
src		src
utils		utils
README.md		README.md
requirements.txt		requirements.txt
zero3.json		zero3.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

Overview

Dependencies

Run

Case

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

Overview

Dependencies

Run

Case

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages