Skip to content

InternLM/ARC-VL

Repository files navigation

Think Visually, Reason Textually: Vision-Language Synergy in ARC

Beichen Zhang · Yuhang Zang · Xiaoyi Dong · Yuhang Cao
Haodong Duan · Dahua Lin · Jiaqi Wang

Corresponding authors.

📢 News

🌈 Overview

We integrate Visual Intelligence into ARC-AGI to leverage the respective advantages of vision and text: vision supports global pattern abstraction and verification, whereas language specializes in precise execution.

We achieve this by introducing two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR) which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction.

Method

🛠️ Inference

Prepare your environment

git clone https://github.com/InternLM/Arc-VL
conda create -n arcvl python==3.11
conda activate arcvl
pip install -r requirements.txt

Modify setup_api_key.shand fill in your base_url and API keys. Activate it by running:

source setup_api_key.sh

Prepare for the data. The data can be downloaded in the following link:

ARC-AGI: https://github.com/fchollet/ARC-AGI

BARC: https://github.com/xu3kev/BARC

Re-ARC: https://github.com/michaelhodel/re-arc

Specify the test dataset, test model and dataset path, and run our vision-language synergy reasoning with the following code.

python inference.py --dataset_name="arc-agi" --model="gpt-4o" --data_path="Your_data_path"
--result_file="result_arcagi_4o.json"
--save_root="images/ARC-AGI/"

Finally, score the inference results.

python score.py --input_file="result.json" --output_file="result_scored.json"

Cases

We conduct an in-depth analysis of the specific outputs of different models (GPT-4o, Gemini-2.5-Pro-thinking-8192, o4-mini) when employing visual thinking versus textual thinking in the ARC-AGI task. Visual thinking demonstrates numerous unique advantages, such as the integration of 2D structural information, a global perspective, and long-range perception capabilities.

case1

case2

case3

case4

✒️Citation

If you find this project useful, please kindly cite:

@article{zhang2025think,
  title={Think Visually, Reason Textually: Vision-Language Synergy in ARC},
  author={Zhang, Beichen and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Duan, Haodong and Lin, Dahua and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2511.15703},
  year={2025}
}

📄 License

Code License

Usage and License Notices: The code is intended and licensed for research use only.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •