Think Visually, Reason Textually: Vision-Language Synergy in ARC

Beichen Zhang · Yuhang Zang^† · Xiaoyi Dong · Yuhang Cao
Haodong Duan · Dahua Lin · Jiaqi Wang^†

^†Corresponding authors.

📢 News

🚀 [2025/11/26] We have released the Inference Code
🚀 [2025/11/19] We have released the paper Think Visually, Reason Textually: Vision-Language Synergy in ARC

🌈 Overview

We integrate Visual Intelligence into ARC-AGI to leverage the respective advantages of vision and text: vision supports global pattern abstraction and verification, whereas language specializes in precise execution.

We achieve this by introducing two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR) which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction.

🛠️ Inference

Prepare your environment

git clone https://github.com/InternLM/Arc-VL
conda create -n arcvl python==3.11
conda activate arcvl
pip install -r requirements.txt

Modify setup_api_key.shand fill in your base_url and API keys. Activate it by running:

source setup_api_key.sh

Prepare for the data. The data can be downloaded in the following link:

ARC-AGI: https://github.com/fchollet/ARC-AGI

BARC: https://github.com/xu3kev/BARC

Re-ARC: https://github.com/michaelhodel/re-arc

Specify the test dataset, test model and dataset path, and run our vision-language synergy reasoning with the following code.

python inference.py --dataset_name="arc-agi" --model="gpt-4o" --data_path="Your_data_path"
--result_file="result_arcagi_4o.json"
--save_root="images/ARC-AGI/"

Finally, score the inference results.

python score.py --input_file="result.json" --output_file="result_scored.json"

Cases

We conduct an in-depth analysis of the specific outputs of different models (GPT-4o, Gemini-2.5-Pro-thinking-8192, o4-mini) when employing visual thinking versus textual thinking in the ARC-AGI task. Visual thinking demonstrates numerous unique advantages, such as the integration of 2D structural information, a global perspective, and long-range perception capabilities.

✒️Citation

If you find this project useful, please kindly cite:

@article{zhang2025think,
  title={Think Visually, Reason Textually: Vision-Language Synergy in ARC},
  author={Zhang, Beichen and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Duan, Haodong and Lin, Dahua and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2511.15703},
  year={2025}
}

📄 License

Usage and License Notices: The code is intended and licensed for research use only.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figs		figs
LICENSE		LICENSE
arc_dataset.py		arc_dataset.py
draw.py		draw.py
inference.py		inference.py
method.py		method.py
readme.md		readme.md
requirements.txt		requirements.txt
score.py		score.py
setup_api.sh		setup_api.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Think Visually, Reason Textually: Vision-Language Synergy in ARC

📢 News

🌈 Overview

🛠️ Inference

Cases

✒️Citation

📄 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

InternLM/ARC-VL

Folders and files

Latest commit

History

Repository files navigation

Think Visually, Reason Textually: Vision-Language Synergy in ARC

📢 News

🌈 Overview

🛠️ Inference

Cases

✒️Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages