Welcome to Optical Reasoning! 👋 This repository accompanies "Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text", a framework that treats images as a standalone reasoning medium. It supports typographic-based optical reasoning for compact rationale rendering and graphical-based optical reasoning for structured visual rationales. The repository also provides scripts for preparing visual rationales and reproducing experimental results.
🖨️ Typographic-Based Optical Reasoning T-OR renders the interleaved-modal rationale sequence into a compact typographic image with XeLaTeX.
🎨 Graphical-Based Optical Reasoning G-OR transforms the interleaved-modal rationale sequence into a unified image-based rationale that organizes reasoning with text, graphical elements, and spatial layouts.
- [2026.06] Initial release of Optical Reasoning.
- 🚀 Quick Start
- ✨ How It Works
- 🪐 Key Features
- 🔥 News
- 🗂️ Project Structure
- 🌱 Acknowledgements
- 📚 Citation
conda create -n optical-reasoning python=3.11 -y
conda activate optical-reasoning
pip install -U pip
pip install -r requirements.txt
The renderer is implemented with XeLaTeX.
# Ubuntu / Debian
apt-get update
apt-get install -y texlive-xetex texlive-latex-extra texlive-fonts-recommended
# macOS
brew install --cask mactex-no-gui
Check the installation:
xelatex --version
cp src/configs/profiles_example.yaml src/configs/profiles.yaml
models:
gpt5.1:
api_key: ""
base_url: ""
model: "gpt-5.1-2025-11-13"
temperature: 0.0
llmjudge:
api_key: ""
base_url: ""
model: "deepseek-chat"
temperature: 0.0
nano-banana-pro:
api_key: ""
base_url: ""
model: "nano-banana-pro"
temperature: 0.0
Tip
The rationales used in the paper can be downloaded from the Optical-Reasoning-4k.
If you want to build visual rationales from textual rationales, start from the JSONL format below.
{
"id": "sample-001",
"problem": "Question text.",
"solution": "Reasoning rationale.",
"answer": "A",
"reasoning_token": 512
}
The fields are:
id: unique example identifier.problem: input question or problem statement.solution: textual rationale to be rendered.answer: ground-truth answer.reasoning_token: token count of the textual rationale insolution.
Tip
Use src/utils/add_reasoning_tokens.py to add the reasoning_token field to each JSONL record:
python src/utils/add_reasoning_tokens.py data/<dataset>/<dataset>.jsonlBy default, the script updates the input file in place. Use --output-path <output.jsonl> to write to a separate file.
The generated T-OR and G-OR rationales follow this folder structure:
data/
└── <dataset>/
├── <dataset>.jsonl
├── T-OR/
│ ├── output.jsonl
│ └── images/
└── G-OR/
├── output.jsonl
└── images/
T-OR renders the rationale into a compact typographic image while preserving the original order of the reasoning content.
DATASET=aqua_rat \
INPUT_JSONL=data/aqua_rat/aqua_rat.jsonl \
OUTPUT_DIR=data/aqua_rat/T-OR \
OUTPUT_JSONL=data/aqua_rat/T-OR/output.jsonl \
bash scripts/render_typographic.sh
Important
For T-OR rendering, the textual rationale in the solution field must be LaTeX text without syntax errors.
- Reads rationales from the
solutionfield. - Searches for a compact and readable typographic layout under the
reasoning_tokenbudget.
G-OR generates a structured visual rationale by composing reasoning steps into graphical panels.
DATASET=aqua_rat \
INPUT_JSONL=data/aqua_rat/aqua_rat.jsonl \
OUTPUT_BASE=data/aqua_rat/G-OR \
OUTPUT_JSONL=data/aqua_rat/G-OR/output.jsonl \
PROFILE=nano-banana-pro \
bash scripts/render_graphical.sh
- Uses the configured generation profile, such as
PROFILE=nano-banana-pro. - Converts the problem, rationale, and optional visual inputs into a step-aligned graphical rationale.
For optical reasoning, T-OR takes the problem text together with the rendered typographic rationale image, while G-OR takes the problem text together with the generated graphical rationale image.
Run inference on T-OR:
PROFILE=gpt5.1 \
INPUT_JSONL=data/aqua_rat/T-OR/output.jsonl \
OUTPUT_DIR=outputs/aqua_rat/T-OR \
OUTPUT_JSONL=outputs/aqua_rat/T-OR/infer_gpt5.1.jsonl \
bash scripts/infer_typographic.sh
Run inference on G-OR:
PROFILE=gpt5.1 \
INPUT_JSONL=data/aqua_rat/G-OR/output.jsonl \
OUTPUT_DIR=outputs/aqua_rat/G-OR \
OUTPUT_JSONL=outputs/aqua_rat/G-OR/infer_gpt5.1.jsonl \
bash scripts/infer_graphical.sh
The example below shows the intended inference pattern after an optical rationale image has already been generated. The MLLM receives the original problem text together with the T-OR or G-OR rationale image, then returns the final answer in \boxed{ANSWER} format.
import json
import subprocess
from pathlib import Path
# 1) Prepare one example with the original problem and a generated rationale image.
# Replace this path with an existing T-OR or G-OR rationale image.
rationale_image = Path("data/aqua_rat/T-OR/images/sample-001.png").resolve()
example = {
"id": "sample-001",
"problem": "A rectangle has length 12 and width 5. What is its area?",
"image_path": str(rationale_image),
}
input_jsonl = Path("outputs/tmp/usage_example_input.jsonl")
output_jsonl = Path("outputs/tmp/usage_example_output.jsonl")
input_jsonl.parent.mkdir(parents=True, exist_ok=True)
input_jsonl.write_text(json.dumps(example) + "\n", encoding="utf-8")
# 2) Run image-based reasoning inference.
subprocess.run(
[
"python",
"src/run.py",
"infer",
"--data",
str(input_jsonl),
"--output",
str(output_jsonl),
"--profile",
"gpt5.1",
"--task-type",
"img_reasoning",
"--max-tokens",
"256",
"--no-evaluate",
],
check=True,
)
# 3) Inspect the model answer.
result = json.loads(output_jsonl.read_text(encoding="utf-8").splitlines()[0])
print(result["prediction"])
print(result["parsed_prediction"])Note
Configure src/configs/profiles.yaml before running the example. For multimodal datasets, add question_image to the JSONL row when the original question also includes an input image.
Text reasoning receives the problem followed by the rationale, and free reasoning asks the model to solve the problem step by step.
python src/run.py infer \
--data data/<dataset>/<dataset>.jsonl \
--output outputs/<dataset>/text_reasoning/infer_<model>.jsonl \
--profile <model> \
--task-type text_reasoning
Use --task-type no_reasoning or --task-type free_reasoning for the other text baselines.
The main experiment script evaluates five models on five benchmarks under seven settings: no reasoning, text reasoning, and T-OR with reasoning token ratios of 0.2, 0.4, 0.6, 0.8, and 1.0.
First, configure the model profiles in src/configs/profiles.yaml and prepare the benchmark datasets and full T-OR rationales under data/. Each benchmark must contain the full T-OR data at:
data/<dataset>/T-OR/
├── output.jsonl
└── images/
Generate the T-OR datasets for reasoning token ratios 0.2, 0.4, 0.6, and 0.8 from the full T-OR data:
bash scripts/generate_tor_ratio_data.shThe generated datasets are written to data/<dataset>/T-OR-<ratio>/. The original T-OR directory is used for ratio 1.0.
After preparing all ratio datasets, run the main experiments:
bash scripts/main_exp.shBy default, the script runs 175 experiments and writes predictions and evaluation metrics to:
outputs/main_exp/<model>/<benchmark>/<setting>/
├── output.jsonl
└── metrics.json
The experiment matrix can be changed through environment variables:
MODELS_STR="gpt5.1 claude-sonnet-4.5 kimi-k2.5 gemini2.5 qwen3vl" \
BENCHMARKS_STR="aqua_rat gpqa gsm8k scienceqa_img zebra-cot" \
TASK_TYPES_STR="no_reasoning text_reasoning img_reasoning" \
TOKEN_RATIOS_STR="0.2 0.4 0.6 0.8 1.0" \
bash scripts/main_exp.shOptical Reasoning explores the bold idea of using images as a standalone reasoning medium for both language and multimodal tasks.
- Optical Reasoning: formulates a unified interleaved-modal rationale sequence and maps it into an image, allowing the model to derive final answers directly from visual reasoning tokens rather than textual ones.
- Typographic-Based (T-OR): optimizes visual layouts by searching over text width and font size to render rationales into compact, high-density typographic images under a strictly controllable reasoning-token budget.
- Graphical-Based (G-OR): decomposes rationales into distinct reasoning steps and assigns them to specific visual panels, creating a step-aligned composition that naturally unifies textual rationales, graphical elements, and spatial layouts.
├── scripts/
│ ├── generate_tor_ratio_data.sh
│ ├── infer_graphical.sh
│ ├── infer_typographic.sh
│ ├── main_exp.sh
│ ├── render_graphical.sh
│ └── render_typographic.sh
│
└── src/
├── run.py
├── configs/
│ └── profiles_example.yaml
├── inference/
│ ├── base_predictor.py
│ ├── evaluation.py
│ └── predictor.py
├── render/
│ ├── graphical_render.py
│ └── typographic_render.py
└── utils/
├── add_reasoning_tokens.py
├── image_scaling.py
└── token_sizing.py
This project is licensed under the MIT License. Please refer to the LICENSE file for more details.
@misc{bian2026opticalreasoningrethinkingimages,
title={Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text},
author={Yutong Bian and Dongjie Cheng and Heming Xia and Yongqi Li and Wenjie Li},
year={2026},
eprint={2606.09585},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.09585},
}
