Skip to content

ModalityDance/Optical-Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

arXiv Paper HF Papers HF Dataset

Welcome to Optical Reasoning! 👋 This repository accompanies "Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text", a framework that treats images as a standalone reasoning medium. It supports typographic-based optical reasoning for compact rationale rendering and graphical-based optical reasoning for structured visual rationales. The repository also provides scripts for preparing visual rationales and reproducing experimental results.

vision

🪐 Key Features

🖨️ Typographic-Based Optical Reasoning T-OR renders the interleaved-modal rationale sequence into a compact typographic image with XeLaTeX.

🎨 Graphical-Based Optical Reasoning G-OR transforms the interleaved-modal rationale sequence into a unified image-based rationale that organizes reasoning with text, graphical elements, and spatial layouts.

🔥 News

  • [2026.06] Initial release of Optical Reasoning.

📑 Table of Contents

🚀 Quick Start

1. Installation

Create environment

conda create -n optical-reasoning python=3.11 -y
conda activate optical-reasoning

pip install -U pip
pip install -r requirements.txt

Install XeLaTeX

The renderer is implemented with XeLaTeX.

# Ubuntu / Debian
apt-get update
apt-get install -y texlive-xetex texlive-latex-extra texlive-fonts-recommended

# macOS
brew install --cask mactex-no-gui

Check the installation:

xelatex --version

Model profiles

cp src/configs/profiles_example.yaml src/configs/profiles.yaml
models:
  gpt5.1:
    api_key: ""
    base_url: ""
    model: "gpt-5.1-2025-11-13"
    temperature: 0.0
  llmjudge:
    api_key: ""
    base_url: ""
    model: "deepseek-chat"
    temperature: 0.0
  nano-banana-pro:
    api_key: ""
    base_url: ""
    model: "nano-banana-pro"
    temperature: 0.0

2. Rationales

Tip

The rationales used in the paper can be downloaded from the Optical-Reasoning-4k.

Prepare Your Own Rationales

If you want to build visual rationales from textual rationales, start from the JSONL format below.

{
  "id": "sample-001",
  "problem": "Question text.",
  "solution": "Reasoning rationale.",
  "answer": "A",
  "reasoning_token": 512
}

The fields are:

  • id: unique example identifier.
  • problem: input question or problem statement.
  • solution: textual rationale to be rendered.
  • answer: ground-truth answer.
  • reasoning_token: token count of the textual rationale in solution.

Tip

Use src/utils/add_reasoning_tokens.py to add the reasoning_token field to each JSONL record:

python src/utils/add_reasoning_tokens.py data/<dataset>/<dataset>.jsonl

By default, the script updates the input file in place. Use --output-path <output.jsonl> to write to a separate file.

The generated T-OR and G-OR rationales follow this folder structure:

data/
  └── <dataset>/
      ├── <dataset>.jsonl
      ├── T-OR/
      │   ├── output.jsonl
      │   └── images/
      └── G-OR/
          ├── output.jsonl
          └── images/

T-OR Rationales

T-OR renders the rationale into a compact typographic image while preserving the original order of the reasoning content.

DATASET=aqua_rat \
INPUT_JSONL=data/aqua_rat/aqua_rat.jsonl \
OUTPUT_DIR=data/aqua_rat/T-OR \
OUTPUT_JSONL=data/aqua_rat/T-OR/output.jsonl \
bash scripts/render_typographic.sh

Important

For T-OR rendering, the textual rationale in the solution field must be LaTeX text without syntax errors.

  • Reads rationales from the solution field.
  • Searches for a compact and readable typographic layout under the reasoning_token budget.

G-OR Rationales

G-OR generates a structured visual rationale by composing reasoning steps into graphical panels.

DATASET=aqua_rat \
INPUT_JSONL=data/aqua_rat/aqua_rat.jsonl \
OUTPUT_BASE=data/aqua_rat/G-OR \
OUTPUT_JSONL=data/aqua_rat/G-OR/output.jsonl \
PROFILE=nano-banana-pro \
bash scripts/render_graphical.sh
  • Uses the configured generation profile, such as PROFILE=nano-banana-pro.
  • Converts the problem, rationale, and optional visual inputs into a step-aligned graphical rationale.

3. Inference

Optical reasoning

For optical reasoning, T-OR takes the problem text together with the rendered typographic rationale image, while G-OR takes the problem text together with the generated graphical rationale image.

Run inference on T-OR:

PROFILE=gpt5.1 \
INPUT_JSONL=data/aqua_rat/T-OR/output.jsonl \
OUTPUT_DIR=outputs/aqua_rat/T-OR \
OUTPUT_JSONL=outputs/aqua_rat/T-OR/infer_gpt5.1.jsonl \
bash scripts/infer_typographic.sh

Run inference on G-OR:

PROFILE=gpt5.1 \
INPUT_JSONL=data/aqua_rat/G-OR/output.jsonl \
OUTPUT_DIR=outputs/aqua_rat/G-OR \
OUTPUT_JSONL=outputs/aqua_rat/G-OR/infer_gpt5.1.jsonl \
bash scripts/infer_graphical.sh

Usage Example

The example below shows the intended inference pattern after an optical rationale image has already been generated. The MLLM receives the original problem text together with the T-OR or G-OR rationale image, then returns the final answer in \boxed{ANSWER} format.

import json
import subprocess
from pathlib import Path

# 1) Prepare one example with the original problem and a generated rationale image.
#    Replace this path with an existing T-OR or G-OR rationale image.
rationale_image = Path("data/aqua_rat/T-OR/images/sample-001.png").resolve()

example = {
    "id": "sample-001",
    "problem": "A rectangle has length 12 and width 5. What is its area?",
    "image_path": str(rationale_image),
}

input_jsonl = Path("outputs/tmp/usage_example_input.jsonl")
output_jsonl = Path("outputs/tmp/usage_example_output.jsonl")
input_jsonl.parent.mkdir(parents=True, exist_ok=True)
input_jsonl.write_text(json.dumps(example) + "\n", encoding="utf-8")

# 2) Run image-based reasoning inference.
subprocess.run(
    [
        "python",
        "src/run.py",
        "infer",
        "--data",
        str(input_jsonl),
        "--output",
        str(output_jsonl),
        "--profile",
        "gpt5.1",
        "--task-type",
        "img_reasoning",
        "--max-tokens",
        "256",
        "--no-evaluate",
    ],
    check=True,
)

# 3) Inspect the model answer.
result = json.loads(output_jsonl.read_text(encoding="utf-8").splitlines()[0])
print(result["prediction"])
print(result["parsed_prediction"])

Note

Configure src/configs/profiles.yaml before running the example. For multimodal datasets, add question_image to the JSONL row when the original question also includes an input image.

Text baselines

Text reasoning receives the problem followed by the rationale, and free reasoning asks the model to solve the problem step by step.

python src/run.py infer \
  --data data/<dataset>/<dataset>.jsonl \
  --output outputs/<dataset>/text_reasoning/infer_<model>.jsonl \
  --profile <model> \
  --task-type text_reasoning

Use --task-type no_reasoning or --task-type free_reasoning for the other text baselines.


4. Reproduction

The main experiment script evaluates five models on five benchmarks under seven settings: no reasoning, text reasoning, and T-OR with reasoning token ratios of 0.2, 0.4, 0.6, 0.8, and 1.0.

First, configure the model profiles in src/configs/profiles.yaml and prepare the benchmark datasets and full T-OR rationales under data/. Each benchmark must contain the full T-OR data at:

data/<dataset>/T-OR/
├── output.jsonl
└── images/

Generate the T-OR datasets for reasoning token ratios 0.2, 0.4, 0.6, and 0.8 from the full T-OR data:

bash scripts/generate_tor_ratio_data.sh

The generated datasets are written to data/<dataset>/T-OR-<ratio>/. The original T-OR directory is used for ratio 1.0.

After preparing all ratio datasets, run the main experiments:

bash scripts/main_exp.sh

By default, the script runs 175 experiments and writes predictions and evaluation metrics to:

outputs/main_exp/<model>/<benchmark>/<setting>/
├── output.jsonl
└── metrics.json

The experiment matrix can be changed through environment variables:

MODELS_STR="gpt5.1 claude-sonnet-4.5 kimi-k2.5 gemini2.5 qwen3vl" \
BENCHMARKS_STR="aqua_rat gpqa gsm8k scienceqa_img zebra-cot" \
TASK_TYPES_STR="no_reasoning text_reasoning img_reasoning" \
TOKEN_RATIOS_STR="0.2 0.4 0.6 0.8 1.0" \
bash scripts/main_exp.sh

✨ How It Works

Optical Reasoning explores the bold idea of using images as a standalone reasoning medium for both language and multimodal tasks.

  • Optical Reasoning: formulates a unified interleaved-modal rationale sequence and maps it into an image, allowing the model to derive final answers directly from visual reasoning tokens rather than textual ones.
  • Typographic-Based (T-OR): optimizes visual layouts by searching over text width and font size to render rationales into compact, high-density typographic images under a strictly controllable reasoning-token budget.
  • Graphical-Based (G-OR): decomposes rationales into distinct reasoning steps and assigns them to specific visual panels, creating a step-aligned composition that naturally unifies textual rationales, graphical elements, and spatial layouts.

🗂️ Project Structure

├── scripts/
│   ├── generate_tor_ratio_data.sh
│   ├── infer_graphical.sh
│   ├── infer_typographic.sh
│   ├── main_exp.sh
│   ├── render_graphical.sh
│   └── render_typographic.sh
│
└── src/
    ├── run.py
    ├── configs/
    │   └── profiles_example.yaml
    ├── inference/
    │   ├── base_predictor.py
    │   ├── evaluation.py
    │   └── predictor.py
    ├── render/
    │   ├── graphical_render.py
    │   └── typographic_render.py
    └── utils/
        ├── add_reasoning_tokens.py
        ├── image_scaling.py
        └── token_sizing.py

🌱 Acknowledgements

XeTex Gsm8k AquaRat ScienceQA Zebra-CoT

This project is licensed under the MIT License. Please refer to the LICENSE file for more details.

📚 Citation

@misc{bian2026opticalreasoningrethinkingimages,
      title={Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text}, 
      author={Yutong Bian and Dongjie Cheng and Heming Xia and Yongqi Li and Wenjie Li},
      year={2026},
      eprint={2606.09585},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.09585}, 
}

About

"Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors