TIR-Bench (Thinking-with-Images Reasoning) is an environment for evaluating agentic reasoning capabilities that require programmatic image manipulation. It contains 1,215 examples across 13 diverse tasks requiring tool use for image processing, such as zooming, rotating, drawing auxiliary lines, and image segmentation. Questions are multi-choice or free-form.
- Agentic thinking-with-images reasoning
- Programmatic image manipulation (zoom, rotate, draw)
- Multi-choice and free-form answer evaluation
Agents are given a standard environment with no sandbox or file system access.
There is one split in this environment:
- test: 1,215 tasks (665 multi-choice, 550 free-form)
Tasks span 13 diverse reasoning categories including Jigsaw Puzzle, Rotated OCR, spot-the-difference, and other image manipulation tasks.
Single-turn evaluation with LLM-graded rewards. The agent submits an answer via the submit_answer tool. Answers are graded by gpt-5-mini using task-type-specific guidelines for multiple choice, numeric, text, spatial/visual, and boolean answers. Reward is 1.0 if correct, 0.0 if incorrect.
dataset.parquet with images sourced from HuggingFace Agents-X/TIR-Bench. Stored on the OpenReward platform.
| Tool | Description |
|---|---|
submit_answer |
Submit an answer (text, letter, or numeric). LLM-graded evaluation. Ends the episode. |
Single-turn. The agent views the image(s) and task prompt, then submits one answer.
TIR-Bench evaluates agentic thinking-with-images abilities:
| Model | Accuracy |
|---|---|
| o3 (tool-use) | 46% |
| Gemini-2.5-Pro (non-agentic) | 28.9% |
| GPT-4o (non-agentic) | <20% |
Traditional non-agentic models perform poorly, demonstrating that strong performance requires thinking-with-images capabilities with tool use.
OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."} when creating a session.
Agents in TIR-Bench solve visual reasoning tasks in a standard environment. The environment does not present direct safety risks.
@article{tirbench2025,
title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
author={Agents-X Project},
journal={arXiv preprint arXiv:2511.01833},
year={2025}
}