TIRBench

Description

TIR-Bench (Thinking-with-Images Reasoning) is an environment for evaluating agentic reasoning capabilities that require programmatic image manipulation. It contains 1,215 examples across 13 diverse tasks requiring tool use for image processing, such as zooming, rotating, drawing auxiliary lines, and image segmentation. Questions are multi-choice or free-form.

Capabilities

Agentic thinking-with-images reasoning
Programmatic image manipulation (zoom, rotate, draw)
Multi-choice and free-form answer evaluation

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0.

Tasks

There is one split in this environment:

test: 1,215 tasks (665 multi-choice, 550 free-form)

Tasks span 13 diverse reasoning categories including Jigsaw Puzzle, Rotated OCR, spot-the-difference, and other image manipulation tasks.

Reward Structure

Single-turn evaluation with LLM-graded rewards. The agent submits an answer via the submit_answer tool. Answers are graded by gpt-5-mini using task-type-specific guidelines for multiple choice, numeric, text, spatial/visual, and boolean answers. Reward is 1.0 if correct, 0.0 if incorrect.

Data

dataset.parquet with images sourced from HuggingFace Agents-X/TIR-Bench. Stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit an answer (text, letter, or numeric). LLM-graded evaluation. Ends the episode.

Time Horizon

Single-turn. The agent views the image(s) and task prompt, then submits one answer.

Environment Difficulty

TIR-Bench evaluates agentic thinking-with-images abilities:

Model	Accuracy
o3 (tool-use)	46%
Gemini-2.5-Pro (non-agentic)	28.9%
GPT-4o (non-agentic)	<20%

Traditional non-agentic models perform poorly, demonstrating that strong performance requires thinking-with-images capabilities with tool use.

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."} when creating a session.

Safety

Agents in TIR-Bench solve visual reasoning tasks in a standard environment. The environment does not present direct safety risks.

Citation

@article{tirbench2025,
  title={TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning},
  author={Agents-X Project},
  journal={arXiv preprint arXiv:2511.01833},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__MACOSX		__MACOSX
.gitignore		.gitignore
DATA_UPLOAD.md		DATA_UPLOAD.md
Dockerfile		Dockerfile
README.md		README.md
constants.py		constants.py
requirements.txt		requirements.txt
server.py		server.py
test_agent.py		test_agent.py
tirbench.py		tirbench.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TIRBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TIRBench

Description

Capabilities

Compute Requirements

License

Tasks

Reward Structure

Data

Tools

Time Horizon

Environment Difficulty

Other Environment Requirements

Safety

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages