LongCoT

Benchmarking Long-Horizon Chain-of-Thought Reasoning

Correspondence: sumeet.motwani@eng.ox.ac.uk, charles.london@cs.ox.ac.uk

LongCoT is a benchmark for measuring how well frontier LLMs can sustain coherent reasoning across extended chains of thought. Every problem pairs a short input with a long reasoning output (often spanning tens to hundreds of thousands of tokens) where individual steps are tractable in isolation but the difficulty emerges from composition: keeping plans on track, tracking state, propagating constraints, and recovering from mistakes without external tools or scaffolding.

The benchmark contains ~2,500 expert-designed problems across five domains (logic, computer science, chemistry, chess, mathematics), each with deterministic verification.

Domains

Domain	Questions	Verification
`logic`	500	Programmatic (simulates/validates solution)
`cs`	500	JSON string match
`chemistry`	500	Regex-based SMILES extraction + canonical match (+ optional LLM fallback)
`chess`	500	Template-aware deterministic parsing (integer/FEN/SAN/move-dict) + engine checks
`math`	500	Deterministic math comparison (+ optional LLM fallback)

Each domain has questions at three difficulty levels: easy, medium, hard. Two benchmarks are defined over this split:

LongCoT-Mini — the easy subset (~500 questions), suitable for quick evaluation. Selected with --difficulty longcot-mini.
LongCoT — medium + hard (~2,000 questions), the full benchmark. Selected with --difficulty longcot.

Installation

LongCoT uses uv for dependency management. Install uv if you don't already have it (curl -LsSf https://astral.sh/uv/install.sh | sh), then:

uv sync

This creates a local .venv, installs LongCoT in editable mode, and pulls in all runtime dependencies (including the OpenAI, Anthropic, and Google GenAI SDKs used by the inference CLI) pinned to uv.lock. Pass --no-dev if you want to skip the test dependencies.

Run commands inside the environment with uv run (e.g. uv run python run_inference.py ...), or activate the venv manually with source .venv/bin/activate.

The math and chemistry fallback judges use Gemini and read GEMINI_API_KEY or GOOGLE_API_KEY.

Running inference

run_inference.py submits prompts to an LLM provider in parallel and writes one JSONL result per question to responses/.

# LongCoT-Mini (easy only) on GPT-5.2
uv run python run_inference.py --difficulty longcot-mini --config oai_gpt52

# Full LongCoT (medium + hard) on an OpenRouter model
uv run python run_inference.py --difficulty longcot --config or_deepseek

# A single domain, single difficulty
uv run python run_inference.py --domain math --difficulty hard --config anthropic_sonnet

# Sanity check without hitting the API
uv run python run_inference.py --config gemini_pro --dry-run --max-questions 5

Valid --difficulty values: easy, medium, hard, longcot-mini (alias for easy), longcot (alias for medium + hard). Omitting the flag runs all three difficulties.

Configs

Configs are YAML files in src/configs/; pass the filename stem as --config. Built-in configs:

Config	Provider	Model
`oai_gpt52`	OpenAI	gpt-5.2
`anthropic_sonnet`	Anthropic	claude-opus-4-6
`gemini_pro`	Gemini	gemini-3-pro
`or_deepseek`	OpenRouter	deepseek-v3.2
`or_glm`	OpenRouter	GLM
`or_grok`	OpenRouter	Grok
`or_kimi`	OpenRouter	Kimi

API keys are declared in the configs with ${ENV_VAR} placeholders (e.g. ${OPENAI_API_KEY}) and resolved at runtime. Copy any of the existing YAML files to add a new provider/model.

Output

Each response is saved as one JSON line with question_id, successful, response_text, model, usage, and (when the provider returns it) reasoning. Default output path is responses/<domain>_<difficulty>_<model>_<timestamp>.jsonl.

Running analysis

run_eval.py takes a JSONL of responses, verifies each against the reference answers, and writes a results JSON to results/.

uv run python run_eval.py responses/math_hard_gpt-5.2_20260226.jsonl

Reported metrics:

Metric	Description
`correct`	Verified as correct
`incorrect`	Verified as wrong
`failed`	API error (no response returned)
`wrong_formatting`	Response did not include `solution = ...` (counted separately; verification is still attempted)
`accuracy`	`correct / (correct + incorrect)` — excludes failed calls
`overall_accuracy`	`correct / total`

LLM fallback judges

For math problems where deterministic comparison is inconclusive and for chemistry responses where SMILES extraction fails, a Gemini fallback judge is invoked. Set GEMINI_API_KEY (or GOOGLE_API_KEY) to enable it. Pass --no-fallback to disable — scores may be lower without it. When fallback is enabled but no key is set, a warning is printed and those borderline cases fail closed.

Python API

If you want to integrate LongCoT into your own harness rather than use the CLI:

import longcot

# Load questions (optionally filter by domain / difficulty)
questions = longcot.load_questions(domain="logic", difficulty="easy")

# Each question is a dataclass with a uniform shape across domains
q = questions[0]
q.question_id   # "Sudoku_easy_1"
q.domain        # "logic"
q.difficulty    # "easy"
q.prompt        # Full prompt text to send to an LLM
q.problem       # dict (template metadata) or None
q.answer        # reference answer payload or None

# Call your LLM with q.prompt, then verify
correct = longcot.verify(q, response_text)
# -> bool

# Batch verification
results = longcot.verify_batch(questions, responses)

Fallback judges can be configured via VerifyOptions:

options = longcot.VerifyOptions(
    math=longcot.MathVerifyOptions(enable_fallback=True),
    chemistry=longcot.ChemistryVerifyOptions(enable_fallback=True),
)
correct = longcot.verify(q, response_text, options=options)

All domains use solution = ... as the answer format; per-question prompts spell out the expected shape of the value.

Submissions

We welcome community submissions to the LongCoT leaderboard.

To submit results:

Evaluate your model on the full benchmark using the default evaluation settings.
Include your model name, provider, and per-question outputs in a results file.
Open a pull request on the benchmark repository with the results and enough information for us to reproduce the run.

Example submission format:

{
  "model": "Your Model Name",
  "provider": "Your Organisation",
  "type": "open|closed",
  "results": [
    {"question_id": "Sudoku_easy_1", "response": "...", "correct": true}
  ]
}

We verify submissions before adding them to the leaderboard.

Authors

Sumeet Ramesh Motwani*, Daniel Nichols*, Charles London*, Peggy Li*, Fabio Pizzati*,
Acer Blake, Hasan Hammoud, Tavish McDonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran,
Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr, Natasha Jaques, Ameya Prabhu,
Brian Bartoldson, Bhavya Kailkhura, Christian Schroeder de Witt

University of Oxford · Lawrence Livermore National Laboratory · MBZUAI · KAUST ·
Hexo AI · University of Washington · University of Tübingen

*Joint first authors

Citation

If you use LongCoT in your work, please cite:

@article{motwani2026longcot,
  title         = {LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning},
  author        = {Motwani, Sumeet Ramesh and Nichols, Daniel and London, Charles and Li, Peggy and Pizzati, Fabio and Blake, Acer and Hammoud, Hasan and McDonald, Tavish and Naik, Akshat and Ivanova, Alesia and Baskaran, Vignesh and Laptev, Ivan and Glatt, Ruben and Ben-Nun, Tal and Torr, Philip and Jaques, Natasha and Prabhu, Ameya and Bartoldson, Brian and Kailkhura, Bhavya and Schroeder de Witt, Christian},
  year          = {2026},
  eprint        = {2604.14140},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2604.14140}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run_eval.py		run_eval.py
run_inference.py		run_inference.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongCoT

Domains

Installation

Running inference

Configs

Output

Running analysis

LLM fallback judges

Python API

Submissions

Authors

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

LongCoT

Domains

Installation

Running inference

Configs

Output

Running analysis

LLM fallback judges

Python API

Submissions

Authors

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages