Important
🌟 If you find this repository useful, please consider giving it a star!
🔥 News
- [2026/05] We have released the full codebase, including the benchmark evaluation framework, data synthesis pipeline, SFT training scripts, and RL configuration files. The Benchmark and Data, STT-Agent-SFT, and STT-Agent-RL are now available. While, we have released our preprint-version paper on Arxiv.
This repository contains the official implementation of the paper "STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics".
We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect state shifts and construct revised execution strategies. Beyond benchmarking, we propose an iterative trajectory refinement technique that eliminates three recurring failure modes from training data, and combine it with online RL to produce STT-Agent, which outperforms many open-source frontier models on STT-Arena.
Figure 1: Overview of the STT-Arena construction pipeline. The pipeline consists of three stages: (1) Environment Curation, (2) Spatio-Temporal Dynamic Injection, and (3) Dual-Agent Assessment, followed by human-in-the-loop review to produce the final 227 benchmark instances.
Figure 2: Statistics of STT-Arena across nine spatio-temporal sub-types and four difficulty levels.
Figure 3: Overall Pass@1 performance of all evaluated models on STT-Arena. Results are grouped into four categories: closed-source LLMs, open-source LLMs, more efficient LLMs, and STT-Agent variants. Even the best-performing model, Claude-4.6-Opus, achieves only 35.39%, underscoring the fundamental difficulty of spatio-temporal dynamic reasoning.
STT-Arena organizes tasks into 3 major categories and 9 subcategories of spatio-temporal conflict:
| Category | ID | Conflict Type | Example |
|---|---|---|---|
| Temporal | T1 | Window Expiry | Hotel booking timeout during multi-step planning. |
| Temporal | T2 | Priority Reorder | Flight rescheduled earlier, invalidating transfers. |
| Temporal | T3 | Quota Reset | Promotional tickets expire at midnight mid-session. |
| Spatial | S1 | Site Mismatch | Reserved vehicle relocated to a different depot. |
| Spatial | S2 | Dependency Block | Warehouse lockdown halts all downstream dispatch. |
| Spatial | S3 | Route Restriction | New customs rules block a cross-border corridor. |
| Spatio-Temporal | ST1 | Resource Shift | Peak demand moves cars away from residential zones. |
| Spatio-Temporal | ST2 | Failure Cascade | Hub outage disrupts regional warehouse inventory. |
| Spatio-Temporal | ST3 | Handoff Failure | Medical sample misses window due to clock drift. |
Tasks are further stratified by four difficulty levels: Easy, Medium, Hard, and Impossible.
The benchmark is built via a three-stage automated pipeline with human-in-the-loop validation:
- Stage 1 — Environment Curation: Real-world seed queries are collected from API-Bank, ToolAce, and Dolci, filtered by statefulness and spatio-temporal sensitivity, and synthesized into validated executable Python environments.
- Stage 2 — Spatio-Temporal Dynamic Injection: Each static environment is augmented with a blueprint-guided conflict trigger, producing dynamic tasks across nine conflict types and four difficulty levels.
- Stage 3 — Dual-Agent Assessment: A planning agent and a checking agent jointly validate each instance against three behavioral invariants, followed by LLM-based consistency auditing and manual human review.
Guided by three recurring failure modes identified in trajectory analysis — Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification — we propose an iterative trajectory refinement approach that post-processes training trajectories in three sequential stages, each targeting one failure mode.
| Model | Easy | Medium | Hard | Impossible | Overall | Avg. Calls |
|---|---|---|---|---|---|---|
| Qwen-3-4B (baseline) | 18.31 | 9.46 | 2.82 | 10.00 | 10.57 | 7.63 |
| STT-Agent (w/o refine) | 28.17 | 16.92 | 11.86 | 47.01 | 23.10 | 32.70 |
| STT-Agent (refine) | 26.76 | 17.41 | 13.56 | 61.11 | 25.11 | 15.30 |
# Clone the repository
git clone https://github.com/YOUR-ORG/STT-Arena.git
cd STT-Arena
# Install dependencies
pip install -r requirements.txtNote: The benchmark data is available on HuggingFace. Place the downloaded data under
evaluation/benchmark.
# You can directly use the script to start and run it; refer to the instructions within the script to fill in the parameters.
export OPENAI_API_KEY="Your api key"
export OPENAI_BASE_URL="Your base url"
bash run.shNote: Pre-built SFT trajectories and RL task sets are available on HuggingFace. Run the pipeline below only if you wish to synthesize new data. You should download the source data (dolci, api-bank, toolace) and place them under
environment_curation/source_data.
# Stage 1: Environment Curation
# Filter seed queries and synthesize static environments
bash environment_curation/stage1_environment_curation.sh
# Stage 2: Spatio-Temporal Dynamic Injection
# Assign conflict types and generate dynamic tasks
bash dynamic_injection/stage2_dynamic_injection.sh
# Stage 3: Dual-Agent Assessment
# Validate instances and generate checklists
bash dual_agent_check/stage3_dual_agent_assessment.shSFT:
# We use LlamaFactory framework to train the base models
# You should first prepare the LlamaFactory format data
bash SFT/prepare_llamafactory_data.sh
# Then place the formatted data under the data folder in LlamaFactory and add the following content into the dataset_info.json file.
"sft_refine": {
"file_name": "sft_refine.json",
"formatting": "alpaca",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history",
"system": "system"
}
}
# Finally, you can start SFT following the SFT/qwen3.yaml configurationRL:
# We use ROLL framework to start the online RL
# We provide the configuration file in RL folder, you can follow the documentation of ROLL framework to start RL training# Apply iterative trajectory refinement to raw SFT trajectories
python SFT/refine_train_trajectory.pyFor questions or feedback, please open a GitHub issue or contact Tingfeng Hui.
If you find this work helpful, please cite our paper:
@misc{hui2026sttarenarealisticenvironmenttoolusing,
title={STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics},
author={Tingfeng Hui and Hao Xu and Pengyu Zhu and Hongsheng Xin and Kun Zhan and Sen Su and Chunxiao Liu and Ning Miao},
year={2026},
eprint={2605.18548},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.18548},
}

