STT-Arena: A More Realistic Environment for
Tool-Using with Spatio-Temporal Dynamics

Important

🌟 If you find this repository useful, please consider giving it a star!

🔥 News

[2026/05] We have released the full codebase, including the benchmark evaluation framework, data synthesis pipeline, SFT training scripts, and RL configuration files. The Benchmark and Data, STT-Agent-SFT, and STT-Agent-RL are now available. While, we have released our preprint-version paper on Arxiv.

This repository contains the official implementation of the paper "STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics".

We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect state shifts and construct revised execution strategies. Beyond benchmarking, we propose an iterative trajectory refinement technique that eliminates three recurring failure modes from training data, and combine it with online RL to produce STT-Agent, which outperforms many open-source frontier models on STT-Arena.

Figure 1: Overview of the STT-Arena construction pipeline. The pipeline consists of three stages: (1) Environment Curation, (2) Spatio-Temporal Dynamic Injection, and (3) Dual-Agent Assessment, followed by human-in-the-loop review to produce the final 227 benchmark instances.

Figure 2: Statistics of STT-Arena across nine spatio-temporal sub-types and four difficulty levels.

📊 Benchmark Results

Figure 3: Overall Pass@1 performance of all evaluated models on STT-Arena. Results are grouped into four categories: closed-source LLMs, open-source LLMs, more efficient LLMs, and STT-Agent variants. Even the best-performing model, Claude-4.6-Opus, achieves only 35.39%, underscoring the fundamental difficulty of spatio-temporal dynamic reasoning.

🗂️ Task Taxonomy

STT-Arena organizes tasks into 3 major categories and 9 subcategories of spatio-temporal conflict:

Category	ID	Conflict Type	Example
Temporal	T1	Window Expiry	Hotel booking timeout during multi-step planning.
Temporal	T2	Priority Reorder	Flight rescheduled earlier, invalidating transfers.
Temporal	T3	Quota Reset	Promotional tickets expire at midnight mid-session.
Spatial	S1	Site Mismatch	Reserved vehicle relocated to a different depot.
Spatial	S2	Dependency Block	Warehouse lockdown halts all downstream dispatch.
Spatial	S3	Route Restriction	New customs rules block a cross-border corridor.
Spatio-Temporal	ST1	Resource Shift	Peak demand moves cars away from residential zones.
Spatio-Temporal	ST2	Failure Cascade	Hub outage disrupts regional warehouse inventory.
Spatio-Temporal	ST3	Handoff Failure	Medical sample misses window due to clock drift.

Tasks are further stratified by four difficulty levels: Easy, Medium, Hard, and Impossible.

🏗️ STT-Arena Construction Pipeline

The benchmark is built via a three-stage automated pipeline with human-in-the-loop validation:

Stage 1 — Environment Curation: Real-world seed queries are collected from API-Bank, ToolAce, and Dolci, filtered by statefulness and spatio-temporal sensitivity, and synthesized into validated executable Python environments.
Stage 2 — Spatio-Temporal Dynamic Injection: Each static environment is augmented with a blueprint-guided conflict trigger, producing dynamic tasks across nine conflict types and four difficulty levels.
Stage 3 — Dual-Agent Assessment: A planning agent and a checking agent jointly validate each instance against three behavioral invariants, followed by LLM-based consistency auditing and manual human review.

🤖 Iterative Trajectory Refinement

Guided by three recurring failure modes identified in trajectory analysis — Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification — we propose an iterative trajectory refinement approach that post-processes training trajectories in three sequential stages, each targeting one failure mode.

Model	Easy	Medium	Hard	Impossible	Overall	Avg. Calls
Qwen-3-4B (baseline)	18.31	9.46	2.82	10.00	10.57	7.63
STT-Agent (w/o refine)	28.17	16.92	11.86	47.01	23.10	32.70
STT-Agent (refine)	26.76	17.41	13.56	61.11	25.11	15.30

🛠️ Usage

1. Installation

# Clone the repository
git clone https://github.com/YOUR-ORG/STT-Arena.git
cd STT-Arena

# Install dependencies
pip install -r requirements.txt

2. Benchmark Evaluation

Note: The benchmark data is available on HuggingFace. Place the downloaded data under evaluation/benchmark.

# You can directly use the script to start and run it; refer to the instructions within the script to fill in the parameters.
export OPENAI_API_KEY="Your api key"
export OPENAI_BASE_URL="Your base url"
bash run.sh

3. Data Synthesis Pipeline

Note: Pre-built SFT trajectories and RL task sets are available on HuggingFace. Run the pipeline below only if you wish to synthesize new data. You should download the source data (dolci, api-bank, toolace) and place them under environment_curation/source_data.

# Stage 1: Environment Curation
# Filter seed queries and synthesize static environments
bash environment_curation/stage1_environment_curation.sh

# Stage 2: Spatio-Temporal Dynamic Injection
# Assign conflict types and generate dynamic tasks
bash dynamic_injection/stage2_dynamic_injection.sh

# Stage 3: Dual-Agent Assessment
# Validate instances and generate checklists
bash dual_agent_check/stage3_dual_agent_assessment.sh

4. STT-Agent Training

SFT:

# We use LlamaFactory framework to train the base models
# You should first prepare the LlamaFactory format data
bash SFT/prepare_llamafactory_data.sh

# Then place the formatted data under the data folder in LlamaFactory and add the following content into the dataset_info.json file.
"sft_refine": {
    "file_name": "sft_refine.json",
    "formatting": "alpaca",
    "columns": {
        "prompt": "instruction",
        "query": "input",
        "response": "output",
        "history": "history",
        "system": "system"
    }
}

# Finally, you can start SFT following the SFT/qwen3.yaml configuration

RL:

# We use ROLL framework to start the online RL
# We provide the configuration file in RL folder, you can follow the documentation of ROLL framework to start RL training

5. Trajectory Refinement

# Apply iterative trajectory refinement to raw SFT trajectories
python SFT/refine_train_trajectory.py

✉️ Contact

For questions or feedback, please open a GitHub issue or contact Tingfeng Hui.

🖊️ Citation

If you find this work helpful, please cite our paper:

@misc{hui2026sttarenarealisticenvironmenttoolusing,
      title={STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics}, 
      author={Tingfeng Hui and Hao Xu and Pengyu Zhu and Hongsheng Xin and Kun Zhan and Sen Su and Chunxiao Liu and Ning Miao},
      year={2026},
      eprint={2605.18548},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.18548}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
RL/env		RL/env
SFT		SFT
assets		assets
dual_agent_check		dual_agent_check
dynamic_injection		dynamic_injection
environment_curation		environment_curation
evaluation		evaluation
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STT-Arena: A More Realistic Environment for
Tool-Using with Spatio-Temporal Dynamics

📊 Benchmark Results

🗂️ Task Taxonomy

🏗️ STT-Arena Construction Pipeline

🤖 Iterative Trajectory Refinement

🛠️ Usage

1. Installation

2. Benchmark Evaluation

3. Data Synthesis Pipeline

4. STT-Agent Training

5. Trajectory Refinement

✉️ Contact

🖊️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

📊 Benchmark Results

🗂️ Task Taxonomy

🏗️ STT-Arena Construction Pipeline

🤖 Iterative Trajectory Refinement

🛠️ Usage

1. Installation

2. Benchmark Evaluation

3. Data Synthesis Pipeline

4. STT-Agent Training

5. Trajectory Refinement

✉️ Contact

🖊️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

STT-Arena: A More Realistic Environment for
Tool-Using with Spatio-Temporal Dynamics

Packages