🤓 Orchestrator: A multi-agent AI coder. Reached #13 on Stanford's TerminalBench. Open sourced!

TL;DR:

Over the weekend, quite unexpectedly, I made a multi-agent AI system that places slightly higher than Claude Code on Stanford's TerminalBench leaderboard (13th place).
This AI system consists of an orchestration agent that dispatches multiple explorer and coder agents to do all the work.
The orchestrator explicitly defines what knowledge artifacts subagents must return, then reuses and synthesises these artifacts across future tasks - creating compound intelligence where each action builds meaningfully on previous discoveries.
🆕 NEW: You can now use this system on your own repositories! See the Getting Started section below.

How the System Works

The orchestrator acts as the brain of the operation - it receives the user's task but never touches code directly. Instead, it:

Analyses the task and breaks it into focused subtasks
Dispatches explorer agents to understand the system
Delegates implementation work to coder agents with precise instructions
Verifies all changes through additional explorer agents
Maintains the context store with all discovered knowledge

The orchestrator's lack of direct code access forces proper delegation and verification patterns, leading to more strategic solutions.

For a full breakdown of this project's code structure, see here

📈 Evaluation Results

Performance on TerminalBench

Terminal bench is a brilliant benchmark created by Stanford and Laude Institute to quantify agents' ability to complete complex tasks in the terminal. My Orchestrator system achieved 13th place on the leaderboard, demonstrating competitive performance against leading AI coding assistants.

I ran the Orchestrator evaluations with both Claude-4-Sonnet and also Qwen3-Coder-480B-A35B:

This image shows Qwen-3-Coder performance on the benchmark. The screenshot towards the top of this README shows Sonnet-4 performance.

Cost & Efficiency

One of the most striking results is the amount of tokens used by Sonnet-4 as opposed to Qwen3-Coder.

The below table shows the total tokens (input and output included) processed across the TerminalBench evaluation run (5 attempts at 80 tasks = 400 trajectories).

Model	Success Rate	Total Evaluation Cost	Token Usage
Claude Sonnet-4	37.0%	$263.56*	93.2M tokens
Qwen-3-Coder	19.7%	$217.83	14.7M tokens

*Claude Sonnet-4 costs reflect heavy caching usage, reducing actual API costs

🤖 The Agents

While all agents use the same underlying LLM, each operates with its own context window, specialised system message, and distinct toolset. This creates functionally different agents optimised for their specific roles.

🎯 Orchestrator Agent

System message Role: Strategic coordinator and persistent intelligence layer
Capabilities: Task decomposition, context management, subagent delegation
Tools: Task creation, subagent launching, context store management
Restrictions: Cannot read or modify code directly - operates purely at architectural level

The orchestrator maintains the complete picture across all tasks, tracking discoveries and progress. It crafts precise task descriptions that explicitly specify what contexts subagents should return, ensuring focused and valuable information gathering.

Trust Calibration Strategy:
The orchestrator employs adaptive delegation based on task complexity:

Low Complexity Tasks: Grants extremely high autonomy to the coder agent for simple modifications and bug fixes
Medium/Large Tasks: Maintains strong trust but uses iterative decomposition - breaking complex problems into atomic, verifiable steps
Verification Philosophy: Uses explorer agents liberally to verify progress, especially when tasks involve critical functionality

🔍 Explorer Agent

System message Role: Read-only investigation and verification specialist
Capabilities: System exploration, code analysis, test execution, verification
Tools: File reading, search operations (grep/glob), bash commands, temporary script creation
Restrictions: Cannot modify existing files - strictly read-only operations

Explorers gather intelligence about the codebase, verify implementations, and discover system behaviors. They create knowledge artifacts that eliminate redundant exploration for future agents.

💻 Coder Agent

System message Role: Implementation specialist with write access
Capabilities: Code creation/modification, refactoring, bug fixes, system changes
Tools: Full file operations (read/write/edit), bash commands, search operations
Restrictions: None - full system access for implementation tasks

Coders transform architectural vision into working code. They receive focused tasks with relevant contexts and implement solutions while maintaining code quality and conventions.

Key System Components

🧠 Smart Context Sharing

How Context Sharing Works

I introduced a novel approach to multi-agent coordination through the Context Store - a persistent knowledge layer that transforms isolated agent actions into coherent problem-solving. Unlike traditional multi-agent systems where agents operate in isolation, my architecture enables sophisticated knowledge accumulation and sharing.

The Context Store Pattern:

Orchestrator-Directed Discovery: The orchestrator explicitly specifies what contexts it needs from each subagent, ensuring focused and relevant information gathering and implementation reporting
Knowledge Artifacts: Subagents create discrete, reusable context items based on the orchestrator's requirements
Persistent Memory: Contexts persist across agent interactions, building a comprehensive system understanding
Selective Injection: The orchestrator precisely injects relevant contexts into new tasks, eliminating redundant discovery and providing all the information a subagent needs to complete it's respective task
Compound Intelligence: Each action builds meaningfully on previous discoveries, creating exponential problem-solving capability

Key Benefits:

Eliminates Redundant Work: Subagents never need to rediscover the same information twice
Reduces Context Window Load: Agents receive only the specific contexts they need
Enables Complex Solutions: Multi-step problems that no single agent could solve become tractable
Maintains Focus: Each subagent operates with a clean, focused context window

This architecture ensures that every piece of discovered information becomes a permanent building block for future tasks, creating a system that genuinely learns and adapts throughout the problem-solving process.

📋 Task Management

The orchestrator maintains a comprehensive task management system that tracks all subagent activities:

Core Functions:

Progress Tracking: Monitors task status (pending, completed, failed) across potentially hundreds of coordinated actions
Failure Recovery: Captures failure reasons to enable strategic adaptation and intelligent retries
Workflow Orchestration: Maintains clear audit trails of what's been attempted, preventing redundant work
Strategic Planning: Enables systematic decomposition of complex problems into verifiable subtasks

The task manager serves as the orchestrator's operational memory - while the context store holds discovered knowledge, the task manager tracks the journey of discovery itself. This dual-layer system ensures the orchestrator always knows both what it has learned AND how it learned it, enabling sophisticated multi-step solutions that build intelligently on previous attempts.

⏱️ Time-Conscious Orchestration

One thing I noticed during early evaluations was that whilst the system was on track to complete extremely complex tasks, it would use lots of subagents to get there, and therefore the task would time out.

Therefore the orchestrator now employs a philosophy for time-efficient execution, recognising that wasted time often stems from poor task specification rather than slow execution:

Prevention Principles:

Front-Loading Precision: The orchestrator spends time crafting exact task descriptions rather than iterating on vague ones
Context Completeness: Always over-provides context rather than under-providing, preventing subagents from rediscovering known information
Explicit Expectations: Every task specifies exactly what contexts should be returned, eliminating unfocused exploration
Tight Scoping: Defines clear boundaries - what to do AND what not to do, preventing scope creep

Getting started

🚀 Quick Start - Use on Your Own Repositories

Installation:

# Clone the repository
git clone https://github.com/Suicynic/multi-agent-coding-system.git
cd multi-agent-coding-system

# Run the installer
bash install.sh

# Set your API key
export LITELLM_API_KEY="your-api-key-here"

Usage:

# Analyze your repository to see what the orchestrator can help with
python3 repo_analyzer.py

# Use the orchestrator on any task
python3 orchestrator_cli.py "Add comprehensive unit tests for the authentication module"

# Work on a different directory
python3 orchestrator_cli.py "Fix bugs in payment processing" --directory /path/to/your/project

# Use a different model
python3 orchestrator_cli.py "Add documentation" --model "openai/gpt-4"

What you can do:

🧪 Add comprehensive test suites with proper coverage and mocking
📚 Generate documentation with examples and API references
🐛 Fix bugs and issues with intelligent debugging and analysis
✨ Add new features with proper planning and implementation
🔧 Refactor code for better maintainability and performance
🔒 Improve security with vulnerability fixes and best practices
🎨 Enhance code quality with linting, formatting, and best practices

For detailed usage, examples, and troubleshooting, see USAGE.md.

📊 For TerminalBench Evaluation

uv sync
./run_terminal_bench_eval.sh

📊 For TerminalBench Evaluation

uv sync
./run_terminal_bench_eval.sh

🔬 Testing Various Models

See /tests for testing different models on simple tasks.

📁 Repository Structure

orchestrator_cli.py - Main CLI interface for using on any repository
repo_analyzer.py - Tool to analyze repositories and suggest tasks
orchestrator_config.py - Configuration management system
USAGE.md - Comprehensive usage guide with examples
examples/ - Usage examples for different programming languages
src/ - Core orchestrator and agent implementation
tests/ - Test suite and model evaluation scripts

Notes

When I originally ran the evaluations, I saw my result would place me in 12th. By the time of submission (24 hours later), my agent placed 13th. Just 48 hours after this, my agent dropped to 15th! Such is the fascinating rate of progress in AI.

Acknowledgements

Thank you to Taras for supporting the compute costs of my experiments
Thank you to all the amazing teams at Anthropic that enabled me to leverage Claude-4-Sonnet, which is such an incredible model
Likewise to the Qwen team for releasing Qwen3-Coder to the world, completely open source
Thank you to the Claude Code team for inspiring my agentic tool use implementation and philosophy
Thank you to the Litellm team for such an simple to use package, allowing me to switch providers with ease
Thank you to OpenRouter for such a great routing service, allowing me to switch models with ease

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
examples		examples
readme_imgs		readme_imgs
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
USAGE.md		USAGE.md
demo.py		demo.py
install.sh		install.sh
orchestrator_cli.py		orchestrator_cli.py
orchestrator_config.py		orchestrator_config.py
orchestrator_standalone.py		orchestrator_standalone.py
pyproject.toml		pyproject.toml
repo_analyzer.py		repo_analyzer.py
run_terminal_bench_eval.sh		run_terminal_bench_eval.sh
test_local_usage.py		test_local_usage.py
test_simple.py		test_simple.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤓 Orchestrator: A multi-agent AI coder. Reached #13 on Stanford's TerminalBench. Open sourced!

How the System Works

📈 Evaluation Results

Performance on TerminalBench

Cost & Efficiency

🤖 The Agents

🎯 Orchestrator Agent

🔍 Explorer Agent

💻 Coder Agent

Key System Components

🧠 Smart Context Sharing

How Context Sharing Works

📋 Task Management

⏱️ Time-Conscious Orchestration

Getting started

🚀 Quick Start - Use on Your Own Repositories

📊 For TerminalBench Evaluation

📊 For TerminalBench Evaluation

🔬 Testing Various Models

📁 Repository Structure

Notes

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Suicynic/multi-agent-coding-system

Folders and files

Latest commit

History

Repository files navigation

🤓 Orchestrator: A multi-agent AI coder. Reached #13 on Stanford's TerminalBench. Open sourced!

How the System Works

📈 Evaluation Results

Performance on TerminalBench

Cost & Efficiency

🤖 The Agents

🎯 Orchestrator Agent

🔍 Explorer Agent

💻 Coder Agent

Key System Components

🧠 Smart Context Sharing

How Context Sharing Works

📋 Task Management

⏱️ Time-Conscious Orchestration

Getting started

🚀 Quick Start - Use on Your Own Repositories

📊 For TerminalBench Evaluation

📊 For TerminalBench Evaluation

🔬 Testing Various Models

📁 Repository Structure

Notes

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages