TL;DR:
- Over the weekend, quite unexpectedly, I made a multi-agent AI system that places slightly higher than Claude Code on Stanford's TerminalBench leaderboard (13th place).
- This AI system consists of an orchestration agent that dispatches multiple explorer and coder agents to do all the work.
- The orchestrator explicitly defines what knowledge artifacts subagents must return, then reuses and synthesises these artifacts across future tasks - creating compound intelligence where each action builds meaningfully on previous discoveries.
- 🆕 NEW: You can now use this system on your own repositories! See the Getting Started section below.
The orchestrator acts as the brain of the operation - it receives the user's task but never touches code directly. Instead, it:
- Analyses the task and breaks it into focused subtasks
- Dispatches explorer agents to understand the system
- Delegates implementation work to coder agents with precise instructions
- Verifies all changes through additional explorer agents
- Maintains the context store with all discovered knowledge
The orchestrator's lack of direct code access forces proper delegation and verification patterns, leading to more strategic solutions.
For a full breakdown of this project's code structure, see here
Terminal bench is a brilliant benchmark created by Stanford and Laude Institute to quantify agents' ability to complete complex tasks in the terminal. My Orchestrator system achieved 13th place on the leaderboard, demonstrating competitive performance against leading AI coding assistants.
I ran the Orchestrator evaluations with both Claude-4-Sonnet and also Qwen3-Coder-480B-A35B:
This image shows Qwen-3-Coder performance on the benchmark. The screenshot towards the top of this README shows Sonnet-4 performance.
One of the most striking results is the amount of tokens used by Sonnet-4 as opposed to Qwen3-Coder.
The below table shows the total tokens (input and output included) processed across the TerminalBench evaluation run (5 attempts at 80 tasks = 400 trajectories).
Model | Success Rate | Total Evaluation Cost | Token Usage |
---|---|---|---|
Claude Sonnet-4 | 37.0% | $263.56* | 93.2M tokens |
Qwen-3-Coder | 19.7% | $217.83 | 14.7M tokens |
*Claude Sonnet-4 costs reflect heavy caching usage, reducing actual API costs
While all agents use the same underlying LLM, each operates with its own context window, specialised system message, and distinct toolset. This creates functionally different agents optimised for their specific roles.
System message
Role: Strategic coordinator and persistent intelligence layer
Capabilities: Task decomposition, context management, subagent delegation
Tools: Task creation, subagent launching, context store management
Restrictions: Cannot read or modify code directly - operates purely at architectural level
The orchestrator maintains the complete picture across all tasks, tracking discoveries and progress. It crafts precise task descriptions that explicitly specify what contexts subagents should return, ensuring focused and valuable information gathering.
Trust Calibration Strategy:
The orchestrator employs adaptive delegation based on task complexity:
- Low Complexity Tasks: Grants extremely high autonomy to the coder agent for simple modifications and bug fixes
- Medium/Large Tasks: Maintains strong trust but uses iterative decomposition - breaking complex problems into atomic, verifiable steps
- Verification Philosophy: Uses explorer agents liberally to verify progress, especially when tasks involve critical functionality
System message
Role: Read-only investigation and verification specialist
Capabilities: System exploration, code analysis, test execution, verification
Tools: File reading, search operations (grep/glob), bash commands, temporary script creation
Restrictions: Cannot modify existing files - strictly read-only operations
Explorers gather intelligence about the codebase, verify implementations, and discover system behaviors. They create knowledge artifacts that eliminate redundant exploration for future agents.
System message
Role: Implementation specialist with write access
Capabilities: Code creation/modification, refactoring, bug fixes, system changes
Tools: Full file operations (read/write/edit), bash commands, search operations
Restrictions: None - full system access for implementation tasks
Coders transform architectural vision into working code. They receive focused tasks with relevant contexts and implement solutions while maintaining code quality and conventions.
I introduced a novel approach to multi-agent coordination through the Context Store - a persistent knowledge layer that transforms isolated agent actions into coherent problem-solving. Unlike traditional multi-agent systems where agents operate in isolation, my architecture enables sophisticated knowledge accumulation and sharing.
The Context Store Pattern:
- Orchestrator-Directed Discovery: The orchestrator explicitly specifies what contexts it needs from each subagent, ensuring focused and relevant information gathering and implementation reporting
- Knowledge Artifacts: Subagents create discrete, reusable context items based on the orchestrator's requirements
- Persistent Memory: Contexts persist across agent interactions, building a comprehensive system understanding
- Selective Injection: The orchestrator precisely injects relevant contexts into new tasks, eliminating redundant discovery and providing all the information a subagent needs to complete it's respective task
- Compound Intelligence: Each action builds meaningfully on previous discoveries, creating exponential problem-solving capability
Key Benefits:
- Eliminates Redundant Work: Subagents never need to rediscover the same information twice
- Reduces Context Window Load: Agents receive only the specific contexts they need
- Enables Complex Solutions: Multi-step problems that no single agent could solve become tractable
- Maintains Focus: Each subagent operates with a clean, focused context window
This architecture ensures that every piece of discovered information becomes a permanent building block for future tasks, creating a system that genuinely learns and adapts throughout the problem-solving process.
The orchestrator maintains a comprehensive task management system that tracks all subagent activities:
Core Functions:
- Progress Tracking: Monitors task status (pending, completed, failed) across potentially hundreds of coordinated actions
- Failure Recovery: Captures failure reasons to enable strategic adaptation and intelligent retries
- Workflow Orchestration: Maintains clear audit trails of what's been attempted, preventing redundant work
- Strategic Planning: Enables systematic decomposition of complex problems into verifiable subtasks
The task manager serves as the orchestrator's operational memory - while the context store holds discovered knowledge, the task manager tracks the journey of discovery itself. This dual-layer system ensures the orchestrator always knows both what it has learned AND how it learned it, enabling sophisticated multi-step solutions that build intelligently on previous attempts.
One thing I noticed during early evaluations was that whilst the system was on track to complete extremely complex tasks, it would use lots of subagents to get there, and therefore the task would time out.
Therefore the orchestrator now employs a philosophy for time-efficient execution, recognising that wasted time often stems from poor task specification rather than slow execution:
Prevention Principles:
- Front-Loading Precision: The orchestrator spends time crafting exact task descriptions rather than iterating on vague ones
- Context Completeness: Always over-provides context rather than under-providing, preventing subagents from rediscovering known information
- Explicit Expectations: Every task specifies exactly what contexts should be returned, eliminating unfocused exploration
- Tight Scoping: Defines clear boundaries - what to do AND what not to do, preventing scope creep
Installation:
# Clone the repository
git clone https://github.com/Suicynic/multi-agent-coding-system.git
cd multi-agent-coding-system
# Run the installer
bash install.sh
# Set your API key
export LITELLM_API_KEY="your-api-key-here"
Usage:
# Analyze your repository to see what the orchestrator can help with
python3 repo_analyzer.py
# Use the orchestrator on any task
python3 orchestrator_cli.py "Add comprehensive unit tests for the authentication module"
# Work on a different directory
python3 orchestrator_cli.py "Fix bugs in payment processing" --directory /path/to/your/project
# Use a different model
python3 orchestrator_cli.py "Add documentation" --model "openai/gpt-4"
What you can do:
- 🧪 Add comprehensive test suites with proper coverage and mocking
- 📚 Generate documentation with examples and API references
- 🐛 Fix bugs and issues with intelligent debugging and analysis
- ✨ Add new features with proper planning and implementation
- 🔧 Refactor code for better maintainability and performance
- 🔒 Improve security with vulnerability fixes and best practices
- 🎨 Enhance code quality with linting, formatting, and best practices
For detailed usage, examples, and troubleshooting, see USAGE.md.
uv sync
./run_terminal_bench_eval.sh
uv sync
./run_terminal_bench_eval.sh
See /tests for testing different models on simple tasks.
orchestrator_cli.py
- Main CLI interface for using on any repositoryrepo_analyzer.py
- Tool to analyze repositories and suggest tasksorchestrator_config.py
- Configuration management systemUSAGE.md
- Comprehensive usage guide with examplesexamples/
- Usage examples for different programming languagessrc/
- Core orchestrator and agent implementationtests/
- Test suite and model evaluation scripts
- When I originally ran the evaluations, I saw my result would place me in 12th. By the time of submission (24 hours later), my agent placed 13th. Just 48 hours after this, my agent dropped to 15th! Such is the fascinating rate of progress in AI.
- Thank you to Taras for supporting the compute costs of my experiments
- Thank you to all the amazing teams at Anthropic that enabled me to leverage Claude-4-Sonnet, which is such an incredible model
- Likewise to the Qwen team for releasing Qwen3-Coder to the world, completely open source
- Thank you to the Claude Code team for inspiring my agentic tool use implementation and philosophy
- Thank you to the Litellm team for such an simple to use package, allowing me to switch providers with ease
- Thank you to OpenRouter for such a great routing service, allowing me to switch models with ease