A multi-agent software development system that transforms natural language into production-ready code — and a framework comparison platform for evaluating orchestration approaches side by side.
- Build an autonomous AI team that accepts a project description and delivers requirements, architecture, code, tests, and deployment artifacts end-to-end.
- Compare multiple orchestration frameworks running the same agents, tools, guardrails, and demos — measuring output quality, cost, latency, and developer experience.
- Evaluate which framework best suits which use case, from quick prototypes to enterprise pipelines.
The multi-agent framework landscape is moving fast. CrewAI, LangGraph, Claude Agent SDK, AutoGen, AWS Bedrock Agents — each makes different trade-offs around orchestration control, state management, human-in-the-loop, persistence, streaming, and cost. Rather than pick one and hope, this project runs the same team through multiple backends and lets the data decide.
All backends share the same Backend protocol, tools, guardrails, Pydantic models, and team profiles. Swap at runtime with --backend:
| Backend | Status | Orchestration model | LLM provider | Key strengths |
|---|---|---|---|---|
| CrewAI | Production | Crews + Flows (@start, @listen, @router) |
OpenRouter | Simple setup, hierarchical process, built-in delegation |
| LangGraph | Production | StateGraph with nodes + conditional edges | OpenRouter | Explicit routing, checkpointing, time-travel, state inspection |
| Claude Agent SDK | Available | Nested subagents + session persistence | Anthropic API | Extended thinking, prompt caching, file rollback, native MCP, streaming |
ai-team --backend crewai "Build a REST API" # CrewAI (default)
ai-team --backend langgraph "Build a REST API" # LangGraph
ai-team --backend claude-agent-sdk "Build a REST API" # Claude Agent SDK (ANTHROPIC_API_KEY + Claude Code)Run the same demo through multiple backends and compare:
python scripts/compare_backends.py demos/01_hello_world --env dev
python scripts/compare_backends.py demos/01_hello_world --env dev --with-claudeProduces a side-by-side report: output quality, cost, latency, token usage, error rate. Use --with-claude to include the Claude Agent SDK (requires ANTHROPIC_API_KEY).
The multi-backend architecture supports adding new frameworks by implementing the Backend protocol:
- AutoGen — Microsoft's multi-agent framework
- AWS Bedrock Agents — managed agent service
- Strands — AWS open-source agent SDK
- Custom — bare LLM calls with manual orchestration
Not every project needs all 9 agents. Select a profile with --team:
| Profile | Agents | Use case |
|---|---|---|
full (default) |
All 9 agents, all phases | Full software project |
backend-api |
Manager, PO, Architect, Backend Dev, QA, DevOps | REST API / microservice |
frontend-app |
Manager, PO, Architect, Frontend Dev, QA, DevOps | SPA / static site |
data-pipeline |
Manager, PO, Architect, Backend Dev, QA | ETL / data engineering |
prototype |
Architect, Fullstack Dev, QA | Minimal design → build → test |
infra-only |
Architect, DevOps, Cloud | IaC / CI-CD only |
research-optimizer |
Optimizer | Karpathy AutoOptimizer Loop (see below) |
Canonical reference (agents, phases, backend parity, demos): docs/TEAM_PROFILES.md.
Source: src/ai_team/config/team_profiles.yaml.
| Feature | Description |
|---|---|
| 9 specialized agents | Manager, Product Owner, Architect, Backend/Frontend/Fullstack Developers, DevOps, Cloud, QA |
| End-to-end workflow | Intake → Planning → Development → Testing → Deployment |
| Multi-backend | Same team, same demos, different orchestration — compare results |
| Team profiles | Right-size the team for the use case (--team backend-api, --team prototype, ...) |
| Enterprise guardrails | Behavioral (role, scope), security (code safety, PII, secrets), quality (syntax, completeness) |
| MCP servers | Per-team, per-agent MCP tool providers (GitHub, filesystem, Docker, Postgres) |
| RAG knowledge | Static best practices + dynamic project knowledge, scoped per agent role |
| Self-improvement reports | Each run produces a manager report that summarizes failures, references prior lessons, and proposes corrective actions |
| AutoOptimizer Loop | Karpathy-style autonomous edit→run→measure→keep/revert loop for iterative code optimization |
| Observable | Web dashboard (FastAPI + React), Textual TUI, Rich CLI monitor, structured logging, cost tracking |
AI-Team automatically turns run outcomes into actionable feedback:
- Report: after each run, the Manager writes a
manager_self_improvement_report.md(and.json) underoutput/runs/<run_id>/reports/. - Learn: failures are recorded in long-term memory and can be promoted into role-scoped “lessons”.
- Inject: promoted lessons are injected into the next run’s agent prompts (by role) to reduce repeat failures.
See a full example report in docs/manager_self_improvement_report.md.
Inspired by Andrej Karpathy's overnight experiment runs, the AutoOptimizer Loop is a tight autonomous cycle that iteratively improves a target metric (e.g. test pass rate, requests/sec, latency) on any workspace:
- Snapshot — records current workspace state
- Edit — agent proposes and applies ONE focused change
- Measure — runs the evaluation command and reads the metric
- Keep or revert — commits winning changes to a dedicated branch; restores snapshot on regression
- Learn — ingests an experiment lesson into RAG so the next iteration builds on what worked
# Run 20 optimization experiments, budget $2
ai-team optimize ./workspace/todo-api \
--metric demos/06_karpathy_optimization/metric.yaml \
--budget 2.00 \
--max-experiments 20
# Full options
ai-team optimize ./workspace/my-app \
--metric metric.yaml \
--strategy strategy.md \
--backend claude-agent-sdk \
--team research-optimizer \
--budget 5.00 \
--max-experiments 50 \
--branch optimize/my-run \
--editable src/ lib/| Flag | Description | Default |
|---|---|---|
--metric |
Path to metric.yaml (name, evaluation_command, direction) |
required |
--strategy |
Path to a Markdown hints file for the optimizer agent | — |
--backend |
Backend to use | claude-agent-sdk |
--team |
Team profile | research-optimizer |
--budget |
Total USD budget across all experiments | 10.0 |
--max-experiments |
Max iterations | 50 |
--branch |
Git branch for winning commits | optimize/karpathy-loop |
--editable |
Paths the agent may edit (informational; enforced via prompt) | src/ |
Results land in logs/experiments.jsonl inside the workspace. See demos/06_karpathy_optimization/ for a ready-to-run example and docs/EVALS.md for eval methodology.
Example excerpt:
## This run: problems observed
1. **GuardrailError** (phase: `testing`): QA Engineer should only write test code, not modify production source.
## Proposed self-improvement actions
- Calibrate behavioral guardrails for QA/testing: reduce false positives when outputs are verbose but still test-scoped; consider role-specific relevance thresholds.Planning output (requirements):
Planning output (architecture):
Detailed log of this project: docs/journey.md.
┌──────────────────────────────────────────────────────────────────────────────┐
│ UI Layer (3 interfaces) │
│ Web Dashboard (FastAPI+React) │ Textual TUI │ Rich CLI Monitor │
│ --backend crewai | langgraph | claude-agent-sdk --team <profile> │
└──────────────────────────────┬───────────────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ Backend Protocol (core/) │
│ run(description, team, env) → ProjectResult │
│ stream(description, team, env) → AsyncIterator[StreamEvent] │
└──────┬───────────────────────┬───────────────────────┬───────────────────────┘
▼ ▼ ▼
┌──────────────┐ ┌───────────────────┐ ┌─────────────────────┐
│ CrewAI │ │ LangGraph │ │ Claude Agent SDK │
│ Crews+Flows │ │ StateGraph+nodes │ │ Nested subagents │
│ @start, │ │ conditional │ │ session persistence │
│ @listen, │ │ edges, subgraphs │ │ hooks, skills, │
│ @router │ │ checkpointing │ │ extended thinking │
└──────┬───────┘ └─────────┬─────────┘ └──────────┬──────────┘
└─────────────────────┼────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ SHARED LAYERS │
│ Tools: file · code · git · test │ MCP servers (per-team, per-agent) │
│ Guardrails: behavioral · security · quality │
│ RAG: static knowledge · project indexing · session knowledge │
│ Memory: session + long-term │ Team profiles (config/team_profiles.yaml) │
└──────────────────────────────────────────────────────────────────────────────┘
See docs/ARCHITECTURE.md for full design.
git clone https://github.com/RickZee/ai-team.git && cd ai-team
cp .env.example .env # Add API keys (see Configuration below)
poetry install
# Run with CrewAI (default)
poetry run ai-team "Create a REST API for a todo list"
# Run with LangGraph
poetry run ai-team --backend langgraph --team backend-api "Create a REST API for a todo list"
# Run with Claude Agent SDK
poetry run ai-team --backend claude-agent-sdk --team backend-api "Create a REST API for a todo list"For step-by-step setup and troubleshooting, see docs/GETTING_STARTED.md.
| Variable | Description | Default |
|---|---|---|
OPENROUTER_API_KEY |
OpenRouter API key (CrewAI / LangGraph backends) | — |
ANTHROPIC_API_KEY |
Anthropic API key (Claude Agent SDK backend) | — |
AI_TEAM_ENV |
Tier: dev, test, prod |
dev |
AI_TEAM_BACKEND |
Default backend: crewai, langgraph, claude-agent-sdk |
crewai |
AI_TEAM_LANGGRAPH_POSTGRES_URI |
Postgres URI for LangGraph checkpointing (optional) | SQLite |
OPENROUTER_API_BASE |
OpenRouter endpoint | https://openrouter.ai/api/v1 |
OPENROUTER_EMBEDDING_MODEL |
Embedding model for crew memory | openai/text-embedding-3-small |
GUARDRAIL_MAX_RETRIES |
Max guardrail retries | 3 |
CODE_QUALITY_MIN_SCORE |
Min quality score (0–1) | 0.7 |
TEST_COVERAGE_MIN |
Min test coverage (0–1) | 0.6 |
MAX_FILE_SIZE_KB |
Max file size for tools (KB) | 500 |
OPTIMIZER_MAX_EXPERIMENTS |
Default max iterations for AutoOptimizer | 50 |
OPTIMIZER_BUDGET_USD |
Default total budget (USD) | 10.0 |
OPTIMIZER_TIMEOUT_PER_EXPERIMENT |
Per-experiment timeout (seconds) | 300 |
OPTIMIZER_MIN_IMPROVEMENT_PCT |
Min improvement % to keep a commit | 0.5 |
OPTIMIZER_MAX_BUDGET_PER_EXPERIMENT_USD |
Per-experiment budget cap | 1.0 |
OPTIMIZER_MAX_TURNS_PER_EXPERIMENT |
Max agent turns per experiment | 40 |
OPTIMIZER_DEFAULT_BACKEND |
Default backend for optimizer | claude-agent-sdk |
Copy .env.example to .env and set the API key for your chosen backend. Before each run, a pre-flight check validates configured models. Agent→model mapping and guardrail behavior are documented in docs/AGENTS.md and docs/GUARDRAILS.md.
Model IDs are in openrouter/<provider>/<model> form (see src/ai_team/config/models.py). Set AI_TEAM_ENV to dev, test, or prod to choose a tier.
| Role | dev | test | prod |
|---|---|---|---|
| Manager | deepseek/deepseek-chat-v3-0324 |
google/gemini-3-flash-preview |
anthropic/claude-sonnet-4 |
| Product Owner | deepseek/deepseek-chat-v3-0324 |
google/gemini-3-flash-preview |
openai/gpt-5.2 |
| Architect | deepseek/deepseek-chat-v3-0324 |
deepseek/deepseek-r1-0528 |
anthropic/claude-sonnet-4 |
| Backend Developer | mistralai/devstral-2512 |
minimax/minimax-m2 |
openai/gpt-5.3-codex |
| Frontend Developer | mistralai/devstral-2512 |
minimax/minimax-m2 |
anthropic/claude-sonnet-4 |
| Fullstack Developer | mistralai/devstral-2512 |
minimax/minimax-m2 |
openai/gpt-5.3-codex |
| Cloud Engineer | deepseek/deepseek-chat-v3-0324 |
deepseek/deepseek-r1-0528 |
anthropic/claude-sonnet-4 |
| DevOps | mistralai/devstral-2512 |
mistralai/devstral-2512 |
openai/gpt-5.3-codex |
| QA Engineer | deepseek/deepseek-chat-v3-0324 |
deepseek/deepseek-r1-0528 |
anthropic/claude-sonnet-4 |
Embeddings (crew memory) use OPENROUTER_EMBEDDING_MODEL (default: openai/text-embedding-3-small). Current IDs and pricing: OpenRouter models.
Six ready-to-run scenarios that exercise the full pipeline:
| # | Demo | Description |
|---|---|---|
| 1 | 01_hello_world |
Minimal Flask REST API — health, items CRUD, pytest, Dockerfile |
| 2 | 02_todo_app |
Full-stack TODO app — Flask + SQLite backend, HTML/JS frontend |
| 3 | 03_data_pipeline |
ETL pipeline — CSV ingest, validate/transform, SQLite load, CLI report |
| 4 | 04_ml_api |
FastAPI ML inference service — scikit-learn model, predict/health/metrics endpoints |
| 5 | 05_microservices |
Three-service system — API Gateway, User Service, Notification Service + docker-compose |
| 6 | 06_karpathy_optimization |
AutoOptimizer Loop — iterative metric-driven optimization with keep/revert and RAG lessons |
poetry run python scripts/run_demo.py demos/01_hello_world
poetry run python scripts/run_demo.py demos/02_todo_app --skip-estimate
# Demo 06: AutoOptimizer Loop
ai-team optimize demos/06_karpathy_optimization/workspace \
--metric demos/06_karpathy_optimization/metric.yaml \
--strategy demos/06_karpathy_optimization/strategy.md \
--budget 2.00Each demo directory contains input.json with the project spec and expected_output.json as an acceptance contract. After a run, scripts/capture_demo.py verifies the output and writes RESULTS.md.
For the full file layout, schema reference, capture/verification workflow, and instructions for adding new demos, see docs/DEMOS.md.
The CLI has two top-level subcommands: run (build a project) and optimize (AutoOptimizer Loop).
# Build subcommand (default)
poetry run python -m ai_team run "Build a minimal Flask API" \
--backend langgraph --team backend-api --env dev --skip-estimate
# Optimize subcommand
ai-team optimize ./workspace/my-app \
--metric metric.yaml --budget 2.00 --max-experiments 20run flags:
| Flag | Description |
|---|---|
--backend |
crewai (default), langgraph, claude-agent-sdk |
--team |
Team profile from config/team_profiles.yaml |
--env |
dev, test, prod — selects model tier |
--skip-estimate |
Skip cost estimate confirmation |
--output |
crewai (default), tui — progress display mode |
--monitor |
Alias for --output tui |
--stream |
JSON lines streaming (LangGraph) |
--thread-id |
Resume thread (LangGraph checkpointing) |
--resume |
Resume after human-in-the-loop interrupt |
optimize flags:
| Flag | Description | Default |
|---|---|---|
--metric |
Path to metric YAML (name, evaluation_command, direction) | required |
--strategy |
Path to Markdown strategy hints file | — |
--backend |
Backend for the optimizer agent | claude-agent-sdk |
--team |
Team profile | research-optimizer |
--budget |
Total USD budget | 10.0 |
--max-experiments |
Max iterations | 50 |
--branch |
Git branch for winning commits | optimize/karpathy-loop |
--editable |
Paths the agent may edit | src/ |
Both ai-team-web and ai-team-tui support backend and team selection in their interfaces.
Three UI modes for different audiences — all share the same backend registry, monitor data models, and cost tracking.
A production-grade browser UI with real-time WebSocket streaming, GitHub-dark theme, and side-by-side backend comparison.
# Development (hot reload)
poetry run ai-team-web & # FastAPI on :8421
cd src/ai_team/ui/web/frontend && npm run dev # React on :5173 (proxies API)
# Production (single server)
cd src/ai_team/ui/web/frontend && npm run build
poetry run ai-team-web # Serves React build + API on :8421Pages:
| Page | Route | Features |
|---|---|---|
| Dashboard | /, /runs/:id |
Run sidebar, live monitor, run summary on complete, artifact preview, HITL panel |
| Run | /run |
Launch form, cost estimate, auto-handoff to Dashboard when run starts |
| Compare | /compare |
Parallel backends, pre-flight cost consent, demo compare ($0), comparison summary |
| Artifacts | /artifacts |
Unified run picker, file tree, tests, architecture, ZIP download |
See docs/WEB_DASHBOARD.md for user journeys and UX notes.
API endpoints:
| Endpoint | Type | Purpose |
|---|---|---|
GET /api/health |
REST | Server health check |
GET /api/profiles |
REST | List team profiles |
GET /api/backends |
REST | List backends |
POST /api/estimate |
REST | Cost estimation |
GET /api/runs |
REST | List runs (in-memory session) |
GET /api/runs/{id} |
REST | Run detail + monitor snapshot |
POST /api/runs/{id}/resume |
REST | Resume LangGraph HITL (human feedback) |
POST /api/demo |
REST | Start demo simulation |
GET /api/registry/runs |
REST | List disk-backed runs (artifacts) |
GET /api/projects/{id}/tree |
REST | Artifact file tree (root=workspace or bundle) |
GET /api/projects/{id}/file |
REST | File content for preview |
GET /api/projects/{id}/tests |
REST | Test results JSON |
GET /api/projects/{id}/architecture |
REST | Architecture summary |
GET /api/projects/{id}/download.zip |
REST | Download workspace ZIP |
WS /ws/run |
WebSocket | Run backend with real-time event streaming |
WS /ws/monitor/{id} |
WebSocket | Monitor active run (500ms state snapshots) |
The React app uses same-origin /api and /ws (Vite proxies in dev; production serves UI and API from one port).
A Textual-based interactive terminal dashboard with keyboard navigation.
poetry run ai-team-tui # Launch TUI
poetry run ai-team-tui --demo # Launch with simulated demoTabs: Dashboard (d), Run (r), Compare (c), Quit (q)
Features: real-time phase pipeline, agent status table, metrics panel, activity log, guardrails panel, cost estimation, backend comparison — all in the terminal.
The original Rich-based live display, embedded in CLI runs.
poetry run ai-team --monitor "Create a REST API for a todo list"
python -m ai_team.monitor # Simulated demo# All tests
poetry run pytest
# With coverage
poetry run pytest --cov=src/ai_team --cov-report=term-missing
# By layer
poetry run pytest tests/unit
poetry run pytest tests/integration
poetry run pytest tests/e2eIntegration full-flow tests use a manual flow driver (no flow.kickoff()), so they run with the rest of the suite and do not hang or spike memory.
When crews use memory (memory=True), they use an OpenRouter-backed embedder (see get_embedder_config()). In integration tests with AI_TEAM_USE_REAL_LLM=1, crew memory is forced off so tests do not depend on the embedding service.
To run crew-level integration tests (planning, development, testing) against real OpenRouter instead of mocks, set AI_TEAM_USE_REAL_LLM=1 and OPENROUTER_API_KEY. Tests will skip if the key is missing. Full-flow tests remain mock-only by design.
AI_TEAM_USE_REAL_LLM=1 poetry run pytest tests/integration -m real_llm -vOptional memory/embedder tests: set OPENROUTER_API_KEY, then AI_TEAM_USE_REAL_LLM=1 AI_TEAM_TEST_MEMORY=1 poetry run pytest tests/integration -m test_memory -v.
To run only the OpenRouter connectivity test (minimal cost; uses a free-tier model), set OPENROUTER_API_KEY in .env and run: AI_TEAM_USE_REAL_LLM=1 poetry run pytest tests/integration/test_openrouter.py::TestOpenRouterGated::test_openrouter_connectivity -v.
See CONTRIBUTING.md for code style and PR requirements.
ai-team/
├── src/ai_team/
│ ├── core/ # Backend protocol, ProjectResult, TeamProfile loader
│ ├── config/ # Settings, agents.yaml, team_profiles.yaml, models.py
│ │ └── optimizer_settings.py # OPTIMIZER_* env var config
│ ├── backends/
│ │ ├── registry.py # Backend discovery and instantiation
│ │ ├── crewai_backend/ # CrewAI: crews, flows, agents, state
│ │ ├── langgraph_backend/ # LangGraph: graphs, nodes, routing, subgraphs
│ │ └── claude_agent_sdk_backend/ # Claude Agent SDK: orchestrator, subagents, hooks, MCP
│ ├── agents/ # Shared agent definitions
│ ├── tools/ # File, code, git, test tools
│ ├── mcp/ # MCP server configs and adapters
│ ├── rag/ # RAG pipeline (static + dynamic + session knowledge)
│ ├── guardrails/ # Behavioral, security, quality
│ ├── memory/ # Session and long-term memory
│ ├── optimizers/ # AutoOptimizer Loop
│ │ ├── loop.py # KarpathyLoop — main state machine
│ │ ├── metric.py # MetricConfig, extract_metric, load_metric_config
│ │ ├── experiment_log.py # ExperimentRecord, append/load/summarise
│ │ └── git_reset.py # git_reset_hard, git_stash helpers
│ ├── monitor.py # TeamMonitor — shared data model for all UIs
│ ├── utils/ # Shared utilities
│ └── ui/
│ ├── web/ # FastAPI server + React/TypeScript/Vite dashboard
│ ├── tui/ # Textual TUI (terminal dashboard)
│ ├── __init__.py
├── tests/
│ ├── unit/
│ ├── integration/
│ └── e2e/
├── evals/
│ ├── scenarios/ # JSON scenario specs (hello-world, todo-api, devops, iac, qa, security, arch)
│ └── role_evals/ # Role-specific eval modules (optimizer, devops, iac, qa, security, arch)
├── demos/ # 01_hello_world … 06_karpathy_optimization (see docs/DEMOS.md)
├── docs/
│ ├── langgraph/ # LangGraph backend plan
│ ├── claude-agent-sdk/ # Claude Agent SDK backend plan
│ └── *.md # ARCHITECTURE, AGENTS, FLOWS, GUARDRAILS, TOOLS, MEMORY, EVALS
└── scripts/ # setup, run_demo, compare_backends
Count lines of code (requires cloc):
cloc \
src tests docker scripts docs demos .github \
--exclude-dir=__pycache__,node_modules,target,dist,build,cdk.out,.git,.venv,.pytest_cache,.archive,.ruff_cache,.mypy_cache,htmlcov,.tox,.eggs,.pdm-build,.pixi \
--vcs=gitWe welcome contributions. Please read CONTRIBUTING.md for:
- Development setup and dependencies
- Code style (black, ruff, mypy)
- PR process and commit message convention
- How to add new agents, tools, or guardrails
| Document | Description |
|---|---|
| ARCHITECTURE.md | System design — UI layer, flows, crews, agents, tools, guardrails, memory |
| FLOWS.md | Orchestration flows (CrewAI and LangGraph) |
| AGENTS.md | Agent roles, prompts, model mapping |
| GUARDRAILS.md | Behavioral, security, quality guardrails |
| TOOLS.md | Tool specifications |
| MEMORY.md | Memory and knowledge management |
| DEMOS.md | Demo projects, schema, capture/verification |
| EVALS.md | Eval methodology, role-specific evals, LLM judges, April 2026 benchmark landscape |
| GETTING_STARTED.md | Setup, configuration, troubleshooting |
| HARDWARE.md | Hardware requirements and recommendations |
| RESULTS.md | Benchmark results and comparisons |
| CREWAI_REFERENCE.md | CrewAI framework reference |
| LangGraph Plan | LangGraph backend architecture and tasks |
| Claude SDK Plan | Claude Agent SDK backend architecture and tasks |
| Prompts | Prompt templates and tracking |
| Journey | Project background and ongoing story |
- License: MIT.
- CrewAI — agent and crew framework.
- LangGraph — graph-based agent orchestration.
- Claude Agent SDK — Anthropic's agent framework.
- OpenRouter — LLM and embeddings API.


