Multi-model consensus swarm orchestration for the Copilot CLI. Launch 50β250+ AI agents across 15 models with Shadow Score Spec L2 validation β from one command.
Learn more and see the website here: dubsopenhub.github.io/swarm-command
Never used the CLI before? No problem.
- Open your terminal
- Paste this:
curl -fsSL https://raw.githubusercontent.com/DUBSOpenHub/swarm-command/main/quickstart.sh | bash- When Copilot opens, type:
swarm commandRequires an active Copilot subscription.
Swarm Command is for tasks that are too big, risky, or cross-cutting for one model:
- Need one answer from many perspectives? It fans your task out across a layered swarm.
- Need confidence, not vibes? It uses cross-review + consensus scoring.
- Need hidden quality checks? It validates bundles with sealed acceptance criteria.
- Need speed at scale? Designed for parallel execution β agents work simultaneously, not sequentially.
- Need zero setup? No servers, no API keys, no build step.
If your task spans architecture + implementation + testing + docs + integration, this is exactly what Swarm Command is built for.
Swarm Command is a multi-model swarm orchestration skill for the Copilot CLI that launches 50 to 250+ AI agents across 15 different models to solve complex tasks through hierarchical fan-out, cross-family review, and consensus-gated synthesis.
Give it a task β architecture, refactoring, testing, docs, or integration β and it decomposes the mission into domains, dispatches Commanders, Squad Leads, and Workers, validates outputs against sealed acceptance criteria, and synthesizes a final answer from collective intelligence instead of single-model intuition.
One model gives you one perspective.
For small tasks, that's perfect. For high-stakes tasks, it's fragile:
- the model may miss cross-cutting risks,
- the task may exceed one context window,
- the output may sound confident without being complete,
- and you have no independent check that the answer actually satisfies the mission.
Swarm Command solves that by turning one request into a structured swarm process: split, parallelize, review, validate, converge.
These systems are complementary β not competitors.
| If you need to... | Use | Why |
|---|---|---|
| Solve one complex task with layered consensus inside your current Copilot CLI session | Swarm Command | Best when you want decomposition, cross-model review, shadow validation, and one synthesized answer |
| Run parallel coding workstreams across terminals or branches | Stampede | Best when the goal is execution throughput across independent task lanes |
| Run a many-model tournament to pressure-test ideas and rank options | Havoc Hackathon | Best when you want competitive ideation, elimination rounds, and judged synthesis |
Rule of thumb:
- Choose Swarm Command for consensus execution.
- Choose Stampede for parallel implementation.
- Choose Havoc Hackathon for idea tournaments and comparative judging.
- π True swarm β 50 to 250+ agents, not 3β5
- ποΈ 5-layer hierarchy β Nexus β Commander β Squad Lead β Worker β Reviewer
- π Cross-model diversity β Claude + GPT families mixed within every pod
- π³οΈ Consensus scoring β 4-stage gate-then-rank with CONSENSUS / MAJORITY / CONFLICT tiers
- π» Shadow Score β Shadow Score Spec L2 conformance. Sealed acceptance criteria generated before commanders execute, validated after, hardened on failure.
- π‘οΈ Depth Guard β 5 laws + 3-layer enforcement prevent runaway agent spawning
- β‘ Circuit breaker β 3-state FSM with 5-level recovery escalation
- π Parallel by design β agents execute concurrently with hierarchical fan-out and pipeline overlap
- π° Cost-controlled β 1024:1 token compression, wave deployment, hard cost ceilings, and cheap workers
- π¦ Zero infrastructure β no servers, no API keys, no build step
Running 250+ agents sounds expensive. It isn't β because every layer is engineered to minimize spend.
Context shrinks at every layer. The Nexus holds 128K tokens; by the time instructions reach a worker, they're 128 tokens. Parents strip rationale, narrow file scope, and tighten constraints so children only receive the bytes they need.
Nexus 128K tokens βββΊ 4K task brief
Commander 64K tokens βββΊ 2K context capsule
Squad Lead 32K tokens βββΊ 512 shard
Worker 8K tokens βββΊ 128 micro-brief
A three-state FSM (CLOSED β OPEN β HALF-OPEN) monitors every layer. If too many agents fail (50β60% threshold), the breaker trips β no new agents spawn, costs stop climbing, and a recovery probe tests before the swarm resumes.
5-level recovery escalation: Retry β Simplify β Model Swap β Scope Reduce β Graceful Degrade.
Agents don't all launch at once. Each pod deploys in three waves with health gates between them:
- Wave 1 (Canary) β 1 agent verifies the task is feasible
- Wave 2 (Probe) β 3 agents test for rate limits and bulk viability
- Wave 3 (Remainder) β full pod only if gates pass
If the canary fails, the full pod never deploys. One cheap test prevents many expensive failures.
| Guard | What it does |
|---|---|
| Timeout cascade | 90s β 60s β 40s β 30s per layer β children always finish before parents |
| Token ceiling | 128K / 64K / 32K / 8K per layer |
| Output size cap | 4K / 1K / 512 / 256 tokens per layer |
| Retry budget | Workers: 0 retries. Squad Leads: 1 retry. |
| Concurrent agent cap | Max 50 agents launching simultaneously |
| Cost ceiling | $5 / $10 / $20 hard cap β kills all agents if breached |
| Scale | Agents | Typical Cost | Hard Cap | Wall-Clock |
|---|---|---|---|---|
| SS-50 | ~36-52 | $2.50 | $5 | ~30s |
| SS-100 | ~89 | $5.50 | $10 | ~45s |
| SS-250 | ~316 | $10 | $20 | ~65β90s |
- Workers are the cheapest models β Haiku and GPT-Mini at L3, 10Γ cheaper than Opus
- Expensive reasoning stays at the top β Opus and Sonnet only at Commander/Nexus level
- Context compresses monotonically β each layer receives a fraction of its parent's tokens
- Failed work stops early β circuit breakers and canary gates prevent runaway spend
Before the full diagrams, here's the mental model:
- Nexus reads the mission and splits it into domains.
- Commanders own each domain and dispatch sub-work.
- Workers do tiny atomic tasks in parallel.
- Reviewers + Shadow Score decide what survives into the final answer.
You ask one question
β
Nexus decomposes the mission
β
Commanders split by domain
β
Workers execute atomic tasks in parallel
β
Reviewers score + Shadow Score validates
β
Nexus emits one final bundle
If you want the visual deep dive, jump to docs/architecture.md or docs/architecture-diagrams.md.
βββββββββββββββββββ
L0 β NEXUS (1) β claude-opus-4.6
β 128K ctx budget β Task decomposition + final synthesis
ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β β β
βββββββ΄ββββββ βββββββ΄ββββββ βββββββ΄ββββββ
L1 β CMD-ARCH β β CMD-IMPL β ... β CMD-INTG β Γ 5 Commanders
β 64K ctx β β 64K ctx β β 64K ctx β Domain specialists
βββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ
β β β
ββββββββββΌβββββββββ β β
β β β β β
ββββ΄βββ ββββ΄βββ ββββ΄βββ
L2 βSQ-1β βSQ-2β ... βSQ-10β Γ 10 per Commander = 50 Squad Leads
β32K β β32K β β32K β Micro-task decomposition + canary deploy
ββββ¬βββ ββββ¬βββ ββββ¬βββ
β β β
ββββ΄βββ ββββ΄βββ ββββ΄βββ
L3 βWΓ5 β βWΓ5 β βWΓ5 β Γ 5 per Squad Lead = 250 Workers
β 8K β β 8K β β 8K β Atomic execution (LEAF β no spawning)
βββββββ βββββββ βββββββ
ββββββββββββββββ
L4 β REVIEWERSΓ10 β Cross-review mesh (pipeline overlap)
β 16K ctx β 4-axis sealed scoring + consensus tiers
ββββββββββββββββ
+ SHADOW SCORING (sealed acceptance criteria, Shadow Score Spec L2)
T+0s T+2s T+5s T+12s T+45s T+65s T+80s T+90s
β β β β β β β β
βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ
ββββββ ββββββββ βββββββββββ ββββββββββββ ββββββββββ βββββββββ ββββββ ββββββ
βNEXUSββ βCMDs ββ βSQUAD ββ βWORKERS β βREVIEW β βMERGE β βVOTEβ βEMITβ
βBOOT β βSPAWN β βLEADS β βEXECUTE β βMESH β βRESULTSβ β β β β
β β β β β+ CANARY β β(parallel)β β(overlapβ β β β β β β
β β β β βVERIFY β β β βstart) β β β β β β β
ββββββ ββββββββ βββββββββββ ββββββββββββ ββββββββββ βββββββββ ββββββ ββββββ
2s 3s 7s 33s 20s 15s 10s 5s
CONTEXT DOWN (shrinking) RESULTS UP (compressing)
======================== ========================
L0 Full Task Brief βββ 4K tokens ββββΊ Final Report βββ 4K tokens
β β²
L1 Context Capsule βββ 2K tokens ββββΊ Bundle βββ 1K tokens
β β²
L2 Shard βββ 512 tokens βββΊ Atom Set βββ 512 tokens
β β²
L3 Micro-Brief βββ 128 tokens βββΊ Atom βββ 256 tokens
β β²
L4 Review Capsule βββ 1K tokens ββββΊ Score Card βββ 512 tokens
| Scale | Agents | Commanders | Workers | Reviewers | Best For | Wall-Clock |
|---|---|---|---|---|---|---|
| SS-50 | ~36-52 | 2-3 | 30-45 | 3 | Fast bounded tasks | ~30s |
| SS-100 | ~89 | 5 | 75 | 8 | Multi-file features and reviews | ~45s |
| SS-250 | ~316 | 5 | 250 | 10 | Repo-wide or high-stakes work | ~65β90s |
Do you need a fast second opinion on 1β2 files?
β SS-50
Do you need a serious answer for a multi-file feature or subsystem?
β SS-100
Do you need repo-wide coverage, compliance-grade review, or maximum consensus?
β SS-250
Default is SS-100. Say swarm command ss-250 for full deployment or swarm command ss-50 for quick tasks.
See docs/scaling.md for cost breakdowns, chooser guidance, and a deeper decision matrix.
Curated highlights β see docs/use-cases.md for the full gallery.
π₯ Stack Trace Whisperer
swarm command ss-50 "Diagnose this error β 3 most likely root causes with fixes: [paste error]"
Three fast expert panels race on runtime, dependency, and logic hypotheses. You get ranked diagnoses, not a single guess.
π Explain Like I Own It
swarm command ss-50 "I just inherited this codebase. Explain src/core/ β what does each piece do, where are the landmines?"
Great for onboarding: architecture map, event flow, and hidden footguns in one brief.
β‘ Performance Profiler's Shortcut
swarm command ss-50 "Find the performance bottlenecks in this file with optimized versions: [paste hot-path file]"
Ideal when you need a prioritized hit list before opening a profiler.
π Zero-Downtime Auth Rewrite
swarm command "Migrate our session auth to JWT + refresh tokens across API, web app, DB, and tests"
Architecture, implementation, testing, docs, and rollout risk all get separate ownership before synthesis.
ποΈ Legacy Service Extraction
swarm command "Extract the billing module from our monolith into a service with minimal downtime"
Produces migration phases, interface boundaries, contract tests, and rollback paths.
π± Offline Sync Feature
swarm command "Design offline-first sync for our field app: local cache, conflict resolution, API changes, UX, and tests"
Covers data model, UX states, conflict semantics, and integration testing in parallel.
π‘οΈ Zero-Day Security Sweep
swarm command ss-250 "Full security audit: every file, every dependency, every injection surface β CVSS-scored vulnerability report"
Best for broad-surface analysis where missing even one category matters.
βοΈ Compliance Fortress
swarm command ss-250 "Audit for GDPR, HIPAA, SOC2, PCI-DSS compliance β every gap, every control, remediation tickets"
Turns a giant policy problem into parallel control checks with one synthesized risk summary.
πΊοΈ Living Runbook Generator
swarm command ss-250 "Read every service, every pipeline, every config β generate the complete operations manual"
Excellent when tribal knowledge has to become documentation fast.
- "What's the CLI flag for X?" β Ask a single agent
- Rename one variable β Manual edit or single agent
- Prod is down and seconds matter β Follow the human runbook first
- Writing a single-voice email β One persona is better than a committee
- Step-through debugging β Sequential work beats consensus here
mkdir -p ~/.copilot/skills/swarm-command ~/.copilot/agents && \
curl -sL https://raw.githubusercontent.com/DUBSOpenHub/swarm-command/main/skills/swarm-command/SKILL.md \
-o ~/.copilot/skills/swarm-command/SKILL.md && \
curl -sL https://raw.githubusercontent.com/DUBSOpenHub/swarm-command/main/agents/swarm-command.agent.md \
-o ~/.copilot/agents/swarm-command.agent.md && \
echo "β
Swarm Command installed β open Copilot CLI and type: swarm command"Verify integrity (optional):
shasum -a 256 ~/.copilot/skills/swarm-command/SKILL.md
shasum -a 256 ~/.copilot/agents/swarm-command.agent.mdπ‘ Security note: We recommend inspecting quickstart.sh before piping to bash. You can also use the manual install above instead.
git clone https://github.com/DUBSOpenHub/swarm-command.git
cd swarm-command
chmod +x quickstart.sh && ./quickstart.shIf you're new, read in this order:
- This README β what it is, when to use it, and how to run it
- docs/learning-path.md β beginner, operator, and architect reading tracks
- docs/architecture.md β the conceptual system model
- docs/scaling.md β which scale to choose and what it costs
- docs/use-cases.md β vivid prompts and expected outcomes
- docs/consensus.md + docs/shadow-scoring.md β the deep mechanics
- I just want to try it: README β install β run
swarm command - I want to operate it well: README β learning path β scaling β use cases
- I want to understand the design: README β architecture β consensus β shadow scoring
No. Swarm Command runs through your active Copilot subscription. No separate servers, queues, or key management required.
Use SS-50 for bounded, fast tasks. Use SS-100 for most real software work. Use SS-250 when the task is repo-wide, high-stakes, or needs maximum coverage and consensus.
Append a personality mode after the scale to adjust how the swarm operates:
swarm command ss-100 thorough "audit auth module"
swarm command ss-250 fast "quick scan of README"| Mode | Workers | Timeout | Models | Retry | Best For |
|---|---|---|---|---|---|
balanced (default) |
5 per squad | 1.0Γ | mixed | 1 | Most tasks |
thorough |
5 per squad | 1.5Γ | opus/sonnet | 2 | High-stakes, complex analysis |
fast |
3 per squad | 0.6Γ | haiku only | 0 | Quick iteration, cost-sensitive |
creative |
4 per squad | 1.0Γ | max diversity | 1 | Brainstorming, novel problems |
cautious |
5 per squad | 1.2Γ | sonnet | 2 | Ambiguous tasks, high conflict risk |
Because diversity helps. Different model families catch different failure modes. Swarm Command intentionally mixes them so agreement means more than self-consistency.
Disagreement is preserved, scored, and escalated. Squad Leads and Commanders mark results as CONSENSUS, MAJORITY, CONFLICT, or UNIQUE, then Nexus arbitrates the unresolved pieces.
It is a hidden acceptance test: criteria are generated before execution, kept sealed from the swarm, then used to validate outputs afterward.
It can produce plans, analyses, patches, documentation, tests, and rollout guidance depending on how you invoke it β but the point is not blind automation. The point is reviewable, consensus-backed output.
Avoid it for tiny edits, urgent incident response where every second matters, or tasks that need one strong voice rather than many perspectives.
Swarm Command came out of a simple question: what if one Copilot CLI session could behave less like one assistant and more like a disciplined organization?
The design evolved from SwarmSpeed 250 experiments into a layered system with:
- a single Nexus orchestrator,
- domain-owning Commanders,
- decomposing Squad Leads,
- leaf-node Workers,
- and independent Reviewers.
The turning point was a self-analysis run later documented in docs/shadow-scoring.md: sealed judges rated a design highly even though it contained critical arithmetic errors. That exposed a core truth of multi-agent systems: review alone is not validation.
That failure drove the big ideas that now define this repo:
- Shadow scoring so hidden criteria can catch what the swarm forgot to optimize for
- Depth Guard so recursion never turns into agent explosion
- Token compression so higher-level intent survives while lower layers stay cheap
- Cross-family review so agreement means more than βthe same model said it twiceβ
In other words: Swarm Command is not just a big swarm. It is a swarm that learned from its own failure modes.
See what a completed swarm run looks like β Example Output
π βββββββββββββββββββββββββββββββββββββββββ
S W A R M C O M P L E T E
βββββββββββββββββββββββββββββββββββββββββββββ
## Results Summary
- Domains completed: 5/5
- Consensus tier: CONSENSUS (4) Β· MAJORITY (1)
- Overall confidence: 0.77
- Agents deployed: 89
- Wall-clock time: 72s
- Shadow Score: 20.0% π‘ Moderate (8 pass Β· 2 fail)
Swarm Command implements Shadow Score Spec L2 conformance β sealed acceptance criteria generated before commanders execute, validated after, hardened on failure.
Formula: Shadow Score = (sealed_failures / sealed_total) Γ 100
| Shadow Score | Level | Action |
|---|---|---|
| 0% | β Perfect | All sealed criteria passed |
| 1β15% | π’ Minor | Proceed normally |
| 16β30% | π‘ Moderate | Attach Gap Report, warn |
| 31β50% | π Significant | Quarantine bundle, hardening cycle |
| > 50% | π΄ Critical | Reject bundle from synthesis |
Sealed-envelope protocol:
- Phase 1.5 β Nexus generates sealed acceptance criteria from the task
- Phases 2β5 β Commanders execute without seeing those criteria
- Phase 6 β Validate outputs, compute Shadow Score, produce Gap Report
- Hardening β If score > 15%, share failure messages only for one fix cycle
See docs/shadow-scoring.md for the full protocol.
A 4-stage consensus pipeline merges the best work from hundreds of agents:
- Worker Self-Score β Each worker emits confidence + self-score with its atom
- Squad Lead Local Merge β Groups atoms by sub-task, classifies as CONSENSUS / MAJORITY / CONFLICT
- Commander Domain Merge β Trimmed mean across squads, applies the consensus formula
- Nexus Cross-Domain Synthesis β Median-of-3 judging and final arbitration
Consensus formula:
score = 0.40 Γ confidence + 0.30 Γ evidence + 0.15 Γ scope + 0.15 Γ coverage β min(0.30, conflict_rate Γ 0.30)
| Tier | Condition | Action |
|---|---|---|
| CONSENSUS | β₯ 70% agreement | Auto-accept |
| MAJORITY | β₯ 50% agreement | Accept with dissent note |
| CONFLICT | < 50% agreement | Nexus arbitration |
| UNIQUE | No overlap | Keep if evidence β₯ 7/10 |
See docs/consensus.md for the full mechanics.
All tunables live in config.yml. Key settings:
consensus:
threshold_consensus: 0.70
threshold_majority: 0.50
depth_guard:
max_spawn_depth: 3
max_workers_per_squad_lead: 5
circuit_breaker:
timeout_cascade: [90, 60, 40, 30]
shadow_scoring:
enabled: true
spec_version: "1.0.0"
conformance_level: "L2"
sealed_criteria_count: 10 # max; per-scale: SS-50=6, SS-100=8, SS-250=10
hardening:
enabled: true # SS-50 overrides to disabled
threshold: 15See docs/scaling.md for full scaling configuration and cost estimates.
| Role | Models |
|---|---|
| Nexus | claude-opus-4.6 |
| Commanders (pool: 9) | claude-opus-4.6, claude-opus-4.5, claude-opus-4.6-1m, claude-sonnet-4.6, claude-sonnet-4.5, claude-sonnet-4, gpt-5.4, gpt-5.2, gpt-5.1 |
| Squad Leads (SS-250 only) | claude-haiku-4.5, gpt-5.4-mini |
| Workers (pool: 6) | claude-haiku-4.5, gpt-5.4-mini, gpt-5-mini, gpt-4.1, gpt-5.3-codex, gpt-5.2-codex |
| Reviewers (7 pairs) | claude-opus-4.6βgpt-5.4, claude-opus-4.5βgpt-5.2, claude-opus-4.6-1mβgpt-5.1, claude-sonnet-4.6βgpt-5.3-codex, claude-sonnet-4.5βgpt-5.2-codex, claude-sonnet-4βgpt-5.4-mini, claude-haiku-4.5βgpt-5-mini |
swarm-command/
βββ README.md # Overview, install, comparison, FAQ
βββ AGENTS.md # Agent/skill descriptions
βββ CONTRIBUTING.md # Contribution guidelines
βββ catalog.yml # Skill metadata
βββ config.yml # All tunables
βββ LICENSE # MIT
βββ SECURITY.md # Security policy
βββ quickstart.sh # One-line installer
βββ .github/
β βββ copilot-instructions.md # AI agent instructions for this repo
β βββ workflows/ci.yml # CI: YAML lint + SKILL.md sync check
β βββ skills/swarm-command/SKILL.md # Skill discovery path
βββ agents/
β βββ swarm-command.agent.md # Standalone agent version
βββ skills/swarm-command/
β βββ SKILL.md # Core skill
βββ templates/
β βββ commander.md # Commander prompt template
β βββ worker.md # Worker prompt template
β βββ reviewer.md # Cross-reviewer prompt template
β βββ squad-lead.md # Squad Lead prompt template
βββ protocols/
β βββ depth-guard.md # 5 Laws + 3-layer enforcement
β βββ circuit-breaker.md # 3-state FSM + 5-level recovery
β βββ context-capsule.md # JSON schemas for data structures
β βββ meta-reviewer.md # Reviewer quality gate protocol
βββ docs/
βββ architecture.md # Architecture overview
βββ architecture-diagrams.md # Mermaid diagrams
βββ consensus.md # Consensus algorithm deep dive
βββ example-output.md # Sample completed swarm run output
βββ learning-path.md # Recommended reading order
βββ scaling.md # Scale chooser + cost estimates
βββ shadow-scoring.md # Shadow scoring protocol
βββ use-cases.md # Expanded prompt gallery
MIT β use it, fork it, build on it.
This project implements Shadow Score Spec L2 β sealed acceptance criteria generated before execution, validated after, hardened on failure.
π Created with π by @DUBSOpenHub with the GitHub Copilot CLI.