Mobius

What if your AI agents competed to get better — automatically?

Mobius is an adversarial swarm orchestrator that pits AI agents against each other, judges them with a cross-family panel, and evolves the winners — across Anthropic, Google, and OpenAI.

Most agent frameworks run one model, hope for the best, and call it done. Mobius takes a different approach: competition drives quality.

Run multiple agents in parallel, have independent judges score them, and evolve the winners.

5 agents (configurable) tackle every task in parallel — different providers, different strategies
3 judges from different model families score the outputs (reduces single-provider bias)
Elo ratings track who's actually good, not who's hyped
Evolution breeds winners into new variants; underperformers get retired
Memory remembers which agents won on similar tasks, so selection gets smarter over time

When Mobius is worth it:

You're uncertain which model/provider is best for your task
Output quality varies between runs and you need consistency
You want long-term performance tracking as models change
You're willing to spend 3-5x more tokens for statistically better outputs

For simple tasks with predictable outputs, a single good model is fine. Mobius is for problems where consistency and optimization matter.

Task → Memory Query → Selector → Swarm (parallel) → Judge Panel → Elo Update
                                                           ↓
                                                Evolve / Retire / Promote

Quick start

Prerequisites

Python 3.12+
At least one LLM API key (more keys = better judge diversity):
- Anthropic (Claude) — recommended, included in default judge panel
- Google (Gemini) — optional, adds cross-family judging
- OpenAI (GPT) — optional, adds cross-family judging
Claude Code subscription — optional, enables free agent seeding and judging via skills

Install & run

pip install -e ".[dev]"
cp .env.example .env          # Add your API keys (see .env.example for all options)
mobius init                    # Create database

Bootstrap agents (choose one):

# Option A: API-driven (~$1.50 one-time cost, automated)
mobius bootstrap

# Option B: Claude Code (free, requires any Claude Code subscription)
# In Claude Code, type: /mobius-seed

# Option C: Domain-specific (uses Opus API to analyze your codebase, ~$0.50)
mobius scout ./my-project --count 5

Run your first competition:

mobius run "Build a CLI that converts CSV to JSON"

Demo

$ mobius run "Write a Python LRU Cache with O(1) operations"

Starting competition (19 agents in pool, selecting best 5)
Strategy: diverse (memory matches: 3)
  Gemini Flash Coder (google/gemini-2.5-flash)
  GPT Mini Coder (openai/gpt-4o-mini)
  Methodical Coder (anthropic/claude-haiku-4-5)
  OpenRouter Wildcard (openrouter/gemini-2.5-flash)       (wildcard)

┌──────────────────────┬────────────┬──────────────┬───────────┐
│ Agent                │ Provider   │ Status       │ Preview   │
├──────────────────────┼────────────┼──────────────┼───────────┤
│ Gemini Flash Coder   │ google     │ completed    │ class LRU │
│ GPT Mini Coder       │ openai     │ completed    │ from coll │
│ Methodical Coder     │ anthropic  │ completed    │ class Nod │
│ OpenRouter Wildcard  │ openrouter │ completed    │ import co │
└──────────────────────┴────────────┴──────────────┴───────────┘

┌─────────────── Judge Panel ───────────────┐
│ Agent               │ Score │ Winner      │
├─────────────────────┼───────┼─────────────┤
│ Gemini Flash Coder  │  29.0 │   WINNER    │
│ GPT Mini Coder      │  24.0 │             │
│ Methodical Coder    │  22.0 │             │
│ OpenRouter Wildcard │  22.0 │             │
└─────────────────────┴───────┴─────────────┘

$ mobius scout ./my-project --count 5
Scouting ./my-project...
  Read: README.md, pyproject.toml, CLAUDE.md
  Sampled: 5 source files
Created: Protocol Engineer (anthropic/claude-sonnet-4-6)   specs=[coding, architecture]
Created: Test Specialist (anthropic/claude-sonnet-4-6)     specs=[testing, debugging]
Created: Dashboard Dev (anthropic/claude-sonnet-4-6)       specs=[frontend, coding]
Created: Spec Auditor (anthropic/claude-sonnet-4-6)        specs=[security, code-review]
Created: Perf Optimizer (google/gemini-2.5-flash)          specs=[backend, algorithms]

Scout created 5 agents for my-project.

Record your own: asciinema rec demo.cast then run some competitions. Scripts in scripts/.

Why Mobius?

Problem	How Mobius solves it	Trade-off
Which model is best?	Run all in parallel; judges pick best per task	3-5x token cost vs. single call
High output variance	Consensus scoring from 3 judges reduces outliers	Consensus can be noisy with small panels; ties resolved by score aggregation
Creative tasks lack ground truth	Cross-provider judges (Claude + Gemini + GPT) reduce vendor bias	Judges add ~$0.10 per match
Agents degrade silently	Elo tracks performance over time; losers get evolved or retired	Requires match history; cold start is random
Selection is manual guesswork	Vector memory recalls what worked on similar past tasks	Embedding similarity isn't perfect for all task types

Commands

mobius run "task"               # Run a competition
mobius train "task" --rounds 5  # Iterative training on a single challenge
mobius loop --rounds 10         # Self-improvement loop across varied tasks
mobius leaderboard              # Elo rankings
mobius scout ./src              # Auto-generate domain agents from your code
mobius evolve backend           # Improve underperformers in a specialization
mobius explain                  # Show last match's judge reasoning
mobius stats                    # Overview
mobius agent list               # Browse agents
mobius agent show <slug>        # Agent details
mobius agent export <slug>      # Export agent as .claude/agents/ markdown

Claude Code Skills (Free)

If you use Claude Code with any Claude Code subscription, these skills replace the paid API equivalents:

Skill	Replaces	What it does
`/mobius-seed`	`mobius bootstrap`	Opus creates agents directly — same quality, $0
`/mobius-run --free`	`mobius run`	Haiku subagents compete, Opus judges — $0
`/mobius-judge`	API judge panel	Opus evaluates outputs locally — $0
`/mobius-audit`	manual testing	Health checks and guided exploration

How it works

Agents

Agents are stored in a SQLite registry with system prompts, provider configs, and Elo ratings. Each agent has:

A provider (Anthropic, Google, OpenAI, OpenRouter)
A specialization (backend, frontend, algorithms, etc.)
An Elo rating that updates after every match
A generation — evolved agents track their lineage

Selection

Three strategies pick agents for each competition:

Specialist — top performers on similar past tasks (memory-driven)
Diverse — maximize provider and specialization variety
Ensemble — balanced mix of both

Judging

A panel of three models from different families scores each output on correctness, quality, and completeness. Consensus scoring prevents any single provider from biasing results.

Evolution

After every N matches, Mobius:

Identifies underperformers (low win rate over recent matches)
Refines underperformers using judge feedback via Opus
Retires agents on long losing streaks
Promotes consistent winners to champion status

Memory

After each competition, the winning agent's task is embedded (all-MiniLM-L6-v2) and stored. Future selections query this vector memory to find agents that won on similar past tasks — so the system gets smarter with every match.

Architecture

Request → Execution → Judgment → Learning

┌─────────────────────────────────────────────────────────────┐
│ CLI (cli.py) — Typer command handlers                       │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator (orchestrator.py) — Coordinates a match        │
│  1. Select agents via selector.py (uses memory.py)          │
│  2. Run agents in parallel via swarm.py                     │
│  3. Judge outputs via judge.py + runner.py                  │
│  4. Update Elo ratings via tournament.py                    │
│  5. Log results to database (db.py)                         │
└─────────────────────────────────────────────────────────────┘

Provider abstraction — All model calls go through providers/ (anthropic, google, openai, openrouter), each implementing the same async interface. Concurrency is controlled at the swarm layer via semaphore.

Data layer — Pydantic models (models.py), SQLite with WAL mode (db.py), agent CRUD (registry.py), and vector similarity via sqlite-vec (memory.py + local Sentence-Transformers embeddings).

Evolution — agent_builder.py uses Opus to create/refine agents; tournament.py handles Elo math; selector.py picks agents via Specialist, Diverse, or Ensemble strategies.

Cost

Costs below are for typical tasks (500-2000 tokens per agent attempt):

Component	Models	Cost	Notes
Swarm (5 agents)	Haiku / Flash / GPT-4o-mini	~$0.01-0.05	Parallel; scales with task length
Judge panel	Opus + Gemini Pro + GPT-4o	~$0.05-0.15	Evaluates all 5 outputs
Full competition	5 agents + 3 judges	~$0.10-0.25	One round
Bootstrap (one-time)	Opus	~$1.50	Creates initial agent pool
Scout	Opus	~$0.50	Analyzes codebase, creates domain agents
Vector embeddings	Sentence-Transformers	$0	Runs locally, no API cost

Claude Code users: /mobius-seed and /mobius-judge use your Opus subscription directly — same quality, zero API cost. CLI commands (mobius run, mobius bootstrap, mobius scout) still require API keys.

Costs estimated March 2026. Verify current pricing in your provider dashboards.

Configuration

Env variable	Default	What it does
`MOBIUS_DATA_DIR`	`data`	Where the database lives
`MOBIUS_SWARM_SIZE`	`5`	Agents per competition
`MOBIUS_BUDGET_USD`	`50.0`	Global spending cap

These are the env-overridable settings. Additional parameters (Elo K-factor, promotion thresholds, embedding model, evolution triggers, retirement streaks, and more) can be tuned directly in config.py.

Contributing

See CONTRIBUTING.md for guidelines.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude		.claude
.github		.github
experiments		experiments
labs		labs
src/mobius		src/mobius
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ideas.md		ideas.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mobius

Quick start

Prerequisites

Install & run

Demo

Why Mobius?

Commands

Claude Code Skills (Free)

How it works

Agents

Selection

Judging

Evolution

Memory

Architecture

Request → Execution → Judgment → Learning

Cost

Configuration

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mobius

Quick start

Prerequisites

Install & run

Demo

Why Mobius?

Commands

Claude Code Skills (Free)

How it works

Agents

Selection

Judging

Evolution

Memory

Architecture

Request → Execution → Judgment → Learning

Cost

Configuration

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages