Evolve fix, parallel judging, Docker sandbox, agentic tasks#10
Evolve fix, parallel judging, Docker sandbox, agentic tasks#10AaronGoldsmith wants to merge 6 commits intomainfrom
Conversation
…stricter parsing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t artifacts - Replace sequential judge evaluation with asyncio.gather (3x latency improvement) - Add cleanup script for dead weight agents and false champion flags - Add 25 diverse competition tasks for mass match testing - Add mobius-evolve skill, tree-solve skill, and custom agent definitions - Gitignore .tree-workspace/, runtime artifacts, and experiment files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New .claude/agents/competition-tasks.md for generating tool-heavy tasks - 15 tasks across 3 tiers: tool-heavy, agentic reasoning, multi-agent - Stress-tested for Windows/Git Bash, 10-turn limit, 30s timeouts - Tier 3 tasks include future multi-agent collaboration notes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Agents run inside disposable python:3.12-slim containers (no network, 512MB limit) - Enable with `mobius run --sandbox` or MOBIUS_SANDBOX=true env var - Container lifecycle managed per competition (create → exec → destroy) - Agent context auto-detects sandboxed environment (Linux/workspace) - Zero overhead: warm container with docker exec (~50ms per command) - Configurable: image, memory limit, network access via MobiusConfig Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR upgrades Mobius’ competition loop with safer agent execution, faster multi-model judging, improved agent evolution mechanics, and new task packs/scripts to support more agentic evaluations.
Changes:
- Add Docker sandbox mode for agent tool execution (
mobius run --sandbox/MOBIUS_SANDBOX=true). - Parallelize cross-family judge evaluation using
asyncio.gather(). - Rework
mobius evolveto target underperformers and add an evaluator–optimizer (self-critique) loop; add new competition task JSON files and a registry cleanup script.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mobius/providers/tools.py | Introduces sandbox container lifecycle + routes tool commands through docker exec. |
| src/mobius/orchestrator.py | Creates/destroys sandbox per competition and sets global sandbox state. |
| src/mobius/runner.py | Adjusts environment context banner when sandbox is active. |
| src/mobius/config.py | Adds sandbox configuration options and env var wiring. |
| src/mobius/cli.py | Adds --sandbox flag to mobius run; enhances mobius evolve options/flow. |
| src/mobius/judge.py | Runs judge models in parallel and handles per-judge shuffle metadata. |
| src/mobius/agent_builder.py | Adds critique_refinement() for evolve self-critique loop. |
| src/mobius/db.py | Removes prior migration that injected “Bash” into agent tools. |
| scripts/competition_tasks_agentic.json | Adds 15 tool-heavy, multi-tier agentic competition tasks. |
| scripts/competition_tasks.json | Adds a baseline set of competition tasks. |
| scripts/cleanup_agents.py | Adds a registry cleanup utility for zero-match agents/champion flags. |
| CLAUDE.md | Documents sandbox mode usage and configuration. |
| .gitignore | Adds patterns for tree workspace and other artifacts. |
| .claude/skills/tree-solve/SKILL.md | Adds “tree-solve” skill documentation. |
| .claude/skills/mobius-evolve/SKILL.md | Adds a local Opus evolve workflow skill doc. |
| .claude/skills/mobius-evolve/scripts/load_underperformers.py | Adds helper script to load underperformers + feedback for evolution. |
| .claude/agents/tree-solver.md | Adds recursive tree-solver agent definition. |
| .claude/agents/depth-test.md | Adds a minimal recursion test agent definition. |
| .claude/agents/competition-tasks.md | Adds an agent definition for generating competition tasks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| subprocess.run( | ||
| ["docker", "start", name], | ||
| capture_output=True, text=True, timeout=10, | ||
| ) |
There was a problem hiding this comment.
docker start is executed but its return code is ignored. If the container fails to start, _active_containers is still populated and later docker exec calls will fail in confusing ways. It’d be safer to check the docker start result (and raise / clean up) before marking the sandbox active.
| subprocess.run( | |
| ["docker", "start", name], | |
| capture_output=True, text=True, timeout=10, | |
| ) | |
| start_result = subprocess.run( | |
| ["docker", "start", name], | |
| capture_output=True, text=True, timeout=10, | |
| ) | |
| if start_result.returncode != 0: | |
| # Best-effort cleanup of the created container if it fails to start | |
| try: | |
| subprocess.run( | |
| ["docker", "rm", "-f", name], | |
| capture_output=True, text=True, timeout=15, | |
| ) | |
| except Exception as e: | |
| logger.warning("Failed to clean up sandbox %s after start error: %s", name, e) | |
| raise RuntimeError(f"Failed to start sandbox: {start_result.stderr.strip()}") |
| sandbox = sandbox or _current_sandbox | ||
| try: | ||
| if sandbox and sandbox in _active_containers: | ||
| result = subprocess.run( | ||
| ["docker", "exec", sandbox, "bash", "-c", command], | ||
| capture_output=True, text=True, timeout=timeout, | ||
| encoding="utf-8", errors="replace", | ||
| ) | ||
| else: | ||
| result = subprocess.run( | ||
| command, shell=True, capture_output=True, text=True, | ||
| timeout=timeout, cwd=working_dir or os.getcwd(), | ||
| encoding="utf-8", errors="replace", | ||
| ) |
There was a problem hiding this comment.
If sandbox is set but the name is not present in _active_containers, run_command silently falls back to running the command on the host. That breaks the isolation guarantee of --sandbox and can lead to untrusted agent commands executing on the host. Prefer failing closed here (return an explicit error / raise) when sandbox was requested but is unavailable.
| { | ||
| "task": "Write a Python script at /tmp/mobius-task-gitanalyze/analyze.py that analyzes the Mobius git repository (at C:/Users/aargo/Development/mobius). It should produce a JSON report containing: (1) total commits per author, (2) lines of code per file extension, (3) the 5 most-changed files by commit count, (4) average commits per day over the last 30 days, (5) list of files that exist in the repo but have never been committed (untracked). Run the script and save the report to report.json. Verify the data by spot-checking at least 2 metrics against direct git commands.", | ||
| "category": "explore-and-analyze", | ||
| "tier": 1, | ||
| "tools_required": ["Bash"], | ||
| "verification": "Judge checks: (1) analyze.py runs without errors, (2) report.json contains all 5 sections with plausible data, (3) spot-check verification shows metrics match actual git output, (4) script uses subprocess to call git (not a library)", | ||
| "setup": "mkdir -p /tmp/mobius-task-gitanalyze" |
There was a problem hiding this comment.
Several tasks hard-code a developer-specific absolute path (C:/Users/aargo/Development/mobius). That will fail on other machines/CI and is incompatible with the new Linux Docker sandbox (/workspace). Consider rewriting these tasks to reference the repo root relative to the working directory (or /workspace when sandboxed).
| "task": "Write a Bash script at /tmp/mobius-task-monitor/sysmon.sh that collects system metrics on Windows (CPU via 'wmic cpu get loadpercentage', memory via 'wmic OS get FreePhysicalMemory,TotalVisibleMemorySize', disk usage via 'df' or 'wmic logicaldisk', process count via 'tasklist | wc -l') every 2 seconds for 10 seconds, writes each snapshot as a JSON line to metrics.jsonl, then generates a summary report (min/max/avg for each metric) to summary.txt. The script must handle missing commands gracefully (try wmic first, fall back to PowerShell 'Get-Process | Measure' or similar if unavailable). Run it and verify the output files are correct and well-formed JSON.", | ||
| "category": "infrastructure", | ||
| "tier": 1, | ||
| "tools_required": ["Bash"], | ||
| "verification": "Judge checks: (1) sysmon.sh runs without errors, (2) metrics.jsonl has 5+ valid JSON lines, (3) summary.txt has correct min/max/avg calculations, (4) script handles at least one missing-tool fallback", | ||
| "setup": "mkdir -p /tmp/mobius-task-monitor" |
There was a problem hiding this comment.
This task is Windows-specific (wmic/tasklist/PowerShell) but the PR introduces a Linux Docker sandbox mode by default. If these tasks are meant to be runnable under --sandbox, they should be rewritten for Linux tooling; otherwise consider tagging them as host-only / non-sandbox tasks so they don’t fail systematically.
| """Clean dead weight agents and fix champion flags in the Mobius registry.""" | ||
|
|
||
| import argparse | ||
| import json |
There was a problem hiding this comment.
json is imported but never used in this script. Removing unused imports helps keep the script minimal and avoids implying JSON output is supported when it isn’t.
| import json |
| sandbox_name = create_sandbox( | ||
| image=self.config.sandbox_image, | ||
| memory_limit=self.config.sandbox_memory_limit, | ||
| network=self.config.sandbox_network, |
There was a problem hiding this comment.
create_sandbox() is called without any linkage to working_dir, so even if the container starts successfully, agent tool commands won’t have access to the host workspace unless the sandbox implementation bind-mounts it. Consider threading working_dir into sandbox creation so the swarm’s working directory is mounted at /workspace in the container.
| network=self.config.sandbox_network, | |
| network=self.config.sandbox_network, | |
| working_dir=working_dir, |
| from mobius.providers.google import GoogleProvider | ||
| from mobius.providers.openai import OpenAIProvider | ||
| from mobius.providers.openrouter import OpenRouterProvider | ||
| from mobius.providers.tools import _current_sandbox |
There was a problem hiding this comment.
Importing _current_sandbox (a private module global) couples the runner to provider internals and makes sandbox state harder to evolve safely. Prefer a small public API in providers.tools (e.g., is_sandbox_active() / get_sandbox()) rather than importing an underscored variable.
| return { | ||
| "pass": bool(data["pass"]), |
There was a problem hiding this comment.
"pass": bool(data["pass"]) will treat non-empty strings like "false" as True, which can incorrectly mark a failed critique as passing. It’s safer to require a real JSON boolean, or explicitly handle string values ("true"/"false") during normalization.
| return { | |
| "pass": bool(data["pass"]), | |
| raw_pass = data["pass"] | |
| if isinstance(raw_pass, bool): | |
| normalized_pass = raw_pass | |
| elif isinstance(raw_pass, str): | |
| lowered = raw_pass.strip().lower() | |
| if lowered == "true": | |
| normalized_pass = True | |
| elif lowered == "false": | |
| normalized_pass = False | |
| else: | |
| logger.warning( | |
| "Critique returned non-boolean 'pass' value: %r", raw_pass | |
| ) | |
| return None | |
| else: | |
| logger.warning( | |
| "Critique returned unsupported 'pass' type: %r (%s)", | |
| raw_pass, | |
| type(raw_pass).__name__, | |
| ) | |
| return None | |
| return { | |
| "pass": normalized_pass, |
| "task": "Reverse-engineer the Mobius agent selection algorithm by reading the source code at C:/Users/aargo/Development/mobius. Then write a simulation at /tmp/mobius-task-selection/simulate.py that models 100 competitions with 10 agents starting at Elo 1000. Track how Elo ratings drift, identify whether the selection algorithm has any bias (does it favor certain agents unfairly?), and determine after how many rounds the rankings stabilize. Output a report with the Elo trajectory data and your analysis to report.txt. The simulation should use the same Elo update formula as the real code.", | ||
| "category": "explore-and-analyze", | ||
| "tier": 2, | ||
| "tools_required": ["Bash"], | ||
| "verification": "Judge checks: (1) simulation uses the actual Elo formula from Mobius source (not a generic one), (2) simulate.py runs and produces trajectory data, (3) report.txt contains analysis of bias and stabilization with supporting data, (4) at least one non-obvious insight about the selection algorithm", | ||
| "setup": "mkdir -p /tmp/mobius-task-selection" |
There was a problem hiding this comment.
This task hard-codes C:/Users/aargo/Development/mobius for codebase inspection, which won’t exist in CI or sandboxed Linux runs. Consider changing it to a relative path (repo root) or /workspace so it’s portable.
|
Superseded: split into 3 focused PRs (evolve+judging, sandbox, tasks/skills) with all review comments addressed. |
Summary
asyncio.gather()for 3x speedup on cross-family judging (Opus + Gemini Pro + GPT-4o).mobius run --sandboxroutes agent execution through disposable containers (no network, 512MB limit, Linux environment).scripts/competition_tasks_agentic.json..tree-workspace/, experiment artifacts).Test plan
mobius run "simple task"works without sandboxmobius run "simple task" --sandboxworks with Docker runningmobius evolveproduces gen-2 agents that compete successfullypython scripts/cleanup_agents.py --dry-runshows correct candidatespytest tests/ -v🤖 Generated with Claude Code