Evolve fix, parallel judging, Docker sandbox, agentic tasks by AaronGoldsmith · Pull Request #10 · AaronGoldsmith/mobius

AaronGoldsmith · 2026-03-21T15:28:33Z

Summary

Fix evolve self-critique loop: Clean feedback extraction, proper iteration, stricter parsing. First gen-2 agents created and proven competitive.
Parallelize judge panel: asyncio.gather() for 3x speedup on cross-family judging (Opus + Gemini Pro + GPT-4o).
Docker sandbox: mobius run --sandbox routes agent execution through disposable containers (no network, 512MB limit, Linux environment).
15 agentic competition tasks: Multi-tier challenges (tool-heavy, reasoning, multi-agent) in scripts/competition_tasks_agentic.json.
Registry cleanup: Script to retire dead-weight agents and clear false champion flags.
Gitignore cleanup: Simplified to patterns only (.tree-workspace/, experiment artifacts).

Test plan

mobius run "simple task" works without sandbox
mobius run "simple task" --sandbox works with Docker running
mobius evolve produces gen-2 agents that compete successfully
Judge panel runs all 3 judges in parallel (check timing)
python scripts/cleanup_agents.py --dry-run shows correct candidates
Existing tests pass: pytest tests/ -v

🤖 Generated with Claude Code

…rmers

…stricter parsing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t artifacts - Replace sequential judge evaluation with asyncio.gather (3x latency improvement) - Add cleanup script for dead weight agents and false champion flags - Add 25 diverse competition tasks for mass match testing - Add mobius-evolve skill, tree-solve skill, and custom agent definitions - Gitignore .tree-workspace/, runtime artifacts, and experiment files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- New .claude/agents/competition-tasks.md for generating tool-heavy tasks - 15 tasks across 3 tiers: tool-heavy, agentic reasoning, multi-agent - Stress-tested for Windows/Git Bash, 10-turn limit, 30s timeouts - Tier 3 tasks include future multi-agent collaboration notes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Agents run inside disposable python:3.12-slim containers (no network, 512MB limit) - Enable with `mobius run --sandbox` or MOBIUS_SANDBOX=true env var - Container lifecycle managed per competition (create → exec → destroy) - Agent context auto-detects sandboxed environment (Linux/workspace) - Zero overhead: warm container with docker exec (~50ms per command) - Configurable: image, memory limit, network access via MobiusConfig Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR upgrades Mobius’ competition loop with safer agent execution, faster multi-model judging, improved agent evolution mechanics, and new task packs/scripts to support more agentic evaluations.

Changes:

Add Docker sandbox mode for agent tool execution (mobius run --sandbox / MOBIUS_SANDBOX=true).
Parallelize cross-family judge evaluation using asyncio.gather().
Rework mobius evolve to target underperformers and add an evaluator–optimizer (self-critique) loop; add new competition task JSON files and a registry cleanup script.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
src/mobius/providers/tools.py	Introduces sandbox container lifecycle + routes tool commands through `docker exec`.
src/mobius/orchestrator.py	Creates/destroys sandbox per competition and sets global sandbox state.
src/mobius/runner.py	Adjusts environment context banner when sandbox is active.
src/mobius/config.py	Adds sandbox configuration options and env var wiring.
src/mobius/cli.py	Adds `--sandbox` flag to `mobius run`; enhances `mobius evolve` options/flow.
src/mobius/judge.py	Runs judge models in parallel and handles per-judge shuffle metadata.
src/mobius/agent_builder.py	Adds `critique_refinement()` for evolve self-critique loop.
src/mobius/db.py	Removes prior migration that injected “Bash” into agent tools.
scripts/competition_tasks_agentic.json	Adds 15 tool-heavy, multi-tier agentic competition tasks.
scripts/competition_tasks.json	Adds a baseline set of competition tasks.
scripts/cleanup_agents.py	Adds a registry cleanup utility for zero-match agents/champion flags.
CLAUDE.md	Documents sandbox mode usage and configuration.
.gitignore	Adds patterns for tree workspace and other artifacts.
.claude/skills/tree-solve/SKILL.md	Adds “tree-solve” skill documentation.
.claude/skills/mobius-evolve/SKILL.md	Adds a local Opus evolve workflow skill doc.
.claude/skills/mobius-evolve/scripts/load_underperformers.py	Adds helper script to load underperformers + feedback for evolution.
.claude/agents/tree-solver.md	Adds recursive tree-solver agent definition.
.claude/agents/depth-test.md	Adds a minimal recursion test agent definition.
.claude/agents/competition-tasks.md	Adds an agent definition for generating competition tasks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-21T15:34:11Z

+    subprocess.run(
+        ["docker", "start", name],
+        capture_output=True, text=True, timeout=10,
+    )


docker start is executed but its return code is ignored. If the container fails to start, _active_containers is still populated and later docker exec calls will fail in confusing ways. It’d be safer to check the docker start result (and raise / clean up) before marking the sandbox active.

Suggested change

subprocess.run(

["docker", "start", name],

capture_output=True, text=True, timeout=10,

)

start_result = subprocess.run(

["docker", "start", name],

capture_output=True, text=True, timeout=10,

)

if start_result.returncode != 0:

# Best-effort cleanup of the created container if it fails to start

try:

subprocess.run(

["docker", "rm", "-f", name],

capture_output=True, text=True, timeout=15,

)

except Exception as e:

logger.warning("Failed to clean up sandbox %s after start error: %s", name, e)

raise RuntimeError(f"Failed to start sandbox: {start_result.stderr.strip()}")

Copilot · 2026-03-21T15:34:11Z

+    sandbox = sandbox or _current_sandbox
+    try:
+        if sandbox and sandbox in _active_containers:
+            result = subprocess.run(
+                ["docker", "exec", sandbox, "bash", "-c", command],
+                capture_output=True, text=True, timeout=timeout,
+                encoding="utf-8", errors="replace",
+            )
+        else:
+            result = subprocess.run(
+                command, shell=True, capture_output=True, text=True,
+                timeout=timeout, cwd=working_dir or os.getcwd(),
+                encoding="utf-8", errors="replace",
+            )


If sandbox is set but the name is not present in _active_containers, run_command silently falls back to running the command on the host. That breaks the isolation guarantee of --sandbox and can lead to untrusted agent commands executing on the host. Prefer failing closed here (return an explicit error / raise) when sandbox was requested but is unavailable.

Copilot · 2026-03-21T15:34:11Z

+  {
+    "task": "Write a Python script at /tmp/mobius-task-gitanalyze/analyze.py that analyzes the Mobius git repository (at C:/Users/aargo/Development/mobius). It should produce a JSON report containing: (1) total commits per author, (2) lines of code per file extension, (3) the 5 most-changed files by commit count, (4) average commits per day over the last 30 days, (5) list of files that exist in the repo but have never been committed (untracked). Run the script and save the report to report.json. Verify the data by spot-checking at least 2 metrics against direct git commands.",
+    "category": "explore-and-analyze",
+    "tier": 1,
+    "tools_required": ["Bash"],
+    "verification": "Judge checks: (1) analyze.py runs without errors, (2) report.json contains all 5 sections with plausible data, (3) spot-check verification shows metrics match actual git output, (4) script uses subprocess to call git (not a library)",
+    "setup": "mkdir -p /tmp/mobius-task-gitanalyze"


Several tasks hard-code a developer-specific absolute path (C:/Users/aargo/Development/mobius). That will fail on other machines/CI and is incompatible with the new Linux Docker sandbox (/workspace). Consider rewriting these tasks to reference the repo root relative to the working directory (or /workspace when sandboxed).

Copilot · 2026-03-21T15:34:11Z

+    "task": "Write a Bash script at /tmp/mobius-task-monitor/sysmon.sh that collects system metrics on Windows (CPU via 'wmic cpu get loadpercentage', memory via 'wmic OS get FreePhysicalMemory,TotalVisibleMemorySize', disk usage via 'df' or 'wmic logicaldisk', process count via 'tasklist | wc -l') every 2 seconds for 10 seconds, writes each snapshot as a JSON line to metrics.jsonl, then generates a summary report (min/max/avg for each metric) to summary.txt. The script must handle missing commands gracefully (try wmic first, fall back to PowerShell 'Get-Process | Measure' or similar if unavailable). Run it and verify the output files are correct and well-formed JSON.",
+    "category": "infrastructure",
+    "tier": 1,
+    "tools_required": ["Bash"],
+    "verification": "Judge checks: (1) sysmon.sh runs without errors, (2) metrics.jsonl has 5+ valid JSON lines, (3) summary.txt has correct min/max/avg calculations, (4) script handles at least one missing-tool fallback",
+    "setup": "mkdir -p /tmp/mobius-task-monitor"


This task is Windows-specific (wmic/tasklist/PowerShell) but the PR introduces a Linux Docker sandbox mode by default. If these tasks are meant to be runnable under --sandbox, they should be rewritten for Linux tooling; otherwise consider tagging them as host-only / non-sandbox tasks so they don’t fail systematically.

Copilot · 2026-03-21T15:34:12Z

+"""Clean dead weight agents and fix champion flags in the Mobius registry."""
+
+import argparse
+import json


json is imported but never used in this script. Removing unused imports helps keep the script minimal and avoids implying JSON output is supported when it isn’t.

Suggested change

import json

Copilot · 2026-03-21T15:34:13Z

+                sandbox_name = create_sandbox(
+                    image=self.config.sandbox_image,
+                    memory_limit=self.config.sandbox_memory_limit,
+                    network=self.config.sandbox_network,


create_sandbox() is called without any linkage to working_dir, so even if the container starts successfully, agent tool commands won’t have access to the host workspace unless the sandbox implementation bind-mounts it. Consider threading working_dir into sandbox creation so the swarm’s working directory is mounted at /workspace in the container.

Suggested change

network=self.config.sandbox_network,

network=self.config.sandbox_network,

working_dir=working_dir,

Copilot · 2026-03-21T15:34:13Z

 from mobius.providers.google import GoogleProvider
 from mobius.providers.openai import OpenAIProvider
 from mobius.providers.openrouter import OpenRouterProvider
+from mobius.providers.tools import _current_sandbox


Importing _current_sandbox (a private module global) couples the runner to provider internals and makes sandbox state harder to evolve safely. Prefer a small public API in providers.tools (e.g., is_sandbox_active() / get_sandbox()) rather than importing an underscored variable.

Copilot · 2026-03-21T15:34:13Z

+            return {
+                "pass": bool(data["pass"]),


"pass": bool(data["pass"]) will treat non-empty strings like "false" as True, which can incorrectly mark a failed critique as passing. It’s safer to require a real JSON boolean, or explicitly handle string values ("true"/"false") during normalization.

Suggested change

return {

"pass": bool(data["pass"]),

raw_pass = data["pass"]

if isinstance(raw_pass, bool):

normalized_pass = raw_pass

elif isinstance(raw_pass, str):

lowered = raw_pass.strip().lower()

if lowered == "true":

normalized_pass = True

elif lowered == "false":

normalized_pass = False

else:

logger.warning(

"Critique returned non-boolean 'pass' value: %r", raw_pass

)

return None

else:

logger.warning(

"Critique returned unsupported 'pass' type: %r (%s)",

raw_pass,

type(raw_pass).__name__,

)

return None

return {

"pass": normalized_pass,

Copilot · 2026-03-21T15:34:13Z

+    "task": "Reverse-engineer the Mobius agent selection algorithm by reading the source code at C:/Users/aargo/Development/mobius. Then write a simulation at /tmp/mobius-task-selection/simulate.py that models 100 competitions with 10 agents starting at Elo 1000. Track how Elo ratings drift, identify whether the selection algorithm has any bias (does it favor certain agents unfairly?), and determine after how many rounds the rankings stabilize. Output a report with the Elo trajectory data and your analysis to report.txt. The simulation should use the same Elo update formula as the real code.",
+    "category": "explore-and-analyze",
+    "tier": 2,
+    "tools_required": ["Bash"],
+    "verification": "Judge checks: (1) simulation uses the actual Elo formula from Mobius source (not a generic one), (2) simulate.py runs and produces trajectory data, (3) report.txt contains analysis of bias and stabilization with supporting data, (4) at least one non-obvious insight about the selection algorithm",
+    "setup": "mkdir -p /tmp/mobius-task-selection"


This task hard-codes C:/Users/aargo/Development/mobius for codebase inspection, which won’t exist in CI or sandboxed Linux runs. Consider changing it to a relative path (repo root) or /workspace so it’s portable.

AaronGoldsmith · 2026-03-21T16:17:39Z

Superseded: split into 3 focused PRs (evolve+judging, sandbox, tasks/skills) with all review comments addressed.

AaronGoldsmith and others added 6 commits March 15, 2026 22:37

Implement agent evolution with self-critique mechanism for underperfo…

2b821f3

…rmers

Fix evolve self-critique loop: clean feedback, proper iteration, and …

d1202c9

…stricter parsing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Simplify gitignore: remove one-off filenames, keep only .tree-workspace/

52c40ee

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 21, 2026 15:28

Copilot started reviewing on behalf of AaronGoldsmith March 21, 2026 15:28 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

This was referenced Mar 21, 2026

use semantic search for evolve #7

Closed

Fix memory system: lower thresholds, add dedup, store from free-mode #8

Closed

AaronGoldsmith closed this Mar 21, 2026

This was referenced Mar 21, 2026

Fix evolve to target underperformers and parallelize judge panel #11

Merged

Add Docker sandbox for isolated agent execution #12

Merged

Add agentic competition tasks, agent definitions, and skills #13

Closed

AaronGoldsmith deleted the fix/evolve-agentic-eval branch March 21, 2026 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evolve fix, parallel judging, Docker sandbox, agentic tasks#10

Evolve fix, parallel judging, Docker sandbox, agentic tasks#10
AaronGoldsmith wants to merge 6 commits intomainfrom
fix/evolve-agentic-eval

AaronGoldsmith commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Uh oh!

AaronGoldsmith commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    subprocess.run(
-        ["docker", "start", name],
-        capture_output=True, text=True, timeout=10,
-    )
+    start_result = subprocess.run(
+        ["docker", "start", name],
+        capture_output=True, text=True, timeout=10,
+    )
+    if start_result.returncode != 0:
+        # Best-effort cleanup of the created container if it fails to start
+        try:
+            subprocess.run(
+                ["docker", "rm", "-f", name],
+                capture_output=True, text=True, timeout=15,
+            )
+        except Exception as e:
+            logger.warning("Failed to clean up sandbox %s after start error: %s", name, e)
+        raise RuntimeError(f"Failed to start sandbox: {start_result.stderr.strip()}")

	network=self.config.sandbox_network,
	network=self.config.sandbox_network,
	working_dir=working_dir,

-            return {
-                "pass": bool(data["pass"]),
+            raw_pass = data["pass"]
+            if isinstance(raw_pass, bool):
+                normalized_pass = raw_pass
+            elif isinstance(raw_pass, str):
+                lowered = raw_pass.strip().lower()
+                if lowered == "true":
+                    normalized_pass = True
+                elif lowered == "false":
+                    normalized_pass = False
+                else:
+                    logger.warning(
+                        "Critique returned non-boolean 'pass' value: %r", raw_pass
+                    )
+                    return None
+            else:
+                logger.warning(
+                    "Critique returned unsupported 'pass' type: %r (%s)",
+                    raw_pass,
+                    type(raw_pass).__name__,
+                )
+                return None
+            return {
+                "pass": normalized_pass,

Conversation

AaronGoldsmith commented Mar 21, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AaronGoldsmith commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants