Skip to content

Evolve fix, parallel judging, Docker sandbox, agentic tasks#10

Closed
AaronGoldsmith wants to merge 6 commits intomainfrom
fix/evolve-agentic-eval
Closed

Evolve fix, parallel judging, Docker sandbox, agentic tasks#10
AaronGoldsmith wants to merge 6 commits intomainfrom
fix/evolve-agentic-eval

Conversation

@AaronGoldsmith
Copy link
Copy Markdown
Owner

Summary

  • Fix evolve self-critique loop: Clean feedback extraction, proper iteration, stricter parsing. First gen-2 agents created and proven competitive.
  • Parallelize judge panel: asyncio.gather() for 3x speedup on cross-family judging (Opus + Gemini Pro + GPT-4o).
  • Docker sandbox: mobius run --sandbox routes agent execution through disposable containers (no network, 512MB limit, Linux environment).
  • 15 agentic competition tasks: Multi-tier challenges (tool-heavy, reasoning, multi-agent) in scripts/competition_tasks_agentic.json.
  • Registry cleanup: Script to retire dead-weight agents and clear false champion flags.
  • Gitignore cleanup: Simplified to patterns only (.tree-workspace/, experiment artifacts).

Test plan

  • mobius run "simple task" works without sandbox
  • mobius run "simple task" --sandbox works with Docker running
  • mobius evolve produces gen-2 agents that compete successfully
  • Judge panel runs all 3 judges in parallel (check timing)
  • python scripts/cleanup_agents.py --dry-run shows correct candidates
  • Existing tests pass: pytest tests/ -v

🤖 Generated with Claude Code

AaronGoldsmith and others added 6 commits March 15, 2026 22:37
…stricter parsing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t artifacts

- Replace sequential judge evaluation with asyncio.gather (3x latency improvement)
- Add cleanup script for dead weight agents and false champion flags
- Add 25 diverse competition tasks for mass match testing
- Add mobius-evolve skill, tree-solve skill, and custom agent definitions
- Gitignore .tree-workspace/, runtime artifacts, and experiment files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New .claude/agents/competition-tasks.md for generating tool-heavy tasks
- 15 tasks across 3 tiers: tool-heavy, agentic reasoning, multi-agent
- Stress-tested for Windows/Git Bash, 10-turn limit, 30s timeouts
- Tier 3 tasks include future multi-agent collaboration notes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Agents run inside disposable python:3.12-slim containers (no network, 512MB limit)
- Enable with `mobius run --sandbox` or MOBIUS_SANDBOX=true env var
- Container lifecycle managed per competition (create → exec → destroy)
- Agent context auto-detects sandboxed environment (Linux/workspace)
- Zero overhead: warm container with docker exec (~50ms per command)
- Configurable: image, memory limit, network access via MobiusConfig

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 21, 2026 15:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades Mobius’ competition loop with safer agent execution, faster multi-model judging, improved agent evolution mechanics, and new task packs/scripts to support more agentic evaluations.

Changes:

  • Add Docker sandbox mode for agent tool execution (mobius run --sandbox / MOBIUS_SANDBOX=true).
  • Parallelize cross-family judge evaluation using asyncio.gather().
  • Rework mobius evolve to target underperformers and add an evaluator–optimizer (self-critique) loop; add new competition task JSON files and a registry cleanup script.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
src/mobius/providers/tools.py Introduces sandbox container lifecycle + routes tool commands through docker exec.
src/mobius/orchestrator.py Creates/destroys sandbox per competition and sets global sandbox state.
src/mobius/runner.py Adjusts environment context banner when sandbox is active.
src/mobius/config.py Adds sandbox configuration options and env var wiring.
src/mobius/cli.py Adds --sandbox flag to mobius run; enhances mobius evolve options/flow.
src/mobius/judge.py Runs judge models in parallel and handles per-judge shuffle metadata.
src/mobius/agent_builder.py Adds critique_refinement() for evolve self-critique loop.
src/mobius/db.py Removes prior migration that injected “Bash” into agent tools.
scripts/competition_tasks_agentic.json Adds 15 tool-heavy, multi-tier agentic competition tasks.
scripts/competition_tasks.json Adds a baseline set of competition tasks.
scripts/cleanup_agents.py Adds a registry cleanup utility for zero-match agents/champion flags.
CLAUDE.md Documents sandbox mode usage and configuration.
.gitignore Adds patterns for tree workspace and other artifacts.
.claude/skills/tree-solve/SKILL.md Adds “tree-solve” skill documentation.
.claude/skills/mobius-evolve/SKILL.md Adds a local Opus evolve workflow skill doc.
.claude/skills/mobius-evolve/scripts/load_underperformers.py Adds helper script to load underperformers + feedback for evolution.
.claude/agents/tree-solver.md Adds recursive tree-solver agent definition.
.claude/agents/depth-test.md Adds a minimal recursion test agent definition.
.claude/agents/competition-tasks.md Adds an agent definition for generating competition tasks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +53
subprocess.run(
["docker", "start", name],
capture_output=True, text=True, timeout=10,
)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker start is executed but its return code is ignored. If the container fails to start, _active_containers is still populated and later docker exec calls will fail in confusing ways. It’d be safer to check the docker start result (and raise / clean up) before marking the sandbox active.

Suggested change
subprocess.run(
["docker", "start", name],
capture_output=True, text=True, timeout=10,
)
start_result = subprocess.run(
["docker", "start", name],
capture_output=True, text=True, timeout=10,
)
if start_result.returncode != 0:
# Best-effort cleanup of the created container if it fails to start
try:
subprocess.run(
["docker", "rm", "-f", name],
capture_output=True, text=True, timeout=15,
)
except Exception as e:
logger.warning("Failed to clean up sandbox %s after start error: %s", name, e)
raise RuntimeError(f"Failed to start sandbox: {start_result.stderr.strip()}")

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +115
sandbox = sandbox or _current_sandbox
try:
if sandbox and sandbox in _active_containers:
result = subprocess.run(
["docker", "exec", sandbox, "bash", "-c", command],
capture_output=True, text=True, timeout=timeout,
encoding="utf-8", errors="replace",
)
else:
result = subprocess.run(
command, shell=True, capture_output=True, text=True,
timeout=timeout, cwd=working_dir or os.getcwd(),
encoding="utf-8", errors="replace",
)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sandbox is set but the name is not present in _active_containers, run_command silently falls back to running the command on the host. That breaks the isolation guarantee of --sandbox and can lead to untrusted agent commands executing on the host. Prefer failing closed here (return an explicit error / raise) when sandbox was requested but is unavailable.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +40
{
"task": "Write a Python script at /tmp/mobius-task-gitanalyze/analyze.py that analyzes the Mobius git repository (at C:/Users/aargo/Development/mobius). It should produce a JSON report containing: (1) total commits per author, (2) lines of code per file extension, (3) the 5 most-changed files by commit count, (4) average commits per day over the last 30 days, (5) list of files that exist in the repo but have never been committed (untracked). Run the script and save the report to report.json. Verify the data by spot-checking at least 2 metrics against direct git commands.",
"category": "explore-and-analyze",
"tier": 1,
"tools_required": ["Bash"],
"verification": "Judge checks: (1) analyze.py runs without errors, (2) report.json contains all 5 sections with plausible data, (3) spot-check verification shows metrics match actual git output, (4) script uses subprocess to call git (not a library)",
"setup": "mkdir -p /tmp/mobius-task-gitanalyze"
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several tasks hard-code a developer-specific absolute path (C:/Users/aargo/Development/mobius). That will fail on other machines/CI and is incompatible with the new Linux Docker sandbox (/workspace). Consider rewriting these tasks to reference the repo root relative to the working directory (or /workspace when sandboxed).

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +16
"task": "Write a Bash script at /tmp/mobius-task-monitor/sysmon.sh that collects system metrics on Windows (CPU via 'wmic cpu get loadpercentage', memory via 'wmic OS get FreePhysicalMemory,TotalVisibleMemorySize', disk usage via 'df' or 'wmic logicaldisk', process count via 'tasklist | wc -l') every 2 seconds for 10 seconds, writes each snapshot as a JSON line to metrics.jsonl, then generates a summary report (min/max/avg for each metric) to summary.txt. The script must handle missing commands gracefully (try wmic first, fall back to PowerShell 'Get-Process | Measure' or similar if unavailable). Run it and verify the output files are correct and well-formed JSON.",
"category": "infrastructure",
"tier": 1,
"tools_required": ["Bash"],
"verification": "Judge checks: (1) sysmon.sh runs without errors, (2) metrics.jsonl has 5+ valid JSON lines, (3) summary.txt has correct min/max/avg calculations, (4) script handles at least one missing-tool fallback",
"setup": "mkdir -p /tmp/mobius-task-monitor"
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is Windows-specific (wmic/tasklist/PowerShell) but the PR introduces a Linux Docker sandbox mode by default. If these tasks are meant to be runnable under --sandbox, they should be rewritten for Linux tooling; otherwise consider tagging them as host-only / non-sandbox tasks so they don’t fail systematically.

Copilot uses AI. Check for mistakes.
Comment thread scripts/cleanup_agents.py
"""Clean dead weight agents and fix champion flags in the Mobius registry."""

import argparse
import json
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json is imported but never used in this script. Removing unused imports helps keep the script minimal and avoids implying JSON output is supported when it isn’t.

Suggested change
import json

Copilot uses AI. Check for mistakes.
sandbox_name = create_sandbox(
image=self.config.sandbox_image,
memory_limit=self.config.sandbox_memory_limit,
network=self.config.sandbox_network,
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_sandbox() is called without any linkage to working_dir, so even if the container starts successfully, agent tool commands won’t have access to the host workspace unless the sandbox implementation bind-mounts it. Consider threading working_dir into sandbox creation so the swarm’s working directory is mounted at /workspace in the container.

Suggested change
network=self.config.sandbox_network,
network=self.config.sandbox_network,
working_dir=working_dir,

Copilot uses AI. Check for mistakes.
Comment thread src/mobius/runner.py
from mobius.providers.google import GoogleProvider
from mobius.providers.openai import OpenAIProvider
from mobius.providers.openrouter import OpenRouterProvider
from mobius.providers.tools import _current_sandbox
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing _current_sandbox (a private module global) couples the runner to provider internals and makes sandbox state harder to evolve safely. Prefer a small public API in providers.tools (e.g., is_sandbox_active() / get_sandbox()) rather than importing an underscored variable.

Copilot uses AI. Check for mistakes.
Comment on lines +351 to +352
return {
"pass": bool(data["pass"]),
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pass": bool(data["pass"]) will treat non-empty strings like "false" as True, which can incorrectly mark a failed critique as passing. It’s safer to require a real JSON boolean, or explicitly handle string values ("true"/"false") during normalization.

Suggested change
return {
"pass": bool(data["pass"]),
raw_pass = data["pass"]
if isinstance(raw_pass, bool):
normalized_pass = raw_pass
elif isinstance(raw_pass, str):
lowered = raw_pass.strip().lower()
if lowered == "true":
normalized_pass = True
elif lowered == "false":
normalized_pass = False
else:
logger.warning(
"Critique returned non-boolean 'pass' value: %r", raw_pass
)
return None
else:
logger.warning(
"Critique returned unsupported 'pass' type: %r (%s)",
raw_pass,
type(raw_pass).__name__,
)
return None
return {
"pass": normalized_pass,

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +80
"task": "Reverse-engineer the Mobius agent selection algorithm by reading the source code at C:/Users/aargo/Development/mobius. Then write a simulation at /tmp/mobius-task-selection/simulate.py that models 100 competitions with 10 agents starting at Elo 1000. Track how Elo ratings drift, identify whether the selection algorithm has any bias (does it favor certain agents unfairly?), and determine after how many rounds the rankings stabilize. Output a report with the Elo trajectory data and your analysis to report.txt. The simulation should use the same Elo update formula as the real code.",
"category": "explore-and-analyze",
"tier": 2,
"tools_required": ["Bash"],
"verification": "Judge checks: (1) simulation uses the actual Elo formula from Mobius source (not a generic one), (2) simulate.py runs and produces trajectory data, (3) report.txt contains analysis of bias and stabilization with supporting data, (4) at least one non-obvious insight about the selection algorithm",
"setup": "mkdir -p /tmp/mobius-task-selection"
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task hard-codes C:/Users/aargo/Development/mobius for codebase inspection, which won’t exist in CI or sandboxed Linux runs. Consider changing it to a relative path (repo root) or /workspace so it’s portable.

Copilot uses AI. Check for mistakes.
Comment thread src/mobius/db.py
@AaronGoldsmith
Copy link
Copy Markdown
Owner Author

Superseded: split into 3 focused PRs (evolve+judging, sandbox, tasks/skills) with all review comments addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants