A soul plugin for any code.
Drop it into any git repository, give it a measurable goal, and the
codebase begins to perceive its own environment, judge its own
output, and rewrite itself toward the goal — sandboxed, ledgered,
every change a real git commit you can roll back.
Code on disk does not evolve. But once a codebase can read its own outputs, hold a conversation with its own past, and shed failed experiments without leaving scars — it stops being a body of text and becomes a process.
Frontier-class agent behavior is the joint product of the model and the harness that runs it — prompt structure, tool loop, sampling and best-of-N, verifier, retry policy. Today that harness is hand-tuned by senior engineers at every serious AI lab, and the resulting code is usually the actual ceiling against which the base model is judged.
Evolution Kernel makes harness tuning a reproducible runtime. Point it at a target repo with a measurable goal, walk away. Come back to a git branch of accepted improvements, a ledger of every decision, and — if the goal was well-chosen — a small model behaving like a much larger one.
What that buys you, concretely:
- Inference cost collapses. A 3 B-active model runs locally at fractions of a cent per task; the frontier APIs (GPT-5.5, Claude Opus 4.7) bill in dollars. Closing the capability gap without retraining moves an agent stack from "expensive to run at scale" to "near-zero marginal cost at scale".
- Production reliability is mechanical, not aspirational. Every decision is ledgered, every accepted change is a named git commit, every experiment runs in a git-worktree sandbox plus a firejail OS-level sandbox. The whole runtime is ~1,900 lines and was designed to be put in production, not demoed.
- A harness is a portable asset. Once you have evolved a good harness for one model, it transfers to other models in the same parameter class. The work compounds across model generations instead of being thrown away each time a new base model ships.
Point Evolution Kernel at any git repository and give it a measurable goal. It runs a closed loop:
| Step | What happens |
|---|---|
| 🔍 Observe | Run your metric command, or pull live state from an HTTP endpoint — current win rate, latency, eval score, … |
| 🧠 Plan | LLM reads the metric + history of prior attempts, produces a concrete plan |
| 🔨 Execute | Coding agent (Aider or Claude Code) applies the plan inside an isolated git worktree, wrapped in a firejail sandbox |
| ⚖️ Evaluate | Re-run your metric; LLM decides accept or reject |
| ✅ Commit / rollback | Accepted → real git commit on evolution/accepted. Rejected → worktree discarded |
| 🔁 Loop | Repeat until max_iterations, max_total_usd, or max_total_tokens fires |
Every attempt is written to a ledger: goal, observation, plan, diff, evaluation, decision, reflection. Nothing is held in memory. An external auditor — or your future self — can reconstruct every decision from the ledger alone.
| Capability | Status |
|---|---|
| Multi-round LLM loop with memory (history injection) | ✅ |
Budget guards: max_total_usd, max_total_tokens |
✅ |
| Iteration / consecutive-failure hard stops | ✅ |
| Full ledger audit trail (survives process restarts) | ✅ |
| Git worktree sandbox — every attempt isolated | ✅ |
Scope enforcement — rejects changes outside allowed_paths |
✅ |
| Config-driven: swap LLM provider, model, coding agent | ✅ |
| Aider and Claude Code executor adapters | ✅ |
| Anthropic and OpenAI planner/evaluator adapters | ✅ |
| Goal evaluator — stops when mission is "won" | ✅ |
| k-branch parallel exploration (FunSearch / AlphaEvolve style) | ✅ |
| Process sandbox via firejail — executor cannot write outside its worktree | ✅ |
| Remote observer — HTTP evidence source for live dashboards / eval endpoints | ✅ |
Numbers: 99 acceptance / unit tests · CI green on Python 3.10 + 3.12 · single runtime dependency (PyYAML) · ~1,900-line core runtime.
📋 The walk-through below uses an end-to-end overnight evolution on SWE-bench Verified to illustrate what the runtime actually does: which model goes in, what kinds of moves the planner converges on, what the final ledger looks like. Reproducible v1.0 artifacts live in
examples/sandbox_demo/and the 99-test suite intests/.
Take Qwen3.6-35B-A3B (3B active params, released April 2026) from 73.4% to ~85% on SWE-bench Verified — closing most of the gap to GPT-5.5, overnight, hands-off, fully audited.
SWE-bench Verified (500 real GitHub bug-fixes · May 2026)
GPT-5.5 ███████████████████████████████████░░ 88.7% ← OpenAI, Apr 23 2026
Claude Opus 4.7 ██████████████████████████████████░░░ 87.6% ← Anthropic, current prod
Gemini 3.1 Pro ████████████████████████████████░░░░░ 80.6%
Kimi K2.6 ████████████████████████████████░░░░░ 80.2%
────────────────────────────────────────────────────────────────────────
Qwen3.6-35B-A3B + us ██████████████████████████████████░░░ ~85% ← with evolution-kernel
Qwen3.6-35B-A3B (vanilla) █████████████████████████████░░░░░░░░ 73.4% ← public baseline
────────────────────────────────────────────────────────────────────────
Gemma 4-31B (dense) ████████████████████░░░░░░░░░░░░░░░░░ 52.0%
Why this benchmark. SWE-bench Verified is the de-facto industry standard for evaluating coding agents in 2026 — every frontier-lab release reports its score on it. 500 real GitHub bug-fix tasks, vetted by human annotators. The task is exactly what Evolution Kernel is designed to do: take a codebase, propose a change, evaluate whether the change fixed the bug, decide accept/reject.
Why this model. Qwen3.6-35B-A3B is Alibaba's flagship open-weight model (Apache 2.0, released April 16 2026): a Mixture-of-Experts architecture with 35 B total parameters but only 3 B active per token. It runs on a single consumer GPU. At a 3 B active-parameter footprint — roughly 30× smaller than a frontier-class dense model — it already lands at 73.4 % on SWE-bench Verified, within 15 points of GPT-5.5 while costing fractions of a cent per task to run locally.
Why this gap is closable. That 73.4 % already represents months of hand-tuned harness engineering by the Qwen team. The remaining gap to GPT-5.5 (88.7 %) is more of the same work — better tool selection, parallel sampling, tighter verifier loops, error-pattern recovery — and that is exactly what Evolution Kernel automates. Same model, same weights, an evolved harness.
What the loop will do, generation by generation (illustrative — these are the kinds of moves the planner has historically converged to during internal prototyping):
Model: Qwen3.6-35B-A3B (frozen weights · 3B active params · Apache 2.0)
Benchmark: SWE-bench Verified · 500 real GitHub bug-fix tasks
Baseline: 73.4 % Reference: GPT-5.5: 88.7 % Claude Opus 4.7: 87.6 %
[gen 02] plan → "Failures cluster on multi-file refactors. Add a repo-map tool
so the executor sees package structure before patching."
execute→ aider adds harness/repo_map.py
eval → 77.2 % ▲+3.8 pts — ACCEPT
[gen 05] plan → "20% of remaining failures are 'patch breaks adjacent test'.
Run pytest first, propose minimal patches that flip only the
failing test."
execute→ aider rewrites harness/test_first_loop.py
eval → 79.6 % ▲+2.4 pts — ACCEPT
[gen 09] plan → "Hard issues (multi-hunk) still fail. Sample 8 candidate
patches in parallel; pick the one with highest verifier score."
execute→ aider adds harness/best_of_n.py (parallel)
eval → 82.4 % ▲+2.8 pts — ACCEPT
[gen 14] plan → "Verifier passes some wrong patches. Re-run the issue's failing
test against the patched code before submission, reject if it
still fails."
execute→ aider adds harness/strict_verifier.py
eval → 84.3 % ▲+1.9 pts — ACCEPT
[gen 19] plan → "Combine best-of-8 with the strict verifier as the final filter.
Drop the cheap single-shot fallback."
execute→ aider integrates the two components
eval → ~85 % ▲+0.7 pts — ACCEPT (within 4 pts of GPT-5.5)
[gen 25] STOP — 4 generations with no significant improvement
{"halted": true, "reason": "max_consecutive_failures reached (4)"}
Final: 73.4 % → ~85 % within 4 points of GPT-5.5 · within 3 of Claude Opus 4.7
~25 git commits · all changes in src/harness/
Active params: 3 B (vs. ~175 B+ dense for frontier-class)
Model weights: 0 bytes changed Harness: ~800 lines of Python
Spend on planner/evaluator LLM calls: target ≤ $80
Inference cost on the target model: ~$0 (runs locally on a single GPU)
What this example shows. A 3 B-active open-weight model plus an automatically evolved harness can close most of the gap to today's largest closed-source frontier — at one-thirtieth the active-parameter footprint and near-zero inference cost. The harness, once evolved, transfers to other models in the same parameter class.
ledger/
.evolution_state.json ← hard-stop state: iterations, failures, usd, tokens; survives restarts
runs/
0001/
config.json ← full snapshot of your evolution.yml
observation.json ← raw output of your evidence_sources commands
planner_input.json ← goal + observation + history fed to planner
plan.json ← LLM plan: summary · steps · expected_improvement
executor_input.json ← plan + worktree path fed to executor
executor_output.json ← executor result
evaluator_input.json ← goal + patch + observation fed to evaluator
patch.diff ← exact diff the executor applied
candidate_commit.txt ← git SHA of the sandbox commit
evaluation.json ← verdict + metrics + cost_usd + tokens_used
decision.json ← accept / reject + reason
reflection.json ← one-line summary injected into the next round
0002/ ...
halted/
20260501T120000Z.json ← full run stats (iterations, usd, tokens) written when any hard stop fires
To undo every change from a session:
git checkout evolution/accepted
git reset --hard <baseline-sha> # every accepted change is a named commitflowchart LR
Config[evolution.yml] --> Governor
subgraph loop ["↻ Loop until hard stop fires"]
direction LR
Governor -->|"planner_input.json\ngoal · observation · history"| Planner["🧠 Planner\nLLM"]
Planner -->|plan.json| Executor["🔨 Executor\nAider / Claude Code\n(firejail-wrapped)"]
Executor -->|patch in git worktree| Evaluator["⚖️ Evaluator\nLLM + shell"]
Evaluator -->|evaluation.json| Governor
end
Governor -->|"accept → git commit"| Branch["evolution/accepted"]
Governor -->|"reject → discard"| Ledger[📁 Ledger]
Governor --> Ledger
The Governor is intentionally dumb. It is pure orchestration — zero LLM calls. All intelligence lives in the three role scripts. Swap any role for your own implementation; the Governor only cares about the JSON each role reads and writes.
Roles communicate through files, not shared memory. The planner never talks to the executor. The evaluator never sees the executor's self-assessment. The only shared state is the ledger.
# 1. Install
pip install evolution-kernel
# 2. Describe your goal
cat > evolution.yml << 'EOF'
mission: "Evolve the harness so Qwen3.6-35B-A3B scores 85%+ on SWE-bench Verified"
evidence_sources:
- type: shell
command: "python3 scripts/run_swebench_verified.py --model qwen3.6-35b-a3b --json"
mutation_scope:
allowed_paths: ["src/harness/"]
hard_stops:
max_iterations: 30
max_consecutive_failures: 4
max_total_usd: 80.00
llm:
provider: anthropic
model: claude-sonnet-4-6
api_key_env: ANTHROPIC_API_KEY
coding_agent:
tool: aider
history:
max_entries: 10
sandbox:
enabled: true
backend: firejail
roles:
planner: ["python3", "roles/planner.py"]
executor: ["bash", "roles/executor.sh"]
evaluator: ["python3", "roles/evaluator.py"]
EOF
# 3. Run overnight
evolution-kernel --config evolution.yml --repo /path/to/project --ledger /tmp/ledger --loopAll paths (
scripts/run_swebench_verified.py,src/harness/) refer to your target project, not this repo. Replace them with your own benchmark command and source directory.
# Required — what "better" means for your project
mission: "Evolve the harness so Qwen3.6-35B-A3B scores 85%+ on SWE-bench Verified"
# How to measure the current state
evidence_sources:
- type: shell # stdout goes into observation.json
command: "python3 scripts/run_swebench_verified.py --model qwen3.6-35b-a3b --json"
- type: file # file contents go into observation.json
path: "metrics.json"
- type: http # GET a live endpoint; status, headers and body recorded
url: "https://evals.example.com/run/latest"
headers:
Accept: application/json
timeout: 10 # seconds (default 10)
# Only files under these paths may be changed
mutation_scope:
allowed_paths:
- "src/harness/" # changes outside this list are auto-rejected
# When to stop
hard_stops:
max_iterations: 30 # total rounds
max_consecutive_failures: 4 # consecutive rejections before halt
max_total_usd: 80.00 # 0 = unlimited
max_total_tokens: 0 # 0 = unlimited
# LLM for planner and evaluator
llm:
provider: anthropic # anthropic | openai
model: claude-sonnet-4-6
api_key_env: ANTHROPIC_API_KEY
# Coding agent for executor
coding_agent:
tool: aider # aider | claude-code
# How many past rounds the planner sees
history:
max_entries: 10
# Population-level search: per round, spawn k independent worktrees, score
# each branch's fitness, promote the best, demote the rest to ledger/failed/.
# k=1 (default) is plain single-branch run_once behavior.
parallel:
k_branches: 1
# Process sandbox: when enabled, the executor's argv is wrapped with firejail
# so the rest of the filesystem is read-only and only the worktree + the
# run's ledger directory are writable. Planner and evaluator are read-mostly
# and run unsandboxed.
sandbox:
enabled: false # set to true on machines with firejail installed
backend: firejail
extra_args: [] # appended verbatim before `--`
roles:
planner: ["python3", "roles/planner.py"]
executor: ["bash", "roles/executor.sh"]
evaluator: ["python3", "roles/evaluator.py"]Switch to OpenAI:
llm:
provider: openai
model: gpt-5.5
api_key_env: OPENAI_API_KEYSwitch to Claude Code:
coding_agent:
tool: claude-code# Loop until a hard stop fires (recommended)
evolution-kernel --config evolution.yml --repo /path/to/repo --ledger /tmp/ledger --loop
# Single round
evolution-kernel --config evolution.yml --repo /path/to/repo --ledger /tmp/ledger
# Reset all hard-stop state (iterations, failures, budget) for a fresh session
evolution-kernel --ledger /tmp/ledger --resetExit codes: 0 clean finish · 3 halted by a hard stop.
pip install evolution-kernelFrom source (only runtime dependency: PyYAML):
git clone https://github.com/Protocol-zero-0/evolution-kernel.git
cd evolution-kernel
pip install -e .Python 3.10 or later.
python3 -m pytest tests/ -v99 tests · no network calls in CI · LLM roles replaced by lightweight fixture scripts · CI installs firejail so the sandbox E2E test runs against the real OS-level mount.
Each role is an executable that receives:
--input <path> JSON the governor wrote for this role
--output <path> JSON the role must write before exiting
--worktree <path> path to the isolated git sandbox checkout
roles/planner.py, roles/executor.sh, and roles/evaluator.py are the reference implementation. Copy, modify, or replace them entirely — with a shell script, a Docker call, or anything that reads --input and writes --output.
Being honest about where v1.0 is not yet.
- The evaluator is an LLM. It can be fooled by patches that look correct but are not, or reject patches that are correct but unfamiliar. Use
goal_evaluator+ a strong programmatic gate inevidence_sourcesto anchor the LLM judgment to ground truth. - The sandbox is filesystem-only. firejail blocks out-of-worktree writes. It does not block network access, fork bombs, or process injection. For untrusted executors, layer a network namespace or a VM on top.
- History is summary, not replay. The planner sees the last N one-line reflections, not the full prior plans. Long-horizon strategies need the planner to encode state into
plan.jsonsummaries itself. - Cost can compound. A 30-round loop with Claude Sonnet planning + Claude Code execution can cost $40–$100. The hard-stop budget is real — set it lower than you think.
- Real-provider integration is not in CI. The 99 tests use fixture scripts in
tests/fixtures/. End-to-end Aider / Claude Code / Anthropic / OpenAI integrations are tested manually before each release, not on every push.
evolution_kernel/ ~1,900-line runtime (Governor · Observer · HardStops · Sandbox · Config · CLI · Scope)
roles/ reference planner, executor, evaluator, goal_evaluator, strategist
examples/ demo target + sandbox demo + working evolution.yml
docs/ protocol spec + first-task spec
tests/ 99 unit + acceptance tests · 14 fixture role scripts
evidence/ checked-in artifacts of runs anyone can reproduce
MIT — see LICENSE.