Evolving full agent blueprints through execution-grounded genetic algorithms โ not just prompts, but tools, memory, planning, and self-evaluation.
Navigation ยท Overview ยท Project Lineage ยท Architecture ยท Quick Start ยท Modules ยท Project Structure ยท Research ยท Contributing
Grounded Agent Forge is the next evolution of execution-grounded prompt optimization. Where the original grounded_evolution evolved text prompts to generate better code, this project evolves complete agent blueprints โ full specifications for autonomous AI agents including system prompts, tool definitions, memory architectures, planning strategies, and self-evaluation mechanisms.
timeline
title The Evolution of Agent Evolution
autoresearch-ai-agent-skeleton : Lexical-only prompt scoring (400+ keyword signals)
grounded_evolution : Execution-grounded validation (AST + pytest + flake8)
grounded_agent_forge : Full agent blueprint evolution in Docker sandbox
| Feature | Impact |
|---|---|
| ๐งฌ Agent-Level Evolution | Not just prompts โ entire agent architectures evolve through genetic algorithms |
| ๐ฆ Docker Sandboxing | Every generated agent executes in an isolated container; real execution metrics drive fitness |
| ๐ฏ Multi-Objective Fitness | Agents scored on correctness, efficiency, tool-use accuracy, planning depth, and self-evaluation |
| ๐ Meta-Evolution | The evolutionary strategy itself evolves: crossover rates, mutation operators, and selection pressure adapt over time |
| ๐งฉ Task Specialization | Populations diversify into specialist agents for different problem domains |
| ๐ Real-Time Dashboard | Web-based visualization of evolution progress, agent scores, and population dynamics |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ grounded_agent_forge โ
โ (THIS REPO) โ
โ Evolves full agent blueprints (prompt + tools + memory + โ
โ planning + self-eval) in Docker sandbox with multi-objective โ
โ fitness, meta-evolution, and task specialization. โ
โ โ
โ ๐๏ธ Agent-level evolution ๐ฆ Docker sandboxed execution โ
โ ๐ฏ 8+ fitness dimensions ๐ Self-tuning meta-evolution โ
โ ๐ Real-time dashboard ๐งฉ Task specialization โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ
โ builds on ยท evolves from
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ grounded_evolution โ
โ (github.com/NullLabTests/grounded_evolution) โ
โ Evolves text prompts with execution-grounded validation via AST โ
โ parse, pytest, and flake8. Two-loop system: lexical + grounded. โ
โ โ
โ ๐ 203 evolution cycles ๐ Best score: 39/80 โ
โ ๐ฌ 7 benchmark tasks ๐ 127 mutations + 76 crossovers โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ
โ builds on ยท evolves from
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ autoresearch-ai-agent-skeleton โ
โ Lexical-only prompt evolution with 400+ keyword signals across โ
โ 19 categories. 5 genetic mutation strategies. Meta-signal โ
โ injection via auto_evolve.py. โ
โ โ
โ ๐ 218 prompts evolved ๐ Best lexical score: 1000/1000 โ
โ ๐ค 400+ keyword signals ๐งฌ 5 mutation strategies โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Capability | Lexical-Only | Grounded Evolution | ๐ Grounded Agent Forge |
|---|---|---|---|
| Keyword prompt scoring | โ 400+ signals | โ 400+ signals | โ 400+ signals |
| Execution-grounded validation | โ | โ AST + pytest + flake8 | โ Full Docker sandbox |
| Evolves prompts | โ | โ | โ |
| Evolves agent blueprints | โ | โ | โ |
| Docker sandbox isolation | โ | โ | โ |
| Multi-objective fitness | โ | โ | โ (8+ dimensions) |
| Meta-evolution | โ signal injection | โ signal injection | โ full strategy evolution |
| Task specialization | โ | โ | โ |
| Real-time dashboard | โ | โ | โ |
| Self-evaluation in agents | โ | โ | โ |
| Tool-use validation | โ | โ | โ |
| Planning depth scoring | โ | โ | โ |
| Infinite research loop | โ (finite) | โ | โ |
| Auto-commit on improvement | โ | โ | โ |
This project was built using DeepSeek V4 as the primary coding model.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GROUNDED AGENT FORGE โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ orchestrator.py โโโโโถโ agent_spec_generator.py โ โ
โ โ โ Main evolution loop โ โ โ Generates agent blueprints โ โ
โ โ โ Selection & mutation โ โ โ System prompt + tools โ โ
โ โ โ Parallel generation โ โ โ Memory + planning config โ โ
โ โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ full_agent_evaluator โ โ Docker Sandbox โ โ
โ โ โ Multi-objective score โโโโโถโ โ Isolated container exec โ โ
โ โ โ 8 fitness dimensions โ โ โ Tool-use validation โ โ
โ โ โ Benchmark execution โ โ โ Planning evaluation โ โ
โ โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ meta_evolver.py โโโโโถ Self-tuning evolution strategy โ
โ โ โ Adaptive mutation โ โ
โ โ โ Weight optimization โ โ
โ โ โ Novelty-driven exploreโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ dashboard/ โโโโโถ Real-time evolution visualization โ
โ โ main.py โ (FastAPI + Web UI) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
graph TB
subgraph Forge["โ๏ธ Agent Forge Loop"]
direction TB
A["๐งฌ Agent Blueprint<br/>Population"] --> B["๐ฏ orchestrator.py<br/>Select + Mutate"]
B --> C["๐ค agent_spec_generator.py<br/>LLM โ Full Agent Spec"]
C --> D["๐ฆ Docker Sandbox<br/>Build + Run Agent"]
D --> E["๐ full_agent_evaluator.py<br/>Multi-Objective Score"]
E --> F["๐ง meta_evolver.py<br/>Tune Evolution Strategy"]
F --> G["๐พ Update Population<br/>+ Persist to DB"]
G --> A
end
subgraph Dashboard["๐ Real-Time Visualization"]
DASH["๐ฅ๏ธ dashboard/main.py<br/>FastAPI + Charts"]
end
E -->|"fitness data"| DASH
DASH -->|"control signals"| B
quadrantChart
title Fitness Dimension Weights
x-axis "Low Impact" --> "High Impact"
y-axis "Easy to Measure" --> "Hard to Measure"
quadrant-1 "Core Metrics"
quadrant-2 "Quality Signals"
quadrant-3 "Secondary"
quadrant-4 "Long-term"
Correctness: [0.9, 0.3]
Tool-Use: [0.6, 0.5]
Planning: [0.5, 0.7]
Code-Quality: [0.4, 0.2]
Memory: [0.3, 0.6]
Self-Eval: [0.3, 0.8]
Efficiency: [0.2, 0.4]
Prompt-Quality: [0.1, 0.1]
| Dimension | Weight | What It Measures |
|---|---|---|
| ๐ฏ Correctness | 30% | Does the agent solve the task correctly? |
| ๐ง Tool-Use Accuracy | 15% | Does the agent call tools with valid arguments? |
| ๐งฉ Planning Depth | 15% | Does the agent decompose problems into steps? |
| ๐ Code Quality | 10% | AST validity, project structure, linting |
| ๐ง Memory Effectiveness | 10% | Does the agent use memory to maintain context? |
| ๐ Self-Evaluation | 10% | Does the agent correctly assess its own outputs? |
| โก Efficiency | 5% | Token efficiency, round-trips to completion |
| ๐ Prompt Quality | 5% | Lexical signal coverage (legacy metric) |
- Python 3.12+
- Docker (for sandboxed agent execution)
- LLM API key โ DeepSeek, OpenAI, or any OpenAI-compatible provider
# Clone the repository
git clone git@github.com:NullLabTests/grounded_agent_forge.git
cd grounded_agent_forge
# Create virtual environment
python -m venv .venv && source .venv/bin/activate
# Install base + forge extras
pip install -e ".[forge]"
# Configure your LLM provider
cp .env.example .env
# Edit .env with your API key and model preferences# Start the infinite agent evolution loop (two ways):
python -m agent_forge.orchestrator
# OR use the shell wrapper:
bash run_forge_loop.shuvicorn dashboard.main:app --reload --port 8000
# Open โ http://localhost:8000| Variable | Default | Description |
|---|---|---|
LLM_API_KEY |
โ | LLM provider API key |
LLM_MODEL |
deepseek-chat |
Model name |
LLM_BASE_URL |
https://api.deepseek.com/v1 |
API endpoint |
FORGE_DB_URL |
sqlite+aiosqlite:///forge_population.db |
Population database |
SANDBOX_TIMEOUT |
300 |
Docker sandbox timeout (seconds) |
MAX_PARALLEL_GENERATIONS |
3 |
Concurrent agent generations |
HUMAN_APPROVAL |
false |
Require manual approval before execution |
DASHBOARD_PORT |
8000 |
Dashboard server port |
The central evolution loop coordinator โ the brain of the forge.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ orchestrator.py โ
โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโ โ
โ โ Load โโโถโ Select โโโถโ Mu- โ โ
โ โ pop โ โ champion โ โ tateโ โ
โ โโโโโโโโโโโ โโโโโโโโโโโโ โโโโฌโโโ โ
โ โผ โ
โ โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโ โ
โ โ Per- โโโโ Track โโโโ Evalโ โ
โ โ sist โ โ fitness โ โ uateโ โ
โ โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Loads/persists agent blueprint population from database
- Tournament selection with elitism
- Mutation and crossover scheduling
- Parallel generation management
- Fitness tracking and convergence detection
Generates full agent specifications from evolved blueprints. An agent spec includes:
| Component | Description |
|---|---|
| ๐ง System Prompt | Core identity, behavior instructions, and constraints |
| ๐ ๏ธ Tool Definitions | Function schemas the agent can call (JSON schema) |
| ๐พ Memory Architecture | Short-term, long-term, and working memory configuration |
| ๐บ๏ธ Planning Strategy | Chain-of-thought, ReAct, or tree-of-thought configuration |
| ๐ Self-Evaluation Criteria | How the agent judges its own outputs |
| ๐ Output Schema | Expected response format and structure |
Multi-objective fitness evaluator โ the forge's quality gate.
Agent Spec
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Build Docker Container โ
โ โโ Install dependencies โ
โ โโ Configure environment โ
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Execute Against Benchmarks โ
โ โโ Task completion check โ
โ โโ Tool call validation โ
โ โโ Planning analysis โ
โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Score Across 8 Dimensions โ
โ โโ Correctness (30%) โ
โ โโ Tool-Use (15%) โ
โ โโ Planning (15%) โ
โ โโ + 5 more metrics โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Builds Docker containers from agent specs
- Executes agents against benchmark tasks
- Scores across 8+ fitness dimensions
- Handles sandbox timeouts and failures gracefully
- Logs detailed per-dimension metrics
Evolution strategy optimizer โ the forge that forges itself.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ meta_evolver.py โ
โ โ
โ Input: population fitness deltasโ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Track operator success โ โ
โ โ per operator โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Adjust probabilities โ โ
โ โ up-weight winners โ โ
โ โ down-weight losers โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Detect stagnation โ โ
โ โ if flat โ novelty search โ โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ Output: new evolution config โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Tracks which mutation/crossover operators produce the best fitness gains
- Adjusts operator probabilities in real-time (self-tuning weights)
- Evolves the evolution strategy itself (meta-level adaptation)
- Detects stagnation and introduces novelty-driven exploration
- Persists strategy state across runs
FastAPI-based web dashboard providing:
| Feature | Description |
|---|---|
| ๐ Population View | Real-time visualization of the agent population |
| ๐ Fitness Trajectory | Score over time across all dimensions |
| ๐ Agent Inspector | Compare blueprint specs side-by-side |
| ๐ฏ Dimension Breakdown | Per-dimension score distribution |
| ๐ฎ Evolution Controls | Pause, resume, and manual trigger |
grounded_agent_forge/
โโโ README.md # This file
โโโ LICENSE # MIT license
โโโ pyproject.toml # Project metadata + dependencies
โโโ AGENTS.md # Agent collaboration conventions
โโโ CHANGELOG.md # Release history
โโโ CONTRIBUTING.md # How to contribute
โโโ SECURITY.md # Security policy
โโโ .env.example # Environment template
โโโ .gitignore # Git ignore rules
โ
โโโ agent_forge/ # โ๏ธ Core forge modules (primary)
โ โโโ __init__.py
โ โโโ orchestrator.py # Evolution loop coordinator
โ โโโ agent_spec_generator.py # Agent blueprint generator
โ โโโ full_agent_evaluator.py # Multi-objective fitness evaluator
โ โโโ meta_evolver.py # Strategy adaptation
โ
โโโ dashboard/ # ๐ Real-time web dashboard
โ โโโ main.py # FastAPI application
โ
โโโ run_forge_loop.sh # Shell automation wrapper
โ
โโโ .github/ # ๐ CI/CD + community
โ โโโ workflows/
โ โ โโโ ci.yml # Lint + import checks
โ โ โโโ badge.yml # Dynamic score badge
โ โโโ ISSUE_TEMPLATE/
โ โ โโโ bug_report.md
โ โ โโโ feature_request.md
โ โ โโโ config.yml
โ โโโ dependabot.yml
โ โโโ FUNDING.yml
โ โโโ CODEOWNERS
โ
โโโ docs/ # ๐ Documentation
โโโ experiments/ # ๐ฌ Experiment outputs
โโโ benchmarks/ # ๐ Task definitions
โ
โโโ evaluator/ # (legacy) Grounded evolution evaluator
โโโ population/ # (legacy) Evolved prompts
โโโ memory/ # (legacy) Evolution state
โโโ analysis/ # (legacy) Visualization scripts
โโโ generator.py # (legacy) LLM code generation
โโโ infinite_research_loop.py # (legacy) Grounded evolution loop
โโโ mutation_engine.py # (legacy) Prompt mutation operators
โโโ population_manager.py # (legacy) Population persistence
Note: Modules marked "(legacy)" are carried forward from
grounded_evolution. They remain functional but the primary development focus is onagent_forge/.
Grounded Agent Forge explores the frontier of evolutionary software optimization:
| Research Direction | Description |
|---|---|
| ๐งฌ Blueprint-Level Evolution | Moving from prompt text optimization to full agent architecture evolution |
| ๐ฆ Execution-Grounded Multi-Objective Fitness | Real Docker sandbox execution across 8+ fitness dimensions |
| ๐ Meta-Evolutionary Adaptation | The evolutionary strategy itself evolves, preventing stagnation |
| ๐งฉ Task Specialization | Populations naturally diversify into domain-specific agent archetypes |
| ๐ Self-Evaluating Agents | Agents that can assess their own output quality are rewarded |
mindmap
root((Agent Forge))
Blueprint Evolution
System prompts
Tool definitions
Memory architectures
Planning strategies
Execution Grounding
Docker sandbox
Real execution metrics
Multi-objective scoring
Meta Evolution
Self-tuning weights
Strategy adaptation
Novelty search
Task Specialization
Domain clustering
Niche formation
Pareto optimization
Dashboard
Real-time viz
Population analysis
Control interface
- โ A claim of AGI or sentience
- โ A self-conscious or self-aware system
- โ Runaway recursive self-improvement
โ It is a well-scoped experimental system for studying how genetic algorithms can evolve complete agent architectures โ with real execution validation in isolated sandboxes.
We welcome contributions! See CONTRIBUTING.md for details.
Quick start for contributors:
# Fork & clone
git clone git@github.com:YOUR_USERNAME/grounded_agent_forge.git
# Install dev dependencies
pip install -e ".[forge]" ruff
# Lint your code
ruff check agent_forge/ dashboard/
# Open a PRMIT โ see LICENSE.
| Contribution | Link |
|---|---|
| ๐งฌ Predecessor | grounded_evolution โ execution-grounded prompt evolution platform with 203 evolution cycles |
| ๐ Inspiration | autoresearch by Andrej Karpathy โ the original lexical prompt evolution concept |
| ๐ค Built Using | DeepSeek V4 as the primary coding model for this project |