Skip to content

NullLabTests/grounded_agent_forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

31 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

โš’๏ธ Grounded Agent Forge

Evolving full agent blueprints through execution-grounded genetic algorithms โ€” not just prompts, but tools, memory, planning, and self-evaluation.

Status: Active License: MIT Python 3.12+ Docker FastAPI SQLAlchemy Code Style

Built With PRs Welcome Research Stars Last Commit Repo Size Platform Twitter


Navigation ยท Overview ยท Project Lineage ยท Architecture ยท Quick Start ยท Modules ยท Project Structure ยท Research ยท Contributing


โœฆ Overview

Grounded Agent Forge is the next evolution of execution-grounded prompt optimization. Where the original grounded_evolution evolved text prompts to generate better code, this project evolves complete agent blueprints โ€” full specifications for autonomous AI agents including system prompts, tool definitions, memory architectures, planning strategies, and self-evaluation mechanisms.

timeline
    title The Evolution of Agent Evolution
    autoresearch-ai-agent-skeleton : Lexical-only prompt scoring (400+ keyword signals)
    grounded_evolution              : Execution-grounded validation (AST + pytest + flake8)
    grounded_agent_forge            : Full agent blueprint evolution in Docker sandbox
Loading

What Makes This Different

Feature Impact
๐Ÿงฌ Agent-Level Evolution Not just prompts โ€” entire agent architectures evolve through genetic algorithms
๐Ÿ“ฆ Docker Sandboxing Every generated agent executes in an isolated container; real execution metrics drive fitness
๐ŸŽฏ Multi-Objective Fitness Agents scored on correctness, efficiency, tool-use accuracy, planning depth, and self-evaluation
๐Ÿ”„ Meta-Evolution The evolutionary strategy itself evolves: crossover rates, mutation operators, and selection pressure adapt over time
๐Ÿงฉ Task Specialization Populations diversify into specialist agents for different problem domains
๐Ÿ“Š Real-Time Dashboard Web-based visualization of evolution progress, agent scores, and population dynamics

๐Ÿงฌ Project Lineage

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     grounded_agent_forge                          โ”‚
โ”‚                         (THIS REPO)                               โ”‚
โ”‚  Evolves full agent blueprints (prompt + tools + memory +         โ”‚
โ”‚  planning + self-eval) in Docker sandbox with multi-objective     โ”‚
โ”‚  fitness, meta-evolution, and task specialization.                 โ”‚
โ”‚                                                                    โ”‚
โ”‚  ๐Ÿ—๏ธ Agent-level evolution    ๐Ÿ“ฆ Docker sandboxed execution         โ”‚
โ”‚  ๐ŸŽฏ 8+ fitness dimensions    ๐Ÿ”„ Self-tuning meta-evolution         โ”‚
โ”‚  ๐Ÿ“Š Real-time dashboard      ๐Ÿงฉ Task specialization                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ–ฒ
                              โ”‚ builds on ยท evolves from
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      grounded_evolution                           โ”‚
โ”‚                   (github.com/NullLabTests/grounded_evolution)    โ”‚
โ”‚  Evolves text prompts with execution-grounded validation via AST  โ”‚
โ”‚  parse, pytest, and flake8. Two-loop system: lexical + grounded.  โ”‚
โ”‚                                                                    โ”‚
โ”‚  ๐Ÿ“ 203 evolution cycles    ๐Ÿ† Best score: 39/80                   โ”‚
โ”‚  ๐Ÿ”ฌ 7 benchmark tasks       ๐Ÿ”„ 127 mutations + 76 crossovers       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ–ฒ
                              โ”‚ builds on ยท evolves from
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  autoresearch-ai-agent-skeleton                    โ”‚
โ”‚  Lexical-only prompt evolution with 400+ keyword signals across   โ”‚
โ”‚  19 categories. 5 genetic mutation strategies. Meta-signal        โ”‚
โ”‚  injection via auto_evolve.py.                                     โ”‚
โ”‚                                                                    โ”‚
โ”‚  ๐Ÿ“ 218 prompts evolved     ๐Ÿ† Best lexical score: 1000/1000       โ”‚
โ”‚  ๐Ÿ”ค 400+ keyword signals    ๐Ÿงฌ 5 mutation strategies               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Capability Comparison

Capability Lexical-Only Grounded Evolution ๐Ÿš€ Grounded Agent Forge
Keyword prompt scoring โœ… 400+ signals โœ… 400+ signals โœ… 400+ signals
Execution-grounded validation โŒ โœ… AST + pytest + flake8 โœ… Full Docker sandbox
Evolves prompts โœ… โœ… โœ…
Evolves agent blueprints โŒ โŒ โœ…
Docker sandbox isolation โŒ โŒ โœ…
Multi-objective fitness โŒ โŒ โœ… (8+ dimensions)
Meta-evolution โœ… signal injection โœ… signal injection โœ… full strategy evolution
Task specialization โŒ โŒ โœ…
Real-time dashboard โŒ โŒ โœ…
Self-evaluation in agents โŒ โŒ โœ…
Tool-use validation โŒ โŒ โœ…
Planning depth scoring โŒ โŒ โœ…
Infinite research loop โŒ (finite) โœ… โœ…
Auto-commit on improvement โŒ โœ… โœ…

This project was built using DeepSeek V4 as the primary coding model.


๐Ÿ—๏ธ Architecture

High-Level System Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       GROUNDED AGENT FORGE                            โ”‚
โ”‚                                                                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚    orchestrator.py       โ”‚โ”€โ”€โ”€โ–ถโ”‚   agent_spec_generator.py      โ”‚   โ”‚
โ”‚  โ”‚  โ”€ Main evolution loop   โ”‚    โ”‚  โ”€ Generates agent blueprints  โ”‚   โ”‚
โ”‚  โ”‚  โ”€ Selection & mutation   โ”‚    โ”‚  โ”€ System prompt + tools       โ”‚   โ”‚
โ”‚  โ”‚  โ”€ Parallel generation   โ”‚    โ”‚  โ”€ Memory + planning config    โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚              โ”‚                                    โ”‚                    โ”‚
โ”‚              โ–ผ                                    โ–ผ                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚   full_agent_evaluator   โ”‚    โ”‚        Docker Sandbox          โ”‚   โ”‚
โ”‚  โ”‚  โ”€ Multi-objective score โ”‚โ”€โ”€โ”€โ–ถโ”‚  โ”€ Isolated container exec     โ”‚   โ”‚
โ”‚  โ”‚  โ”€ 8 fitness dimensions  โ”‚    โ”‚  โ”€ Tool-use validation         โ”‚   โ”‚
โ”‚  โ”‚  โ”€ Benchmark execution   โ”‚    โ”‚  โ”€ Planning evaluation         โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚              โ”‚                                                         โ”‚
โ”‚              โ–ผ                                                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                          โ”‚
โ”‚  โ”‚      meta_evolver.py     โ”‚โ”€โ”€โ”€โ–ถ Self-tuning evolution strategy       โ”‚
โ”‚  โ”‚  โ”€ Adaptive mutation     โ”‚                                          โ”‚
โ”‚  โ”‚  โ”€ Weight optimization   โ”‚                                          โ”‚
โ”‚  โ”‚  โ”€ Novelty-driven exploreโ”‚                                          โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                          โ”‚
โ”‚                                                                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                          โ”‚
โ”‚  โ”‚      dashboard/          โ”‚โ”€โ”€โ”€โ–ถ Real-time evolution visualization   โ”‚
โ”‚  โ”‚      main.py             โ”‚     (FastAPI + Web UI)                   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Evolution Cycle

graph TB
    subgraph Forge["โš’๏ธ Agent Forge Loop"]
        direction TB
        A["๐Ÿงฌ Agent Blueprint<br/>Population"] --> B["๐ŸŽฏ orchestrator.py<br/>Select + Mutate"]
        B --> C["๐Ÿค– agent_spec_generator.py<br/>LLM โ†’ Full Agent Spec"]
        C --> D["๐Ÿ“ฆ Docker Sandbox<br/>Build + Run Agent"]
        D --> E["๐Ÿ“Š full_agent_evaluator.py<br/>Multi-Objective Score"]
        E --> F["๐Ÿง  meta_evolver.py<br/>Tune Evolution Strategy"]
        F --> G["๐Ÿ’พ Update Population<br/>+ Persist to DB"]
        G --> A
    end

    subgraph Dashboard["๐Ÿ“ˆ Real-Time Visualization"]
        DASH["๐Ÿ–ฅ๏ธ dashboard/main.py<br/>FastAPI + Charts"]
    end

    E -->|"fitness data"| DASH
    DASH -->|"control signals"| B
Loading

Multi-Objective Fitness Dimensions

quadrantChart
    title Fitness Dimension Weights
    x-axis "Low Impact" --> "High Impact"
    y-axis "Easy to Measure" --> "Hard to Measure"
    quadrant-1 "Core Metrics"
    quadrant-2 "Quality Signals"
    quadrant-3 "Secondary"
    quadrant-4 "Long-term"
    Correctness: [0.9, 0.3]
    Tool-Use: [0.6, 0.5]
    Planning: [0.5, 0.7]
    Code-Quality: [0.4, 0.2]
    Memory: [0.3, 0.6]
    Self-Eval: [0.3, 0.8]
    Efficiency: [0.2, 0.4]
    Prompt-Quality: [0.1, 0.1]
Loading
Dimension Weight What It Measures
๐ŸŽฏ Correctness 30% Does the agent solve the task correctly?
๐Ÿ”ง Tool-Use Accuracy 15% Does the agent call tools with valid arguments?
๐Ÿงฉ Planning Depth 15% Does the agent decompose problems into steps?
๐Ÿ“ Code Quality 10% AST validity, project structure, linting
๐Ÿง  Memory Effectiveness 10% Does the agent use memory to maintain context?
๐Ÿ” Self-Evaluation 10% Does the agent correctly assess its own outputs?
โšก Efficiency 5% Token efficiency, round-trips to completion
๐Ÿ“– Prompt Quality 5% Lexical signal coverage (legacy metric)

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.12+
  • Docker (for sandboxed agent execution)
  • LLM API key โ€” DeepSeek, OpenAI, or any OpenAI-compatible provider

Setup

# Clone the repository
git clone git@github.com:NullLabTests/grounded_agent_forge.git
cd grounded_agent_forge

# Create virtual environment
python -m venv .venv && source .venv/bin/activate

# Install base + forge extras
pip install -e ".[forge]"

# Configure your LLM provider
cp .env.example .env
# Edit .env with your API key and model preferences

Run the Forge

# Start the infinite agent evolution loop (two ways):
python -m agent_forge.orchestrator

# OR use the shell wrapper:
bash run_forge_loop.sh

Launch the Dashboard

uvicorn dashboard.main:app --reload --port 8000
# Open โ†’ http://localhost:8000

Configuration

Variable Default Description
LLM_API_KEY โ€” LLM provider API key
LLM_MODEL deepseek-chat Model name
LLM_BASE_URL https://api.deepseek.com/v1 API endpoint
FORGE_DB_URL sqlite+aiosqlite:///forge_population.db Population database
SANDBOX_TIMEOUT 300 Docker sandbox timeout (seconds)
MAX_PARALLEL_GENERATIONS 3 Concurrent agent generations
HUMAN_APPROVAL false Require manual approval before execution
DASHBOARD_PORT 8000 Dashboard server port

๐Ÿ“ฆ Modules

โš’๏ธ agent_forge/orchestrator.py

The central evolution loop coordinator โ€” the brain of the forge.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         orchestrator.py              โ”‚
โ”‚                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Load    โ”‚โ”€โ–ถโ”‚ Select   โ”‚โ”€โ–ถโ”‚ Mu- โ”‚ โ”‚
โ”‚  โ”‚ pop     โ”‚  โ”‚ champion โ”‚  โ”‚ tateโ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚
โ”‚                                 โ–ผ    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Per-    โ”‚โ—€โ”€โ”‚ Track    โ”‚โ—€โ”€โ”‚ Evalโ”‚ โ”‚
โ”‚  โ”‚ sist    โ”‚  โ”‚ fitness  โ”‚  โ”‚ uateโ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Loads/persists agent blueprint population from database
  • Tournament selection with elitism
  • Mutation and crossover scheduling
  • Parallel generation management
  • Fitness tracking and convergence detection

๐Ÿค– agent_forge/agent_spec_generator.py

Generates full agent specifications from evolved blueprints. An agent spec includes:

Component Description
๐Ÿง  System Prompt Core identity, behavior instructions, and constraints
๐Ÿ› ๏ธ Tool Definitions Function schemas the agent can call (JSON schema)
๐Ÿ’พ Memory Architecture Short-term, long-term, and working memory configuration
๐Ÿ—บ๏ธ Planning Strategy Chain-of-thought, ReAct, or tree-of-thought configuration
๐Ÿ” Self-Evaluation Criteria How the agent judges its own outputs
๐Ÿ“ Output Schema Expected response format and structure

๐Ÿ“Š agent_forge/full_agent_evaluator.py

Multi-objective fitness evaluator โ€” the forge's quality gate.

Agent Spec
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Build Docker Container     โ”‚
โ”‚  โ””โ”€ Install dependencies   โ”‚
โ”‚  โ””โ”€ Configure environment  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Execute Against Benchmarks โ”‚
โ”‚  โ””โ”€ Task completion check  โ”‚
โ”‚  โ””โ”€ Tool call validation   โ”‚
โ”‚  โ””โ”€ Planning analysis      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Score Across 8 Dimensions  โ”‚
โ”‚  โ””โ”€ Correctness (30%)      โ”‚
โ”‚  โ””โ”€ Tool-Use (15%)         โ”‚
โ”‚  โ””โ”€ Planning (15%)         โ”‚
โ”‚  โ””โ”€ + 5 more metrics       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Builds Docker containers from agent specs
  • Executes agents against benchmark tasks
  • Scores across 8+ fitness dimensions
  • Handles sandbox timeouts and failures gracefully
  • Logs detailed per-dimension metrics

๐Ÿง  agent_forge/meta_evolver.py

Evolution strategy optimizer โ€” the forge that forges itself.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         meta_evolver.py          โ”‚
โ”‚                                  โ”‚
โ”‚  Input: population fitness deltasโ”‚
โ”‚                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Track operator success     โ”‚ โ”‚
โ”‚  โ”‚ per operator               โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚             โ–ผ                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Adjust probabilities       โ”‚ โ”‚
โ”‚  โ”‚ up-weight winners          โ”‚ โ”‚
โ”‚  โ”‚ down-weight losers         โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚             โ–ผ                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚  โ”‚ Detect stagnation          โ”‚ โ”‚
โ”‚  โ”‚ if flat โ†’ novelty search   โ”‚ โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚             โ–ผ                    โ”‚
โ”‚  Output: new evolution config   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Tracks which mutation/crossover operators produce the best fitness gains
  • Adjusts operator probabilities in real-time (self-tuning weights)
  • Evolves the evolution strategy itself (meta-level adaptation)
  • Detects stagnation and introduces novelty-driven exploration
  • Persists strategy state across runs

๐Ÿ“ˆ dashboard/main.py

FastAPI-based web dashboard providing:

Feature Description
๐Ÿ“Š Population View Real-time visualization of the agent population
๐Ÿ“ˆ Fitness Trajectory Score over time across all dimensions
๐Ÿ” Agent Inspector Compare blueprint specs side-by-side
๐ŸŽฏ Dimension Breakdown Per-dimension score distribution
๐ŸŽฎ Evolution Controls Pause, resume, and manual trigger

๐Ÿ“ Project Structure

grounded_agent_forge/
โ”œโ”€โ”€ README.md                         # This file
โ”œโ”€โ”€ LICENSE                           # MIT license
โ”œโ”€โ”€ pyproject.toml                    # Project metadata + dependencies
โ”œโ”€โ”€ AGENTS.md                         # Agent collaboration conventions
โ”œโ”€โ”€ CHANGELOG.md                      # Release history
โ”œโ”€โ”€ CONTRIBUTING.md                   # How to contribute
โ”œโ”€โ”€ SECURITY.md                       # Security policy
โ”œโ”€โ”€ .env.example                      # Environment template
โ”œโ”€โ”€ .gitignore                        # Git ignore rules
โ”‚
โ”œโ”€โ”€ agent_forge/                      # โš’๏ธ Core forge modules (primary)
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ orchestrator.py               # Evolution loop coordinator
โ”‚   โ”œโ”€โ”€ agent_spec_generator.py       # Agent blueprint generator
โ”‚   โ”œโ”€โ”€ full_agent_evaluator.py       # Multi-objective fitness evaluator
โ”‚   โ””โ”€โ”€ meta_evolver.py               # Strategy adaptation
โ”‚
โ”œโ”€โ”€ dashboard/                        # ๐Ÿ“Š Real-time web dashboard
โ”‚   โ””โ”€โ”€ main.py                       # FastAPI application
โ”‚
โ”œโ”€โ”€ run_forge_loop.sh                 # Shell automation wrapper
โ”‚
โ”œโ”€โ”€ .github/                          # ๐Ÿ”„ CI/CD + community
โ”‚   โ”œโ”€โ”€ workflows/
โ”‚   โ”‚   โ”œโ”€โ”€ ci.yml                    # Lint + import checks
โ”‚   โ”‚   โ””โ”€โ”€ badge.yml                 # Dynamic score badge
โ”‚   โ”œโ”€โ”€ ISSUE_TEMPLATE/
โ”‚   โ”‚   โ”œโ”€โ”€ bug_report.md
โ”‚   โ”‚   โ”œโ”€โ”€ feature_request.md
โ”‚   โ”‚   โ””โ”€โ”€ config.yml
โ”‚   โ”œโ”€โ”€ dependabot.yml
โ”‚   โ”œโ”€โ”€ FUNDING.yml
โ”‚   โ””โ”€โ”€ CODEOWNERS
โ”‚
โ”œโ”€โ”€ docs/                             # ๐Ÿ“š Documentation
โ”œโ”€โ”€ experiments/                      # ๐Ÿ”ฌ Experiment outputs
โ”œโ”€โ”€ benchmarks/                       # ๐Ÿ“‹ Task definitions
โ”‚
โ”œโ”€โ”€ evaluator/                        # (legacy) Grounded evolution evaluator
โ”œโ”€โ”€ population/                       # (legacy) Evolved prompts
โ”œโ”€โ”€ memory/                           # (legacy) Evolution state
โ”œโ”€โ”€ analysis/                         # (legacy) Visualization scripts
โ”œโ”€โ”€ generator.py                      # (legacy) LLM code generation
โ”œโ”€โ”€ infinite_research_loop.py         # (legacy) Grounded evolution loop
โ”œโ”€โ”€ mutation_engine.py                # (legacy) Prompt mutation operators
โ””โ”€โ”€ population_manager.py             # (legacy) Population persistence

Note: Modules marked "(legacy)" are carried forward from grounded_evolution. They remain functional but the primary development focus is on agent_forge/.


๐Ÿ”ฌ Research Context

Grounded Agent Forge explores the frontier of evolutionary software optimization:

Research Direction Description
๐Ÿงฌ Blueprint-Level Evolution Moving from prompt text optimization to full agent architecture evolution
๐Ÿ“ฆ Execution-Grounded Multi-Objective Fitness Real Docker sandbox execution across 8+ fitness dimensions
๐Ÿ”„ Meta-Evolutionary Adaptation The evolutionary strategy itself evolves, preventing stagnation
๐Ÿงฉ Task Specialization Populations naturally diversify into domain-specific agent archetypes
๐Ÿ” Self-Evaluating Agents Agents that can assess their own output quality are rewarded
mindmap
  root((Agent Forge))
    Blueprint Evolution
      System prompts
      Tool definitions
      Memory architectures
      Planning strategies
    Execution Grounding
      Docker sandbox
      Real execution metrics
      Multi-objective scoring
    Meta Evolution
      Self-tuning weights
      Strategy adaptation
      Novelty search
    Task Specialization
      Domain clustering
      Niche formation
      Pareto optimization
    Dashboard
      Real-time viz
      Population analysis
      Control interface
Loading

What This Is NOT

  • โŒ A claim of AGI or sentience
  • โŒ A self-conscious or self-aware system
  • โŒ Runaway recursive self-improvement

โœ… It is a well-scoped experimental system for studying how genetic algorithms can evolve complete agent architectures โ€” with real execution validation in isolated sandboxes.


๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for details.

Quick start for contributors:

# Fork & clone
git clone git@github.com:YOUR_USERNAME/grounded_agent_forge.git

# Install dev dependencies
pip install -e ".[forge]" ruff

# Lint your code
ruff check agent_forge/ dashboard/

# Open a PR

๐Ÿ“„ License

MIT โ€” see LICENSE.


๐Ÿ™ Credits

Contribution Link
๐Ÿงฌ Predecessor grounded_evolution โ€” execution-grounded prompt evolution platform with 203 evolution cycles
๐Ÿ“œ Inspiration autoresearch by Andrej Karpathy โ€” the original lexical prompt evolution concept
๐Ÿค– Built Using DeepSeek V4 as the primary coding model for this project

Made with ๐Ÿงฌ by NullLabTests ยท Evolution is the ultimate optimizer

License Issues Forks

About

Next evolution of grounded evolution: evolves full agent blueprints (system prompt + tools + memory + planning + self-evaluation) using Docker sandboxing, multi-objective fitness, meta-evolution, and task specialization. Built using DeepSeek V4.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors