Skip to content

Darwin-Agent/HarnessX

Repository files navigation

HarnessX Logo HarnessX
Compose. Β  Adapt. Β  Evolve.

Compose the Harness, define the Agent.
From zero-code to full customization β€” one core, X entry points.

License Python Version Status

Overview β€’ Architecture β€’ Quick Start β€’ Benchmarks β€’ Roadmap β€’ δΈ­ζ–‡ζ–‡ζ‘£


πŸ”­ Overview

The harness β€” not just the model β€” determines agent performance. The same base model produces dramatically different results depending on how context is managed, how tools are orchestrated, how errors are recovered, and how evaluation signals feed back.

HarnessX is a harness foundry: forge any number of agent harnesses from reusable processors and bundles, pair each with any model, and evolve them through training β€” all without rewriting the agent.

Most frameworks solved model swapping. Behavior swapping remains expensive β€” switching from a coding agent to a research agent, adding memory or guardrails, means rewriting the agent.

HarnessX solves this with one clean separation:

agent = model.agentic(harness)
  • ModelConfig β€” provider routing, fallback, per-role model assignment
  • HarnessConfig β€” the full behavior pipeline (tools, memory, processors, trace, sandbox)

The X in HarnessX stands for eXtensible Behavior Composition β€” compose, adapt, and evolve harnesses without rewriting the agent:

🧩 Compose β€” 9-dimension behavior pipeline; any behavior = Processor, combine with | operator.

βš™οΈ Adapt β€” Harness observes performance and auto-searches optimal harness configurations.

πŸš€ Evolve β€” every run produces reward-annotated trajectories that feed SFT / RL training.


πŸ—οΈ Architecture

HarnessX Architecture

β†’ See docs/architecture.md for the full 9-dimension behavior pipeline, processor hook points, and composition API.


πŸš€ Quick Start

Click to expand

Install

One-click install (interactive β€” asks before installing uv, Node.js, and optional IM Gateway):

curl -sSf https://raw.githubusercontent.com/Darwin-Agent/HarnessX/main/scripts/install.sh | bash

Non-interactive β€” install everything without prompts:

curl -sSf https://raw.githubusercontent.com/Darwin-Agent/HarnessX/main/scripts/install.sh | bash -s -- --all

Both commands install uv, Python 3.12, harnessx, and (with Node.js available) the Harness Lab frontend. After installation, reload your shell or run source ~/.bashrc (or ~/.zshrc on macOS).

Manual install with uv
uv python install 3.12
uv venv --python 3.12 .venv
source .venv/bin/activate
uv pip install -e .
# Build frontend (required for hx lab)
cd frontend && npm install && npm run build && cd ..

CLI

export ANTHROPIC_API_KEY=sk-...

hx "Research 2026 AI agent trends and write a structured report"
hx -p "Write a Python fizzbuzz"     # non-interactive, print and exit
hx -c path/to/config.yaml           # load a YAML config
hx --resume <run_id>                # resume a previous session
hx lab                              # open the Lab UI at localhost:8000

IM Gateway

Connect your agent to Feishu, Telegram, Slack, Discord, or DingTalk with a single service. The gateway ships with a built-in React console for managing channels, sessions, and workspaces.

hx-gateway start   # start the gateway (configured in ~/.harnessx/gateway.yaml)

β†’ See gateway/README.md for setup, channel configuration, and architecture.

Python SDK

Minimal runnable example
import asyncio
from harnessx import BaseTask, HarnessConfig
from harnessx.core.model_config import ModelConfig
from harnessx.providers.anthropic_provider import AnthropicProvider

async def main():
    model = ModelConfig(main=AnthropicProvider("claude-sonnet-4-6"))
    harness = model.agentic(HarnessConfig())
    result = await harness.run(BaseTask(description="What is 2 + 2?"))
    print(result.final_output)

asyncio.run(main())

πŸ“Š Benchmarks

HarnessX provides two evolution loops that systematically improve agent performance on any benchmark:

  • Harness Evolution β€” a meta-harness analyzes trajectories and automatically searches for better processor combinations, prompt strategies, and tool configurations, without changing the model.
  • Model Evolution β€” reward-annotated trajectories from harness runs feed RL fine-tuning (via VERL), improving the model itself.

The two loops compose: evolve the harness first, then evolve the model on top. Below are results on the GAIA benchmark. See benchmarks/README.md for additional benchmarks and adapter details.

Harness Evolution (Qwen 3.5 9B)

Starting from a default harness (R0, 33%), the meta-harness discovers better configurations round by round β€” reaching 47% by R3, a +14pp gain with zero model changes. β†’ Reproduce: recipe/gaia_evolver/

Harness Evolution Config β€” Round 0 to Round 3

Harness Evolution (GPT-5)

The same approach scales to frontier models. Overall GAIA accuracy rises from 62% to 84% after evolution, with gains across all five domains. β†’ Reproduce: recipe/gaia_evolver/

Harness Evolution β€” Per-Domain Accuracy

Model-Harness Co-Evolution (Qwen 3.5 9B)

When the two loops run together, the gains compound: harness evolution lifts the baseline from 33.97% to 41.67%; model evolution pushes it further to 55.77% β€” a +64% relative improvement, all on a 9B model. β†’ Reproduce: recipe/verl_harnessX/

Model-Harness Co-Evolution


πŸ“ Project Structure

HarnessX/
β”œβ”€β”€ harnessx/                  # 🧠 Core framework
β”‚   β”œβ”€β”€ core/                  #    Harness, Builder, RunLoop, State, Events, Trajectory
β”‚   β”œβ”€β”€ processors/            #    7 categories Γ— multiple processors
β”‚   β”‚   β”œβ”€β”€ context/           #    πŸ“ System prompt, history, user wrapper
β”‚   β”‚   β”œβ”€β”€ control/           #    πŸ›‘οΈ 13 safety & reliability processors
β”‚   β”‚   β”œβ”€β”€ evaluation/        #    πŸ“Š LLM judge, PRM, self-verify
β”‚   β”‚   β”œβ”€β”€ memory/            #    🧠 Extraction, retrieval, 5 strategies
β”‚   β”‚   β”œβ”€β”€ multi_model/       #    πŸ”— Model routing
β”‚   β”‚   β”œβ”€β”€ observability/     #    πŸ”­ OTel, checkpoints, metrics
β”‚   β”‚   └── tools/             #    πŸ”§ Skill loader, schema adapter, filters
β”‚   β”œβ”€β”€ providers/             # πŸ”Œ 6 model backends + agentic mixin
β”‚   β”œβ”€β”€ plugins/               # 🧩 Plugin base, discovery, builtins, dimensions
β”‚   β”‚   └── dimensions/
β”‚   β”‚       └── light_memory/  # 🧠 Light-Memory (self-developed)
β”‚   β”œβ”€β”€ tools/                 # βš’οΈ Tool registry, builtins
β”‚   β”œβ”€β”€ sandbox/               # πŸ“¦ Local, Docker, E2B
β”‚   β”œβ”€β”€ tracing/               # πŸ“‘ Journal, OTel, null tracer
β”‚   β”œβ”€β”€ rl/                    # 🧬 RLConfigSpec, TaskBuilder
β”‚   β”œβ”€β”€ bundles/               # πŸ“¦ Pre-composed capability bundles
β”‚   β”œβ”€β”€ api/                   # 🌐 FastAPI + SSE for Lab UI
β”‚   └── cli.py                 # ⌨️ CLI entry point (hx)
β”œβ”€β”€ benchmarks/                # πŸ“Š 4 integrated + 3 ongoing benchmarks
β”œβ”€β”€ recipe/                    # πŸ§ͺ slime (RL training recipe)
β”œβ”€β”€ examples/                  # πŸ“– coding / research / assistant / custom_processor
β”œβ”€β”€ extensions/                # πŸ”Œ Skills (docx, pdf, pptx, xlsx)
β”œβ”€β”€ frontend/                  # πŸ–₯️ Lab UI (React + TypeScript + Tailwind)
└── tests/                     # βœ… Unit, integration, E2E

πŸ—ΊοΈ Roadmap

For detailed design notes and motivation behind planned items, see ROADMAP.

Phase Focus Status
1 Core: 9-dimension behavior pipeline, 13 processors, multi-provider, SFT/RL bridge, 4 benchmarks, Lab UI current
2 Meta-opt: Bayesian Optimization, Meta-Harness, auto config search in progress
3 Self-evolution: closed-loop training, HarnessHUB community marketplace planned
4 Memory: multimodal backends, third-party integrations (VERL, SuperMemory, OpenVKing) planned

In-Repo Implementations

  • Light-Memory β€” file-based memory with time-decay, daily compression, git versioning (harnessx/plugins/dimensions/light_memory/)
  • Slime RL recipe β€” SGLang rollout adapter + token annotation + GRPO training pipeline (recipe/slime/)
  • MetaHarness β€” agent observes its own trajectories and proposes harness config changes; observer harness + meta-agent + sandboxed promotion loop
  • LoCoMo benchmark β€” long-context memory evaluation: session recall, cross-turn consistency, compaction fidelity
  • Bayesian Optimization β€” surrogate model search over the ~10^6-configuration harness space
  • HarnessHUB β€” community platform to publish, version, and pull HarnessConfig bundles (hx pull coding-agent@v1.2; Lab UI panel; private registries)
  • Multimodal Memory β€” CLIP-based image/video memory backend via the plugin system
  • Harness Memory Evolution β€” closed loop: trajectories β†’ RL fine-tuning β†’ better model β†’ better harness; population-level mutation + data flywheel

Third-Party Integrations (opt-in, live in recipe/)

  • VERL β€” connect HarnessX rollouts to distributed PPO / GRPO training loops
  • MemPalace β€” structured episodic memory backend
  • SuperMemory β€” cloud-backed semantic memory via the plugin system
  • OpenVKing β€” vector-knowledge-graph memory for entity-rich domains
  • Memory quality metrics β€” retrieval precision / recall surfaced through HarnessJournal
  • Data synthesis pipeline β€” controlled SFT / preference-dataset generation with diversity constraints

🀝 Contributing

HarnessX is fully open-source under the MIT License. Contributions are welcome for:

  • 🧩 New processors β€” behavior modules for unexplored dimensions
  • 🧠 New memory backends β€” via the plugin system
  • πŸ“Š New benchmark adapters β€” benchmarks/ pattern
  • πŸ§ͺ RL training recipes β€” recipe/
  • πŸ–₯️ Lab UI improvements

Please read CONTRIBUTING.md first.


@software{harnessx2026,
  title   = {HarnessX: A Composable, Self-Evolving Agent Harness Foundry},
  author  = {Darwin Agent Team},
  year    = {2026},
  url     = {https://github.com/Darwin-Agent/HarnessX},
  license = {MIT},
}

HARNESSX β€” Compose. Adapt. Evolve.
Built with care by the Darwin Agent Team

About

HarnessX is a harness foundry: forge any number of agent harnesses from reusable processors and bundles, pair each with any model, and evolve them through training.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors