Evaluate agents, run RL rollouts, and collect rollout data across any agent and any sandbox — one API, no bespoke microservice per pairing.
Docs · Quickstart · Cookbook · Roadmap
|
Claude Code · Codex · Aider · OpenHands · your own |
SWE-bench images · custom Docker · Daytona · E2B · your own backend |
|
⇣ bridged by ⇣ await sandbox.remote(fn, *args, **kwargs) |
|
Agentix is small on purpose. The whole framework is two operations:
| You write | You get | |
|---|---|---|
| Bundle | agentix build [path] |
A deploy-ready image with your code and its dependencies |
| Remote call | await sandbox.remote(fn, ...) |
The return value of fn, executed inside the sandbox |
fn is any importable Python callable — an agent, a shell helper, a
scorer, or a whole multi-step rollout. Args travel in, the typed return
value comes back out. There is no fixed RPC surface to conform to and no
base class for your code to inherit.
from app import run
result = await sandbox.remote(run, input="hello")Side traffic rides along automatically: stdlib logging from inside the
sandbox replays into your host logs, and OTel-shaped /trace spans
capture every step — ready for eval dashboards and RL buffers.
pip install agentixx agentix-runtime-basic agentix-provider-dockerBuild a bundle once (takes a few minutes), then every remote call is
seconds. From examples/hello-world:
cd examples/hello-world
uv sync
uv run agentix build . --output dist/hello-world.bundle.tar
BUNDLE=$(uv run agentix deploy docker dist/hello-world.bundle.tar --format json | jq -r .bundle)
uv run python main.py --bundle "$BUNDLE"The host code is just provider → session → remote call:
from agentix.bash import run
from agentix.provider.base import SandboxConfig
from agentix.provider.docker import DockerProvider
config = SandboxConfig(image="python:3.13-slim", bundle=BUNDLE)
async with DockerProvider().session(config) as sandbox:
result = await sandbox.remote(run, command="echo hello from $(uname -a)")Build a cross-arch bundle by passing --platform linux/amd64 to both
agentix build and agentix deploy. Full walkthrough:
quickstart.
The point of one call surface is that an eval or RL loop wires together out of the same primitive — the agent, the environment setup, and the scorer are all just functions you remote-call:
| You have | You expose | You call |
|---|---|---|
| An agent (Claude Code, Codex, OpenHands, …) | async def run(...) -> RunResult |
await sandbox.remote(run, ...) |
| Shell, files, repo setup | async def run(command: str) -> BashResult |
await sandbox.remote(bash_run, ...) |
| A benchmark or reward model | async def score(...) -> Score |
await sandbox.remote(score, ...) |
examples/run-swe-rollouts is the
full loop end to end: sandbox agent run → patch extraction → SWE-bench
harness score → one rollout log per instance.
vs. sandbox runners (swe-rex,
E2B, Daytona, Harbor). A runner hands you a box and a fixed way to reach
into it — a predefined RPC surface, or "run a shell / docker exec
command" plus a vendor SDK. Anything richer means squeezing your logic
through that narrow hole. Agentix inverts it: the bundle installs your
real Python, and sandbox.remote(fn, ...) calls any importable
function and returns its typed value. A backend decides where the box
runs; Agentix decides what you can call inside it — so you layer it on
top of Docker, E2B, or Daytona.
| swe-rex · E2B · Daytona · Harbor | Agentix | |
|---|---|---|
| Reach into the sandbox | Fixed RPC surface, or shell / docker exec + vendor SDK |
await sandbox.remote(fn, ...) — any importable function |
| Sandbox logs & stdout | Scrape command output | stdlib logging auto-bridged to the host over /log |
| Observability | Bring your own | /trace spans (OTel-shaped) for every step |
| Model under test | Whatever the agent's SDK speaks | abridge translates Claude ⇄ OpenAI ⇄ Gemini — any agent on any model |
vs. rollout-as-a-service (ProRL-Agent-Server). ProRL popularized an HTTP server with task-specific handlers and token trajectories for RL trainers. Agentix shares the decoupling — training stays separate from rollout execution — with a lighter surface.
| ProRL-Agent-Server | Agentix | |
|---|---|---|
| Add a new task | Implement a handler, register it | Write a function, install it |
| Call a rollout | HTTP request to the service | await sandbox.remote(fn, ...) |
| Trajectories | Token-in / token-out over the service API | Captured by abridge as rollout logs |
| Sweet spot | HPC-scale multi-turn RL fleets | Teams wiring eval + RL data without a platform team |
Both designs are powerful at HPC scale. Agentix targets the much larger
set of research and product teams that want await remote(fn) with fewer
moving parts.
- One API for everything. Agent, tool, or scorer — the same
await sandbox.remote(fn, ...). - Bundles from a normal Python project.
agentix buildreadspyproject.toml; an optionaldefault.nixadds system binaries. - Backends you choose. Local Docker/Podman, Daytona, E2B, Apptainer,
or your own
SandboxProvider. - Sandbox logs on the host.
printand stdlibloggingfrom any remote call replay into your host logging tree over/log— no scraping command output. - Tracing built in. OTel-shaped
/tracespans for every step, the same across agents and environments; ship them anywhere withagentix-trace-otel. - Any model behind any agent.
abridgetranslates between Claude, OpenAI, and Gemini, so an agent that speaks one provider can be evaluated against any model — and the host captures the trajectory (token-in / token-out) for RL.
One monorepo, separate PyPI packages. The core is agentixx; everything
else is an optional plugin under plugins/.
| Package | Role |
|---|---|
agentix-runtime-basic |
agentix.bash, file ops, sandbox primitives |
agentix-provider-docker · -daytona · -e2b · -apptainer |
Sandbox backends |
agentix-runner |
run_rollouts(...) — batch eval/rollout orchestration |
agentix-dataset-swe |
SWE-bench task images + official-harness scoring |
agentix-agent-claude-code · -mini-swe-agent · -qwen-code |
Agent adapters |
agentix-bridge |
Model translation + rollout → RL buffer capture (abridge) |
agentix-trace-otel |
Export /trace spans to any OTLP backend |
Drop a directory under plugins/ and it becomes a workspace member;
uv sync --all-packages installs it editable.
git clone https://github.com/Agentiix/Agentix
cd Agentix
uv sync --all-packages --all-extras
uv run pytest
uv run ruff check agentix/ tests/This repo is a uv workspace — core, plugins, and examples share one lockfile, so editing any member is live in the shared venv with no publish cycle. See ARCHITECTURE.md for how bundles and remote calls work under the hood.