Role Foundry is a framework for training AI apprentices under honest, holdout-aware evaluation. The core unit is a generation: each evaluated generation leaves an inspectable provenance chain — receipt bundle, evaluation context, score deltas, and a promotion decision. Promoted public generations can then be staged for ERC-8004 issuance on Base through a thin Role Foundry-owned Python mint path.
The framework handles the training loop. The role defines what the apprentice learns.
The current alpha demo ships one concrete role: a Software Engineer apprentice that implements Role Foundry product slices under public-regression and local private-holdout review. Robin + Neo are the teachers. The apprentice builds the system it is being trained by.
Honest scope note: The currently shipped curriculum slices are frontend/product-heavy because that is what the alpha app exposes. The “Software Engineer” framing reflects the intended breadth — code review, regression prevention, documentation honesty — not a claim that all those curriculum families are shipped today.
Name: Software Engineer Apprentice Job: Ship coherent, judge-facing Role Foundry product slices Teachers: Robin + Neo Constraints: standalone repo, demo mode first, no auth, no Privy, no fake live integrations
docker compose up serves a static demo with pre-baked data that shows:
- Vision/system overview — explains the general framework vs the current concrete role
- Apprentice definition — the Software Engineer apprentice is learning to build Role Foundry itself
- Public curriculum — visible training slices like rewriting the apprentice story, clarifying curriculum vs holdouts, exposing score deltas, and adding proof bundles
- Holdout integrity story — judge-visible holdout categories in the demo plus a local-only scaffold for fresh teacher-only rewrites outside the public repo
- Two judged runs — Run 2 clearly improves over Run 1
- Teacher scorecard — per-scenario teacher notes plus aggregate score across public curriculum and holdout-facing review
- Failure → curriculum loop — failed holdout themes become the next public teaching themes without exposing hidden prompt text
- Iteration history — score deltas over time stay visible in both the UI and stored run data
- Proof bundle — receipt summary, changed files, policy snapshot, and transcript excerpt
- Portable identity path — promoted public generations can be drafted for ERC-8004 issuance on Base without faking a wallet transaction
This is the point of Role Foundry: make capability visible with honest evaluation instead of vibes.
- Teachers define a role — what good work looks like for this apprentice
- Role Foundry publishes public curriculum — scenarios the apprentice can practice on
- Role Foundry keeps fresh hidden holdouts teacher-only — the public repo carries the contract/template/tests for that lane, not the private prompts themselves
- The apprentice ships a generation — copy, UI, scorecard, or artifact surface
- Teacher judges the generation — receipts, scorecard context, and aggregate score become part of the generation record
- Later generations record deltas — the next run makes the better/equal/worse movement explicit
- Humans decide what gets promoted — public curriculum themes and readiness evidence can move forward without leaking hidden prompt text
- Promoted public generations can be staged for ERC-8004 issuance — Base is the current portable-identity target
Teachers can extend the holdout lane with manually curated episodes from external sources like SWE-bench, Playwright docs, or code-review guides. These stay teacher-only and never enter the public repo or student-visible curriculum. See docs/swe-bench-holdout-extension.md for the process and constraints.
SWE-bench usage is intentionally small and teacher-only: at most 5-10 manually rewritten episodes per extension round, stored in the existing gitignored private holdout path. This is not bulk integration or public curriculum.
| Demo mode | Live mode | |
|---|---|---|
| What runs | Static UI with pre-baked apprentice data | Configured read-only browser shell + optional Clawith / runner-bridge receipts |
| Requirements | Docker (web container + local services) | liveDataUrl export for the browser shell; add Clawith image + model credentials for actual runs |
| Good for | Judges, walkthroughs, design review | Inspecting exported run state honestly, then actual training / evaluation once the backend is wired |
| Current status | Shipping now | Browser shell now consumes configured exports; native live storage/browser fan-out is still pending |
cp .env.example .env
docker compose up
open http://localhost:8080This starts the static web demo plus Postgres and Redis.
To exercise the browser live shell against the committed alpha-loop sample:
http://localhost:8080/?mode=live&liveDataUrl=live-read-model.alpha-loop.sample.json
To start the optional backend-side live mode (requires an external Clawith image):
docker compose --profile live upSee docs/clawith-integration.md for prerequisites and the full integration guide.
For the narrow real external-executor proof lane, see docs/clawith-vibecosystem-real-path.md and submission/clawith-vibecosystem-roundtrip-proof.manifest.json.
The first honest runner-bridge slice is now in the repo. It is intentionally small:
python3 -m runner_bridge.clidrives one run lifecycleLocalReplayRunneris the zero-secret backend that writes a transcript and artifact bundle- optional
teacher_evaluationinput produces a teacher scorecard, public curriculum themes, and iteration history deltas - the bridge stores a redacted
request.jsonplus a rawrequest.private.jsonso sealed holdout prompts stay out of student-facing artifacts - the bridge also emits a receipt provenance pack (
receipts/manifest.json, baseline/candidate/evaluation exports,receipts/evidence-index.json, andreceipts/summary.md) so judges can trace a run back to its source artifacts without changing the scoring semantics - if you pass
--clawith-url, the bridge patches run state into a Clawith-compatible control plane - if you omit
--clawith-url, you can still exercise the artifact/transcript contract locally
Examples:
python3 -m runner_bridge.cli \
--request runner_bridge/examples/first-live-run.json \
--clawith-url http://localhost:3000 \
--clawith-secret "$CLAWITH_SECRET"python3 -m runner_bridge.cli \
--request runner_bridge/examples/teacher-eval-loop.json \
--clawith-url http://localhost:3000 \
--clawith-secret "$CLAWITH_SECRET"Artifacts land under runtime/runs/<run_id>/.
See docs/runner-bridge.md for the control-plane patch contract, teacher scorecard extension, the public benchmark-pack prompt path, comparison receipt flow, and the local/mockable fallback path.
There is now a first honest bridge-mediated autoresearch alpha loop:
python3 -m runner_bridge.autoresearch_alpha \
--request runner_bridge/examples/autoresearch-alpha-public-loop.jsonWhat it proves today:
- a real baseline → candidate student → candidate teacher-eval lifecycle
- a concrete better/equal/worse comparison receipt
- artifact coverage across all three stages
- an explicit integrity gate that allows public-regression claims while blocking fake sealed-eval claims
That last point matters. The public benchmark pack is usable now, but the current teacher-only families are still marked blocked_pending_rewrite, so the repo cannot honestly claim a fresh sealed holdout path yet. The alpha loop says that plainly instead of faking it.
There is also now a separate local-only private holdout scaffold:
benchmarks/private-holdout-pack-template.jsondefines the public-safe shape onlybenchmarks/private-holdout-pack/is gitignored for real teacher-only materialtests/test_private_holdout_separation.pyproves tracked artifacts stay cleanrunner_bridge.autoresearch_alphacan now hydrate a local private-holdout teacher lane fromprivate_holdout_manifestwhile keeping student-visible artifacts redacted
That scaffold is now enough for a local private-holdout alpha run once fresh episodes are authored locally.
Allowed now: fresh hidden holdouts in a gitignored local manifest, real reruns, and receipts that keep teacher-only content out of tracked and student-visible artifacts.
Still blocked: sealed-eval claims, sealed certification, tamper-proof evaluation, and any claim that a third party independently sealed the holdouts.
The local-only shape is:
python3 -m runner_bridge.autoresearch_alpha \
--request benchmarks/private-holdout-pack/local-private-holdout-alpha-loop.request.jsonThat request file stays local-only, points private_holdout_manifest at the gitignored manifest, and references holdout episodes by id so the bridge can hydrate teacher-only prompts into request.private.json only.
The repo now ships a narrow Python-native path that turns the generation-provenance chain into a portable identity handoff on Base through the agent0-sdk / agent0-py flow:
runner_bridge/product_integrations.py— after each evaluated generation, writes a local ERC-8004 registration draft, completion template, and a canonical Python mint contract tied back to the existing receipt/scorecard artifacts. No onchain writes.runner_bridge/erc8004_agent0.py— explicit live-mint helper:SDK(chainId, rpcUrl, signer, registryOverrides?)→createAgent(...)→setMetadata(...)→register(tokenUri)→wait_confirmed().
Target chains: Base Sepolia (chain id 84532, review/demo default) and Base Mainnet (chain id 8453, explicit submission target). Both are env-driven via BASE_SEPOLIA_RPC_URL / BASE_MAINNET_RPC_URL. Registry overrides remain optional via BASE_SEPOLIA_REGISTRY / BASE_MAINNET_REGISTRY if the SDK defaults ever need to be overridden.
What is real now: registration drafts, completion templates, the Python mint helper module, wired-vs-pending diagnostics, and a reviewer-visible story that promoted/public generations are the ones eligible for public issuance.
What is pending: agent0-sdk availability in the Python environment, a configured Base RPC URL, a hosted public token URI for the draft JSON, an explicit promoted/public decision, and a real confirmed mint. Live mint stays off by default behind ROLE_FOUNDRY_ERC8004_ENABLE_LIVE_MINT=1. No minting has been claimed or faked.
app/agent0_base_adapter.mjs remains in-repo as a historical browser-side experiment, but it is no longer the canonical repo path.
See docs/erc8004-base-agent0-adapter.md for usage and specs/013-erc8004-base-agent0-adapter.md for the full spec.
This repo is intentionally honest about what is not wired yet:
- the browser live shell is read-only — it consumes configured exports / receipts, but it does not chase native run storage or claim upstream Clawith parity
- only one local/mockable runner path is implemented today (
LocalReplayRunner); teacher scorecards and iteration history are real contracts, but Claude/Codex-backed adapters still need wiring - the committed alpha-loop browser fixture is a sample/read-model export, not proof that a fully real baseline → candidate → teacher-eval loop has already executed end to end on this branch
- no auth, no Privy, no fake consumer OAuth path
- no live artifact viewer backed by run storage fan-out
Live mode can now seed the repo's Clawith-compatible seam, drive bridge-mediated runs, and the browser shell can consume configured read-model / alpha-loop exports. That is still deliberately narrow. It does not claim stock upstream Clawith natively accepts Role Foundry seed writes. Demo mode remains first-class and judge-friendly on its own.
Clawith is the live control plane. It is profile-gated in docker-compose.yml and can be started with --profile live. In the full system it owns:
- agent registry
- run registry
- scenario and holdout storage
- evaluation store and scorecards
- approvals, scheduling, and audit trails
For the hackathon MVP, Clawith integration is wired at the Docker layer and the browser now has a narrow read-only shell for configured exports. It still does not claim native upstream parity or full live artifact browsing. See docs/runner-bridge.md for the bridge pattern and docs/clawith-integration.md for live-mode setup.
| Runner | Role | Why |
|---|---|---|
| Claude + vibecosystem | Student / builder | Strong for implementation-heavy slices |
| Codex | Teacher / critic / evaluator | Independent model family for judging |
| Deterministic scripts | Verifier | Cheap pass/fail checks |
Using different model families for building and judging reduces correlated self-grading.
docs/milestones.md— spec-first milestone rail and current delivery statusdocs/v1-mvp-plan.md— build slicesdocs/clawith-integration.md— live-mode setup, prerequisites, image contract, and read-only probe lanedocs/clawith-adapter-bringup.md— seam-to-upstream mapping matrix and adapter-first bring-up prereqsdocs/clawith-vibecosystem-real-path.md— smallest real external gateway + Claude/vibecosystem roundtrip lanedocs/runner-bridge.md— bridge path, teacher evaluation contract, comparison receipts, and explicit auth deferraldocs/public-benchmark-pack-v1.md— public-safe benchmark pack scope, blocked families, and local private-holdout pathdocs/software-engineer-curriculum-sources.md— narrow public source inventory for the software-engineering apprenticedocs/teacher-source-curriculum-workflow.md— discover → curate → promote workflow for teacher-driven curriculum extensiondocs/private-holdout-authoring.md— local-only teacher workflow for authoring and auditing fresh holdoutsdocs/swe-bench-holdout-extension.md— teacher-only process for small manually curated SWE-bench-derived holdout episodesdocs/conversation-log.md— curated build log for the submissionsubmission/— final submission packaging templates and review checklistsdocs/erc8004-base-agent0-adapter.md— ERC-8004 Base / agent0-sdk adapter usage and claim boundarydocs/agent-town-connection.md— Agent Town relationshipdocs/synthesis-hackathon-ideation.md— ideation and rankingdocs/synthesis-hackathon-stack-architecture.md— architecture notes
specs/008-public-benchmark-pack-v1.md— public benchmark pack contract for the current alpha spinespecs/009-clawith-readiness-probe.md— adapter-first upstream readiness probespecs/010-autoresearch-alpha-public-loop.md— the first executable public alpha loop with integrity gatespecs/011-live-ui-read-model.md— read-only browser adapter for configured live/read-model exportsspecs/012-private-holdout-pack.md— local-only private holdout contract without shipping teacher materialspecs/013-erc8004-base-agent0-adapter.md— ERC-8004 Base / agent0-sdk adapter spec
GPL-3.0