Agent Skill for governed RL reward design, curriculum design, and hyperparameter tuning.
AutoRL is a contract-first Agent Skill / Codex skill package for RL engineers who want agents to automate bounded reinforcement-learning task iteration in oh-my-codex (OMX) workflows. It helps agents reason about, constrain, and verify changes around an existing training stack, especially when iterating on:
- reward design;
- curriculum design;
- trainer and optimizer hyperparameters;
- other explicitly approved training-surface changes.
It is not a trainer or runtime. AutoRL is a reusable control and documentation layer for agent-assisted RL improvement: clarify the mission, bind the task contract, execute only through a known external runtime, collect evidence, verify the result, then decide what happens next.
AutoRL is intentionally skill-only and contract-first. This repository does not provide an embedded RL runtime, framework adapter, scheduler, queue, dispatcher, retry daemon, long-running monitor, experiment launcher, checkpoint store, or log backend. Instead, it provides contracts, templates, gates, knowledge packs, and workflow guardrails for agents working with your existing RL project.
Reinforcement-learning projects fail in ways that ordinary software workflows often miss:
- reward hacking can look like progress until behavior is inspected;
- βone more runβ quickly becomes untraceable experiment history;
- hidden launch commands make results hard to reproduce;
- agents can drift from analysis into unauthorized training edits;
- long-running jobs can tempt tools into acting like implicit schedulers or monitors.
AutoRL addresses those failure modes with one rule:
Every loop must be bounded by a task contract, grounded in a task-local cache, executed through an external runbook, and closed with artifact-backed evidence, verification, and a decision.
| Highlight | What it gives you |
|---|---|
| π Contract-first gates | Plan and Execute proceed only after the task contract, metric spec, runbook, surface map, and cache state agree. |
ποΈ Task-local .autorl/ cache |
Task facts, runbooks, metrics, loop state, manifests, and decisions live under .autorl/tasks/<task_slug>/, not inside the reusable skill package. |
| π§ͺ Evidence-first decisions | Each iteration should produce retained artifacts, a feedback packet, a verification verdict, and a continue / stop / rollback / narrow / escalate decision. |
| π‘οΈ External-runtime boundary | AutoRL records commands, artifacts, stop rules, resource facts, and outcomes; real training still runs through your validated external runtime. |
| π² Worktree isolation | Baseline-relative training-surface edits use fresh task-cache worktrees after explicit RL-loop entry authorization. |
| π§ Source-traceable RL knowledge | Domain packs for mobile robot RL, manipulation RL, and WBC / legged RL help diagnose reward, curriculum, and tuning choices with applicability limits. |
| π« Anti-scheduler by design | Helpers may record bounded file-backed leases, but AutoRL must not become a scheduler, queue, monitor, dispatcher, or retry service. |
AutoRL is a skill and contract surface for an OMX-driven agent workflow. Use it from the RL repository you want to improve; do not treat this repository as a standalone trainer.
AutoRL expects oh-my-codex (OMX) to be available. OMX provides the $autoresearch workflow that drives the validator-gated research loop around this skill.
omx setup
omx doctorUse the skill workflow invocation form
omx --madmax --xhigh
$autoresearch "Use AutoRL, xxx" # supported workflow skill
Make this repository available to your Codex skills environment. The skill entrypoint is:
SKILL.md -> name: autorl
Run from the repository that contains the RL task you want to improve. Use OMX $autoresearch as the outer loop and ask it to use AutoRL as the contract and RL-iteration skill surface.
Recommended intake:
$deep-interview --autoresearch
Mission: use the AutoRL skill to govern a bounded RL iteration.
Goal: improve rough-terrain locomotion success rate.
Allowed surfaces: reward terms and curriculum only.
Budget: at most 3 runs, no more than 2 hours each.
Runtime: use the project training command, but verify the runbook first.
Success: higher success rate without worse fall rate or unsafe behavior.
Validation mode: prompt-architect-artifact.
Then execute the validator-gated loop:
$autoresearch
Use the AutoRL skill to discover the RL stack, materialize the .autorl task cache, bind the task contract, and proceed only through the validated external runbook.
Within the $autoresearch loop, AutoRL guides the workflow through:
- Discover the RL stack and relevant knowledge packs.
- Interview for mission, constraints, non-goals, metrics, budgets, approvals, and runtime details.
- Materialize concrete task-cache artifacts under
.autorl/tasks/<task_slug>/. - Contract the approved scope, allowed surfaces, runbook, metric spec, and stop rules.
Training execution or training-surface modification starts only after the user explicitly asks to enter the RL training loop and the task contract adjudicates Execute as allowed.
After an external run completes, AutoRL returns through:
Evidence -> Verify -> Decide -> Handoff
No hidden βnext run.β No silent scope expansion. No unreviewed training edits.
flowchart LR
A[Discover] --> B[Interview]
B --> C[Materialize .autorl task cache]
C --> D[Contract]
D --> E[Plan one bounded change]
E --> F[Execute via external runtime]
F --> G[Evidence]
G --> H[Verify]
H --> I[Decide]
I --> J{Continue?}
J -->|yes, after gate| E
J -->|stop / rollback / escalate| K[Handoff]
Long-running training uses an external orchestration handoff. AutoRL records loop state, run params, resource facts, artifact expectations, stop/cancel details, and externally consumed completion/status facts. Waiting and next-operation timing belong to the external runtime, operator, or OMX $autoresearch workflowβnot to AutoRL core.
AutoRL separates reusable package logic from task-specific operational facts.
A materialized task typically looks like this:
.autorl/
tasks/
<task_slug>/
task-profile.md # mission, constraints, non-goals
surface-map.md # allowed reward/curriculum/hyperparameter surfaces
runbook.md # concrete external runtime commands and artifacts
metric-spec.md # success, regression, and rollback metrics
task-contract.md # package-local Plan / Execute adjudication record
state.json # compact cache facts, schema, freshness, loop refs
artifact-manifest.md # retained logs, checkpoints, metrics, outputs
worktrees/ # fresh isolated worktrees for authorized edits
loops/ # loop state, run params, evidence, decisions
This cache is the operational authority for one concrete task instance. It does not redefine the AutoRL skill package.
AutoRL includes task-agnostic RL guidance that agents can cite during planning, evidence analysis, verification, and decision-making.
| Pack | Use it for |
|---|---|
modules/knowledge/mobile-robot-rl/ |
Navigation, obstacle avoidance, social navigation, exploration, docking, and learned local planning. |
modules/knowledge/manipulation-rl/ |
Dexterous hands, grasp / lift / place, insertion, assembly, tactile tasks, bimanual tasks, and deformable-object tasks. |
modules/knowledge/wbc-rl/ |
Whole-body control, humanoids, quadrupeds, command tracking, terrain curricula, and sim-to-real robustness. |
modules/knowledge/stack-guidance/ |
Framework discovery and stack-specific non-destructive validation patterns. |
Knowledge packs are advisory evidence, not governing authority. When they influence a plan or verdict, preserve their rule IDs, source IDs, applicability notes, exclusions, evidence levels, and uncertainty notes. They must not override the task contract, metric spec, runbook, or observed evidence.
.
βββ SKILL.md # Codex skill entrypoint and behavioral contract
βββ README.md # project overview and quick start
βββ decisions/ # ADR-style rationale
βββ docs/ # architecture, examples, feature list, module docs
βββ modules/
βββ governance/ # authority model, lifecycle gates, contract adjudication
βββ task/ # task-cache layout, materialization, state, worktrees
βββ planning/ # intake, discovery, iteration plans, evidence, verification, decisions
βββ execution/ # external invocation and resource boundaries, not a runtime
βββ knowledge/ # stack guidance, domain packs, glossary, deferred-runtime guardrails
Recommended reading path:
SKILL.mdβ the executable skill contract.docs/modular-architecture.mdβ module boundaries and workflow chaining.docs/feature-list.mdβ must / should / could scope and anti-bloat rules.docs/task-example.mdβ example lifecycle and artifact flow.modules/governance/authority-model/architecture.mdβ source-of-truth order and runtime boundary.
| AutoRL does | AutoRL does not |
|---|---|
| Clarifies mission, constraints, metrics, budgets, and allowed surfaces. | Guess hidden runtime commands or task facts. |
| Materializes task-local contracts and cache artifacts. | Store downstream task profiles inside the skill package. |
| Plans one bounded reward, curriculum, or hyperparameter change. | Silently expand into unrelated training-surface edits. |
| Requires explicit RL-loop entry authorization before execution. | Launch training just because a plan exists. |
| Invokes only validated external runbooks. | Implement an embedded trainer, adapter, scheduler, queue, or monitor. |
| Compresses artifacts into feedback and verification records. | Treat a reward curve as sufficient proof of success. |
| Decides whether to continue, stop, roll back, narrow scope, or escalate. | Auto-resume or keep working while an external job runs. |
AutoRL should remain:
- Task-agnostic β reusable across downstream RL projects.
- Contract-first β phase transitions are explicit and auditable.
- Evidence-backed β claims cite artifacts, metrics, logs, or retained records.
- External-runtime friendly β built to work around your existing launcher, cluster, container, or job system.
- Anti-bloat β no package-owned runtime engine unless a future ADR explicitly changes the project direction.
- Agent-safe β bounded subagents may inspect, prepare, collect, and verify, but they do not independently authorize Plan or Execute.
High-quality contributions should strengthen the contract surface without sneaking in runtime ownership.
Good first contribution areas include:
- clearer task-cache examples;
- additional non-destructive validation probes;
- better metric and evidence templates;
- source-traceable knowledge cards with applicability limits;
- documentation and diagrams that make gates easier to follow.
Please avoid adding runtime engines, framework adapters, launch schedulers, long-running monitors, artifact backends, or hidden task-specific assumptions. If the project needs to move beyond the current skill-only boundary, introduce that transition through an ADR first.
AutoRL aims to make RL iteration feel less like gambling and more like engineering discipline:
one task
one contract
one bounded change
one external runbook
one evidence bundle
one verified decision
If you want AI agents to improve RL systems safely, reproducibly, and with auditability, AutoRL is designed for that goal.