A Claude Code plugin that wraps a decompose → plan → implement → review → checkpoint loop around long, multi-phase coding work.
Mantra: mechanical verification > AI judgment > human checkpoint. Push as much as possible into hooks; reserve human attention for what hooks and AI review cannot catch.
Status: v1.1 (post-m8-qc-html postmortem). Single-round reviews with HITL split, per-leaf commit budget, dual-track codex review, full state machine.
- Big tasks fall apart silently. Complexity compounds: 20 decision points × 80 % per-step accuracy ≈ 1 % end-to-end correct. Without a contract that every downstream step is checked against, "the agent finished" usually means "the agent got tired".
- Premature task complete. Training is still running, tests have not been run, acceptance bullets are not satisfied — Claude marks the task done anyway. Mechanical gates need to refuse the commit, not the human.
- Doc–code drift goes unnoticed. An implementation step changes the code but does not update the top-level decomposition. The next session sees the inconsistency and reverts the code (real CWF-author experience). Treating the decomposition as the contract — and forcing every plan/review to read it — closes that loop.
- Human-in-the-loop is either too noisy or too sparse. Asking for input at every step is exhausting; full-auto runs drift unwatched. Devstack splits cognitive load: heavy effort at decompose time (one human-led pass), light dispatch at review time (5 single-letter options), bounded checkpoint surface.
# In a Claude Code session, in your project directory:
/init # one-shot opt-in + status print
/decompose <high-level goal> # human-led contract authoring
/auto-implement # run plan → impl → review per leaf
That's the whole happy path. Each leaf phase will:
- Plan — Claude drafts
plan.mdwith typed task IDs and mechanical acceptance. - PlanReview — codex (read-only) verifies the plan against the decomposition.
- Implement — Claude (or codex-goal) executes one task at a time;
task-completion-gate.shruns your acceptance predicates on every commit. - PhaseReview — two codex passes run concurrently: Track A (code quality) and Track B (contract alignment).
- Checkpoint — when HITL is on, you dispatch with a single letter:
y / r / n / f / aN.
Every leaf produces exactly 2 NEW commits: <slug>/phase X.Y: plan approved and <slug>/phase X.Y: implementation complete. Commits are how state is recovered across sessions.
human codex (read-only) codex / claude (read-write)
│ │ │
/init ────► opt-in marker + workflow.config.json │ │
│ │ │
/decompose ► author decomposition-<slug>.md ────────────► plan/contract review ──┐ │
│ │ │ │
▼ │ │ │
approved? ──no──► /decompose amend │ │ │
│ yes │ │ │
/auto-implement ─────────────────────────────────────────►│ │ │
│ │ │ │
├── for each leaf ──► Plan ─────────────────►│ │ │
│ │ codex plan review │ │
│ ▼ │ │
│ PlanReviewWait (HITL on) │ │
│ y/r/n/f/aN │ │
│ │ │ │
│ Implement (engine: claude | codex-goal) ─────│───────────►│
│ │ task-gate after every commit │
│ ▼ │
│ PhaseReview (Track A + Track B run concurrently) │
│ │ │
│ ReviewWait (HITL on) │
│ y/r/n/f/aN │
│ │ │
│ CheckpointWait (HITL on) ─► next leaf │
▼
/status read-only 5-section report (any time)
/auto-resume resume from Paused (handles ReviewWait / PlanReviewWait)
/auto-abort terminal abort, preserves commits
/rollback delete commits with auto backup branch + 2-stage confirm
a3 at any wait dispatcher means accept this one and auto-skip the next 3 waits — the bridge between fully-supervised and fully-unattended runs.
devstack is opinionated. The opinions came from running real multi-phase work through earlier versions and watching exactly where the wheels came off. Eight things it does that adjacent plugins do not (or do less rigorously):
- Formalised state machine. 18 transitions (E1–E18), 11 abort conditions, JSON-schema-validated state file, single-writer (
hooks/lib/workflow-lib.sh::transition). Other plugins express state through marker files (coarse, drifts) or carry no formal state at all. Illegal transitions error loudly instead of silently continuing — bugs surface near their cause. - Dual-track concurrent codex review. Track A audits code quality (bug / security / perf / maintainability). Track B audits contract alignment against the decomposition (silent_drop / scope_creep / contract_break / acceptance_miss). Splitting the two prevents the single-reviewer attention dilution that single-track loops suffer from. P1/P2/P3 grading + Track B's explicit
decomposition: amendedexception are scars from earlier rounds. - Cross-model review. codex reviews Claude's code. Most loops are Claude-reviews-Claude (same model family, shared blind spots) or self-evaluation (
/goalfamily). Heterogeneous reviewer raises the upper bound on caught defects. - Per-leaf commit budget enforced as a Bash hook.
commit-budget-gate.sh(PreToolUse on Bash) accepts only the four reserved commit subjects and lets non-slug commits pass. The result: each leaf produces exactly 2 NEW workflow commits, the git log is itself a readable work record, and there is nothing to clean up afterward. - Multi-decomposition management. Slug-isolated state with a
current-decompositionswitch lets a project run several decompositions in flight (auth + payments + migration on the same repo). Adjacent plugins assume one decomposition per project. - Real test coverage. 16 shell test scripts cover the state machine, abort conditions, every hook, every command, all 6 acceptance predicate forms, the commit budget, resume, status, and rollback. Plus
dogfood-demo.sh, fixtures, andverify-equivalence.sh. Industrial-grade test surface for a workflow plugin. - First-class long-running processes. ML training, large builds, headless agents — registered via the
longproclibrary, gated by aStophook so Claude cannot end the session while a tracked subprocess isACTIVEorSTUCK. Other plugins ignore this regime entirely. - Refused the auto-fix loop. v1.1 deliberately set
review.rounds = 1and added the HITL dispatcher (y/r/n/f/aN). Ralph-style infinite-fix loops have one well-known failure mode — burn budget on the wrong direction. Devstack chose human dispatch over more retries. That refusal needed having tried it the other way first.
devstack is a Claude Code plugin. The published repository is the plugin itself (this directory becomes the plugin root).
# 1. Clone
git clone git@github.com:Pelion-AI/devstack.git ~/code/devstack
# 2. Probe deps (need git + jq; codex >= 0.128.0 enables codex-goal engine)
bash ~/code/devstack/scripts/check-deps.sh
# 3. Register with Claude Code (path-based marketplace shown; see MIGRATION.md for alternatives)
jq --arg path "$HOME/code/devstack" '
.extraKnownMarketplaces["devstack-local"] = { "source": { "source": "path", "path": $path } }
| .enabledPlugins["devstack@devstack-local"] = true
' ~/.claude/settings.json > /tmp/s.json && mv /tmp/s.json ~/.claude/settings.json
# 4. Restart Claude Code so it loads hooks/hooks.json + the v1.1 commands.Other registration paths (direct enabledPlugins, ~/.claude/plugins/ symlink, manual user-scope hooks) are documented in MIGRATION.md.
The plugin is inert in a project until you opt in. From a Claude Code session:
/init
That writes .claude/workflow-enabled and .claude/workflow.config.json and prints a one-screen status summary. Idempotent — re-runnable.
Kill-switch one project without affecting others:
touch .claude/workflow-emergency-disableThe disable marker takes precedence over workflow-enabled.
| command | purpose |
|---|---|
/init |
Opt-in this project + print status |
/decompose |
Author or amend a decomposition |
/auto-implement |
Run the workflow loop (resumable) |
/auto-resume |
Resume from Paused (handles ReviewWait / PlanReviewWait) |
/auto-abort |
Terminal abort; preserves commits |
/status |
Read-only 5-section analytical report |
/rollback |
Delete commits with auto backup branch + 2-stage confirm |
/auto-long-process <on|off> |
Toggle longproc Stop-hook gate |
/human-in-the-loop <on|off> |
Toggle HITL checkpoint pauses |
/spawn-teamagents <on|off> |
Toggle teamagents intent (engine still deferred in v1.1) |
workflow-decompose, workflow-plan, workflow-implement, workflow-review, workflow-codex-goal, long-running-processes. Each opens with a "Critical Must-Do" TL;DR and embeds the relevant contracts (LeafPhaseSpec, PlanTask, the 6 mechanical acceptance predicate forms).
| event | hook | purpose |
|---|---|---|
PreToolUse (Bash) |
commit-budget-gate.sh |
Enforce 2-NEW-commits-per-leaf via reserved subjects |
PostToolUse (ExitPlanMode) |
codex-review-plan.sh |
Single-track codex plan review (~10 min budget); disk fallback for empty payload |
PostToolUse (Bash) |
task-completion-gate.sh |
lint / test / 6-form acceptance / longproc / todo / commit_prefix / loop checks |
Stop |
stop-phase-gate.sh |
Block session end while ACTIVE/STUCK longproc subprocesses are tracked |
SessionStart |
session-start-resume.sh |
When status=Paused, surface /status / /auto-resume / /auto-abort |
SubagentStop |
subagent-review-save.sh |
Persist subagent review-shaped output to <slug>/phase-<X.Y>/review.md |
engine: claude(default) — Claude executes one task at a time per theworkflow-implementSKILL's 8-step flow; per-taskgit committriggers the gate.engine: codex-goal— spawns a headlesscodex /goalsubprocess, classified into 5 termination states; runs a post-codex gate afterachievedbecause codex's own commits do not trigger PostToolUse(Bash).engine: teamagents— config switch only in v1.1; real parallel-Agent-Teams runtime is deferred.
workflow.config.json ships with v1.1 defaults. Toggle the three boolean switches at the top via the /auto-long-process, /human-in-the-loop, /spawn-teamagents commands; everything else is edited directly.
{
"human_in_the_loop": "on",
"auto_long_process": "on",
"spawn_teamagents": "off",
"plan": { "codex_review": true, "rounds": 1, "review_wait_on_hitl": true, "fix_iterations": 2 },
"implement": { "default_engine": "claude", "task_failure_threshold": 3, "large_diff_warn_threshold": 800,
"codex_goal": { "default_budget_tokens": 80000, "max_budget_tokens": 200000,
"monitoring_interval_seconds": 30 } },
"review": { "method": "codex", "codex_timeout_seconds": 600, "rounds": 1,
"review_wait_on_hitl": true, "phase_review_dual_track": true, "fix_iterations": 2 },
"auto_mode": { "max_duration_minutes": 240, "max_idle_minutes": 30,
"budget_limited_action": "pause", "budget_extend_max_count": 1 },
"rollback": { "require_clean_tree": true, "auto_backup_branch": true },
"hooks": { "session_start_resume_prompt": true, "subagent_review_save": true }
}When human_in_the_loop=on, plan review and phase review halt at PlanReviewWait / ReviewWait instead of auto-fix-looping:
| key | meaning |
|---|---|
y |
Accept the review and apply P1 fixes |
r |
Re-run the same review (use when you suspect codex hallucinated) |
n |
Accept review as final, skip fixes, advance |
f |
Write feedback.md, transition to Paused; resume after editing |
aN |
Accept + auto-skip the next N review-wait pauses |
aN is the bridge: read the first leaf's review carefully, then a4 to let the rest of the run go unattended.
.
├── .claude-plugin/plugin.json ← manifest
├── README.md ← this file
├── MIGRATION.md ← upgrade path from user-scope hooks
├── commands/ ← 10 slash commands
├── skills/ ← 6 SKILLs (decompose / plan / implement / review / codex-goal / longproc)
├── hooks/
│ ├── hooks.json ← 6 event registrations
│ ├── commit-budget-gate.sh ← PreToolUse on Bash
│ ├── codex-review-plan.sh ← PostToolUse on ExitPlanMode (with disk fallback)
│ ├── codex-track-{a,b}-prompt.md ← dual-track prompt definitions
│ ├── task-completion-gate.sh ← PostToolUse on Bash (6 acceptance forms)
│ ├── stop-phase-gate.sh ← Stop hook
│ ├── session-start-resume.sh ← SessionStart hook
│ ├── subagent-review-save.sh ← SubagentStop hook
│ └── lib/ ← workflow-lib.sh (transition), codex-review.sh, codex-goal.sh, longproc-lib.sh
├── scripts/
│ ├── init-project.sh
│ ├── check-deps.sh
│ ├── codex-goal-runner.sh
│ ├── codex-track-{a,b}.sh
│ └── rollback.sh
├── templates/
│ ├── workflow.config.json ← v1.1 schema
│ ├── auto-state.schema.json ← legal state combinations
│ └── decomposition.md.template
└── tests/ ← 16 shell tests + dogfood + fixtures + verify-equivalence
# Syntax check
bash -n hooks/*.sh hooks/lib/*.sh scripts/*.sh tests/*.sh
# Run every test
for t in tests/test-*.sh tests/dogfood-demo.sh; do bash "$t"; done
# End-to-end equivalence (real codex; takes a few minutes; skips codex-goal smoke if codex < 0.128.0)
bash tests/verify-equivalence.sh- codex-goal requires codex ≥ 0.128.0. Below that,
codex-goal-runner.shrefuses with a clear error. Plan/PhaseReview hooks still work on older codex. engine: teamagentsis config-only./auto-implementstep 8 still ESCALATEs on teamagents tasks; real parallel Agent Teams runtime is deferred.- Single-session, single-turn execution. A long
/auto-implementshares one Claude context window — no auto-compact between leaves. For very long runs, plan smaller decompositions or use HITLfto reset between leaves. /rollbackusesgit reset --hard. Commits stay ingit reflogfor ~90 days plus the auto-created backup branch. Full purge is not done automatically.
Author: see .claude-plugin/plugin.json. License: TBD until v1.0 release (treat as all rights reserved).