Three orchestration workflows for AI-assisted software engineering with Claude Code, tested head-to-head over 6 weeks and 80+ experiments.
Companion to the video: Workflows That Ship — I tested three orchestration approaches. The results weren't what I expected. Full write-up: doryzidon.com/blog/workflows-that-ship
I built the same AGENTS.md / CLAUDE.md scorer with each of the three workflows, then deployed all three so you can run your own configs against them and compare. Same job, different workflows that produced them.
| Workflow | Deployed scorer |
|---|---|
| Oneshot | claude-scorer-oneshot-v7-latest.fly.dev |
| Light agentic with review | claude-scorer-light-v7-latest.fly.dev |
| TDD-style pipeline | claude-scorer-tdd-v7-latest.fly.dev |
Drop in an AGENTS.md file you've actually shipped. The disagreements between the three are where the interesting questions live.
All three use the Claude Agent SDK under the hood for fresh-context, repeatable runs. The difference is how much orchestration sits on top of the same plan.
| Workflow | What it is | How to run |
|---|---|---|
| Oneshot | Skip the pipeline. Hand the plan to a single Claude Code session and let it ship. | Open Claude Code, ask it to implement plans/<your-plan>.md end-to-end |
| Light agentic | dev_cycle pipeline with --review-mode static — IMPLEMENT → VERIFY (lint + pytest) → COMMIT per step, static checks only |
uv run --directory agent_tools/dev_cycle python run.py --plan <plan> --review-mode static |
| TDD red/green | dev_cycle pipeline with --review-mode agent or full — same loop but with LLM review at every gate plus a final eng-review pass |
uv run --directory agent_tools/dev_cycle python run.py --plan <plan> --review-mode agent |
The "three workflows" are really one pipeline (dev_cycle) with a knob (--review-mode) plus the option to skip it entirely (oneshot). Same plan, same agents, different rigor.
See WORKFLOWS.md for how to run each end-to-end.
.claude/
├── agents/ ← sde, test-eng, sys-arch (TDD roles)
└── skills/ ← plan, dev-cycle, run-tests, python-reviewer
agent_tools/
└── dev_cycle/ ← TDD pipeline (Python, uv-managed) — the only workflow
that needed dedicated orchestration code
conventions/ ← coding standards the agents follow
(python-coding.md, workflow-contract.md,
orchestration-flow.md)
plans/ ← real plans I gave each workflow
discussions/ ← session notes from the comparison runs
runs/
├── comparison-2026-04-24/ ← deploy + run logs from the head-to-head test
└── vault/ ← dev_cycle pipeline state per scorer rebuild
Everyone is shipping orchestration frameworks. The pitch is always the same: "slop versus methodical process — here's the right way."
I wanted receipts.
So I built a Claude scorer (an agents.md / CLAUDE.md evaluator) and ran it through three workflows on the same plan:
- Oneshot — single agent, takes the plan, ships
- Light agentic with review — merged steps, each builds itself, lightweight gates
- TDD red/green — system architect → test engineer → SDE → reviewer
Then I measured. Six weeks. 80+ experiments. The results are in runs/.
- Oneshot: ~7 minutes, 64 tests, scored 8.15 / 10
- Light agentic: scored 77 / 100 with review gates that caught real bugs
- TDD: generated 610+ tests, but no clearly better output
More tests didn't mean better software. Sometimes more orchestration just means more code that does the same thing — slower and more expensively.
- Plans really matter. Strong plan, strong result. Skip this and no workflow saves you.
- Ground truth + acceptance criteria are non-negotiable. Without them you can't tell if anything actually worked.
- Don't outsmart the model. Models keep getting better. Verify them, don't try to over-engineer around them.
- Use the SDK for control. Repeatable runs, fresh context per query — even on a oneshot.
- Orchestration is complicated and costs time. It has value, but expect heavy experimentation.
- Don't believe the hype. Measure, don't trust the demos.
See WORKFLOWS.md for the full step-by-step on running each of the three workflows on your own plan.
The short version:
- Install Claude Code and copy
.claude/into your project. - For the TDD workflow, also
uv syncinagent_tools/dev_cycle/. - Run
/planto produce a plan with acceptance criteria. - Run oneshot, light agentic (
/dev-cycleinteractively), or TDD (uv run --directory agent_tools/dev_cycle python run.py <plan>).
- The Claude scorer app code itself. It was built three different ways during these runs and lived only in deployed Fly.io apps. The receipts of what got built are the run logs in
runs/, not the source code. - API keys, tokens, deploy secrets. Pre-publication scan was clean. If you spot one, file an issue.
MIT — see LICENSE.
This repo is a snapshot tied to a specific experiment + video. If something looks stale or you want a newer cut, open an issue.