software-engineering-workflows

Three orchestration workflows for AI-assisted software engineering with Claude Code, tested head-to-head over 6 weeks and 80+ experiments.

Companion to the video: Workflows That Ship — I tested three orchestration approaches. The results weren't what I expected. Full write-up: doryzidon.com/blog/workflows-that-ship

Try the deployed scorers

I built the same AGENTS.md / CLAUDE.md scorer with each of the three workflows, then deployed all three so you can run your own configs against them and compare. Same job, different workflows that produced them.

Workflow	Deployed scorer
Oneshot	claude-scorer-oneshot-v7-latest.fly.dev
Light agentic with review	claude-scorer-light-v7-latest.fly.dev
TDD-style pipeline	claude-scorer-tdd-v7-latest.fly.dev

Drop in an AGENTS.md file you've actually shipped. The disagreements between the three are where the interesting questions live.

The three workflows

All three use the Claude Agent SDK under the hood for fresh-context, repeatable runs. The difference is how much orchestration sits on top of the same plan.

Workflow	What it is	How to run
Oneshot	Skip the pipeline. Hand the plan to a single Claude Code session and let it ship.	Open Claude Code, ask it to implement `plans/<your-plan>.md` end-to-end
Light agentic	`dev_cycle` pipeline with `--review-mode static` — IMPLEMENT → VERIFY (lint + pytest) → COMMIT per step, static checks only	`uv run --directory agent_tools/dev_cycle python run.py --plan <plan> --review-mode static`
TDD red/green	`dev_cycle` pipeline with `--review-mode agent` or `full` — same loop but with LLM review at every gate plus a final eng-review pass	`uv run --directory agent_tools/dev_cycle python run.py --plan <plan> --review-mode agent`

The "three workflows" are really one pipeline (dev_cycle) with a knob (--review-mode) plus the option to skip it entirely (oneshot). Same plan, same agents, different rigor.

See WORKFLOWS.md for how to run each end-to-end.

What's in this repo

.claude/
├── agents/          ← sde, test-eng, sys-arch (TDD roles)
└── skills/          ← plan, dev-cycle, run-tests, python-reviewer

agent_tools/
└── dev_cycle/       ← TDD pipeline (Python, uv-managed) — the only workflow
                       that needed dedicated orchestration code

conventions/         ← coding standards the agents follow
                       (python-coding.md, workflow-contract.md,
                        orchestration-flow.md)

plans/               ← real plans I gave each workflow
discussions/         ← session notes from the comparison runs
runs/
├── comparison-2026-04-24/   ← deploy + run logs from the head-to-head test
└── vault/                    ← dev_cycle pipeline state per scorer rebuild

Why this exists

Everyone is shipping orchestration frameworks. The pitch is always the same: "slop versus methodical process — here's the right way."

I wanted receipts.

So I built a Claude scorer (an agents.md / CLAUDE.md evaluator) and ran it through three workflows on the same plan:

Oneshot — single agent, takes the plan, ships
Light agentic with review — merged steps, each builds itself, lightweight gates
TDD red/green — system architect → test engineer → SDE → reviewer

Then I measured. Six weeks. 80+ experiments. The results are in runs/.

The headlines

Oneshot: ~7 minutes, 64 tests, scored 8.15 / 10
Light agentic: scored 77 / 100 with review gates that caught real bugs
TDD: generated 610+ tests, but no clearly better output

More tests didn't mean better software. Sometimes more orchestration just means more code that does the same thing — slower and more expensively.

Six things I'm taking away

Plans really matter. Strong plan, strong result. Skip this and no workflow saves you.
Ground truth + acceptance criteria are non-negotiable. Without them you can't tell if anything actually worked.
Don't outsmart the model. Models keep getting better. Verify them, don't try to over-engineer around them.
Use the SDK for control. Repeatable runs, fresh context per query — even on a oneshot.
Orchestration is complicated and costs time. It has value, but expect heavy experimentation.
Don't believe the hype. Measure, don't trust the demos.

How to run a workflow yourself

See WORKFLOWS.md for the full step-by-step on running each of the three workflows on your own plan.

The short version:

Install Claude Code and copy .claude/ into your project.
For the TDD workflow, also uv sync in agent_tools/dev_cycle/.
Run /plan to produce a plan with acceptance criteria.
Run oneshot, light agentic (/dev-cycle interactively), or TDD (uv run --directory agent_tools/dev_cycle python run.py <plan>).

What you won't find here

The Claude scorer app code itself. It was built three different ways during these runs and lived only in deployed Fly.io apps. The receipts of what got built are the run logs in runs/, not the source code.
API keys, tokens, deploy secrets. Pre-publication scan was clean. If you spot one, file an issue.

License

MIT — see LICENSE.

Maintenance

This repo is a snapshot tied to a specific experiment + video. If something looks stale or you want a newer cut, open an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

software-engineering-workflows

Try the deployed scorers

The three workflows

What's in this repo

Why this exists

The headlines

Six things I'm taking away

How to run a workflow yourself

What you won't find here

License

Maintenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude		.claude
agent_tools/dev_cycle		agent_tools/dev_cycle
conventions		conventions
discussions		discussions
plans		plans
runs		runs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WORKFLOWS.md		WORKFLOWS.md

Folders and files

Latest commit

History

Repository files navigation

software-engineering-workflows

Try the deployed scorers

The three workflows

What's in this repo

Why this exists

The headlines

Six things I'm taking away

How to run a workflow yourself

What you won't find here

License

Maintenance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages