Skip to content

DoryZi/software-engineering-workflows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

software-engineering-workflows

Three orchestration workflows for AI-assisted software engineering with Claude Code, tested head-to-head over 6 weeks and 80+ experiments.

Companion to the video: Workflows That Ship — I tested three orchestration approaches. The results weren't what I expected. Full write-up: doryzidon.com/blog/workflows-that-ship

Try the deployed scorers

I built the same AGENTS.md / CLAUDE.md scorer with each of the three workflows, then deployed all three so you can run your own configs against them and compare. Same job, different workflows that produced them.

Workflow Deployed scorer
Oneshot claude-scorer-oneshot-v7-latest.fly.dev
Light agentic with review claude-scorer-light-v7-latest.fly.dev
TDD-style pipeline claude-scorer-tdd-v7-latest.fly.dev

Drop in an AGENTS.md file you've actually shipped. The disagreements between the three are where the interesting questions live.

The three workflows

All three use the Claude Agent SDK under the hood for fresh-context, repeatable runs. The difference is how much orchestration sits on top of the same plan.

Workflow What it is How to run
Oneshot Skip the pipeline. Hand the plan to a single Claude Code session and let it ship. Open Claude Code, ask it to implement plans/<your-plan>.md end-to-end
Light agentic dev_cycle pipeline with --review-mode static — IMPLEMENT → VERIFY (lint + pytest) → COMMIT per step, static checks only uv run --directory agent_tools/dev_cycle python run.py --plan <plan> --review-mode static
TDD red/green dev_cycle pipeline with --review-mode agent or full — same loop but with LLM review at every gate plus a final eng-review pass uv run --directory agent_tools/dev_cycle python run.py --plan <plan> --review-mode agent

The "three workflows" are really one pipeline (dev_cycle) with a knob (--review-mode) plus the option to skip it entirely (oneshot). Same plan, same agents, different rigor.

See WORKFLOWS.md for how to run each end-to-end.

What's in this repo

.claude/
├── agents/          ← sde, test-eng, sys-arch (TDD roles)
└── skills/          ← plan, dev-cycle, run-tests, python-reviewer

agent_tools/
└── dev_cycle/       ← TDD pipeline (Python, uv-managed) — the only workflow
                       that needed dedicated orchestration code

conventions/         ← coding standards the agents follow
                       (python-coding.md, workflow-contract.md,
                        orchestration-flow.md)

plans/               ← real plans I gave each workflow
discussions/         ← session notes from the comparison runs
runs/
├── comparison-2026-04-24/   ← deploy + run logs from the head-to-head test
└── vault/                    ← dev_cycle pipeline state per scorer rebuild

Why this exists

Everyone is shipping orchestration frameworks. The pitch is always the same: "slop versus methodical process — here's the right way."

I wanted receipts.

So I built a Claude scorer (an agents.md / CLAUDE.md evaluator) and ran it through three workflows on the same plan:

  1. Oneshot — single agent, takes the plan, ships
  2. Light agentic with review — merged steps, each builds itself, lightweight gates
  3. TDD red/green — system architect → test engineer → SDE → reviewer

Then I measured. Six weeks. 80+ experiments. The results are in runs/.

The headlines

  • Oneshot: ~7 minutes, 64 tests, scored 8.15 / 10
  • Light agentic: scored 77 / 100 with review gates that caught real bugs
  • TDD: generated 610+ tests, but no clearly better output

More tests didn't mean better software. Sometimes more orchestration just means more code that does the same thing — slower and more expensively.

Six things I'm taking away

  1. Plans really matter. Strong plan, strong result. Skip this and no workflow saves you.
  2. Ground truth + acceptance criteria are non-negotiable. Without them you can't tell if anything actually worked.
  3. Don't outsmart the model. Models keep getting better. Verify them, don't try to over-engineer around them.
  4. Use the SDK for control. Repeatable runs, fresh context per query — even on a oneshot.
  5. Orchestration is complicated and costs time. It has value, but expect heavy experimentation.
  6. Don't believe the hype. Measure, don't trust the demos.

How to run a workflow yourself

See WORKFLOWS.md for the full step-by-step on running each of the three workflows on your own plan.

The short version:

  1. Install Claude Code and copy .claude/ into your project.
  2. For the TDD workflow, also uv sync in agent_tools/dev_cycle/.
  3. Run /plan to produce a plan with acceptance criteria.
  4. Run oneshot, light agentic (/dev-cycle interactively), or TDD (uv run --directory agent_tools/dev_cycle python run.py <plan>).

What you won't find here

  • The Claude scorer app code itself. It was built three different ways during these runs and lived only in deployed Fly.io apps. The receipts of what got built are the run logs in runs/, not the source code.
  • API keys, tokens, deploy secrets. Pre-publication scan was clean. If you spot one, file an issue.

License

MIT — see LICENSE.

Maintenance

This repo is a snapshot tied to a specific experiment + video. If something looks stale or you want a newer cut, open an issue.

About

Three AI engineering orchestration workflows tested head-to-head — companion to the video 'Workflows That Ship'

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages