Skip to content

Add draft-to-design-doc eval (TDD-style, three variants)#8

Draft
szjanikowski wants to merge 1 commit into
mainfrom
eval/draft-to-design-doc
Draft

Add draft-to-design-doc eval (TDD-style, three variants)#8
szjanikowski wants to merge 1 commit into
mainfrom
eval/draft-to-design-doc

Conversation

@szjanikowski
Copy link
Copy Markdown
Contributor

Summary

  • Adds evals/draft-to-design-doc/ — a quick benchmark project comparing how Claude Code extracts a Design Doc JSON from a Markdown design draft across three levels of scaffolding (vanilla, guided, noesis).
  • One TDD-style test.sh (identical between both task copies) asserts the expected correct ChangeSet shape — currently fails for all three variants because of the green-field constraint in save_design_doc. The failure is the spec; passes will arrive when the constraint is lifted.
  • The noesis variant Dockerfile installs the full plugin (Bun on ubuntu:24.04 — needs GLIBC 2.39 for the lbug native module), registers the noesis-graph MCP server through [[environment.mcp_servers]], and clones DDD-starter-dotnet into /app/repo/ so the agent has the implemented codebase as the diff baseline.
  • assessment_dimensions.json is intentionally empty for now — quality grading is owned by manual inspection until the dimensions are designed.

Why

Local development needs a fast feedback loop on analyze-design-draft skill quality across:

  • baseline LLM output (vanilla)
  • prompt-only scaffolding (guided)
  • full plugin + MCP (noesis)

Smoke results from a first run: vanilla 6:38 / 12 KB, guided 13:25 / 21 KB, noesis 9:29 / 21 KB. Detailed comparison lives in the run artifacts under jobs/ (gitignored).

Open issues this surfaces

  • noesis-graph:save_design_doc enforces a green-field ChangeSet shape (rejects modified / removed outside boundedContexts). The TDD test asserts the correct (post-fix) shape, so all variants currently fail. See README's "TDD note".
  • The skill agent shortcuts Steps 2-5 of the analyze-design-draft workflow (no topics / decisions extracted in our run); merge_document returns topics_added: 0, decisions_added: 0. Worth investigating separately whether prompt-side or skill-side fix is needed.

Test plan

  • bash evals/draft-to-design-doc/tasks/threshold-discount-extraction-noesis/environment/prepare-context.sh succeeds locally
  • nasde run --variant vanilla --tasks threshold-discount-extraction --without-eval -C evals/draft-to-design-doc runs to completion (expected: test.sh fails)
  • nasde run --variant guided --tasks threshold-discount-extraction --without-eval -C evals/draft-to-design-doc runs to completion (expected: test.sh fails)
  • nasde run --variant noesis --tasks threshold-discount-extraction-noesis --without-eval -C evals/draft-to-design-doc runs to completion (expected: test.sh fails)
  • Inspect artifacts/workspace/output/design-doc.json (vanilla/guided) or artifacts/workspace/noesis/design-docs/*.json (noesis) and confirm the failure mode matches the green-field bug

🤖 Generated with Claude Code

Adds a quick benchmark project under `evals/draft-to-design-doc/` that runs
Claude Code at three levels of scaffolding on the same Markdown design draft
and produces a Design Doc JSON for inspection.

Variants:
  - vanilla — minimal CLAUDE.md, no plugin/MCP. Reads draft.md +
    schema-reference.md and writes /app/output/design-doc.json directly.
  - guided — extended prompt mirroring noesis:analyze-design-draft's
    extract-design-model.md. Same task as vanilla.
  - noesis — full noesis plugin installed via Dockerfile (bun install on
    ubuntu:24.04 — needs GLIBC 2.39 for lbug native module). noesis-graph MCP
    server registered through [[environment.mcp_servers]]. The skill is invoked
    on /app/draft.md.

Both task sandboxes clone DDD-starter-dotnet into /app/repo/ so the agent can
inspect what is already implemented and classify model elements correctly as
modified vs added.

`tests/test.sh` is TDD-style: ONE specification (byte-for-byte identical
between the two task copies) asserts the expected correct ChangeSet shape:

  - Sales BC in boundedContexts.modified (already in /app/repo/Sources/Sales)
  - Sales.Pricing.Discounts module in modules.modified (already exists)
  - ThresholdDiscount in buildingBlocks.added (genuinely new)
  - Discount in buildingBlocks.modified (existing union gains a new variant)

Currently noesis-graph:save_design_doc enforces a green-field shape, so all
three variants fail this test. The failure IS the signal — when the
green-field constraint is lifted, runs producing the correct shape will start
passing.

`assessment_dimensions.json` is intentionally empty for now — quality grading
is left to manual inspection until the dimensions are designed.

Use `prepare-context.sh` to stage plugin sources into the noesis Dockerfile's
build context before running the noesis variant; see README.md for invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@szjanikowski
Copy link
Copy Markdown
Contributor Author

First three-way TDD run

Ran all three variants on the new TDD test (sequential checks, not parallel concurrency):

Variant Time Reward First failure
vanilla 11:56 0 module name is Discounts, not Sales.Pricing.Discounts
guided 9:17 0 module name is Discounts, not Sales.Pricing.Discounts
noesis 11:13 0 Sales ended up in boundedContexts.added, not modified

What works (vanilla + guided, with /app/repo mounted)

Both ad-hoc variants consume /app/repo/ correctly: Sales lands in boundedContexts.modified, Discount in buildingBlocks.modified, ThresholdDiscount in buildingBlocks.added. They fail only on the module-naming convention — the agent uses Discounts instead of the fully-qualified Sales.Pricing.Discounts.

What breaks for noesis (plugin green-field constraint)

The noesis variant is worse on this dimension despite having scanning + MCP — save_design_doc enforces green-field shape, so everything ends up in added even though the agent knows from /app/repo/ that Sales and the underlying module already exist. This is the bug the test was designed to surface.

// noesis output
{
  "bcs_added":     ["Sales"],            // wrong — should be modified
  "bcs_modified":  [],
  "modules_added": ["Pricing.Discounts"], // wrong shape AND wrong name
  "bbs_added":     ["ThresholdDiscount", "Discount", "DiscountsSqlRepository"]
                                          // Discount and DiscountsSqlRepository wrong — exist in /app/repo
}

Two distinct bugs

  1. Green-field constraint in save_design_doc — blocks correct diff representation. Affects only noesis.
  2. Module naming convention not propagated — affects all three variants (Discounts vs Sales.Pricing.Discounts). The schema-reference says modules are fully-qualified but the agent's not picking that up. May warrant a stricter mention in extract-design-model.md or a save-time rejection.

Run artifacts under evals/draft-to-design-doc/jobs/ (gitignored). Each contains verifier/test-stdout.txt, verifier/reward.txt, and the produced design doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant