PDX-482: feat(mcp): harden testcase tool descriptions for single-call contract#174
Conversation
… contract RCA: PDX-479 surfaced that authoring guidance lives in three places (prompts, resource, tool descriptions) and a regression in any one of them — like the multi-call author-test flow that shipped in PR #153 — can drift the LLM away from correct test case construction. PDX-481 fixed the prompts + resource but the tool descriptions themselves still carried no construct-vs-amend contract. The steps[] field description was just "Ordered list of test steps" with no anti-pattern protection. If a future upstream prompt re-introduces the multi-call pattern, only the tool description can push back at the call site — and it currently says nothing about it. Fix: Added a three-line construction contract to the top of testCaseGenerate.ts TOOL_DESCRIPTION (single-call pattern, step_edit is for AMENDING, stop-and-assemble guidance for the common mistake). Tightened the steps[] field description to call out the FULL/COMPLETE step tree in one call and warn against the multi-call append pattern. Mirrored the contract in testCaseStepTools.ts: the step_edit description now self-identifies as AMENDMENT-ONLY, rejects construction usage, points agents at provar_testcase_generate for new test cases, and spells out the structural defects (dropped scenarios, flat asserts, inconsistent step types) from misuse. Added 6 regression-guard unit tests asserting the canonical phrasing in both tool descriptions. Added scripts/pdx-482-validate.cjs (13 protocol-surface assertions, 13/13 PASS) for direct JSON-RPC verification. Full gate green: 1127 mocha tests, lint clean, compile clean.
Quality Orchestrator🟢 LOW · 🧪 Tests to Run · Running 2 of 51 tests
▶ Run commandnpx vitest run \
unit/mcp/testCaseGenerate.test.ts \
unit/mcp/testCaseStepTools.test.ts⚡ quality-orchestrator · |
There was a problem hiding this comment.
Pull request overview
Hardens the MCP tool descriptions for provar_testcase_generate and provar_testcase_step_edit with an explicit single-call construction contract vs. amendment-only contract, so that the guidance survives any future drift in upstream prompts/resources. Adds regression-guard unit tests and a JSON-RPC protocol-surface validation script.
Changes:
- Prepend a three-line construction contract to
provar_testcase_generate's TOOL_DESCRIPTION and tighten thesteps[]field description. - Mark
provar_testcase_step_editas AMENDMENT-ONLY, explicitly rejecting construct-from-scratch usage and listing structural defects. - Add 6 unit tests plus a new
scripts/pdx-482-validate.cjsJSON-RPC validator.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/mcp/tools/testCaseGenerate.ts | Adds construction-contract preamble to TOOL_DESCRIPTION and tightens steps[] field description |
| src/mcp/tools/testCaseStepTools.ts | Adds AMENDMENT-ONLY framing and updates the short description |
| test/unit/mcp/testCaseGenerate.test.ts | 3 regression-guard tests for the construction contract |
| test/unit/mcp/testCaseStepTools.test.ts | 3 regression-guard tests for the amendment-only contract |
| scripts/pdx-482-validate.cjs | New JSON-RPC validation harness with 13 protocol-surface assertions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
RCA: An adversarial review of the original PDX-482 commit identified three critical defects in the construct/amend contract: (1) PROVAR_MCP_SCHEMA_MODE=compact silently swapped the description for a contract-free one-liner — the LLM would never see the contract in compact mode, making it a regression highway; (2) the regex /step_edit[^.]*not for CONSTRUCTING|CONSTRUCTING[^.]*not/i had a false-positive that could pass on hostile rewordings like "constructing...not via generate"; (3) no test asserted the contract appears EARLY in the description, so a future refactor could move it down where LLM attention is lower. Additionally the step_edit test used an OR-clause across the three structural defects so dropping two of them would silently dilute the warning.
Fix: (1) Compact form on provar_testcase_generate now reads "Generate a Provar test case in ONE call with the FULL steps[] tree. Do NOT call with steps=[] then append via provar_testcase_step_edit (step_edit is for AMENDING existing test cases, not for CONSTRUCTING new ones)." — protocol-surface validator now spawns the server twice (standard + PROVAR_MCP_SCHEMA_MODE=compact) and runs 6 contract assertions against the compact form. (2) Replaced the false-positive regex with literal includes('not for CONSTRUCTING one from scratch') in both the unit test and the validator — locked on the canonical phrasing. (3) Added a leading-position assertion in both the unit test and validator: indexOf('Construction pattern') < 200 to prevent silent drift. (4) Tightened the step_edit "structural defects" test from OR-clause to three separate AND-style assertions on "dropped scenarios", "flat asserts", and "inconsistent step types" — dropping any one now fails the test. Gate: 1129 mocha tests, lint clean, validator 20/20 (was 13/13) covering both schema modes.
Independent adversarial review + fixes —
|
| # | Finding | Resolution |
|---|---|---|
| 1 | PROVAR_MCP_SCHEMA_MODE=compact was a regression highway. The desc() helper returns the compact form when that env var is set; my compact string was the pre-PDX-482 contract-free one-liner "Generate a Provar XML test case skeleton with UUID guids and steps structure." — the entire contract would silently disappear in compact mode. |
Rewrote the compact form to carry the contract: "Generate a Provar test case in ONE call with the FULL steps[] tree. Do NOT call with steps=[] then append via provar_testcase_step_edit (step_edit is for AMENDING existing test cases, not for CONSTRUCTING new ones)." Validator now spawns the server twice (standard + compact) and runs 6 contract assertions against the compact form. |
| 2 | Regex false-positive in the "step_edit not for CONSTRUCTING" assertion. /step_edit[^.]*not for CONSTRUCTING|CONSTRUCTING[^.]*not/i would pass on hostile rewordings like "constructing... is the only way... not via generate". |
Replaced both the unit test and validator with literal includes('not for CONSTRUCTING one from scratch') — locked on the canonical phrasing so the assertion can only pass when the exact contract phrase is present. |
| 3 | No leading-position assertion. A future refactor could move the contract to the bottom of the joined array — every other assertion would still pass, but the LLM would see 4 KB of mechanics before reaching the contract. | Added indexOf('Construction pattern') < 200 in both the unit test and the validator. Current position: 0 (literally the first chars). |
Important nit & resolution
| Finding | Resolution |
|---|---|
step_edit test for the "structural defects" warning used an OR-clause across "dropped scenarios", "flat asserts", "inconsistent step types". A future cleanup that removes two of the three would silently dilute the warning while the test still passes. |
Split into three separate assert.ok calls — dropping any one of the three defects now fails the test. The full consequence chain stays in the description. |
Gate after fixes
| Check | Before | After |
|---|---|---|
| Validator assertions | 13 / 13 PASS | 20 / 20 PASS (added 6 compact-mode + 1 leading-position) |
| Mocha tests | 1127 passing | 1129 passing (+1 leading-position test, +1 compact-mode describe block) |
| Lint | clean | clean |
| Compile | clean | clean |
Findings deliberately deferred to a follow-up
The adversarial review also suggested:
- Runtime guard — return
STEPS_REQUIREDwhensteps.length === 0 && !dry_run. This is a behaviour change with broader implications (some callers may legitimately want a skeleton-only generation). Out of scope for this hardening story; worth a separate ticket if we decide the contract should be enforced at runtime, not just at the description surface. - Title field — adding the contract to the tool
title:field. Most MCP clients show titles in tool-picker chips. Considered but skipped because the title field is also constrained by client UIs (Cursor shows full title, Claude Desktop truncates at ~50 chars). The description is the higher-leverage surface. - Audit
guidePrompts.tsfor parallel wording — already aligned by PDX-481 (which I merged into develop earlier today). Re-verified by hand; no divergence.
PR is now updated. Reviewer-facing changes: 47c75e2 on top of ac77154.
Summary
provar_testcase_generate'sTOOL_DESCRIPTION: pass the FULL step tree in a single call;step_editis for AMENDING; stop-and-assemble guidance for the common mistakesteps[]field description to call out the COMPLETE step tree requirement and warn against the multi-call append patternprovar_testcase_step_edit: self-identifies as AMENDMENT-ONLY, rejects construct-from-scratch usage, points agents atprovar_testcase_generatefor new test cases, spells out the structural defects (dropped scenarios, flat asserts, inconsistent step types) from misusescripts/pdx-482-validate.cjs— 13 protocol-surface assertions via JSON-RPC, all PASSJira
Why
PDX-479 surfaced that authoring guidance lives in three places — orchestration prompt, tool-guide resource, and tool descriptions. A regression in any one of them can drift the LLM. PDX-481 fixed the prompt + resource but left the tool descriptions silent: the existing
steps[]field description was just"Ordered list of test steps"with no anti-pattern protection. If a future upstream prompt re-introduces the multi-call pattern, only the tool description can push back at the call site — and it currently says nothing about it.This PR closes that gap. The tool description is the strongest, most-cited signal in an MCP client's working context because it surfaces on every tool-discovery cycle. With the contract embedded there, the single-call requirement survives any subsequent prompt/resource drift.
Test plan
yarn compilecleannode_modules/.bin/nyc node_modules/.bin/mocha "test/**/*.test.ts"— 1127 passing, 0 failing (6 new regression-guard tests)yarn lintcleannode scripts/pdx-482-validate.cjs— 13 / 13 PASS — confirms both tool descriptions carry the canonical phrasing at the MCP protocol surfaceAdversarial check
The PDX-482 AC asks for a defensive validation: even if PDX-481's prompts were intentionally reverted in a scratch branch, the in-tool description alone should be sufficient to keep an LLM on the single-call pattern. The 13 protocol-surface assertions in `pdx-482-validate.cjs` are the codified form of that check: they read what an MCP client surfaces to its LLM at the call site, which is where the contract has to fire.
Changes
Out of scope
step_editagainst freshly-generated empty skeletons (would be over-engineering for this story)Customer-facing docs
`docs/mcp.md` already carries the construct-vs-amend split from PDX-481. `docs/provar-mcp-public-docs.md` and `docs/university-of-provar-mcp-course.md` are maintained separately by the Provar team and may need a mirror update if they describe tool descriptions verbatim.
🤖 Generated with Claude Code