PDX-470/471/473: feat(mcp): add detail level, diff mode, and completeness score to validation tools#168
Merged
Conversation
Quality Orchestrator🟢 LOW · 🧪 Tests to Run · Running 7 of 47 tests
▶ Run commandnpx vitest run \
unit/mcp/projectValidateFromPath.test.ts \
unit/mcp/testCaseValidate.test.ts \
unit/mcp/testPlanValidate.test.ts \
unit/mcp/testSuiteValidate.test.ts \
unit/mcp/detailLevel.test.ts \
unit/mcp/validationDiff.test.ts \
unit/mcp/validationScore.test.ts⚡ quality-orchestrator · |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds response shaping, validation scoring, and diff-run persistence utilities for MCP validation tools so agents can request smaller responses, compare runs, and decide whether to continue validation loops.
Changes:
- Adds
detailresponse levels andcompleteness_score/recommended_next_actionfields across validation tools. - Adds run snapshot persistence and diff computation utilities, with baseline diff mode wired into suite, testcase, and project validation.
- Adds unit tests for score/diff utilities plus suite/plan response changes.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
src/mcp/utils/detailLevel.ts |
Adds shared summary/standard/full response filtering helper. |
src/mcp/utils/validationScore.ts |
Adds completeness score and next-action helpers. |
src/mcp/utils/validationDiff.ts |
Adds run id generation, snapshot persistence, baseline loading, and diff computation. |
src/mcp/tools/testSuiteValidate.ts |
Wires detail level, run ids, diff mode, and next-action scoring into suite validation. |
src/mcp/tools/testPlanValidate.ts |
Wires detail level and next-action scoring into plan validation. |
src/mcp/tools/testCaseValidate.ts |
Wires detail level, run ids, diff mode, and next-action scoring into testcase validation. |
src/mcp/tools/projectValidateFromPath.ts |
Wires detail level, run ids, diff mode, and next-action scoring into project validation. |
src/mcp/tools/descHelper.ts |
Adds an environment-driven description helper. |
test/unit/mcp/validationScore.test.ts |
Adds unit coverage for scoring and next-action helpers. |
test/unit/mcp/validationDiff.test.ts |
Adds unit coverage for run persistence and diff helper behavior. |
test/unit/mcp/testSuiteValidate.test.ts |
Adds suite validation tests for detail, score/action, and baseline diff behavior. |
test/unit/mcp/testPlanValidate.test.ts |
Adds plan validation tests for detail and score/action behavior. |
Comments suppressed due to low confidence (3)
src/mcp/tools/testSuiteValidate.ts:172
hasAnyRun(storageDir)is evaluated aftersaveRun()has already written the current run, so it will be true even on the first validation for this suite. A failing first run will therefore returnfix_and_revalidateinstead of the intendedinspect_failuresfirst-run action.
const completeness_score = calcCompletenessScore(summary.test_cases_valid, summary.total_test_cases);
const hasBaseline = hasAnyRun(storageDir);
const recommended_next_action = calcNextAction(completeness_score, hasBaseline);
src/mcp/tools/testCaseValidate.ts:190
- The current run is persisted before loading
baseline_run_id. When the requested baseline is the oldest retained run, this write can trigger eviction at the 20-run cap and make an otherwise valid baseline returnBASELINE_NOT_FOUND.
try {
saveRun(storageDir, runId, currentViolations);
} catch (saveErr) {
log('warn', 'provar_testcase_validate: could not save run for diff', {
requestId,
error: (saveErr as Error).message,
});
}
// Diff mode
if (baseline_run_id !== undefined && baseline_run_id !== '') {
const baseline = loadBaselineViolations(storageDir, baseline_run_id);
src/mcp/tools/projectValidateFromPath.ts:264
- The current project run is saved before
baseline_run_idis loaded. If the supplied baseline is the oldest of the 20 retained runs, saving the new run can evict it first, so a valid baseline id fails withBASELINE_NOT_FOUND.
if (save_results !== false) {
try {
saveRun(storageDir, runId, currentViolations);
} catch (saveErr) {
log('warn', 'provar_project_validate: could not save run for diff', {
requestId,
error: (saveErr as Error).message,
});
}
}
// Diff mode
if (baseline_run_id !== undefined && baseline_run_id !== '') {
const baseline = loadBaselineViolations(storageDir, baseline_run_id);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+128
to
+133
| const storageDir = suiteStorageDir(); | ||
| const runId = generateRunId(suite_name); | ||
| const currentViolations = result.violations as unknown as DiffableViolation[]; | ||
|
|
||
| try { | ||
| saveRun(storageDir, runId, currentViolations); |
Comment on lines
+67
to
+69
| function suiteStorageDir(): string { | ||
| return path.join(os.homedir(), '.provardx', 'validation'); | ||
| } |
Comment on lines
+132
to
+143
| try { | ||
| saveRun(storageDir, runId, currentViolations); | ||
| } catch (saveErr) { | ||
| log('warn', 'provar_testsuite_validate: could not save run for diff', { | ||
| requestId, | ||
| error: (saveErr as Error).message, | ||
| }); | ||
| } | ||
|
|
||
| // Diff mode | ||
| if (baseline_run_id !== undefined && baseline_run_id !== '') { | ||
| const baseline = loadBaselineViolations(storageDir, baseline_run_id); |
Comment on lines
+217
to
+219
| const completeness_score = calcCompletenessScore(baseResult.is_valid ? 1 : 0, 1); | ||
| const hasBaseline = hasAnyRun(storageDir); | ||
| const recommended_next_action = calcNextAction(completeness_score, hasBaseline); |
| // No API key configured — run local validation with onboarding message | ||
| const storageDir = tcStorageDir(); | ||
| const runId = generateRunId(tcRunContext(file_path, source)); | ||
| const currentViolations = baseResult.issues as unknown as DiffableViolation[]; |
Comment on lines
+50
to
+51
| const message = String(v['message'] ?? '').slice(0, 120); | ||
| return `${rule_id}||${applies_to}||${message}`; |
Comment on lines
+17
to
+25
| * Recommend what the agent should do next based on the completeness score and | ||
| * whether any prior runs exist on disk for this validation context. | ||
| * | ||
| * - `stop` → score is 100 — nothing left to fix | ||
| * - `inspect_failures` → first run (no baseline on disk) — review what's failing before trying to fix | ||
| * - `fix_and_revalidate`→ subsequent run — agent knows the failure set, should fix and re-run | ||
| */ | ||
| export function calcNextAction(score: number, hasBaseline: boolean): NextAction { | ||
| if (score === 100) return 'stop'; |
Comment on lines
+89
to
+100
| detail: z | ||
| .enum(['summary', 'standard', 'full']) | ||
| .optional() | ||
| .default('standard') | ||
| .describe( | ||
| 'Response verbosity. "summary": is_valid, scores, and stop signal only. "standard"/"full": full issues list (default).' | ||
| ), | ||
| baseline_run_id: z | ||
| .string() | ||
| .optional() | ||
| .describe( | ||
| 'run_id from a previous call. When provided, returns only issues that are new or resolved since that run: { added, resolved, unchanged_count, run_id }. If not found, returns error BASELINE_NOT_FOUND.' |
Comment on lines
+200
to
+211
| detail: z | ||
| .enum(['summary', 'standard', 'full']) | ||
| .optional() | ||
| .default('standard') | ||
| .describe( | ||
| 'Response verbosity. "summary": key scores and stop signal only. "standard": slim violation summary (default). "full": full per-suite and per-test-case data (implies include_plan_details:true).' | ||
| ), | ||
| baseline_run_id: z | ||
| .string() | ||
| .optional() | ||
| .describe( | ||
| 'run_id from a previous call. When provided, returns only project-level violations that are new or resolved since that run: { added, resolved, unchanged_count, run_id }. If not found, returns error BASELINE_NOT_FOUND.' |
Comment on lines
+7
to
+14
|
|
||
| /** | ||
| * Returns `compact` when PROVAR_MCP_SCHEMA_MODE=compact, otherwise `standard`. | ||
| * Reads the env var on each call so tests can set it without resetting module cache. | ||
| */ | ||
| export function desc(standard: string, compact: string): string { | ||
| return process.env['PROVAR_MCP_SCHEMA_MODE'] === 'compact' ? compact : standard; | ||
| } |
mrdailey99
added a commit
that referenced
this pull request
May 13, 2026
RCA: Multiple correctness and code quality issues identified in review
of the detail/diff/completeness PDX-470/471/473 implementation.
Fix: - generateRunId: add random suffix to prevent sub-millisecond collisions
- testSuiteValidate: collect full violation hierarchy (recursive helper)
- All three validate tools: load baseline before saveRun to prevent
eviction race; call hasAnyRun before saveRun for first-run heuristic
- testCaseValidate: include best_practices_violations in diff snapshot;
extract resolveBaseResult helper to reduce handler complexity to 17
- projectValidateFromPath: omit run_id when save_results=false;
extract classifyError helper to reduce handler complexity to 18
- Remove dead code: delete unused descHelper.ts
- Tests: update run_id regex, add 8 TC tests, add detailLevel.test.ts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mrdailey99
added a commit
that referenced
this pull request
May 13, 2026
…eness_score for PR #168 RCA: CLAUDE.md requires docs updates for every PR that adds or modifies tool parameters; PR #168 added detail, baseline_run_id, run_id, completeness_score, and recommended_next_action to 4 validation tools without updating docs/mcp.md Fix: Updated provar_testcase_validate, provar_testsuite_validate, provar_testplan_validate, and provar_project_validate docs with new input params and output fields; added BASELINE_NOT_FOUND error code; marked include_plan_details/max_uncovered/max_violations as deprecated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re to validation tools RCA: Iterative fix-validate loops re-emit full violation inventories on every call, compounding token cost with no stop signal; agents have no way to know when to stop iterating or which violations changed since the prior run. Fix: Add detail=summary|standard|full, baseline_run_id diff mode (returns only added/resolved violations), and completeness_score/recommended_next_action to all four validation tools. New utilities: detailLevel.ts, validationScore.ts, validationDiff.ts. 79 unit tests across validationScore, validationDiff, testSuiteValidate, testPlanValidate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RCA: Multiple correctness and code quality issues identified in review
of the detail/diff/completeness PDX-470/471/473 implementation.
Fix: - generateRunId: add random suffix to prevent sub-millisecond collisions
- testSuiteValidate: collect full violation hierarchy (recursive helper)
- All three validate tools: load baseline before saveRun to prevent
eviction race; call hasAnyRun before saveRun for first-run heuristic
- testCaseValidate: include best_practices_violations in diff snapshot;
extract resolveBaseResult helper to reduce handler complexity to 17
- projectValidateFromPath: omit run_id when save_results=false;
extract classifyError helper to reduce handler complexity to 18
- Remove dead code: delete unused descHelper.ts
- Tests: update run_id regex, add 8 TC tests, add detailLevel.test.ts
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RCA: Five correctness issues remained after the Copilot follow-up commit: Bug 5 — tcStorageDir/suiteStorageDir both wrote to same path allowing cross-tool baseline collisions; Bug 7 — computeDiff Map collapsed duplicate violations; Bug 8 — 120-char truncation caused false-equal keys; Bug 9 — calcNextAction returned stop even when quality/BP violations remained; Missing AC — include_plan_details and max_* params not marked @deprecated. Fix: - Namespace storage dirs (testcase/, testsuite/) to prevent cross-tool baseline collisions — computeDiff now uses multiset (counts per key) so duplicate violations are distinct events — remove 120-char message truncation — calcNextAction gains remainingViolationCount param (default 0); stop only fires when score=100 AND count=0 — all three tools pass currentViolations.length — projectValidateFromPath marks include_plan_details/max_uncovered/max_violations @deprecated — 2 multiset tests, 2 secondary-check tests, updated TC test, 7 projectValidate tests for run_id, detail=summary, BASELINE_NOT_FOUND, diff round-trip Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eness_score for PR #168 RCA: CLAUDE.md requires docs updates for every PR that adds or modifies tool parameters; PR #168 added detail, baseline_run_id, run_id, completeness_score, and recommended_next_action to 4 validation tools without updating docs/mcp.md Fix: Updated provar_testcase_validate, provar_testsuite_validate, provar_testplan_validate, and provar_project_validate docs with new input params and output fields; added BASELINE_NOT_FOUND error code; marked include_plan_details/max_uncovered/max_violations as deprecated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6ffb10d to
806fc26
Compare
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
detail=summary|standard|fullparameter to all four structured-response validation tools (provar_testsuite_validate,provar_testplan_validate,provar_testcase_validate,provar_project_validate).summaryreturns only key scores and the stop signal — eliminates token explosion on repeat calls.baseline_run_iddiff mode toprovar_testsuite_validateandprovar_project_validate. Each validation call saves a violation snapshot keyed byrun_id; subsequent calls withbaseline_run_idreturn only{added, resolved, unchanged_count}instead of the full inventory.completeness_score(0–100) andrecommended_next_action(stop | inspect_failures | fix_and_revalidate) to all four tools. Agents can userecommended_next_action=stopas an unambiguous signal to end the fix-validate loop.New shared utilities
src/mcp/utils/detailLevel.tsapplyDetailLevel,DetailLevelsrc/mcp/utils/validationScore.tscalcCompletenessScore,calcNextActionsrc/mcp/utils/validationDiff.tsgenerateRunId,saveRun,hasAnyRun,loadBaselineViolations,computeDiffTest plan
validationScore.test.ts— 8 tests forcalcCompletenessScoreandcalcNextActionvalidationDiff.test.ts— 13 tests for run persistence, cap-at-20 eviction, and diff computationtestSuiteValidate.test.ts— 30 tests (19 existing + 11 new for PDX-470/471/473)testPlanValidate.test.ts— 28 tests (19 existing + 9 new for PDX-470/473)🤖 Generated with Claude Code