Skip to content

spec: add fixture-driven conformance suite#248

Merged
mickdarling merged 4 commits intodevelopfrom
feature/conformance-suite
Apr 15, 2026
Merged

spec: add fixture-driven conformance suite#248
mickdarling merged 4 commits intodevelopfrom
feature/conformance-suite

Conversation

@mickdarling
Copy link
Copy Markdown
Contributor

Summary

  • add a fixture-driven conformance runner with test, verify-fixtures, report, and version commands
  • add reference evidence bundles for passing Level 1, passing Level 2, semantic-warning, and failing cases
  • wire the conformance fixtures into npm test, CI, and the conformance/generator docs

Related

Testing

  • npm test
  • npm run docs:dates:check

@mickdarling
Copy link
Copy Markdown
Contributor Author

@claude please review this PR for content, quality, gaps, and consistency.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Claude finished @mickdarling's task in 3m 17s —— View job


PR Review: spec: add fixture-driven conformance suite

Tasks

  • Read CLAUDE.md for repo-specific guidelines
  • Read and analyze changed files
  • Review conformance runner script
  • Review fixture/evidence files
  • Review CI workflow
  • Review documentation changes
  • Post comprehensive review feedback

Overview

The fixture-driven evidence-bundle approach is well-conceived. Instead of requiring a live adapter for CI, fixtures capture pre-recorded responses and the runner evaluates them deterministically. The four reference cases (passing L1, passing L2, semantic-warn, failure) cover the important paths, and wiring verify-fixtures into npm test and CI gives this real teeth from day one. The two-tier semantic model and the evaluateSemantic tie-breaking logic are solid.

Below are findings organized by severity.


Bugs / Inconsistencies

1. --verbose is dead code · scripts/run-conformance-tests.mjs:779

options.verbose is set when --verbose/-v is passed but is never read anywhere in the evaluation or rendering pipeline. Users will expect verbose output and get none.

Fix this →


2. --timeout documented but not implemented · docs/conformance-testing.md:461

The CLI options table lists --timeout with a default of 30, and exit code 4 is documented for timeouts. Neither exists in parseArgs(). Since the current runner is purely in-process (no network), timeout may not be meaningful yet — but the docs should say so or the option should be removed.

Fix this →


3. Exit code 3 semantics mismatch · scripts/run-conformance-tests.mjs:860 vs docs/conformance-testing.md:469

The spec documents exit code 3 as "Configuration error — invalid adapter path or configuration". The script uses it for unknown command (Unknown command: ${command}). These are different failure modes. The docs table should say "Unknown command or configuration error".


4. JSON report format: required type and testDate field · docs/conformance-testing.md:397-407 vs scripts/run-conformance-tests.mjs:636-647

Two concrete mismatches between the example report in the docs and what summarize() actually produces:

  • Docs show "required": "MUST" (string). Code produces "required": true (boolean).
  • Docs show a "testDate" field. summarize() never writes testDate.

Anyone consuming JSON output programmatically and following the spec example will get different shapes.

Fix this →


5. Test count table is stale · docs/conformance-testing.md:283-288

Section 4.1 declares "Total MUST PASS tests: 9" with breakdown 2+3+2+2. The actual runner produces more:

Category Spec says Runner produces
Introspection Fidelity 2 4 (operations, types, detail payloads, completeness)
Parameter Handling 3 4 (required, unknown, defaults, snake_case)
Error Quality 2 3 (schema, leakage, actionable)
Round-Trip Integrity 2 2 ✓

The table and total count should be updated to match.

Fix this →


Gaps

6. CI workflow doesn't trigger on schema changes · .github/workflows/conformance-tests.yml:7-22

The conformance runner loads schemas/introspection-response.schema.json and schemas/operation-result.schema.json at startup (lines 51-57 of the script). If those schemas change without touching any conformance file, CI won't re-run and a broken validator could silently pass.

Add 'schemas/introspection-response.schema.json' and 'schemas/operation-result.schema.json' to both push.paths and pull_request.paths.

Note: I'm unable to modify .github/workflows/ files — you'll need to make this change directly.


7. No error handling around loadJson · scripts/run-conformance-tests.mjs:24-26

function loadJson(path) {
  return JSON.parse(readFileSync(path, "utf8"));
}

If a fixture is missing or malformed, Node throws a raw error with a stack trace rather than a clean exit message. A try/catch with process.stderr.write and process.exit(3) would give users actionable output.

Fix this →


8. runReportCommand doesn't validate input or exit with a code · scripts/run-conformance-tests.mjs:835-843

Passing a non-report JSON (e.g., a fixture file) to report will silently produce malformed output. The command also never calls process.exit(), unlike test. The docs don't note this asymmetry. At minimum, basic duck-typing (checking for categories, summary) and a final process.exit(0) would be consistent.


9. --tier 2 and --tier both are undifferentiated · scripts/run-conformance-tests.mjs:492-502

evaluateSemantic branches only on tier === "1". Both "2" and "both" execute identical code. A user passing --tier 2 may expect to skip structural pattern matching and run only LLM-style checks — that's not what happens. The docs list all three values without explaining this equivalence.


10. export-conformance-evidence.sh reference in generator docs · docs/adapter/generator.md:913

The updated example references a shell script ./generated-adapter/scripts/export-conformance-evidence.sh that doesn't exist and has no spec. This will confuse implementors who follow the example literally. Either add a note that this is a placeholder, or replace with a more realistic example.


Minor / Style

11. Issue reference may be wrong in conformance-testing.md · docs/conformance-testing.md:539

The References section links "GitHub Issue #55 — Conformance test requirements". The PR body closes #10 and #56. If #55 covers distinct remaining work, that's fine — but it's worth a comment confirming this is intentional and not a stale link.

12. reference-failure.json tests multiple failures simultaneously

The failure fixture combines camelCase parameter name (widgetId), completeness mismatch (widget_id/include_deleted not in introspection), and implementation leakage — all in one file. This is a valid smoke test, but if the runner's behavior changes for one failure type, it can mask regressions in others. Consider whether separating failure modes into distinct fixtures would give better diagnostic signal.

13. conformance-testing.md section 6.1 example has "required": "MUST" inconsistency with Markdown render section · docs/conformance-testing.md:707-711

renderMarkdown() writes `Requirement: MUST` as plain text, not as a JSON field. The markdown output is fine, but the JSON output uses true/false. Both are correct for their format, but the two representations should be called out explicitly so implementors don't conflate them.


Summary

The core design — evidence bundles, two-tier semantic evaluation, verify-fixtures gate in CI — is clean and practical. The main items to address before merge:

  1. Fix --verbose (dead code) and --timeout (documented but missing)
  2. Align the JSON report format (required type, testDate) between docs and code
  3. Update section 4.1 test count table to match the actual runner
  4. Add schema file paths to CI trigger paths (requires manual edit to the workflow)
  5. Add loadJson error handling

Copy link
Copy Markdown
Contributor Author

@claude please review this PR for content, quality, gaps, and consistency.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Claude finished @mickdarling's task in 4m 56s —— View job


Tasks

  • Read CLAUDE.md for repo-specific guidelines
  • Read and analyze changed files
  • Review conformance runner script (scripts/run-conformance-tests.mjs)
  • Review fixture/evidence files
  • Review CI workflow and package.json
  • Review documentation changes
  • Post comprehensive review

PR Review: spec: add fixture-driven conformance suite

The evidence-bundle approach is the right call for a spec repo — static, reviewable, and deterministic. The second commit (45a5c5b) already resolved the issues flagged in the first pass (--verbose dead code, --timeout documentation, exit code 3 semantics, required type alignment, test count table, CI schema paths, loadJson error handling, runReportCommand validation, export-conformance-evidence.mjs note). What remains is smaller.


Bugs

1. Uncaught TypeError when fixture/report path is omitted · scripts/run-conformance-tests.mjs:789,836

runTestCommand and runReportCommand call resolve(positional[0]) before entering loadJson. When no path is given, positional[0] is undefined, and path.resolve(undefined) throws a TypeError [ERR_INVALID_ARG_TYPE] that bypasses loadJson's try/catch — users see a raw stack trace instead of a clean error.

// runTestCommand (line 789) and runReportCommand (line 836)
// resolve() throws before loadJson's try/catch can help
const fixture = loadJson(resolve(fixturePath));  // fixturePath may be undefined

Guard both entry points:

if (!fixturePath) {
  process.stderr.write(`No fixture path provided.\nUsage: node scripts/run-conformance-tests.mjs test <path> [options]\n`);
  process.exit(3);
}

Fix this →


Documentation Gaps

2. Section 4.2 SHOULD PASS table is missing "Level 2 Features" · docs/conformance-testing.md:289-297

evaluateFixture adds a third optional category — evaluateLevel2Capabilities (which checks endpoint modes, field selection, and batch operations) — whenever level >= 2. With required: false, failures produce exit code 2. The section 4.2 table only lists Constraint Documentation and Semantic Evaluation; Level 2 Features is absent.

Category Tests Rationale
Constraint Documentation 2 Discoverable constraints
Level 2 Features 3 Endpoint modes, field selection, batch
Semantic Evaluation Per implementation LLM discoverability

Fix this →


3. Section 5.4 mermaid diagram has a phantom decision branch · docs/conformance-testing.md:360-371

The diagram shows:

B -->|Yes| C{Tier 2 Required?}
C -->|No| E[PASS]

But in evaluateSemantic (lines 500-562), there is no "Tier 2 required?" check — when tier !== "1", both tier1Pass and tier2Pass are always computed and combined. The C -->|No| PASS path doesn't correspond to any code path.

Either simplify the diagram to show the actual four-outcome logic (tier1 ∧ tier2 → PASS, ¬tier1 ∧ tier2 → WARN, tier1 ∧ ¬tier2 → FAIL, ¬tier1 ∧ ¬tier2 → FAIL), or add a note that the diagram describes the full intended design rather than the current in-process fixture runner.

Fix this →


Design Gaps

4. Empty round_trip_tests trivially satisfies a MUST PASS category

evaluateRoundTrip iterates fixture.round_trip_tests and fixture.update_preservation_tests. When both arrays are empty, all individual checks accumulate zero failures and both sub-tests report PASS. This means an implementation that provides no round-trip evidence at all is considered conformant at Level 1 for that category.

reference-failure.json exploits this inadvertently — it has "round_trip_tests": [] and "update_preservation_tests": [], so Round-Trip Integrity PASSes even though the fixture is supposed to demonstrate a comprehensive failure. A note in the fixture or a minimum-evidence check (at least one entry for each, or a SKIP result when empty rather than trivial PASS) would close this gap.


5. No fixture schema — malformed evidence bundles fail silently or cryptically

There is no JSON Schema for the fixture format itself. A missing accepted_parameters key silently defaults to {} (producing WARNs). A misspelled introspection.operations_list key silently skips operations validation. An implementor building their first evidence bundle has no machine-checkable spec to validate against before running the tool.

Consider adding a schemas/conformance-fixture.schema.json and validating fixtures against it at the start of runTestCommand and runVerifyFixtures.


Minor

6. version command is out of sync with package.json · scripts/run-conformance-tests.mjs:865

process.stdout.write("mcpaql-conformance-fixtures 0.1.0\n");

package.json version is 1.0.0-alpha.1. The separate versioning is defensible (the runner is a sub-artifact), but neither the code nor the docs explain that these are intentionally different. Either read the version from package.json dynamically, or add a comment.

7. verify-fixtures prints no totals line

PASS reference-failure.json -> expected 1, got 1
PASS reference-level1.json -> expected 0, got 0
...

A trailing "All N fixtures passed." (or "N/M fixtures failed.") would make the output scannable in CI logs without having to count lines.

8. FORBIDDEN_ERROR_PATTERNS includes /at\s+\S+/i

This catches stack-frame lines (at Function.eval), but would also flag legitimate error messages like "available at docs.example.com" or "look at the field name". For in-fixture strings this is low-risk since implementors control the message, but worth documenting the trade-off in a comment so maintainers understand why the pattern is written broadly.


Summary

The second commit addressed the bulk of the initial review. The remaining items break down as:

Priority Item
Bug Guard runTestCommand/runReportCommand against undefined path before calling resolve()
Docs Add "Level 2 Features" row to section 4.2 SHOULD PASS table
Docs Fix or annotate the section 5.4 mermaid diagram
Design Document/decide behavior when round-trip evidence arrays are empty
Design Consider adding a fixture JSON Schema for self-validation
Minor Sync or explain the version string vs package.json version
Minor Add summary line to verify-fixtures output

The core design is solid. Items 1–3 are quick fixes; 4–5 can be addressed in a follow-on issue if preferred.

@mickdarling mickdarling force-pushed the feature/conformance-suite branch from 879b86f to 98e5e6d Compare April 15, 2026 21:15
Copy link
Copy Markdown
Contributor Author

@claude please review this PR for content, quality, gaps, and consistency.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Claude finished @mickdarling's task in 3m 13s —— View job


Tasks

  • Read CLAUDE.md for repo guidelines
  • Read conformance runner script
  • Read documentation changes
  • Read fixture/evidence files
  • Read CI workflow and package.json
  • Post comprehensive third-pass review

PR Review (Pass 3): spec: add fixture-driven conformance suite

The third commit (98e5e6d) resolved most of the second pass's findings. Below is a status audit plus new findings from reading the current file state.


Status of Prior Review Items

All first-pass items: resolved.

From second pass:

Item Status
Guard runTestCommand/runReportCommand against undefined path ✅ Fixed (scripts/run-conformance-tests.mjs:789-793, 841-845)
Add "Level 2 Features" row to section 4.2 SHOULD PASS table ✅ Fixed
Fix/annotate section 5.4 mermaid diagram ✅ Fixed (diagram now matches actual four-outcome logic)
Document/decide behavior for empty round-trip arrays ❌ Open — still trivially passes (see item 4 below)
Consider adding a fixture JSON Schema ❌ Open — no schemas/conformance-fixture.schema.json
Sync or explain version string vs package.json ❌ Open — "mcpaql-conformance-fixtures 0.1.0" vs 1.0.0-alpha.1 with no comment
Add summary line to verify-fixtures output ❌ Open — still no totals line
Comment on /at\s+\S+/i regex trade-off ❌ Open — no comment added

New Bugs

1. --category filter doesn't affect exit code · scripts/run-conformance-tests.mjs:796-809

summarize() computes report.exitCode across all categories before the category filter is applied. The filtered output shown to the user reflects one category, but process.exit(report.exitCode) uses the unfiltered result. A user scripting against --category "Introspection Fidelity" will get a misleading exit code if a different category fails.

const report = evaluateFixture(fixture, options.level, options.tier); // exitCode set here
if (options.category) {
  report.categories = report.categories.filter(...); // display-only filter
}
process.exit(report.exitCode); // reflects ALL categories, not the filtered one

The fix is to recompute report.exitCode from the filtered category set, or to document explicitly that --category is display-only and exit code always reflects the full evaluation.

Fix this →


2. Exit code 5 is documented but never emitted · docs/conformance-testing.md:469

The exit codes table lists:

5 | Internal error | Unexpected error in test runner

There is no process.exit(5) anywhere in the script. An unhandled exception falls through to Node's default exit code 1 — not 5. Either remove exit code 5 from the table, or add a top-level process.on('uncaughtException', ...) handler that writes to stderr and exits with 5.

Fix this →


Documentation Gaps

3. Section 2.2 "Test Categories Required" omits "Level 2 Features" · docs/conformance-testing.md:98-101

Section 2.2 lists only two optional test categories for Level 2:

  • Constraint Documentation (SHOULD PASS)
  • Semantic Evaluation (SHOULD PASS)

But the runner always adds evaluateLevel2Capabilities (endpoint modes, field selection, batch) as a third SHOULD category at level ≥ 2 (line 665), and section 4.2 correctly lists it. Section 2.2 is incomplete — implementors reading only that section won't know their evidence bundle needs endpoint_modes, capabilities, and batch_example fields.

Fix this →


4. Endpoint mode naming inconsistency: "Semantic" vs "crude" · docs/conformance-testing.md:90, scripts/run-conformance-tests.mjs:573

Section 2.2 requirements table says:

| Endpoint modes | Semantic endpoint mode and Single mode supported |

But evaluateLevel2Capabilities tests:

endpointModes.includes("crude") && endpointModes.includes("single")

And the failure message reads: "Level 2 evidence must include both crude and single endpoint modes."

All four reference fixtures use "crude" as the value. An implementor writing their first evidence bundle who reads section 2.2 would likely write "semantic" and get a confusing failure. Either update section 2.2 to say "CRUDE endpoint mode and Single mode", or add a note that "crude" is the fixture value for what the spec calls the semantic endpoint mode.

Fix this →


5. Section 6.1 JSON example omits requestedLevel field · docs/conformance-testing.md:381-404

summarize() (line 647) always emits a requestedLevel field alongside conformanceLevel. The example in section 6.1 doesn't include requestedLevel, so anyone building a report consumer who follows the spec example will miss the field. A one-line addition to the example JSON resolves this.


Remaining from Pass 2 (Minor, no change)

6. verify-fixtures prints no totals line — still ends with the last per-fixture line, no "N/M fixtures passed." summary.

7. version string mismatch"mcpaql-conformance-fixtures 0.1.0" (line 877) vs package.json 1.0.0-alpha.1 with no explanatory comment.

8. /at\s+\S+/i pattern comment missing — would flag "available at docs.example.com", worth a one-line comment noting the intentional trade-off.

9. Empty round_trip_tests/update_preservation_tests trivially PASSreference-failure.json has [] for both arrays, so Round-Trip Integrity PASSes in a fixture designed to demonstrate failure. A note in the fixture (or a SKIP result when arrays are empty) would improve diagnostic clarity.


Summary

The runner is solid and the two previous passes cleaned up the substantive bugs. What remains:

Priority Item
Bug --category exit code doesn't reflect filtered output
Bug Exit code 5 documented, never emitted
Docs Section 2.2 missing "Level 2 Features" test category
Docs "Semantic" vs "crude" endpoint mode naming inconsistency
Docs Section 6.1 JSON example missing requestedLevel field
Minor verify-fixtures no totals line (pass 2, still open)
Minor version string / package.json mismatch (pass 2, still open)
Minor /at\s+\S+/i comment missing (pass 2, still open)
Design Empty round-trip arrays trivially PASS (pass 2, still open)

Items 1–5 are quick fixes. Items 6–9 can be addressed as follow-on cleanup.

@mickdarling mickdarling force-pushed the feature/conformance-suite branch from 93c322a to a1f46e0 Compare April 15, 2026 21:53
@mickdarling mickdarling merged commit c83e1e3 into develop Apr 15, 2026
5 checks passed
@mickdarling mickdarling deleted the feature/conformance-suite branch April 15, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spec(conformance): LLM-based semantic evaluation for API discoverability tests spec: Create conformance test suite

1 participant