spec: add fixture-driven conformance suite#248
Conversation
|
@claude please review this PR for content, quality, gaps, and consistency. |
|
Claude finished @mickdarling's task in 3m 17s —— View job PR Review:
|
| Category | Spec says | Runner produces |
|---|---|---|
| Introspection Fidelity | 2 | 4 (operations, types, detail payloads, completeness) |
| Parameter Handling | 3 | 4 (required, unknown, defaults, snake_case) |
| Error Quality | 2 | 3 (schema, leakage, actionable) |
| Round-Trip Integrity | 2 | 2 ✓ |
The table and total count should be updated to match.
Gaps
6. CI workflow doesn't trigger on schema changes · .github/workflows/conformance-tests.yml:7-22
The conformance runner loads schemas/introspection-response.schema.json and schemas/operation-result.schema.json at startup (lines 51-57 of the script). If those schemas change without touching any conformance file, CI won't re-run and a broken validator could silently pass.
Add 'schemas/introspection-response.schema.json' and 'schemas/operation-result.schema.json' to both push.paths and pull_request.paths.
Note: I'm unable to modify
.github/workflows/files — you'll need to make this change directly.
7. No error handling around loadJson · scripts/run-conformance-tests.mjs:24-26
function loadJson(path) {
return JSON.parse(readFileSync(path, "utf8"));
}If a fixture is missing or malformed, Node throws a raw error with a stack trace rather than a clean exit message. A try/catch with process.stderr.write and process.exit(3) would give users actionable output.
8. runReportCommand doesn't validate input or exit with a code · scripts/run-conformance-tests.mjs:835-843
Passing a non-report JSON (e.g., a fixture file) to report will silently produce malformed output. The command also never calls process.exit(), unlike test. The docs don't note this asymmetry. At minimum, basic duck-typing (checking for categories, summary) and a final process.exit(0) would be consistent.
9. --tier 2 and --tier both are undifferentiated · scripts/run-conformance-tests.mjs:492-502
evaluateSemantic branches only on tier === "1". Both "2" and "both" execute identical code. A user passing --tier 2 may expect to skip structural pattern matching and run only LLM-style checks — that's not what happens. The docs list all three values without explaining this equivalence.
10. export-conformance-evidence.sh reference in generator docs · docs/adapter/generator.md:913
The updated example references a shell script ./generated-adapter/scripts/export-conformance-evidence.sh that doesn't exist and has no spec. This will confuse implementors who follow the example literally. Either add a note that this is a placeholder, or replace with a more realistic example.
Minor / Style
11. Issue reference may be wrong in conformance-testing.md · docs/conformance-testing.md:539
The References section links "GitHub Issue #55 — Conformance test requirements". The PR body closes #10 and #56. If #55 covers distinct remaining work, that's fine — but it's worth a comment confirming this is intentional and not a stale link.
12. reference-failure.json tests multiple failures simultaneously
The failure fixture combines camelCase parameter name (widgetId), completeness mismatch (widget_id/include_deleted not in introspection), and implementation leakage — all in one file. This is a valid smoke test, but if the runner's behavior changes for one failure type, it can mask regressions in others. Consider whether separating failure modes into distinct fixtures would give better diagnostic signal.
13. conformance-testing.md section 6.1 example has "required": "MUST" inconsistency with Markdown render section · docs/conformance-testing.md:707-711
renderMarkdown() writes `Requirement: MUST` as plain text, not as a JSON field. The markdown output is fine, but the JSON output uses true/false. Both are correct for their format, but the two representations should be called out explicitly so implementors don't conflate them.
Summary
The core design — evidence bundles, two-tier semantic evaluation, verify-fixtures gate in CI — is clean and practical. The main items to address before merge:
- Fix
--verbose(dead code) and--timeout(documented but missing) - Align the JSON report format (
requiredtype,testDate) between docs and code - Update section 4.1 test count table to match the actual runner
- Add schema file paths to CI trigger paths (requires manual edit to the workflow)
- Add
loadJsonerror handling
|
@claude please review this PR for content, quality, gaps, and consistency. |
|
Claude finished @mickdarling's task in 4m 56s —— View job Tasks
PR Review:
|
| Category | Tests | Rationale |
|---|---|---|
| Constraint Documentation | 2 | Discoverable constraints |
| Level 2 Features | 3 | Endpoint modes, field selection, batch |
| Semantic Evaluation | Per implementation | LLM discoverability |
3. Section 5.4 mermaid diagram has a phantom decision branch · docs/conformance-testing.md:360-371
The diagram shows:
B -->|Yes| C{Tier 2 Required?}
C -->|No| E[PASS]
But in evaluateSemantic (lines 500-562), there is no "Tier 2 required?" check — when tier !== "1", both tier1Pass and tier2Pass are always computed and combined. The C -->|No| PASS path doesn't correspond to any code path.
Either simplify the diagram to show the actual four-outcome logic (tier1 ∧ tier2 → PASS, ¬tier1 ∧ tier2 → WARN, tier1 ∧ ¬tier2 → FAIL, ¬tier1 ∧ ¬tier2 → FAIL), or add a note that the diagram describes the full intended design rather than the current in-process fixture runner.
Design Gaps
4. Empty round_trip_tests trivially satisfies a MUST PASS category
evaluateRoundTrip iterates fixture.round_trip_tests and fixture.update_preservation_tests. When both arrays are empty, all individual checks accumulate zero failures and both sub-tests report PASS. This means an implementation that provides no round-trip evidence at all is considered conformant at Level 1 for that category.
reference-failure.json exploits this inadvertently — it has "round_trip_tests": [] and "update_preservation_tests": [], so Round-Trip Integrity PASSes even though the fixture is supposed to demonstrate a comprehensive failure. A note in the fixture or a minimum-evidence check (at least one entry for each, or a SKIP result when empty rather than trivial PASS) would close this gap.
5. No fixture schema — malformed evidence bundles fail silently or cryptically
There is no JSON Schema for the fixture format itself. A missing accepted_parameters key silently defaults to {} (producing WARNs). A misspelled introspection.operations_list key silently skips operations validation. An implementor building their first evidence bundle has no machine-checkable spec to validate against before running the tool.
Consider adding a schemas/conformance-fixture.schema.json and validating fixtures against it at the start of runTestCommand and runVerifyFixtures.
Minor
6. version command is out of sync with package.json · scripts/run-conformance-tests.mjs:865
process.stdout.write("mcpaql-conformance-fixtures 0.1.0\n");package.json version is 1.0.0-alpha.1. The separate versioning is defensible (the runner is a sub-artifact), but neither the code nor the docs explain that these are intentionally different. Either read the version from package.json dynamically, or add a comment.
7. verify-fixtures prints no totals line
PASS reference-failure.json -> expected 1, got 1
PASS reference-level1.json -> expected 0, got 0
...
A trailing "All N fixtures passed." (or "N/M fixtures failed.") would make the output scannable in CI logs without having to count lines.
8. FORBIDDEN_ERROR_PATTERNS includes /at\s+\S+/i
This catches stack-frame lines (at Function.eval), but would also flag legitimate error messages like "available at docs.example.com" or "look at the field name". For in-fixture strings this is low-risk since implementors control the message, but worth documenting the trade-off in a comment so maintainers understand why the pattern is written broadly.
Summary
The second commit addressed the bulk of the initial review. The remaining items break down as:
| Priority | Item |
|---|---|
| Bug | Guard runTestCommand/runReportCommand against undefined path before calling resolve() |
| Docs | Add "Level 2 Features" row to section 4.2 SHOULD PASS table |
| Docs | Fix or annotate the section 5.4 mermaid diagram |
| Design | Document/decide behavior when round-trip evidence arrays are empty |
| Design | Consider adding a fixture JSON Schema for self-validation |
| Minor | Sync or explain the version string vs package.json version |
| Minor | Add summary line to verify-fixtures output |
The core design is solid. Items 1–3 are quick fixes; 4–5 can be addressed in a follow-on issue if preferred.
879b86f to
98e5e6d
Compare
|
@claude please review this PR for content, quality, gaps, and consistency. |
|
Claude finished @mickdarling's task in 3m 13s —— View job Tasks
PR Review (Pass 3):
|
| Item | Status |
|---|---|
Guard runTestCommand/runReportCommand against undefined path |
✅ Fixed (scripts/run-conformance-tests.mjs:789-793, 841-845) |
| Add "Level 2 Features" row to section 4.2 SHOULD PASS table | ✅ Fixed |
| Fix/annotate section 5.4 mermaid diagram | ✅ Fixed (diagram now matches actual four-outcome logic) |
| Document/decide behavior for empty round-trip arrays | ❌ Open — still trivially passes (see item 4 below) |
| Consider adding a fixture JSON Schema | ❌ Open — no schemas/conformance-fixture.schema.json |
Sync or explain version string vs package.json |
❌ Open — "mcpaql-conformance-fixtures 0.1.0" vs 1.0.0-alpha.1 with no comment |
Add summary line to verify-fixtures output |
❌ Open — still no totals line |
Comment on /at\s+\S+/i regex trade-off |
❌ Open — no comment added |
New Bugs
1. --category filter doesn't affect exit code · scripts/run-conformance-tests.mjs:796-809
summarize() computes report.exitCode across all categories before the category filter is applied. The filtered output shown to the user reflects one category, but process.exit(report.exitCode) uses the unfiltered result. A user scripting against --category "Introspection Fidelity" will get a misleading exit code if a different category fails.
const report = evaluateFixture(fixture, options.level, options.tier); // exitCode set here
if (options.category) {
report.categories = report.categories.filter(...); // display-only filter
}
process.exit(report.exitCode); // reflects ALL categories, not the filtered oneThe fix is to recompute report.exitCode from the filtered category set, or to document explicitly that --category is display-only and exit code always reflects the full evaluation.
2. Exit code 5 is documented but never emitted · docs/conformance-testing.md:469
The exit codes table lists:
5| Internal error | Unexpected error in test runner
There is no process.exit(5) anywhere in the script. An unhandled exception falls through to Node's default exit code 1 — not 5. Either remove exit code 5 from the table, or add a top-level process.on('uncaughtException', ...) handler that writes to stderr and exits with 5.
Documentation Gaps
3. Section 2.2 "Test Categories Required" omits "Level 2 Features" · docs/conformance-testing.md:98-101
Section 2.2 lists only two optional test categories for Level 2:
- Constraint Documentation (SHOULD PASS)
- Semantic Evaluation (SHOULD PASS)
But the runner always adds evaluateLevel2Capabilities (endpoint modes, field selection, batch) as a third SHOULD category at level ≥ 2 (line 665), and section 4.2 correctly lists it. Section 2.2 is incomplete — implementors reading only that section won't know their evidence bundle needs endpoint_modes, capabilities, and batch_example fields.
4. Endpoint mode naming inconsistency: "Semantic" vs "crude" · docs/conformance-testing.md:90, scripts/run-conformance-tests.mjs:573
Section 2.2 requirements table says:
| Endpoint modes | Semantic endpoint mode and Single mode supported |
But evaluateLevel2Capabilities tests:
endpointModes.includes("crude") && endpointModes.includes("single")And the failure message reads: "Level 2 evidence must include both crude and single endpoint modes."
All four reference fixtures use "crude" as the value. An implementor writing their first evidence bundle who reads section 2.2 would likely write "semantic" and get a confusing failure. Either update section 2.2 to say "CRUDE endpoint mode and Single mode", or add a note that "crude" is the fixture value for what the spec calls the semantic endpoint mode.
5. Section 6.1 JSON example omits requestedLevel field · docs/conformance-testing.md:381-404
summarize() (line 647) always emits a requestedLevel field alongside conformanceLevel. The example in section 6.1 doesn't include requestedLevel, so anyone building a report consumer who follows the spec example will miss the field. A one-line addition to the example JSON resolves this.
Remaining from Pass 2 (Minor, no change)
6. verify-fixtures prints no totals line — still ends with the last per-fixture line, no "N/M fixtures passed." summary.
7. version string mismatch — "mcpaql-conformance-fixtures 0.1.0" (line 877) vs package.json 1.0.0-alpha.1 with no explanatory comment.
8. /at\s+\S+/i pattern comment missing — would flag "available at docs.example.com", worth a one-line comment noting the intentional trade-off.
9. Empty round_trip_tests/update_preservation_tests trivially PASS — reference-failure.json has [] for both arrays, so Round-Trip Integrity PASSes in a fixture designed to demonstrate failure. A note in the fixture (or a SKIP result when arrays are empty) would improve diagnostic clarity.
Summary
The runner is solid and the two previous passes cleaned up the substantive bugs. What remains:
| Priority | Item |
|---|---|
| Bug | --category exit code doesn't reflect filtered output |
| Bug | Exit code 5 documented, never emitted |
| Docs | Section 2.2 missing "Level 2 Features" test category |
| Docs | "Semantic" vs "crude" endpoint mode naming inconsistency |
| Docs | Section 6.1 JSON example missing requestedLevel field |
| Minor | verify-fixtures no totals line (pass 2, still open) |
| Minor | version string / package.json mismatch (pass 2, still open) |
| Minor | /at\s+\S+/i comment missing (pass 2, still open) |
| Design | Empty round-trip arrays trivially PASS (pass 2, still open) |
Items 1–5 are quick fixes. Items 6–9 can be addressed as follow-on cleanup.
93c322a to
a1f46e0
Compare
Summary
test,verify-fixtures,report, andversioncommandsnpm test, CI, and the conformance/generator docsRelated
Testing
npm testnpm run docs:dates:check