Feat/validation evals#32
Merged
Merged
Conversation
…w to validationConversation
ssmrmmk
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a new Preview page to the demo that showcases a real end-to-end MDMA product flow (insurance-claim intake) with live validation + LLM auto-fix on every assistant turn. Reworks the underlying agent into a two-agent architecture: a conversation agent that only talks (no MDMA in its visible text), plus an author sub-agent — same model + provider — that produces the MDMA via the
generate_mdmatool. The change is opt-in viauseAgent({ useAuthorSubAgent: true })and now powers both the Preview and the Agent Chat views.Along the way: validator API is split into per-block
validate()and multi-messagevalidateConversation();form.onSubmitis now required; theaction-referencesrule is dropped; the conversation-judge prompt was promoted out ofmdma-fixer/; and many new model-specific fixer/author/agent-tool prompt variants land (gpt-5.x family, Claude Opus/Sonnet/Haiku, Gemini 2.5/3.x, Grok 4.x).The fixer prompt now ships with model-tailored variants across every major family — OpenAI's full gpt-5.x lineup (5, 5-mini, 5-nano, 5.1, 5.2, 5.4, 5.4-mini, 5.4-nano, 5.5) plus gpt-4.1/-mini/-nano, all four Anthropic Claude tiers (Opus 4.6/4.7, Sonnet, Haiku), the Gemini 2.5 + 3.x families (Pro, Flash, Flash-Lite, plus the customtools Pro variant), and xAI Grok 4.20/4.3. Each variant composes from a shared MDMA_FIXER_* base plus vendor-local guards we discovered during eval runs (no-leading-separator, preserve-input-structure, table-key-direction, replace-all-placeholders, etc.), so the same validate() → LLM fixer → re-validate loop hits ≥14/15 single-block fix tests on every supported model. Reasoning-leak suppression for Gemini Pro and Grok 4.3 is handled at the provider layer via an OpenRouter reasoning.exclude passthrough in the eval config rather than per-prompt, keeping the fixer prompts themselves clean.
The repo now ships a ## Best Practices section covering custom-prompt design lessons learned across the eval matrix — concrete advice on when a flow needs explicit step boundaries, how to scope action labels as opaque handlers (don't reference back into the document), and why "one interactive component per assistant turn" is enforced. It also documents the two-agent architecture used by the Preview view as the recommended pattern for real product flows: keep MDMA generation strictly behind the generate_mdma tool and let a sub-agent (same model + provider, author prompt as system) own the document, so the conversation agent's visible text stays plain prose. Both surfaces — the README and the demo's docs view (CustomPromptBestPractices.tsx) — render the same guidance so external readers and in-app explorers see consistent recommendations.
Type of Change
Breaking notes:
@mobile-reality/mdma-validator:validateFlow→validateConversation(rename);action-referencesrule removed.@mobile-reality/mdma-spec:form.onSubmitis now required.@mobile-reality/mdma-prompt-pack:MDMA_FIXER_CONVERSATION_JUDGE→MDMA_CONVERSATION_JUDGE(rename + relocated).Packages Affected
@mobile-reality/mdma-spec—form.onSubmitrequired; action-label fields documented as opaque labels.@mobile-reality/mdma-parser— fixtures only.@mobile-reality/mdma-runtime@mobile-reality/mdma-attachables-core— test fixture update for requiredonSubmit.@mobile-reality/mdma-renderer-react@mobile-reality/mdma-validator—validateFlow→validateConversation;action-referencesrule removed; per-block vs multi-message split.@mobile-reality/mdma-prompt-pack— many model variants; conversation-judge promoted; new sub-agent-friendly composition.@mobile-reality/mdma-cli— test fixture update for requiredonSubmit.Checklist
pnpm formatandpnpm lintpass).pnpm test).pnpm typecheck).pnpm changeset) for the three affected published packages.sensitive: truewhere appropriate (IBAN in the insurance flow).How to Test
pnpm install && pnpm buildpnpm demoand open/previewin the browser.personal-info-form) in the live preview pane on the right.claim-submitted-calloutshould render with the polished Preview-specific styling.Screenshots / Examples
Example flow definition driving the Preview (excerpt from
demo/src/preview/insurance-flow-prompt.ts):