Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection#13
Merged
kevinngo1304 merged 18 commits intomainfrom May 6, 2026
Merged
Conversation
…recency Replace history.screenshots(n_last=3) with _select_key_screenshots() which scores each agent step by the actions it performed: - +10 for the final step (always most important) - +3 for high-signal actions: navigate, input_text, done, select_dropdown_option, upload_file, evaluate (JS mutation) - +1 for mid-signal actions: clicks and other interactions - +4 for steps that produced errors - 0 for low-signal actions: scroll, refresh_dom_state, search_page, wait Top-N steps are returned in chronological order so the judge sees the visual progression rather than just the end state. Falls back to the last screenshot when no steps score above zero. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each persona now produces 1-3 concrete, actionable feature/UX improvement suggestions grounded in what they observed during testing. Suggestions flow through the agent verdict, judge verdict, and into HTML reports and the executive summary.
…d personas The discovered persona rendering (description, traits, execution hints) was computed but discarded — the prompt always called the predefined renderer, which returns near-empty output for discovered persona slugs.
Each discovered persona now carries a tailored suggestion_instruction generated during labeling, which is used in the execution prompt to produce persona-grounded feature suggestions instead of generic ones.
The judge was duplicating the agent's feature suggestion work, resulting in ~6 suggestions per test instead of the intended 1-3. Remove the feature_suggestions prompt and field from the judge, use only the agent's suggestions, and make the report section collapsible.
Agent Task Evaluation Results: 2/2 (100%)View detailed results
Check the evaluate-tasks job for detailed task execution logs. |
isha-prosus
reviewed
Apr 22, 2026
isha-prosus
reviewed
Apr 22, 2026
Pydantic model guarantees these fields are never None, making the trailing `or ''` and `or []` unnecessary.
boomer_ui -> classic_ui, genz_ui -> modern_ui, whitespace_police_ui -> layout_auditor_ui
…e-file change Trait names, judge questions, summary names, and the drift assertion all live in models.py now. judge.py and prompts.py derive their trait lists from TraitVector class vars instead of maintaining their own copies.
isha-prosus
reviewed
Apr 22, 2026
Persona badges and grouping now gracefully handle personas not in the predefined list, using a default badge color and stable ordering (predefined first, then discovered alphabetically).
The judge was returning empty trait_evaluations because OpenAI strict mode sets additionalProperties:false on all objects, blocking dynamic keys in dict[str, str] fields. Switch JudgeVerdict.trait_evaluations to list[TraitEvaluation] (structured objects with trait_name + assessment) and enrich the discovered-persona judge context with per-trait evaluation questions derived from each dimension's low/high anchors.
isha-prosus
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection
Branch:
improve-personas-and-judge→main17 commits | 14 files changed | +590 / -108 lines
Summary
This PR adds three major capabilities to Murphy's persona-driven testing pipeline:
New UI-focused test personas — Three new built-in personas (
classic_ui,modern_ui,layout_auditor_ui) that evaluate visual design quality rather than functional correctness, each with dedicated trait dimensions and judge criteria.Per-persona feature suggestions — Every persona now produces 1–3 concrete, actionable feature/UX improvement suggestions grounded in what it observed during testing. Suggestions flow through the entire pipeline: execution, judging, HTML/Markdown reports, and executive summary.
Smarter screenshot selection for the judge — Screenshots sent to the judge are now selected by action signal strength (navigation, text input, errors, final state) instead of simple recency, so the judge sees the most informative visual progression.
What changed
New UI personas,
designtest type, and centralized trait metadatamurphy/models.py— Addedclassic_ui,modern_ui,layout_auditor_uitoTestPersona. ExtendedTraitVectorwith three new fields (visual_density_preference,aesthetic_era,layout_strictness). AddeddesigntoTestType. Registered all three personas inPERSONA_REGISTRYwith full trait vectors. Centralized trait classification onTraitVector(CORE_LEVEL_TRAITS,DESIGN_LEVEL_TRAITS,_SUMMARY_NAMES) so adding a trait is a single-file change. MovedTRAIT_JUDGE_QUESTIONSfromcore/judge.pyintomodels.pywith an assertion that keys stay in sync with the trait tuples. ChangedJudgeVerdict.trait_evaluationsfromdict[str, str]to alist[TraitEvaluation](structured model withtrait_name+assessment) to avoid OpenAI strict-modeadditionalProperties: falseissues, with atrait_evaluations_dictproperty for consumers.murphy/core/judge.py— Removed the inlineTRAIT_JUDGE_QUESTIONSdict (now imported frommodels). Added adesignrule toTEST_TYPE_RULES. Replaced the hardcodedtrait_fieldsdict inbuild_judge_trait_contextwithtraits.level_trait_items(test_type). Clarified the judge system prompt to explicitly describe the expectedtrait_evaluationsformat (list of{trait_name, assessment}objects).murphy/prompts.py— Added persona descriptions, distribution percentages, execution behavior instructions, and success criteria examples for all three UI personas. Rebalanced the persona distribution (total still 100%). Replaced inline trait rendering withTraitVector.render_summary()andTraitVector.render_full().Per-persona feature suggestions
murphy/models.py— Addedfeature_suggestions: list[str]toScenarioExecutionVerdictandTestResult.murphy/prompts.py— Added_PERSONA_SUGGESTION_INSTRUCTIONSdict with tailored suggestion prompts for every built-in persona, plus_build_suggestion_instruction()which falls back to discovered persona instructions. Injected intobuild_execution_prompt.murphy/personas/pipeline_models.py— Addedsuggestion_instructionfield toPersonaDescriptionandPersona.murphy/personas/persona_labeling.py— Updated the LLM labeling prompt to request asuggestion_instructionper cluster; wired it throughbuild_persona_result.murphy/personas/bridge.py— Addedget_discovered_suggestion_instruction()to look up discovered persona suggestions. Added_trait_question_for_score()to generate per-trait evaluation questions from dimension anchors for discovered personas. Refactoredbuild_discovered_judge_contextto emitPer-trait evaluation questionsmatching the predefined persona format, with explicittrait_evaluationsformat instructions.murphy/core/execution.py— Propagatedfeature_suggestionsfrom the agent's verdict intoTestResult. Switched totrait_evaluations_dictfor the judge verdict conversion.murphy/core/summary.py— Aggregated all feature suggestions into the executive summary prompt sorecommended_actionsare informed by persona-grounded suggestions.murphy/io/report_markdown.py— Renders per-test suggestions in detail sections and an aggregated collapsible "Feature Suggestions" table in the report.murphy/api/templates.py— Renders feature suggestions in the HTML results view. Added support for dynamically discovered personas with fallback badge colors and stable ordering (predefined first, then discovered alphabetically). Fixed white text on persona badges.Smarter screenshot selection
murphy/core/judge.py— Added_select_key_screenshots()which scores each agent step by action type signal strength (high: navigate, input_text, done, select_dropdown_option, upload_file, evaluate; low: scroll, refresh_dom_state, search_page, find_elements, switch_tab, wait) and error presence, then picks the top N most informative screenshots in chronological order. Replaced the oldhistory.screenshots(n_last=3)call.Tests
tests/murphy/personas/test_persona_labeling.py— Updated mock data and assertions to coversuggestion_instruction.tests/murphy/test_models.py— Updatedtest_judge_verdict_with_trait_evaluationsto use the newlist[TraitEvaluation]format and verifytrait_evaluations_dict.tests/murphy/core/test_summary_extended.py— Fixed trait evaluation value to use'pass'/'fail'format.Housekeeping
CHANGELOG.md— Documented all additions, changes, and fixes under[1.1.0].