feat(experiments/mcp): Write diagnosing-experiment-results skill and Evals#59029
Conversation
|
🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down. Add the Most PRs don't need this. Real regressions still get caught on master and fix-forward. |
|
Size Change: -7.37 kB (-0.01%) Total Size: 118 MB 📦 View Changed
ℹ️ View Unchanged
|
There was a problem hiding this comment.
Large PR (3176 lines, 16 files) with zero reviews that exceeded the size gate ceiling. The changes span new eval test infrastructure, extensive skill documentation, and auto-generated API type updates across two products — needs a human reviewer to validate correctness and scope.
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
ee/hogai/eval/sandboxed/experiments/eval_midrun_changes.py:85-135
**Behavioral check bundled into `diagnosis_group` — violates the anti-bundling pattern established in `eval_interpretation_traps.py`**
The `ship_variant_under_uncertainty` case explicitly tests "should I ship?" and the case description states the correct answer is "not yet / not on this evidence." However, the behavioral check ("advises AGAINST") is embedded inside the `diagnosis_group` description text rather than enforced with a separate `AdvisesAgainstShipping` scorer + `"advises_against_shipping": True` in `expected`.
`eval_interpretation_traps.py` contains an explicit comment explaining exactly why this bundling is a regression vector:
> "Both are needed: the previous version of this case relied on `diagnosis_group` alone, where the LLM judge's 'must recommend waiting' criterion was bundled into the diagnosis description — splitting them separates content from behavior and makes regressions in either dimension visible independently."
The same failure mode applies here: the agent can identify the diagnostic correctly but still recommend shipping, and that behavioral regression would not be caught by `CitesDiagnosticGroup` alone. `AdvisesAgainstShipping` should be added to the scorers list, and `"advises_against_shipping": True` should be added to this case's `expected` dict — mirroring the pattern in `eval_interpretation_traps.py`.
Reviews (1): Last reviewed commit: "Fix CI" | Re-trigger Greptile |
Problem
Users frequently ask "why does my experiment look wrong / biased / empty?". There was no symptom-driven path from the user's complaint to the right diagnostic, and no eval coverage for this class of question.
Changes
diagnosing-experiment-resultswith a symptom-to-diagnostic-logic table and five reference files covering:How did you test this code?
Publish to changelog?
no
Docs update
No
🤖 Agent context
Authored with Claude Code across many different sessions