Conversation
…or faithfulness a lot)
In _load_context, drop the _read_optional fallback that substituted "{}"
for a missing result.json. ConversationResult({}) would fail with a
cryptic 6-field Pydantic error 20 lines later; now we raise a clear
FileNotFoundError next to the existing scenario_db checks instead.
In _run_and_pipeline, narrow the broad except so that data-shape
problems (FileNotFoundError, json.JSONDecodeError) mark the record as
invalid (vr=None, classified not_finished) and the iteration continues,
while programming bugs (KeyError, TypeError, etc.) propagate loudly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When every per-turn latency falls outside the sanity range (0 < v < 1000s) — e.g. a short conversation whose only assistant turn was a barge-in with a negative measured latency — return a skipped MetricScore (score=None, skipped=True) instead of erroring with "No valid response speeds computed". This matches turn_taking, which treats negative latencies as legitimate (its curve spans -500ms to 5000ms) and bypasses the latency path entirely for interrupted turns. response_speed is a diagnostic metric, so a degenerate input should be a quiet skip, not an error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tead of erroring response_speed: the "missing audio timestamps" branch (empty latency_assistant_turns) now returns score=None, skipped=True for consistency with the all-filtered-out branch. Diagnostic metric, no reason to error on degenerate input. turn_taking: the "no overlapping turn IDs" branch (e.g. agent never responded after greeting) now scores 0.0 instead of erroring. When turn-taking can't happen, that's a real turn-taking failure, not missing data — score should reflect that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JosephMarinier
approved these changes
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bug fixes hit while debugging a stuck rerun loop on a remote benchmark job.
1. Judge response parsing —
8e6342bparse_judge_responseusedextract_and_load_json(first match wins). For metrics likefaithfulness, judge responses often contain incidental JSON fragments in prose (e.g.[]from tool-call references), so the first object wasn't the real answer.Now iterates every JSON value, flattens lists, and returns the largest dict — preferring ones with a top-level
rating. Adds a unit test for the multi-object case.2. Pipeline robustness on incomplete trial dirs —
3aa4073_load_contextsubstituted"{}"for a missingresult.json, producing a cryptic Pydantic error. Now raisesFileNotFoundErrorlike the existing scenario_db checks._run_and_pipeline's outerexcept Exception: raisewas both hiding bugs and cancelling the whole iteration on a single corrupt artifact. Now catchesFileNotFoundError/json.JSONDecodeErrornarrowly (record marked invalid, iteration continues); other exceptions still propagate.3.
response_speedskips on degenerate input —d3dc008,9bb88e5Both error branches (all latencies filtered out by the
0 < v < 1000ssanity range, orlatency_assistant_turnsempty because the agent never produced an overlapping turn) now returnscore=None, skipped=True.response_speedis diagnostic — these cases are already flagged bytask_completion,conversation_progression,turn_taking.4.
turn_takingscores 0 on no-overlap —9bb88e5When no turn has both user and assistant audio (agent didn't participate), this is a turn-taking failure, not missing data. Now scores
0.0instead of erroring.Test plan
pytest tests/unit/metrics/{test_metrics_utils,test_response_speed,test_turn_taking}.pypytest tests/unit/orchestrator tests/unit/metrics/test_runner.py tests/integration/test_evaluation_mode.py— 94 tests green