Skip to content

Fix metric bugs#91

Merged
gabegma merged 4 commits intomainfrom
ggm/fix-metric-bugs
Apr 28, 2026
Merged

Fix metric bugs#91
gabegma merged 4 commits intomainfrom
ggm/fix-metric-bugs

Conversation

@gabegma
Copy link
Copy Markdown
Collaborator

@gabegma gabegma commented Apr 28, 2026

Summary

Bug fixes hit while debugging a stuck rerun loop on a remote benchmark job.

1. Judge response parsing — 8e6342b

parse_judge_response used extract_and_load_json (first match wins). For metrics like faithfulness, judge responses often contain incidental JSON fragments in prose (e.g. [] from tool-call references), so the first object wasn't the real answer.

Now iterates every JSON value, flattens lists, and returns the largest dict — preferring ones with a top-level rating. Adds a unit test for the multi-object case.

2. Pipeline robustness on incomplete trial dirs — 3aa4073

  • _load_context substituted "{}" for a missing result.json, producing a cryptic Pydantic error. Now raises FileNotFoundError like the existing scenario_db checks.
  • _run_and_pipeline's outer except Exception: raise was both hiding bugs and cancelling the whole iteration on a single corrupt artifact. Now catches FileNotFoundError / json.JSONDecodeError narrowly (record marked invalid, iteration continues); other exceptions still propagate.

3. response_speed skips on degenerate input — d3dc008, 9bb88e5

Both error branches (all latencies filtered out by the 0 < v < 1000s sanity range, or latency_assistant_turns empty because the agent never produced an overlapping turn) now return score=None, skipped=True. response_speed is diagnostic — these cases are already flagged by task_completion, conversation_progression, turn_taking.

4. turn_taking scores 0 on no-overlap — 9bb88e5

When no turn has both user and assistant audio (agent didn't participate), this is a turn-taking failure, not missing data. Now scores 0.0 instead of erroring.

Test plan

  • pytest tests/unit/metrics/{test_metrics_utils,test_response_speed,test_turn_taking}.py
  • pytest tests/unit/orchestrator tests/unit/metrics/test_runner.py tests/integration/test_evaluation_mode.py — 94 tests green

@gabegma gabegma self-assigned this Apr 28, 2026
gabegma and others added 3 commits April 28, 2026 12:11
In _load_context, drop the _read_optional fallback that substituted "{}"
for a missing result.json. ConversationResult({}) would fail with a
cryptic 6-field Pydantic error 20 lines later; now we raise a clear
FileNotFoundError next to the existing scenario_db checks instead.

In _run_and_pipeline, narrow the broad except so that data-shape
problems (FileNotFoundError, json.JSONDecodeError) mark the record as
invalid (vr=None, classified not_finished) and the iteration continues,
while programming bugs (KeyError, TypeError, etc.) propagate loudly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When every per-turn latency falls outside the sanity range (0 < v < 1000s)
— e.g. a short conversation whose only assistant turn was a barge-in with
a negative measured latency — return a skipped MetricScore (score=None,
skipped=True) instead of erroring with "No valid response speeds computed".

This matches turn_taking, which treats negative latencies as legitimate
(its curve spans -500ms to 5000ms) and bypasses the latency path entirely
for interrupted turns. response_speed is a diagnostic metric, so a
degenerate input should be a quiet skip, not an error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tead of erroring

response_speed: the "missing audio timestamps" branch (empty
latency_assistant_turns) now returns score=None, skipped=True for
consistency with the all-filtered-out branch. Diagnostic metric, no
reason to error on degenerate input.

turn_taking: the "no overlapping turn IDs" branch (e.g. agent never
responded after greeting) now scores 0.0 instead of erroring. When
turn-taking can't happen, that's a real turn-taking failure, not
missing data — score should reflect that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gabegma gabegma marked this pull request as ready for review April 28, 2026 16:26
Comment thread src/eva/metrics/utils.py
Comment thread tests/unit/metrics/test_metrics_utils.py
@gabegma gabegma added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit eca7384 Apr 28, 2026
1 check passed
@gabegma gabegma deleted the ggm/fix-metric-bugs branch April 28, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants