Fix metric bugs by gabegma · Pull Request #91 · ServiceNow/eva

gabegma · 2026-04-28T15:57:47Z

Summary

Bug fixes hit while debugging a stuck rerun loop on a remote benchmark job.

1. Judge response parsing — `8e6342b`

parse_judge_response used extract_and_load_json (first match wins). For metrics like faithfulness, judge responses often contain incidental JSON fragments in prose (e.g. [] from tool-call references), so the first object wasn't the real answer.

Now iterates every JSON value, flattens lists, and returns the largest dict — preferring ones with a top-level rating. Adds a unit test for the multi-object case.

2. Pipeline robustness on incomplete trial dirs — `3aa4073`

_load_context substituted "{}" for a missing result.json, producing a cryptic Pydantic error. Now raises FileNotFoundError like the existing scenario_db checks.
_run_and_pipeline's outer except Exception: raise was both hiding bugs and cancelling the whole iteration on a single corrupt artifact. Now catches FileNotFoundError / json.JSONDecodeError narrowly (record marked invalid, iteration continues); other exceptions still propagate.

3. `response_speed` skips on degenerate input — `d3dc008`, `9bb88e5`

Both error branches (all latencies filtered out by the 0 < v < 1000s sanity range, or latency_assistant_turns empty because the agent never produced an overlapping turn) now return score=None, skipped=True. response_speed is diagnostic — these cases are already flagged by task_completion, conversation_progression, turn_taking.

4. `turn_taking` scores 0 on no-overlap — `9bb88e5`

When no turn has both user and assistant audio (agent didn't participate), this is a turn-taking failure, not missing data. Now scores 0.0 instead of erroring.

Test plan

pytest tests/unit/metrics/{test_metrics_utils,test_response_speed,test_turn_taking}.py
pytest tests/unit/orchestrator tests/unit/metrics/test_runner.py tests/integration/test_evaluation_mode.py — 94 tests green

…or faithfulness a lot)

In _load_context, drop the _read_optional fallback that substituted "{}" for a missing result.json. ConversationResult({}) would fail with a cryptic 6-field Pydantic error 20 lines later; now we raise a clear FileNotFoundError next to the existing scenario_db checks instead. In _run_and_pipeline, narrow the broad except so that data-shape problems (FileNotFoundError, json.JSONDecodeError) mark the record as invalid (vr=None, classified not_finished) and the iteration continues, while programming bugs (KeyError, TypeError, etc.) propagate loudly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When every per-turn latency falls outside the sanity range (0 < v < 1000s) — e.g. a short conversation whose only assistant turn was a barge-in with a negative measured latency — return a skipped MetricScore (score=None, skipped=True) instead of erroring with "No valid response speeds computed". This matches turn_taking, which treats negative latencies as legitimate (its curve spans -500ms to 5000ms) and bypasses the latency path entirely for interrupted turns. response_speed is a diagnostic metric, so a degenerate input should be a quiet skip, not an error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tead of erroring response_speed: the "missing audio timestamps" branch (empty latency_assistant_turns) now returns score=None, skipped=True for consistency with the all-filtered-out branch. Diagnostic metric, no reason to error on degenerate input. turn_taking: the "no overlapping turn IDs" branch (e.g. agent never responded after greeting) now scores 0.0 instead of erroring. When turn-taking can't happen, that's a real turn-taking failure, not missing data — score should reflect that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix judge output parsing when multiple objects are present (happens f…

8e6342b

…or faithfulness a lot)

gabegma self-assigned this Apr 28, 2026

gabegma and others added 3 commits April 28, 2026 12:11

gabegma marked this pull request as ready for review April 28, 2026 16:26

JosephMarinier reviewed Apr 28, 2026

View reviewed changes

Comment thread src/eva/metrics/utils.py

JosephMarinier reviewed Apr 28, 2026

View reviewed changes

Comment thread tests/unit/metrics/test_metrics_utils.py

JosephMarinier approved these changes Apr 28, 2026

View reviewed changes

gabegma added this pull request to the merge queue Apr 28, 2026

Merged via the queue into main with commit eca7384 Apr 28, 2026
1 check passed

gabegma deleted the ggm/fix-metric-bugs branch April 28, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix metric bugs#91

Fix metric bugs#91
gabegma merged 4 commits intomainfrom
ggm/fix-metric-bugs

gabegma commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gabegma commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Judge response parsing — 8e6342b

2. Pipeline robustness on incomplete trial dirs — 3aa4073

3. response_speed skips on degenerate input — d3dc008, 9bb88e5

4. turn_taking scores 0 on no-overlap — 9bb88e5

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabegma commented Apr 28, 2026 •

edited

Loading

1. Judge response parsing — `8e6342b`

2. Pipeline robustness on incomplete trial dirs — `3aa4073`

3. `response_speed` skips on degenerate input — `d3dc008`, `9bb88e5`

4. `turn_taking` scores 0 on no-overlap — `9bb88e5`