Add submetrics for more metrics by gabegma · Pull Request #65 · ServiceNow/eva

gabegma · 2026-04-20T17:46:43Z

Summary

Adds sub-metric breakdowns to existing metrics so results can be analyzed at a finer grain (per error type, per failure mode, per dimension, per entity type) without having to add new top-level metrics. Also threads a higher_is_better signal through the metric base class, aggregator, and analysis app so lower-is-better metrics are rendered correctly.

Framework changes

BaseMetric.higher_is_better class attribute. Default True; override on the class (e.g., latency metrics) rather than per record.
New helpers in eva.metrics.utils:
- make_rate_sub_metric(...) — rate-style sub-metric where score == normalized_score == numerator / denominator.
- build_binary_flag_sub_metrics(...) — one sub-metric per dimension/corruption type with score = 1.0 when the judge flagged it (aggregated mean reads as an issue-occurrence rate).
- direction_for_sub_metric(...) — derives direction from the key suffix: _rate → lower-is-better, _accuracy → higher-is-better, otherwise inherit the parent.
PerTurnConversationJudgeMetric.build_sub_metrics(context, per_turn_ratings, per_turn_extra) hook for subclasses to surface per-turn breakdowns.
MetricsRunner now writes higher_is_better on every metric and sub-metric entry in the run-level aggregates, reading the parent direction from the registry.

Per-metric sub-metrics added

stt_wer: substitution_rate, deletion_rate, insertion_rate (component / reference-word count).
tool_call_validity: num_tool_calls (count, non-normalized) + one *_rate per error type in CALL_ERROR_TYPES.
conciseness_judge: one <failure_mode>_rate per failure mode, over rated turns.
conversation_progression_judge: one <dimension>_rate per dimension (binary flagged).
faithfulness_judge: one <dimension>_rate per dimension (binary flagged).
transcription_accuracy_key_entities: one <entity_type>_accuracy per entity type seen in the run.
user_behavioral_fidelity: one <corruption_type>_rate per corruption type (binary detected).

Analysis app

Prefixes lower-is-better labels with ↓ (parent direction from the registry; sub-metric direction from the key suffix).
Supports non-normalized sub-metrics like tool_call_validity__num_tool_calls (count axis, no hover suffix).
Bar ordering, query-param-driven state, and error surfacing in the analysis view.

Test plan

pytest tests/unit/metrics — unit tests added for each metric's sub-metric output (conciseness, conversation progression, faithfulness, STT WER, tool call validity, transcription accuracy key entities, user behavioral fidelity) and for the runner's sub-metric aggregation with higher_is_better.
Spot-check the analysis app against a recent run — confirm ↓ prefix on lower-is-better metrics and that tool_call_validity__num_tool_calls renders on the count axis.

gabegma self-assigned this Apr 20, 2026

gabegma added 6 commits April 22, 2026 17:41

Add initial version of submetrics

d1c3572

Compute rate rather than score and add higher is better flag

511a161

Move higher is better at the class definition

4dae959

Improve app (order bars in plot, add lower is better, and query param

98ea46f

Put tool calls count separately

e8a6725

Add errors in analysis app

2eb4420

gabegma force-pushed the ggm/add-sub-metrics branch from feab1ff to 2eb4420 Compare April 23, 2026 00:10

gabegma marked this pull request as ready for review April 23, 2026 00:14

Add latest run as query param

411cfdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add submetrics for more metrics#65

Add submetrics for more metrics#65
gabegma wants to merge 7 commits intomainfrom
ggm/add-sub-metrics

gabegma commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gabegma commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Framework changes

Per-metric sub-metrics added

Analysis app

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabegma commented Apr 20, 2026 •

edited

Loading