Skip to content

Add submetrics for more metrics#65

Open
gabegma wants to merge 7 commits intomainfrom
ggm/add-sub-metrics
Open

Add submetrics for more metrics#65
gabegma wants to merge 7 commits intomainfrom
ggm/add-sub-metrics

Conversation

@gabegma
Copy link
Copy Markdown
Collaborator

@gabegma gabegma commented Apr 20, 2026

Summary

Adds sub-metric breakdowns to existing metrics so results can be analyzed at a finer grain (per error type, per failure mode, per dimension, per entity type) without having to add new top-level metrics. Also threads a higher_is_better signal through the metric base class, aggregator, and analysis app so lower-is-better metrics are rendered correctly.

Framework changes

  • BaseMetric.higher_is_better class attribute. Default True; override on the class (e.g., latency metrics) rather than per record.
  • New helpers in eva.metrics.utils:
    • make_rate_sub_metric(...) — rate-style sub-metric where score == normalized_score == numerator / denominator.
    • build_binary_flag_sub_metrics(...) — one sub-metric per dimension/corruption type with score = 1.0 when the judge flagged it (aggregated mean reads as an issue-occurrence rate).
    • direction_for_sub_metric(...) — derives direction from the key suffix: _rate → lower-is-better, _accuracy → higher-is-better, otherwise inherit the parent.
  • PerTurnConversationJudgeMetric.build_sub_metrics(context, per_turn_ratings, per_turn_extra) hook for subclasses to surface per-turn breakdowns.
  • MetricsRunner now writes higher_is_better on every metric and sub-metric entry in the run-level aggregates, reading the parent direction from the registry.

Per-metric sub-metrics added

  • stt_wer: substitution_rate, deletion_rate, insertion_rate (component / reference-word count).
  • tool_call_validity: num_tool_calls (count, non-normalized) + one *_rate per error type in CALL_ERROR_TYPES.
  • conciseness_judge: one <failure_mode>_rate per failure mode, over rated turns.
  • conversation_progression_judge: one <dimension>_rate per dimension (binary flagged).
  • faithfulness_judge: one <dimension>_rate per dimension (binary flagged).
  • transcription_accuracy_key_entities: one <entity_type>_accuracy per entity type seen in the run.
  • user_behavioral_fidelity: one <corruption_type>_rate per corruption type (binary detected).

Analysis app

  • Prefixes lower-is-better labels with (parent direction from the registry; sub-metric direction from the key suffix).
  • Supports non-normalized sub-metrics like tool_call_validity__num_tool_calls (count axis, no hover suffix).
  • Bar ordering, query-param-driven state, and error surfacing in the analysis view.

Test plan

  • pytest tests/unit/metrics — unit tests added for each metric's sub-metric output (conciseness, conversation progression, faithfulness, STT WER, tool call validity, transcription accuracy key entities, user behavioral fidelity) and for the runner's sub-metric aggregation with higher_is_better.
  • Spot-check the analysis app against a recent run — confirm prefix on lower-is-better metrics and that tool_call_validity__num_tool_calls renders on the count axis.

@gabegma gabegma self-assigned this Apr 20, 2026
@gabegma gabegma force-pushed the ggm/add-sub-metrics branch from feab1ff to 2eb4420 Compare April 23, 2026 00:10
@gabegma gabegma marked this pull request as ready for review April 23, 2026 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant