Conversation
feab1ff to
2eb4420
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds sub-metric breakdowns to existing metrics so results can be analyzed at a finer grain (per error type, per failure mode, per dimension, per entity type) without having to add new top-level metrics. Also threads a
higher_is_bettersignal through the metric base class, aggregator, and analysis app so lower-is-better metrics are rendered correctly.Framework changes
BaseMetric.higher_is_betterclass attribute. DefaultTrue; override on the class (e.g., latency metrics) rather than per record.eva.metrics.utils:make_rate_sub_metric(...)— rate-style sub-metric wherescore == normalized_score == numerator / denominator.build_binary_flag_sub_metrics(...)— one sub-metric per dimension/corruption type withscore = 1.0when the judge flagged it (aggregated mean reads as an issue-occurrence rate).direction_for_sub_metric(...)— derives direction from the key suffix:_rate→ lower-is-better,_accuracy→ higher-is-better, otherwise inherit the parent.PerTurnConversationJudgeMetric.build_sub_metrics(context, per_turn_ratings, per_turn_extra)hook for subclasses to surface per-turn breakdowns.MetricsRunnernow writeshigher_is_betteron every metric and sub-metric entry in the run-level aggregates, reading the parent direction from the registry.Per-metric sub-metrics added
stt_wer:substitution_rate,deletion_rate,insertion_rate(component / reference-word count).tool_call_validity:num_tool_calls(count, non-normalized) + one*_rateper error type inCALL_ERROR_TYPES.conciseness_judge: one<failure_mode>_rateper failure mode, over rated turns.conversation_progression_judge: one<dimension>_rateper dimension (binary flagged).faithfulness_judge: one<dimension>_rateper dimension (binary flagged).transcription_accuracy_key_entities: one<entity_type>_accuracyper entity type seen in the run.user_behavioral_fidelity: one<corruption_type>_rateper corruption type (binary detected).Analysis app
↓(parent direction from the registry; sub-metric direction from the key suffix).tool_call_validity__num_tool_calls(count axis, no hover suffix).Test plan
pytest tests/unit/metrics— unit tests added for each metric's sub-metric output (conciseness, conversation progression, faithfulness, STT WER, tool call validity, transcription accuracy key entities, user behavioral fidelity) and for the runner's sub-metric aggregation withhigher_is_better.↓prefix on lower-is-better metrics and thattool_call_validity__num_tool_callsrenders on the count axis.