Open
Conversation
gabegma
reviewed
Apr 23, 2026
| ### Evaluation Methodology | ||
|
|
||
| 1. Compute `last_audio_speaker` as whichever side (`"user"` or `"assistant"`) has the latest audio end-timestamp across all turns. Returns `None` if neither side recorded audio. | ||
| 2. Flag the record as a missed turn iff `conversation_ended_reason == "inactivity_timeout"` **and** `last_audio_speaker == "user"`. |
Collaborator
There was a problem hiding this comment.
Suggested change
| 2. Flag the record as a missed turn iff `conversation_ended_reason == "inactivity_timeout"` **and** `last_audio_speaker == "user"`. | |
| 2. Flag the record as a missed turn if `conversation_ended_reason == "inactivity_timeout"` **and** `last_audio_speaker == "user"`. |
gabegma
reviewed
Apr 23, 2026
| if ctx.conversation_finished: # type: ignore[attr-defined] | ||
| gate_passed.append(record_id) | ||
| continue | ||
| if is_agent_timeout_on_user_turn( |
Collaborator
There was a problem hiding this comment.
I had in mind that we would modify the conversation_valid_end definition to include either the agent timeout or that the end tool call is properly called. Any reason for doing it manually here rather than in the metric directly?
gabegma
reviewed
Apr 23, 2026
|
|
||
| config_data = json.loads(config_path.read_text()) | ||
| # Backwards compat: remap any legacy metric names saved in an older config.json. | ||
| from eva.metrics.legacy_aliases import rename_metric_keys, rename_metric_list |
Collaborator
There was a problem hiding this comment.
Could we move the import to the top? I think we are importing 3 times now.
gabegma
reviewed
Apr 23, 2026
| if ctx is None: | ||
| not_finished.append(record_id) | ||
| continue | ||
| if ctx.conversation_finished: # type: ignore[attr-defined] |
Collaborator
There was a problem hiding this comment.
Could we rename this context variable to conversation_valid_end since I think this is what it now represents?
- Docs: 'iff' -> 'if' in conversation_correctly_finished.md - Hoist legacy_aliases imports to module tops (eva.metrics.runner, eva.orchestrator.runner) - ConversationValidEndMetric now scores 1.0 on agent_timeout_on_user_turn as well as goodbye - Rename _ProcessorContext/MetricContext.conversation_finished -> conversation_valid_end and compute it as (goodbye OR agent_timeout_on_user_turn) - Simplify ValidationRunner._classify to a single valid_end check; keep agent_timeout set for terminal flagging
gabegma
approved these changes
Apr 24, 2026
Collaborator
gabegma
left a comment
There was a problem hiding this comment.
LGTM!! Minor detail but we could update the doc for conversation_valid_end now that it includes agent time-out failures.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tested on run with perturbations, before 9/50 would rerun due to "inactivity timeout", with this change they were all identified as agent errors and no reruns were needed based on the conversation finished check. The diagnostic metric shows those 9 failed on agent turn response, all other succeeded. I looked at a few of the examples it flagged and listened to the audio and it was as expected, the agent did not respond to a user turn that i could clearly hear.