Skip to content

Penalize no agent response#70

Open
tara-servicenow wants to merge 9 commits intomainfrom
pr/tara/no_response
Open

Penalize no agent response#70
tara-servicenow wants to merge 9 commits intomainfrom
pr/tara/no_response

Conversation

@tara-servicenow
Copy link
Copy Markdown
Collaborator

@tara-servicenow tara-servicenow commented Apr 21, 2026

Tested on run with perturbations, before 9/50 would rerun due to "inactivity timeout", with this change they were all identified as agent errors and no reruns were needed based on the conversation finished check. The diagnostic metric shows those 9 failed on agent turn response, all other succeeded. I looked at a few of the examples it flagged and listened to the audio and it was as expected, the agent did not respond to a user turn that i could clearly hear.

### Evaluation Methodology

1. Compute `last_audio_speaker` as whichever side (`"user"` or `"assistant"`) has the latest audio end-timestamp across all turns. Returns `None` if neither side recorded audio.
2. Flag the record as a missed turn iff `conversation_ended_reason == "inactivity_timeout"` **and** `last_audio_speaker == "user"`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Flag the record as a missed turn iff `conversation_ended_reason == "inactivity_timeout"` **and** `last_audio_speaker == "user"`.
2. Flag the record as a missed turn if `conversation_ended_reason == "inactivity_timeout"` **and** `last_audio_speaker == "user"`.

if ctx.conversation_finished: # type: ignore[attr-defined]
gate_passed.append(record_id)
continue
if is_agent_timeout_on_user_turn(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had in mind that we would modify the conversation_valid_end definition to include either the agent timeout or that the end tool call is properly called. Any reason for doing it manually here rather than in the metric directly?

Comment thread src/eva/orchestrator/runner.py Outdated

config_data = json.loads(config_path.read_text())
# Backwards compat: remap any legacy metric names saved in an older config.json.
from eva.metrics.legacy_aliases import rename_metric_keys, rename_metric_list
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the import to the top? I think we are importing 3 times now.

if ctx is None:
not_finished.append(record_id)
continue
if ctx.conversation_finished: # type: ignore[attr-defined]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this context variable to conversation_valid_end since I think this is what it now represents?

- Docs: 'iff' -> 'if' in conversation_correctly_finished.md
- Hoist legacy_aliases imports to module tops (eva.metrics.runner, eva.orchestrator.runner)
- ConversationValidEndMetric now scores 1.0 on agent_timeout_on_user_turn as well as goodbye
- Rename _ProcessorContext/MetricContext.conversation_finished -> conversation_valid_end and compute it as (goodbye OR agent_timeout_on_user_turn)
- Simplify ValidationRunner._classify to a single valid_end check; keep agent_timeout set for terminal flagging
Copy link
Copy Markdown
Collaborator

@gabegma gabegma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! Minor detail but we could update the doc for conversation_valid_end now that it includes agent time-out failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants