Improve faithfulness and conversation progression for S2S by gabegma · Pull Request #93 · ServiceNow/eva

gabegma · 2026-04-28T20:53:32Z

Summary

Faithfulness — S2S carve-out for misrepresenting_tool_result. S2S pipelines were getting r=1 on token-level TTS/STT artifacts on tool IDs (e.g., REQ-FAC-0ddb46f41d67 → REQFAC0DDB46F1D67). Articulation fidelity is agent_speech_fidelity's scope, not faithfulness's. Populated the empty S2S_ASSISTANT_TURNS_DISCLAIMER (orientation: assistant turns are STT'd audio, articulation goes through agent_speech_fidelity) and added a dimension-specific S2S_MISREPRESENTATION_NOTE injected only into misrepresenting_tool_result for S2S. Cascade pipelines unchanged.

Conversation progression — same approach for information_loss and redundant_statements. Same pattern applied: pipeline-specific scoping notes carve out token-level transcription artifacts that should be evaluated by agent_speech_fidelity instead of being double-counted.

Validation. Re-judged 15 hand-picked cases through local/rerun_faithfulness.py (9 TTS-mangling drops + 4 S2S semantic stays + 2 cascade controls):

9/9 drops: MTR moved from 1 → 2/3.
6/6 stays: overall=1 preserved; semantic violations still flagged in MTR or hallucination per scope.
0 cases lost the underlying signal.

after 5f4bb76 in ServiceNow#93

I included a few small things before the default metrics. Let me know if anything is worth discussing in a separate PR. * [Lock dependencies](ServiceNow@fd65e8a) after 5f4bb76 in ServiceNow#93 - [Lock dependencies in pre-commit](ServiceNow@2d35a89), which is also run in the CI, so `uv.lock` never gets out of sync again. - [Get package version dynamically from `src/eva/__init__.py`](ServiceNow@940936e) - [Remove simulation and metrics versions from pyproject.toml](ServiceNow@0b194f9), as they are redundant with `src/eva/__init__.py` and out of sync since 5f4bb76. * [Revert duplicated `_redact_api_keys()`](ServiceNow@747d71a) from ServiceNow#45, duplicated with the existing `_redact_model_params()` * [Remove remaining unreachable `or`s](ServiceNow@74832b9) left after [this](ServiceNow@6d771db#diff-5b03da5a28117137f315375ad13e2fe88f52779180b5af968805e4a24f90615bL85-R87) in ServiceNow#50 * [Fully remove deprecated env vars](ServiceNow@d06f766) - These env vars were not supported; they have been raising a deprecation error for the last 2 months, so we can assume people have stopped using them by now. * [Update Pydantic](ServiceNow@df958c9) (this would fail without the previous commit) * [Remove CLI aliases](ServiceNow@20c459a) - This is a breaking change, but specifying these old CLI aliases/shortcuts will now raise a clear `eva: error: unrecognized arguments: ...`, so this shouldn't cause silent errors. - [Update README](ServiceNow@a9774f1) * [Run all metrics by default](ServiceNow@1590ce6) * [Support `--metric=` (work around a bug in Pydantic)](ServiceNow@c8eb0d4)

gabegma added 3 commits April 28, 2026 15:52

Improve faithfulness for S2S

9cce69b

Improve conversation progression for S2S

700b5ec

Bump metric version

5f4bb76

gabegma self-assigned this Apr 28, 2026

tara-servicenow approved these changes Apr 28, 2026

View reviewed changes

tara-servicenow added this pull request to the merge queue Apr 28, 2026

Merged via the queue into main with commit 9230b38 Apr 28, 2026
1 check passed

tara-servicenow deleted the ggm/fix-metric-bugs branch April 28, 2026 22:03

JosephMarinier mentioned this pull request Apr 29, 2026

Run all metrics by default #96

Merged

AdrienDS pushed a commit to labstr/eva that referenced this pull request Apr 29, 2026

Lock dependencies

fd65e8a

after 5f4bb76 in ServiceNow#93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve faithfulness and conversation progression for S2S#93

Improve faithfulness and conversation progression for S2S#93
tara-servicenow merged 3 commits intomainfrom
ggm/fix-metric-bugs

gabegma commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gabegma commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabegma commented Apr 28, 2026 •

edited

Loading