Skip to content

Improve faithfulness and conversation progression for S2S#93

Merged
tara-servicenow merged 3 commits intomainfrom
ggm/fix-metric-bugs
Apr 28, 2026
Merged

Improve faithfulness and conversation progression for S2S#93
tara-servicenow merged 3 commits intomainfrom
ggm/fix-metric-bugs

Conversation

@gabegma
Copy link
Copy Markdown
Collaborator

@gabegma gabegma commented Apr 28, 2026

Summary

Faithfulness — S2S carve-out for misrepresenting_tool_result. S2S pipelines were getting r=1 on token-level TTS/STT artifacts on tool IDs (e.g., REQ-FAC-0ddb46f41d67REQFAC0DDB46F1D67). Articulation fidelity is agent_speech_fidelity's scope, not faithfulness's. Populated the empty S2S_ASSISTANT_TURNS_DISCLAIMER (orientation: assistant turns are STT'd audio, articulation goes through agent_speech_fidelity) and added a dimension-specific S2S_MISREPRESENTATION_NOTE injected only into misrepresenting_tool_result for S2S. Cascade pipelines unchanged.

Conversation progression — same approach for information_loss and redundant_statements. Same pattern applied: pipeline-specific scoping notes carve out token-level transcription artifacts that should be evaluated by agent_speech_fidelity instead of being double-counted.

Validation. Re-judged 15 hand-picked cases through local/rerun_faithfulness.py (9 TTS-mangling drops + 4 S2S semantic stays + 2 cascade controls):

  • 9/9 drops: MTR moved from 1 → 2/3.
  • 6/6 stays: overall=1 preserved; semantic violations still flagged in MTR or hallucination per scope.
  • 0 cases lost the underlying signal.

@gabegma gabegma self-assigned this Apr 28, 2026
@tara-servicenow tara-servicenow added this pull request to the merge queue Apr 28, 2026
Merged via the queue into main with commit 9230b38 Apr 28, 2026
1 check passed
@tara-servicenow tara-servicenow deleted the ggm/fix-metric-bugs branch April 28, 2026 22:03
AdrienDS pushed a commit to labstr/eva that referenced this pull request Apr 29, 2026
AdrienDS pushed a commit to labstr/eva that referenced this pull request Apr 29, 2026
I included a few small things before the default metrics. Let me know if
anything is worth discussing in a separate PR.

* [Lock
dependencies](ServiceNow@fd65e8a)
after 5f4bb76 in ServiceNow#93
- [Lock dependencies in
pre-commit](ServiceNow@2d35a89),
which is also run in the CI, so `uv.lock` never gets out of sync again.
- [Get package version dynamically from
`src/eva/__init__.py`](ServiceNow@940936e)
- [Remove simulation and metrics versions from
pyproject.toml](ServiceNow@0b194f9),
as they are redundant with `src/eva/__init__.py` and out of sync since
5f4bb76.
* [Revert duplicated
`_redact_api_keys()`](ServiceNow@747d71a)
from ServiceNow#45, duplicated with the existing `_redact_model_params()`
* [Remove remaining unreachable
`or`s](ServiceNow@74832b9)
left after
[this](ServiceNow@6d771db#diff-5b03da5a28117137f315375ad13e2fe88f52779180b5af968805e4a24f90615bL85-R87)
in ServiceNow#50
* [Fully remove deprecated env
vars](ServiceNow@d06f766)
- These env vars were not supported; they have been raising a
deprecation error for the last 2 months, so we can assume people have
stopped using them by now.
* [Update
Pydantic](ServiceNow@df958c9)
(this would fail without the previous commit)
* [Remove CLI
aliases](ServiceNow@20c459a)
- This is a breaking change, but specifying these old CLI
aliases/shortcuts will now raise a clear `eva: error: unrecognized
arguments: ...`, so this shouldn't cause silent errors.
- [Update
README](ServiceNow@a9774f1)
* [Run all metrics by
default](ServiceNow@1590ce6)
* [Support `--metric=` (work around a bug in
Pydantic)](ServiceNow@c8eb0d4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants