Improve faithfulness and conversation progression for S2S#93
Merged
tara-servicenow merged 3 commits intomainfrom Apr 28, 2026
Merged
Improve faithfulness and conversation progression for S2S#93tara-servicenow merged 3 commits intomainfrom
tara-servicenow merged 3 commits intomainfrom
Conversation
tara-servicenow
approved these changes
Apr 28, 2026
AdrienDS
pushed a commit
to labstr/eva
that referenced
this pull request
Apr 29, 2026
after 5f4bb76 in ServiceNow#93
AdrienDS
pushed a commit
to labstr/eva
that referenced
this pull request
Apr 29, 2026
I included a few small things before the default metrics. Let me know if anything is worth discussing in a separate PR. * [Lock dependencies](ServiceNow@fd65e8a) after 5f4bb76 in ServiceNow#93 - [Lock dependencies in pre-commit](ServiceNow@2d35a89), which is also run in the CI, so `uv.lock` never gets out of sync again. - [Get package version dynamically from `src/eva/__init__.py`](ServiceNow@940936e) - [Remove simulation and metrics versions from pyproject.toml](ServiceNow@0b194f9), as they are redundant with `src/eva/__init__.py` and out of sync since 5f4bb76. * [Revert duplicated `_redact_api_keys()`](ServiceNow@747d71a) from ServiceNow#45, duplicated with the existing `_redact_model_params()` * [Remove remaining unreachable `or`s](ServiceNow@74832b9) left after [this](ServiceNow@6d771db#diff-5b03da5a28117137f315375ad13e2fe88f52779180b5af968805e4a24f90615bL85-R87) in ServiceNow#50 * [Fully remove deprecated env vars](ServiceNow@d06f766) - These env vars were not supported; they have been raising a deprecation error for the last 2 months, so we can assume people have stopped using them by now. * [Update Pydantic](ServiceNow@df958c9) (this would fail without the previous commit) * [Remove CLI aliases](ServiceNow@20c459a) - This is a breaking change, but specifying these old CLI aliases/shortcuts will now raise a clear `eva: error: unrecognized arguments: ...`, so this shouldn't cause silent errors. - [Update README](ServiceNow@a9774f1) * [Run all metrics by default](ServiceNow@1590ce6) * [Support `--metric=` (work around a bug in Pydantic)](ServiceNow@c8eb0d4)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Faithfulness — S2S carve-out for
misrepresenting_tool_result. S2S pipelines were getting r=1 on token-level TTS/STT artifacts on tool IDs (e.g.,REQ-FAC-0ddb46f41d67→REQFAC0DDB46F1D67). Articulation fidelity isagent_speech_fidelity's scope, not faithfulness's. Populated the emptyS2S_ASSISTANT_TURNS_DISCLAIMER(orientation: assistant turns are STT'd audio, articulation goes throughagent_speech_fidelity) and added a dimension-specificS2S_MISREPRESENTATION_NOTEinjected only intomisrepresenting_tool_resultfor S2S. Cascade pipelines unchanged.Conversation progression — same approach for
information_lossandredundant_statements. Same pattern applied: pipeline-specific scoping notes carve out token-level transcription artifacts that should be evaluated byagent_speech_fidelityinstead of being double-counted.Validation. Re-judged 15 hand-picked cases through
local/rerun_faithfulness.py(9 TTS-mangling drops + 4 S2S semantic stays + 2 cascade controls):