Fix partial metric re-runs by fanny-riols · Pull Request #59 · ServiceNow/eva

fanny-riols · 2026-04-15T15:35:36Z

What

Fixes issues that made partial metric re-runs (e.g. --metrics response_speed --force-rerun-metrics) unreliable.

Changes

metrics_summary.json no longer overwritten on partial re-run
When re-running a subset of metrics, per_metric now covers all metrics found across records rather than only the ones being re-run. metric_errors and pass_at_k_config are merged from the existing file so unrelated fields are not lost.

LLM deployment check skipped for metrics-only re-runs
apply_env_overrides gains a strict_llm param. Passing strict_llm=False when --force-rerun-metrics is set lets re-runs succeed on runs whose LLM deployment is no longer in EVA_MODEL_LIST — no simulation is needed, so the check is unnecessary.

RunConfig loading no longer conflicts with current env pipeline mode
from_existing_run used model_validate_json which, in pydantic-settings v2, merges env vars and .env on top of the saved JSON — causing a conflict if e.g. EVA_MODEL__LLM is set in the environment but the saved run used S2S. Fixed by loading with a local settings_customise_sources override that reads only from init kwargs. Also skips the pipeline mode conflict check in _strip_other_mode_fields when --force-rerun-metrics is set.

When re-running a subset of metrics (e.g. --metrics response_speed --force-rerun-metrics), the summary now aggregates per_metric for all metrics found across records rather than only the re-run ones. Also merges metric_errors and pass_at_k_config from the existing file so unrelated fields are not lost.

add strict_llm param to apply_env_overrides; pass strict_llm=False when --force-rerun-metrics is set so metrics-only re-runs on runs whose LLM deployment is no longer in EVA_MODEL_LIST don't fail

from_existing_run now loads the saved config using only init_settings (no env vars / .env file), preventing the saved model config from being contaminated by the current environment's pipeline mode vars. Also skip the pipeline mode conflict check in _strip_other_mode_fields when --force-rerun-metrics is set, as the model config is unused.

tara-servicenow

LGTM!

fanny-riols changed the title ~~Pr/fr/metrics summary~~ response_speed: with/no tool call breakdown + re-run reliability fixes Apr 15, 2026

fanny-riols changed the base branch from main to pr/fr/response_speed_decomposition April 15, 2026 15:36

fanny-riols changed the title ~~response_speed: with/no tool call breakdown + re-run reliability fixes~~ Fix partial metric re-runs Apr 15, 2026

fanny-riols added 4 commits April 16, 2026 15:30

Skip active-LLM deployment check when force-rerunning metrics

539523e

add strict_llm param to apply_env_overrides; pass strict_llm=False when --force-rerun-metrics is set so metrics-only re-runs on runs whose LLM deployment is no longer in EVA_MODEL_LIST don't fail

Fix missing docstring arg description for _strip_other_mode_fields

12a0dee

fanny-riols changed the base branch from pr/fr/response_speed_decomposition to main April 16, 2026 19:32

fanny-riols force-pushed the pr/fr/metrics-summary branch from d6fe6be to 12a0dee Compare April 16, 2026 19:32

fanny-riols marked this pull request as ready for review April 16, 2026 19:42

tara-servicenow approved these changes Apr 16, 2026

View reviewed changes

Merge branch 'main' into pr/fr/metrics-summary

19aeac5

fanny-riols enabled auto-merge April 17, 2026 13:04

fanny-riols added this pull request to the merge queue Apr 17, 2026

Merged via the queue into main with commit e40e1f4 Apr 17, 2026
1 check passed

fanny-riols deleted the pr/fr/metrics-summary branch April 17, 2026 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix partial metric re-runs#59

Fix partial metric re-runs#59
fanny-riols merged 5 commits intomainfrom
pr/fr/metrics-summary

fanny-riols commented Apr 15, 2026 •

edited

Loading

Uh oh!

tara-servicenow left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fanny-riols commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Uh oh!

tara-servicenow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fanny-riols commented Apr 15, 2026 •

edited

Loading