Skip to content

Fix evaluator token metrics not persisted in red teaming results#46021

Merged
slister1001 merged 3 commits intoAzure:mainfrom
slister1001:fix/redteam-token-metrics
Apr 1, 2026
Merged

Fix evaluator token metrics not persisted in red teaming results#46021
slister1001 merged 3 commits intoAzure:mainfrom
slister1001:fix/redteam-token-metrics

Conversation

@slister1001
Copy link
Copy Markdown
Member

Problem

Since the transition from the old evaluation-processor path to the Foundry/RAIServiceScorer path, evaluator token metrics (\promptTokens, \completionTokens) from the sync eval API response are silently dropped from red teaming output items.

In version 1.14.0, the old _evaluation_processor\ stored the full sync eval response as _eval_run_output_item, which carried \properties.metrics\ through to the output. The new Foundry path evaluates inline via \RAIServiceScorer\ and sets \�valuation_result: None, bypassing that extraction path entirely. A fallback mechanism via \scorer_token_usage\ was added, but it has a camelCase/snake_case key mismatch bug.

Root Cause

_extract_token_usage()\ in _rai_scorer.py\ searches for snake_case keys (\prompt_tokens, \completion_tokens) but the sync eval API returns camelCase keys (\promptTokens, \completionTokens). This causes the extraction to silently return an empty dict, so \scorer_token_usage\ is never set and the fallback in _result_processor.py\ never fires.

Both extraction paths are affected:

  • Path 1 (\sample.usage): Raw JSON responses use camelCase keys
  • Path 2 (
    esults[].properties.metrics): Also uses camelCase keys

Fix

Updated _extract_token_usage()\ to handle both camelCase and snake_case keys, normalizing to snake_case for downstream consumers. Uses a shared _extract_from_dict()\ helper that checks both key styles.

Tests

Added 2 new tests:

  • \ est_score_metadata_includes_token_usage_from_sample_camelcase\ — verifies camelCase keys from raw JSON \sample.usage\
  • \ est_score_metadata_includes_token_usage_from_result_properties_camelcase\ — verifies camelCase keys from \properties.metrics\

All 4 token usage tests pass. Black formatting verified.

Copilot AI review requested due to automatic review settings March 31, 2026 17:54
@slister1001 slister1001 requested a review from a team as a code owner March 31, 2026 17:54
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 31, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes red teaming output persistence of evaluator token metrics by updating the Foundry RAIServiceScorer token-usage extraction to support both camelCase (raw JSON) and snake_case (SDK model) response keys, ensuring downstream consumers consistently receive snake_case.

Changes:

  • Updated _extract_token_usage() to accept both camelCase and snake_case token-usage keys and normalize output to snake_case.
  • Added unit tests covering camelCase token usage extraction from both sample.usage and results[].properties.metrics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py Normalizes token usage extraction across camelCase/snake_case sync-eval response shapes.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py Adds regression tests for camelCase token usage extraction paths.

@slister1001 slister1001 force-pushed the fix/redteam-token-metrics branch from 013c0ff to 94e7a72 Compare March 31, 2026 18:07
The sync eval API returns token usage keys in camelCase (promptTokens,
completionTokens) but _extract_token_usage() only looked for snake_case
keys (prompt_tokens, completion_tokens). This caused the extraction to
silently return an empty dict, so scorer_token_usage was never set and
evaluator token metrics were dropped from red teaming output items.

The fix normalises both camelCase and snake_case keys to snake_case in
_extract_token_usage(), covering both SDK model objects (snake_case) and
raw JSON responses from non-OneDP endpoints (camelCase).

Also updated _compute_per_model_usage() in _result_processor.py to
accept both key styles when aggregating evaluator token usage, since
scorer_token_usage now arrives in snake_case.

Added two new tests for camelCase key handling in both sample.usage and
result properties.metrics extraction paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 force-pushed the fix/redteam-token-metrics branch from 94e7a72 to fe8ce49 Compare March 31, 2026 18:26
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 enabled auto-merge (squash) March 31, 2026 21:57
- Move _CAMEL_TO_SNAKE and _SNAKE_KEYS to module-level constants in
  _rai_scorer.py to avoid per-call recreation
- Add 5 tests for ResultProcessor._compute_per_model_usage covering
  camelCase keys, snake_case keys, mixed, aggregation, and empty input

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 merged commit dfda2d6 into Azure:main Apr 1, 2026
21 checks passed
slister1001 added a commit that referenced this pull request Apr 1, 2026
)

* Fix evaluator token metrics not persisted in red teaming results

The sync eval API returns token usage keys in camelCase (promptTokens,
completionTokens) but _extract_token_usage() only looked for snake_case
keys (prompt_tokens, completion_tokens). This caused the extraction to
silently return an empty dict, so scorer_token_usage was never set and
evaluator token metrics were dropped from red teaming output items.

The fix normalises both camelCase and snake_case keys to snake_case in
_extract_token_usage(), covering both SDK model objects (snake_case) and
raw JSON responses from non-OneDP endpoints (camelCase).

Also updated _compute_per_model_usage() in _result_processor.py to
accept both key styles when aggregating evaluator token usage, since
scorer_token_usage now arrives in snake_case.

Added two new tests for camelCase key handling in both sample.usage and
result properties.metrics extraction paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review: use American English spelling (normalize)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants