Fix evaluator token metrics not persisted in red teaming results by slister1001 · Pull Request #46021 · Azure/azure-sdk-for-python

slister1001 · 2026-03-31T17:54:58Z

Problem

Since the transition from the old evaluation-processor path to the Foundry/RAIServiceScorer path, evaluator token metrics (\promptTokens, \completionTokens) from the sync eval API response are silently dropped from red teaming output items.

In version 1.14.0, the old _evaluation_processor\ stored the full sync eval response as _eval_run_output_item, which carried \properties.metrics\ through to the output. The new Foundry path evaluates inline via \RAIServiceScorer\ and sets \�valuation_result: None, bypassing that extraction path entirely. A fallback mechanism via \scorer_token_usage\ was added, but it has a camelCase/snake_case key mismatch bug.

Root Cause

_extract_token_usage()\ in _rai_scorer.py\ searches for snake_case keys (\prompt_tokens, \completion_tokens) but the sync eval API returns camelCase keys (\promptTokens, \completionTokens). This causes the extraction to silently return an empty dict, so \scorer_token_usage\ is never set and the fallback in _result_processor.py\ never fires.

Both extraction paths are affected:

Path 1 (\sample.usage): Raw JSON responses use camelCase keys
Path 2 (
esults[].properties.metrics): Also uses camelCase keys

Fix

Updated _extract_token_usage()\ to handle both camelCase and snake_case keys, normalizing to snake_case for downstream consumers. Uses a shared _extract_from_dict()\ helper that checks both key styles.

Tests

Added 2 new tests:

\ est_score_metadata_includes_token_usage_from_sample_camelcase\ — verifies camelCase keys from raw JSON \sample.usage\
\ est_score_metadata_includes_token_usage_from_result_properties_camelcase\ — verifies camelCase keys from \properties.metrics\

All 4 token usage tests pass. Black formatting verified.

Copilot

Pull request overview

Fixes red teaming output persistence of evaluator token metrics by updating the Foundry RAIServiceScorer token-usage extraction to support both camelCase (raw JSON) and snake_case (SDK model) response keys, ensuring downstream consumers consistently receive snake_case.

Changes:

Updated _extract_token_usage() to accept both camelCase and snake_case token-usage keys and normalize output to snake_case.
Added unit tests covering camelCase token usage extraction from both sample.usage and results[].properties.metrics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py`	Normalizes token usage extraction across camelCase/snake_case sync-eval response shapes.
`sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py`	Adds regression tests for camelCase token usage extraction paths.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py

The sync eval API returns token usage keys in camelCase (promptTokens, completionTokens) but _extract_token_usage() only looked for snake_case keys (prompt_tokens, completion_tokens). This caused the extraction to silently return an empty dict, so scorer_token_usage was never set and evaluator token metrics were dropped from red teaming output items. The fix normalises both camelCase and snake_case keys to snake_case in _extract_token_usage(), covering both SDK model objects (snake_case) and raw JSON responses from non-OneDP endpoints (camelCase). Also updated _compute_per_model_usage() in _result_processor.py to accept both key styles when aggregating evaluator token usage, since scorer_token_usage now arrives in snake_case. Added two new tests for camelCase key handling in both sample.usage and result properties.metrics extraction paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py

- Move _CAMEL_TO_SNAKE and _SNAKE_KEYS to module-level constants in _rai_scorer.py to avoid per-call recreation - Add 5 tests for ResultProcessor._compute_per_model_usage covering camelCase keys, snake_case keys, mixed, aggregation, and empty input Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

) * Fix evaluator token metrics not persisted in red teaming results The sync eval API returns token usage keys in camelCase (promptTokens, completionTokens) but _extract_token_usage() only looked for snake_case keys (prompt_tokens, completion_tokens). This caused the extraction to silently return an empty dict, so scorer_token_usage was never set and evaluator token metrics were dropped from red teaming output items. The fix normalises both camelCase and snake_case keys to snake_case in _extract_token_usage(), covering both SDK model objects (snake_case) and raw JSON responses from non-OneDP endpoints (camelCase). Also updated _compute_per_model_usage() in _result_processor.py to accept both key styles when aggregating evaluator token usage, since scorer_token_usage now arrives in snake_case. Added two new tests for camelCase key handling in both sample.usage and result properties.metrics extraction paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review: use American English spelling (normalize) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 31, 2026 17:54

slister1001 requested a review from a team as a code owner March 31, 2026 17:54

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 31, 2026

Copilot started reviewing on behalf of slister1001 March 31, 2026 17:56 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py Outdated Show resolved Hide resolved

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py Outdated Show resolved Hide resolved

slister1001 force-pushed the fix/redteam-token-metrics branch from 013c0ff to 94e7a72 Compare March 31, 2026 18:07

slister1001 force-pushed the fix/redteam-token-metrics branch from 94e7a72 to fe8ce49 Compare March 31, 2026 18:26

Address PR review: use American English spelling (normalize)

e48f53e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 enabled auto-merge (squash) March 31, 2026 21:57

BryceByDesign requested changes Mar 31, 2026

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py Outdated Show resolved Hide resolved

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py Show resolved Hide resolved

BryceByDesign approved these changes Apr 1, 2026

View reviewed changes

slister1001 merged commit dfda2d6 into Azure:main Apr 1, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix evaluator token metrics not persisted in red teaming results#46021

Fix evaluator token metrics not persisted in red teaming results#46021
slister1001 merged 3 commits intoAzure:mainfrom
slister1001:fix/redteam-token-metrics

slister1001 commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slister1001 commented Mar 31, 2026

Problem

Root Cause

Fix

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants