Fix evaluator token metrics not persisted in red teaming results#46021
Merged
slister1001 merged 3 commits intoAzure:mainfrom Apr 1, 2026
Merged
Fix evaluator token metrics not persisted in red teaming results#46021slister1001 merged 3 commits intoAzure:mainfrom
slister1001 merged 3 commits intoAzure:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes red teaming output persistence of evaluator token metrics by updating the Foundry RAIServiceScorer token-usage extraction to support both camelCase (raw JSON) and snake_case (SDK model) response keys, ensuring downstream consumers consistently receive snake_case.
Changes:
- Updated
_extract_token_usage()to accept both camelCase and snake_case token-usage keys and normalize output to snake_case. - Added unit tests covering camelCase token usage extraction from both
sample.usageandresults[].properties.metrics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py |
Normalizes token usage extraction across camelCase/snake_case sync-eval response shapes. |
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py |
Adds regression tests for camelCase token usage extraction paths. |
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py
Outdated
Show resolved
Hide resolved
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py
Outdated
Show resolved
Hide resolved
013c0ff to
94e7a72
Compare
The sync eval API returns token usage keys in camelCase (promptTokens, completionTokens) but _extract_token_usage() only looked for snake_case keys (prompt_tokens, completion_tokens). This caused the extraction to silently return an empty dict, so scorer_token_usage was never set and evaluator token metrics were dropped from red teaming output items. The fix normalises both camelCase and snake_case keys to snake_case in _extract_token_usage(), covering both SDK model objects (snake_case) and raw JSON responses from non-OneDP endpoints (camelCase). Also updated _compute_per_model_usage() in _result_processor.py to accept both key styles when aggregating evaluator token usage, since scorer_token_usage now arrives in snake_case. Added two new tests for camelCase key handling in both sample.usage and result properties.metrics extraction paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
94e7a72 to
fe8ce49
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BryceByDesign
requested changes
Mar 31, 2026
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py
Outdated
Show resolved
Hide resolved
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py
Show resolved
Hide resolved
- Move _CAMEL_TO_SNAKE and _SNAKE_KEYS to module-level constants in _rai_scorer.py to avoid per-call recreation - Add 5 tests for ResultProcessor._compute_per_model_usage covering camelCase keys, snake_case keys, mixed, aggregation, and empty input Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BryceByDesign
approved these changes
Apr 1, 2026
slister1001
added a commit
that referenced
this pull request
Apr 1, 2026
) * Fix evaluator token metrics not persisted in red teaming results The sync eval API returns token usage keys in camelCase (promptTokens, completionTokens) but _extract_token_usage() only looked for snake_case keys (prompt_tokens, completion_tokens). This caused the extraction to silently return an empty dict, so scorer_token_usage was never set and evaluator token metrics were dropped from red teaming output items. The fix normalises both camelCase and snake_case keys to snake_case in _extract_token_usage(), covering both SDK model objects (snake_case) and raw JSON responses from non-OneDP endpoints (camelCase). Also updated _compute_per_model_usage() in _result_processor.py to accept both key styles when aggregating evaluator token usage, since scorer_token_usage now arrives in snake_case. Added two new tests for camelCase key handling in both sample.usage and result properties.metrics extraction paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review: use American English spelling (normalize) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Since the transition from the old evaluation-processor path to the Foundry/RAIServiceScorer path, evaluator token metrics (\promptTokens, \completionTokens) from the sync eval API response are silently dropped from red teaming output items.
In version 1.14.0, the old _evaluation_processor\ stored the full sync eval response as _eval_run_output_item, which carried \properties.metrics\ through to the output. The new Foundry path evaluates inline via \RAIServiceScorer\ and sets \�valuation_result: None, bypassing that extraction path entirely. A fallback mechanism via \scorer_token_usage\ was added, but it has a camelCase/snake_case key mismatch bug.
Root Cause
_extract_token_usage()\ in _rai_scorer.py\ searches for snake_case keys (\prompt_tokens, \completion_tokens) but the sync eval API returns camelCase keys (\promptTokens, \completionTokens). This causes the extraction to silently return an empty dict, so \scorer_token_usage\ is never set and the fallback in _result_processor.py\ never fires.
Both extraction paths are affected:
esults[].properties.metrics): Also uses camelCase keys
Fix
Updated _extract_token_usage()\ to handle both camelCase and snake_case keys, normalizing to snake_case for downstream consumers. Uses a shared _extract_from_dict()\ helper that checks both key styles.
Tests
Added 2 new tests:
All 4 token usage tests pass. Black formatting verified.