Extract RAI scorer token metrics into Score metadata and save to memory by slister1001 · Pull Request #45865 · Azure/azure-sdk-for-python

slister1001 · 2026-03-24T00:29:04Z

Extract token usage (prompt_tokens, completion_tokens, total_tokens) from RAI service eval_result via sample.usage or result properties.metrics
Add token_usage to score_metadata dict in RAIServiceScorer
Save scores to PyRIT CentralMemory after creation (fail-safe)
Propagate scorer token_usage through ResultProcessor to output item properties.metrics for downstream aggregation
Add 5 unit tests covering token extraction, memory save, and error handling

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

- Extract token usage (prompt_tokens, completion_tokens, total_tokens) from RAI service eval_result via sample.usage or result properties.metrics - Add token_usage to score_metadata dict in RAIServiceScorer - Save scores to PyRIT CentralMemory after creation (fail-safe) - Propagate scorer token_usage through ResultProcessor to output item properties.metrics for downstream aggregation - Add 5 unit tests covering token extraction, memory save, and error handling Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR enhances red teaming result fidelity by extracting RAI scorer token usage into score metadata, persisting scores into PyRIT CentralMemory, and propagating token metrics into output item properties.metrics for downstream aggregation.

Changes:

Extract token usage (prompt_tokens, completion_tokens, total_tokens, cached_tokens) from RAI evaluation results and attach it to Score.score_metadata.
Save created scores into PyRIT CentralMemory (best-effort) to support later retrieval (e.g., attack_result.last_score).
Propagate scorer token usage through ResultProcessor into output item properties.metrics, with unit tests covering extraction + memory behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py`	Adds token usage extraction + score metadata builder; persists scores to `CentralMemory`.
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py`	Reads token usage from serialized score metadata and propagates it into output `properties.metrics` when eval metrics aren’t present.
`sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py`	Adds unit tests for token usage extraction paths and fail-safe memory persistence.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py

Match against canonical and legacy metric name aliases when extracting token usage from result-level properties.metrics, consistent with how score extraction already handles aliases via _SYNC_TO_LEGACY_METRIC_NAMES and _LEGACY_TO_SYNC_METRIC_NAMES. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nagkumar91

Clean, well-scoped PR. No issues found.

Token extraction with two-level fallback (sample.usage → result.properties.metrics) is robust ✅
Memory save with graceful degradation on failure ✅
String metadata handling for PyRIT serialization ✅
Scorer token usage propagation as fallback when eval_row lacks metrics ✅
5 tests covering extraction, fallback, absent data, memory save, and memory failure ✅

LGTM.

…ry (#45865) * Extract RAI scorer token metrics into Score metadata and save to memory - Extract token usage (prompt_tokens, completion_tokens, total_tokens) from RAI service eval_result via sample.usage or result properties.metrics - Add token_usage to score_metadata dict in RAIServiceScorer - Save scores to PyRIT CentralMemory after creation (fail-safe) - Propagate scorer token_usage through ResultProcessor to output item properties.metrics for downstream aggregation - Add 5 unit tests covering token extraction, memory save, and error handling Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Use metric aliases in _extract_token_usage fallback Match against canonical and legacy metric name aliases when extracting token usage from result-level properties.metrics, consistent with how score extraction already handles aliases via _SYNC_TO_LEGACY_METRIC_NAMES and _LEGACY_TO_SYNC_METRIC_NAMES. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Backport 1.16.2 hotfix CHANGELOG with release date (2026-03-24) - Add missing token metrics entry (PR Azure#45865) to 1.16.2 section - Add 1.16.3 (Unreleased) section with existing extra_headers feature - Bump _version.py to 1.16.3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Backport 1.16.2 hotfix CHANGELOG with release date (2026-03-24) - Add missing token metrics entry (PR #45865) to 1.16.2 section - Add 1.16.3 (Unreleased) section with existing extra_headers feature - Bump _version.py to 1.16.3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 requested a review from a team as a code owner March 24, 2026 00:29

Copilot AI review requested due to automatic review settings March 24, 2026 00:29

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 24, 2026

Copilot started reviewing on behalf of slister1001 March 24, 2026 00:29 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py Outdated Show resolved Hide resolved

nagkumar91 approved these changes Mar 24, 2026

View reviewed changes

slister1001 merged commit 06c9e88 into Azure:main Mar 24, 2026
21 checks passed

slister1001 added a commit that referenced this pull request Mar 24, 2026

chore: Add PR #45865 to CHANGELOG for 1.16.2 hotfix

0694d4e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 mentioned this pull request Mar 30, 2026

Increment package version after release of azure-ai-evaluation #46001

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract RAI scorer token metrics into Score metadata and save to memory#45865

Extract RAI scorer token metrics into Score metadata and save to memory#45865
slister1001 merged 2 commits intoAzure:mainfrom
slister1001:fix/foundry-scorer-token-metrics

slister1001 commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

nagkumar91 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slister1001 commented Mar 24, 2026

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

nagkumar91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants