Skip to content

Extract RAI scorer token metrics into Score metadata and save to memory#45865

Merged
slister1001 merged 2 commits intoAzure:mainfrom
slister1001:fix/foundry-scorer-token-metrics
Mar 24, 2026
Merged

Extract RAI scorer token metrics into Score metadata and save to memory#45865
slister1001 merged 2 commits intoAzure:mainfrom
slister1001:fix/foundry-scorer-token-metrics

Conversation

@slister1001
Copy link
Copy Markdown
Member

  • Extract token usage (prompt_tokens, completion_tokens, total_tokens) from RAI service eval_result via sample.usage or result properties.metrics
  • Add token_usage to score_metadata dict in RAIServiceScorer
  • Save scores to PyRIT CentralMemory after creation (fail-safe)
  • Propagate scorer token_usage through ResultProcessor to output item properties.metrics for downstream aggregation
  • Add 5 unit tests covering token extraction, memory save, and error handling

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

- Extract token usage (prompt_tokens, completion_tokens, total_tokens) from
  RAI service eval_result via sample.usage or result properties.metrics
- Add token_usage to score_metadata dict in RAIServiceScorer
- Save scores to PyRIT CentralMemory after creation (fail-safe)
- Propagate scorer token_usage through ResultProcessor to output item
  properties.metrics for downstream aggregation
- Add 5 unit tests covering token extraction, memory save, and error handling

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 requested a review from a team as a code owner March 24, 2026 00:29
Copilot AI review requested due to automatic review settings March 24, 2026 00:29
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances red teaming result fidelity by extracting RAI scorer token usage into score metadata, persisting scores into PyRIT CentralMemory, and propagating token metrics into output item properties.metrics for downstream aggregation.

Changes:

  • Extract token usage (prompt_tokens, completion_tokens, total_tokens, cached_tokens) from RAI evaluation results and attach it to Score.score_metadata.
  • Save created scores into PyRIT CentralMemory (best-effort) to support later retrieval (e.g., attack_result.last_score).
  • Propagate scorer token usage through ResultProcessor into output item properties.metrics, with unit tests covering extraction + memory behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py Adds token usage extraction + score metadata builder; persists scores to CentralMemory.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py Reads token usage from serialized score metadata and propagates it into output properties.metrics when eval metrics aren’t present.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py Adds unit tests for token usage extraction paths and fail-safe memory persistence.

Match against canonical and legacy metric name aliases when extracting
token usage from result-level properties.metrics, consistent with how
score extraction already handles aliases via _SYNC_TO_LEGACY_METRIC_NAMES
and _LEGACY_TO_SYNC_METRIC_NAMES.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@nagkumar91 nagkumar91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped PR. No issues found.

  • Token extraction with two-level fallback (sample.usageresult.properties.metrics) is robust ✅
  • Memory save with graceful degradation on failure ✅
  • String metadata handling for PyRIT serialization ✅
  • Scorer token usage propagation as fallback when eval_row lacks metrics ✅
  • 5 tests covering extraction, fallback, absent data, memory save, and memory failure ✅

LGTM.

@slister1001 slister1001 merged commit 06c9e88 into Azure:main Mar 24, 2026
21 checks passed
slister1001 added a commit that referenced this pull request Mar 24, 2026
…ry (#45865)

* Extract RAI scorer token metrics into Score metadata and save to memory

- Extract token usage (prompt_tokens, completion_tokens, total_tokens) from
  RAI service eval_result via sample.usage or result properties.metrics
- Add token_usage to score_metadata dict in RAIServiceScorer
- Save scores to PyRIT CentralMemory after creation (fail-safe)
- Propagate scorer token_usage through ResultProcessor to output item
  properties.metrics for downstream aggregation
- Add 5 unit tests covering token extraction, memory save, and error handling

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Use metric aliases in _extract_token_usage fallback

Match against canonical and legacy metric name aliases when extracting
token usage from result-level properties.metrics, consistent with how
score extraction already handles aliases via _SYNC_TO_LEGACY_METRIC_NAMES
and _LEGACY_TO_SYNC_METRIC_NAMES.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 added a commit that referenced this pull request Mar 24, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 added a commit to slister1001/azure-sdk-for-python that referenced this pull request Mar 30, 2026
- Backport 1.16.2 hotfix CHANGELOG with release date (2026-03-24)
- Add missing token metrics entry (PR Azure#45865) to 1.16.2 section
- Add 1.16.3 (Unreleased) section with existing extra_headers feature
- Bump _version.py to 1.16.3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 added a commit that referenced this pull request Mar 31, 2026
- Backport 1.16.2 hotfix CHANGELOG with release date (2026-03-24)
- Add missing token metrics entry (PR #45865) to 1.16.2 section
- Add 1.16.3 (Unreleased) section with existing extra_headers feature
- Bump _version.py to 1.16.3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 added a commit that referenced this pull request Apr 1, 2026
- Backport 1.16.2 hotfix CHANGELOG with release date (2026-03-24)
- Add missing token metrics entry (PR #45865) to 1.16.2 section
- Add 1.16.3 (Unreleased) section with existing extra_headers feature
- Bump _version.py to 1.16.3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants