[Evaluation] Additional red team e2e tests by slister1001 · Pull Request #45579 · Azure/azure-sdk-for-python

slister1001 · 2026-03-08T18:25:12Z

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Copilot

Pull request overview

Adds additional end-to-end coverage for RedTeam “Foundry” execution and aligns a few red team internals/outputs with expected contracts and error semantics.

Changes:

Expanded test_red_team_foundry.py with new Foundry e2e scenarios (model-config targets, agent targets, new risk categories, multi-turn strategies, and contract error paths).
Fixed Foundry baseline objective cache lookup keying to use the same risk-category→objective mapping as the generator path.
Treated leftover pending/running statuses as terminal failures when producing final run status, and surfaced RAI evaluation service “error outcome” as undetermined instead of attack success.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_red_team_foundry.py`	Adds substantial Foundry red team e2e coverage across targets, strategies, and risk categories.
`sdk/evaluation/azure-ai-evaluation/tests/conftest.py`	Updates OpenAI/test-proxy routing configuration used by recordings/playback.
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py`	Adjusts final run-level status determination semantics after scan completion.
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team.py`	Fixes baseline objective cache key mismatch in Foundry execution path.
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py`	Detects evaluation-service error outcomes and raises so PyRIT marks results as undetermined.
`sdk/evaluation/azure-ai-evaluation/CHANGELOG.md`	Documents the bug fixes included in this PR.

sdk/evaluation/azure-ai-evaluation/tests/conftest.py

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Tests cover: basic execution, XPIA, multiple risk categories, application scenarios, strategy combinations, model_config targets, agent callbacks, agent tool context, ProtectedMaterial/CodeVulnerability/TaskAdherence categories, SensitiveDataLeakage, agent-only risk rejection, multi-turn, and crescendo attacks. Also fixes PROXY_URL() TypeError in conftest.py (PROXY_URL is a str, not callable). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Revert PROXY_URL back to PROXY_URL() (it's a function, not a variable) - Apply black formatting to assert statements in test_red_team_foundry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t risk categories - Add _safe_tqdm_write() wrapper to handle UnicodeEncodeError on Windows cp1252 terminals - Replace all tqdm.write() calls with _safe_tqdm_write() in _red_team.py - Add custom seed prompt files for agent-only risk categories (task_adherence, sensitive_data_leakage, prohibited_actions) that lack server-side seed data - Update test_foundry_task_adherence_category and test_foundry_agent_sensitive_data_leakage to use custom_attack_seed_prompts, bypassing get_attack_objectives API - Apply black formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Merge upstream/main (7 commits) into foundry-e2e-tests branch - Fix PROXY_URL() call in conftest.py (PROXY_URL is a string, not callable) - Re-record all 15 foundry red team E2E tests with updated source code Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

In CI, devtools_testutils.config.PROXY_URL is a function that must be called. Locally (pip-installed), it's a string constant. Use callable() check to handle both environments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Patch random.sample and random.choice to return deterministic (first-N) results for the model config target test. This ensures the same objectives are selected during both recording and playback, preventing test proxy 404 mismatches caused by non-deterministic objective selection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

# Conflicts: # sdk/evaluation/azure-ai-evaluation/assets.json # sdk/evaluation/azure-ai-evaluation/tests/conftest.py

Extend /openai/v1 path normalization to all Azure endpoint patterns (*.openai.azure.com, *.cognitiveservices.azure.com, sovereign clouds) not just Foundry endpoints. PyRIT 0.11+ uses AsyncOpenAI(base_url=) which appends /chat/completions directly, requiring the /openai/v1 prefix. Without this fix, model config targets using classic AOAI endpoints get 404 errors because PyRIT sends requests to the bare endpoint. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

RISK_CATEGORY_METRIC_MAP mapped HateUnfairness to HATE_FAIRNESS (legacy name), but the sync eval API returns results under hate_unfairness (canonical name). The scorer's result matching compared against the un-normalized hate_fairness, causing it to never match and silently fall back to score=0 — making ASR always 0% for hate_unfairness regardless of actual model behavior. Changes: - metric_mapping.py: Map HateUnfairness to HATE_UNFAIRNESS (canonical name). The routing layer in evaluate_with_rai_service_sync normalizes to the legacy name when use_legacy_endpoint=True, so both paths work. - _rai_scorer.py: Match results against both canonical and legacy aliases using _SYNC_TO_LEGACY_METRIC_NAMES, so future metric renames don't silently break scoring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Tests now expect /openai/v1 suffix on all Azure endpoints, matching the updated get_chat_target() behavior needed for PyRIT 0.11+. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When target_type=agent and no client_id is provided (local execution, not ACA), fall back to the existing credential to set aml-aca-token header. Previously this header was only set via ACA managed identity, causing 'Authorization failed for seeds' when running agent-target red team scans locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This reverts commit abb47c4.

Two fixes: 1. _rai_service_target.py: Accept both 'message' (PyRIT 0.11+) and 'prompt_request' (legacy) parameter names in send_prompt_async(). PyRIT 0.11 changed the interface from prompt_request= to message=, causing TypeError on multi-turn and crescendo attacks. 2. _generated_rai_client.py: Set aml-aca-token header from existing credential for agent-type seed requests when no client_id (ACA managed identity) is available. Enables local SDK testing of agent targets without ACA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix list[Message] -> List[Message] type hint for Python 3.8 compat - Guard _fallback_response against None when retry kwargs are malformed - Add CHANGELOG entries for metric fix, PyRIT compat, endpoint normalization, and agent token fallback - Move _AZURE_OPENAI_HOST_SUFFIXES to module-level constant - Use _validate_attack_details shared helper in multi-turn/crescendo tests - Change agent token fallback log level from debug to warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nagkumar91

Code Review

🟡 XPIA agent fallback catches all exceptions silently (Medium)

In _red_team.py, the XPIA prompt fetch for agent targets catches Exception and logs only at debug level:

except Exception as agent_error:
    if target_type_str == "agent":
        self.logger.debug(f"Agent-type XPIA prompt fetch failed ({agent_error}), falling back to model-type")

This swallows ALL exceptions — network failures, bugs, timeouts, JSON parse errors — not just expected 404/auth errors. For agent targets, these errors vanish into debug logs making troubleshooting very difficult.

Fix: Catch specific expected exception types (e.g., HttpResponseError) and re-raise unexpected ones, or at minimum log at warning level.

🟡 Agent credential fallback — bare except at debug level (Medium)

In _generated_rai_client.py, the agent token fallback silently swallows all errors:

try:
    token = self.token_manager.credential.get_token(TokenScope.DEFAULT_AZURE_MANAGEMENT.value).token
    headers["aml-aca-token"] = token
except Exception:
    self.logger.debug("Could not set aml-aca-token from existing credential", exc_info=True)

If the credential is misconfigured or expired, execution continues without the token header. The subsequent service call will fail with a cryptic auth error that's very hard to trace back to this swallowed exception.

Fix: Log at warning level instead of debug.

🟡 `_fallback_response` returns `[]` on missing request (Medium)

In _rai_service_target.py, when neither message nor prompt_request is in retry kwargs:

request = retry_state.kwargs.get("message") or retry_state.kwargs.get("prompt_request")
if request is None:
    logger.warning("_fallback_response: no 'message' or 'prompt_request' in retry kwargs")
    return []

This is the retry error callback after 5 attempts are exhausted. Returning [] breaks the List[Message] contract — callers accessing response[0] will get IndexError. Consider raising an exception instead since this represents an unrecoverable error state.

🟠 Import inside method in `_rai_scorer.py` (Low)

async def _score_piece_async(self, ...):
    from azure.ai.evaluation._common.rai_service import (
        _SYNC_TO_LEGACY_METRIC_NAMES, _LEGACY_TO_SYNC_METRIC_NAMES,
    )

This runs on every scoring call. Move to module level for clarity and minor perf improvement.

- Upgrade XPIA agent fallback log from debug to warning (_red_team.py) - Upgrade aml-aca-token credential fallback log from debug to warning (_generated_rai_client.py) - Raise RuntimeError instead of returning [] in _fallback_response (_rai_service_target.py) - Move metric name imports to module level (_rai_scorer.py) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nagkumar91

All four issues from previous review are addressed:

XPIA agent fallback — log level raised from debug to warning ✅
Agent credential fallback — log level raised from debug to warning ✅
_fallback_response empty list — now raises RuntimeError instead of returning [] ✅
Import inside method — moved to module-level in _rai_scorer.py ✅

LGTM.

* Add 15 Foundry red team E2E tests for full RAISvc contract coverage Tests cover: basic execution, XPIA, multiple risk categories, application scenarios, strategy combinations, model_config targets, agent callbacks, agent tool context, ProtectedMaterial/CodeVulnerability/TaskAdherence categories, SensitiveDataLeakage, agent-only risk rejection, multi-turn, and crescendo attacks. Also fixes PROXY_URL() TypeError in conftest.py (PROXY_URL is a str, not callable). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix PROXY_URL() call and apply black formatting - Revert PROXY_URL back to PROXY_URL() (it's a function, not a variable) - Apply black formatting to assert statements in test_red_team_foundry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix Windows encoding bug in tqdm output and use custom seeds for agent risk categories - Add _safe_tqdm_write() wrapper to handle UnicodeEncodeError on Windows cp1252 terminals - Replace all tqdm.write() calls with _safe_tqdm_write() in _red_team.py - Add custom seed prompt files for agent-only risk categories (task_adherence, sensitive_data_leakage, prohibited_actions) that lack server-side seed data - Update test_foundry_task_adherence_category and test_foundry_agent_sensitive_data_leakage to use custom_attack_seed_prompts, bypassing get_attack_objectives API - Apply black formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Re-record foundry E2E tests after merging upstream/main - Merge upstream/main (7 commits) into foundry-e2e-tests branch - Fix PROXY_URL() call in conftest.py (PROXY_URL is a string, not callable) - Re-record all 15 foundry red team E2E tests with updated source code Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix PROXY_URL handling for both callable and string variants In CI, devtools_testutils.config.PROXY_URL is a function that must be called. Locally (pip-installed), it's a string constant. Use callable() check to handle both environments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix test_foundry_with_model_config_target recording playback failure Patch random.sample and random.choice to return deterministic (first-N) results for the model config target test. This ensures the same objectives are selected during both recording and playback, preventing test proxy 404 mismatches caused by non-deterministic objective selection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix Azure OpenAI endpoint normalization for PyRIT 0.11+ compatibility Extend /openai/v1 path normalization to all Azure endpoint patterns (*.openai.azure.com, *.cognitiveservices.azure.com, sovereign clouds) not just Foundry endpoints. PyRIT 0.11+ uses AsyncOpenAI(base_url=) which appends /chat/completions directly, requiring the /openai/v1 prefix. Without this fix, model config targets using classic AOAI endpoints get 404 errors because PyRIT sends requests to the bare endpoint. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix hate_unfairness metric name mismatch in RAI scorer RISK_CATEGORY_METRIC_MAP mapped HateUnfairness to HATE_FAIRNESS (legacy name), but the sync eval API returns results under hate_unfairness (canonical name). The scorer's result matching compared against the un-normalized hate_fairness, causing it to never match and silently fall back to score=0 — making ASR always 0% for hate_unfairness regardless of actual model behavior. Changes: - metric_mapping.py: Map HateUnfairness to HATE_UNFAIRNESS (canonical name). The routing layer in evaluate_with_rai_service_sync normalizes to the legacy name when use_legacy_endpoint=True, so both paths work. - _rai_scorer.py: Match results against both canonical and legacy aliases using _SYNC_TO_LEGACY_METRIC_NAMES, so future metric renames don't silently break scoring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update recording for model config target test * Update unit tests for Azure OpenAI endpoint normalization Tests now expect /openai/v1 suffix on all Azure endpoints, matching the updated get_chat_target() behavior needed for PyRIT 0.11+. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix agent seed auth for local SDK usage When target_type=agent and no client_id is provided (local execution, not ACA), fall back to the existing credential to set aml-aca-token header. Previously this header was only set via ACA managed identity, causing 'Authorization failed for seeds' when running agent-target red team scans locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "Fix agent seed auth for local SDK usage" This reverts commit abb47c4. * Fix send_prompt_async parameter name for PyRIT 0.11+ and agent seed auth Two fixes: 1. _rai_service_target.py: Accept both 'message' (PyRIT 0.11+) and 'prompt_request' (legacy) parameter names in send_prompt_async(). PyRIT 0.11 changed the interface from prompt_request= to message=, causing TypeError on multi-turn and crescendo attacks. 2. _generated_rai_client.py: Set aml-aca-token header from existing credential for agent-type seed requests when no client_id (ACA managed identity) is available. Enables local SDK testing of agent targets without ACA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update recordings for foundry E2E tests * Update unit tests * Apply black formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR #45579 review feedback - Fix list[Message] -> List[Message] type hint for Python 3.8 compat - Guard _fallback_response against None when retry kwargs are malformed - Add CHANGELOG entries for metric fix, PyRIT compat, endpoint normalization, and agent token fallback - Move _AZURE_OPENAI_HOST_SUFFIXES to module-level constant - Use _validate_attack_details shared helper in multi-turn/crescendo tests - Change agent token fallback log level from debug to warning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review: improve logging, error handling, and imports - Upgrade XPIA agent fallback log from debug to warning (_red_team.py) - Upgrade aml-aca-token credential fallback log from debug to warning (_generated_rai_client.py) - Raise RuntimeError instead of returning [] in _fallback_response (_rai_service_target.py) - Move metric name imports to module level (_rai_scorer.py) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 8, 2026 18:25

slister1001 requested a review from a team as a code owner March 8, 2026 18:25

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 8, 2026

Copilot started reviewing on behalf of slister1001 March 8, 2026 18:25 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

sdk/evaluation/azure-ai-evaluation/tests/conftest.py Outdated Show resolved Hide resolved

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Show resolved Hide resolved

slister1001 force-pushed the foundry-e2e-tests branch from 250ada4 to 59748fa Compare March 10, 2026 17:39

slister1001 and others added 11 commits March 10, 2026 14:59

Fix PROXY_URL() call and apply black formatting

baeebc0

- Revert PROXY_URL back to PROXY_URL() (it's a function, not a variable) - Apply black formatting to assert statements in test_red_team_foundry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge upstream/main to update with latest changes

5235c0a

Merge remote-tracking branch 'upstream/main' into foundry-e2e-tests

e963b4d

# Conflicts: # sdk/evaluation/azure-ai-evaluation/assets.json # sdk/evaluation/azure-ai-evaluation/tests/conftest.py

Update recording for model config target test

c6db68d

Update unit tests for Azure OpenAI endpoint normalization

35168c4

Tests now expect /openai/v1 suffix on all Azure endpoints, matching the updated get_chat_target() behavior needed for PyRIT 0.11+. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 force-pushed the foundry-e2e-tests branch from ff6da95 to 35168c4 Compare March 19, 2026 17:38

slister1001 force-pushed the foundry-e2e-tests branch from 77066a9 to abb47c4 Compare March 19, 2026 20:31

slister1001 and others added 3 commits March 19, 2026 17:19

Revert "Fix agent seed auth for local SDK usage"

0d5cd90

This reverts commit abb47c4.

Merge remote-tracking branch 'upstream/main' into foundry-e2e-tests

edc4408

slister1001 force-pushed the foundry-e2e-tests branch from 23c612c to 2e5b2ba Compare March 20, 2026 00:03

slister1001 and others added 4 commits March 20, 2026 11:41

Update recordings for foundry E2E tests

4336614

Update unit tests

ecd68b8

Apply black formatting

c162f2c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge remote-tracking branch 'upstream/main' into foundry-e2e-tests

711a2fc

slister1001 force-pushed the foundry-e2e-tests branch from 79f30b9 to 24d0ff0 Compare March 23, 2026 19:22

nagkumar91 reviewed Mar 23, 2026

View reviewed changes

nagkumar91 approved these changes Mar 23, 2026

View reviewed changes

slister1001 enabled auto-merge (squash) March 23, 2026 21:20

slister1001 merged commit 4e84079 into Azure:main Mar 23, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] Additional red team e2e tests #45579

[Evaluation] Additional red team e2e tests #45579
slister1001 merged 22 commits intoAzure:mainfrom
slister1001:foundry-e2e-tests

slister1001 commented Mar 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

nagkumar91 left a comment

Uh oh!

nagkumar91 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slister1001 commented Mar 8, 2026

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

nagkumar91 left a comment

Choose a reason for hiding this comment

Code Review

🟡 XPIA agent fallback catches all exceptions silently (Medium)

🟡 Agent credential fallback — bare except at debug level (Medium)

🟡 _fallback_response returns [] on missing request (Medium)

🟠 Import inside method in _rai_scorer.py (Low)

Uh oh!

nagkumar91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

🟡 `_fallback_response` returns `[]` on missing request (Medium)

🟠 Import inside method in `_rai_scorer.py` (Low)