Skip to content

[Evaluation] Additional red team e2e tests #45579

Merged
slister1001 merged 22 commits intoAzure:mainfrom
slister1001:foundry-e2e-tests
Mar 23, 2026
Merged

[Evaluation] Additional red team e2e tests #45579
slister1001 merged 22 commits intoAzure:mainfrom
slister1001:foundry-e2e-tests

Conversation

@slister1001
Copy link
Copy Markdown
Member

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings March 8, 2026 18:25
@slister1001 slister1001 requested a review from a team as a code owner March 8, 2026 18:25
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds additional end-to-end coverage for RedTeam “Foundry” execution and aligns a few red team internals/outputs with expected contracts and error semantics.

Changes:

  • Expanded test_red_team_foundry.py with new Foundry e2e scenarios (model-config targets, agent targets, new risk categories, multi-turn strategies, and contract error paths).
  • Fixed Foundry baseline objective cache lookup keying to use the same risk-category→objective mapping as the generator path.
  • Treated leftover pending/running statuses as terminal failures when producing final run status, and surfaced RAI evaluation service “error outcome” as undetermined instead of attack success.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_red_team_foundry.py Adds substantial Foundry red team e2e coverage across targets, strategies, and risk categories.
sdk/evaluation/azure-ai-evaluation/tests/conftest.py Updates OpenAI/test-proxy routing configuration used by recordings/playback.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py Adjusts final run-level status determination semantics after scan completion.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team.py Fixes baseline objective cache key mismatch in Foundry execution path.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py Detects evaluation-service error outcomes and raises so PyRIT marks results as undetermined.
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Documents the bug fixes included in this PR.

Tests cover: basic execution, XPIA, multiple risk categories, application
scenarios, strategy combinations, model_config targets, agent callbacks,
agent tool context, ProtectedMaterial/CodeVulnerability/TaskAdherence
categories, SensitiveDataLeakage, agent-only risk rejection, multi-turn,
and crescendo attacks.

Also fixes PROXY_URL() TypeError in conftest.py (PROXY_URL is a str, not callable).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 and others added 11 commits March 10, 2026 14:59
- Revert PROXY_URL back to PROXY_URL() (it's a function, not a variable)
- Apply black formatting to assert statements in test_red_team_foundry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t risk categories

- Add _safe_tqdm_write() wrapper to handle UnicodeEncodeError on Windows cp1252 terminals
- Replace all tqdm.write() calls with _safe_tqdm_write() in _red_team.py
- Add custom seed prompt files for agent-only risk categories (task_adherence,
  sensitive_data_leakage, prohibited_actions) that lack server-side seed data
- Update test_foundry_task_adherence_category and test_foundry_agent_sensitive_data_leakage
  to use custom_attack_seed_prompts, bypassing get_attack_objectives API
- Apply black formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Merge upstream/main (7 commits) into foundry-e2e-tests branch
- Fix PROXY_URL() call in conftest.py (PROXY_URL is a string, not callable)
- Re-record all 15 foundry red team E2E tests with updated source code

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In CI, devtools_testutils.config.PROXY_URL is a function that must be
called. Locally (pip-installed), it's a string constant. Use callable()
check to handle both environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Patch random.sample and random.choice to return deterministic (first-N)
results for the model config target test. This ensures the same objectives
are selected during both recording and playback, preventing test proxy
404 mismatches caused by non-deterministic objective selection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	sdk/evaluation/azure-ai-evaluation/assets.json
#	sdk/evaluation/azure-ai-evaluation/tests/conftest.py
Extend /openai/v1 path normalization to all Azure endpoint patterns
(*.openai.azure.com, *.cognitiveservices.azure.com, sovereign clouds)
not just Foundry endpoints. PyRIT 0.11+ uses AsyncOpenAI(base_url=)
which appends /chat/completions directly, requiring the /openai/v1 prefix.

Without this fix, model config targets using classic AOAI endpoints
get 404 errors because PyRIT sends requests to the bare endpoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RISK_CATEGORY_METRIC_MAP mapped HateUnfairness to HATE_FAIRNESS (legacy name),
but the sync eval API returns results under hate_unfairness (canonical name).
The scorer's result matching compared against the un-normalized hate_fairness,
causing it to never match and silently fall back to score=0 — making ASR
always 0% for hate_unfairness regardless of actual model behavior.

Changes:
- metric_mapping.py: Map HateUnfairness to HATE_UNFAIRNESS (canonical name).
  The routing layer in evaluate_with_rai_service_sync normalizes to the
  legacy name when use_legacy_endpoint=True, so both paths work.
- _rai_scorer.py: Match results against both canonical and legacy aliases
  using _SYNC_TO_LEGACY_METRIC_NAMES, so future metric renames don't
  silently break scoring.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tests now expect /openai/v1 suffix on all Azure endpoints, matching
the updated get_chat_target() behavior needed for PyRIT 0.11+.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When target_type=agent and no client_id is provided (local execution,
not ACA), fall back to the existing credential to set aml-aca-token
header. Previously this header was only set via ACA managed identity,
causing 'Authorization failed for seeds' when running agent-target
red team scans locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 and others added 3 commits March 19, 2026 17:19
Two fixes:

1. _rai_service_target.py: Accept both 'message' (PyRIT 0.11+) and
   'prompt_request' (legacy) parameter names in send_prompt_async().
   PyRIT 0.11 changed the interface from prompt_request= to message=,
   causing TypeError on multi-turn and crescendo attacks.

2. _generated_rai_client.py: Set aml-aca-token header from existing
   credential for agent-type seed requests when no client_id (ACA
   managed identity) is available. Enables local SDK testing of
   agent targets without ACA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 added a commit to slister1001/azure-sdk-for-python that referenced this pull request Mar 23, 2026
- Fix list[Message] -> List[Message] type hint for Python 3.8 compat
- Guard _fallback_response against None when retry kwargs are malformed
- Add CHANGELOG entries for metric fix, PyRIT compat, endpoint
  normalization, and agent token fallback
- Move _AZURE_OPENAI_HOST_SUFFIXES to module-level constant
- Use _validate_attack_details shared helper in multi-turn/crescendo tests
- Change agent token fallback log level from debug to warning

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix list[Message] -> List[Message] type hint for Python 3.8 compat
- Guard _fallback_response against None when retry kwargs are malformed
- Add CHANGELOG entries for metric fix, PyRIT compat, endpoint
  normalization, and agent token fallback
- Move _AZURE_OPENAI_HOST_SUFFIXES to module-level constant
- Use _validate_attack_details shared helper in multi-turn/crescendo tests
- Change agent token fallback log level from debug to warning

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@nagkumar91 nagkumar91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

🟡 XPIA agent fallback catches all exceptions silently (Medium)

In _red_team.py, the XPIA prompt fetch for agent targets catches Exception and logs only at debug level:

except Exception as agent_error:
    if target_type_str == "agent":
        self.logger.debug(f"Agent-type XPIA prompt fetch failed ({agent_error}), falling back to model-type")

This swallows ALL exceptions — network failures, bugs, timeouts, JSON parse errors — not just expected 404/auth errors. For agent targets, these errors vanish into debug logs making troubleshooting very difficult.

Fix: Catch specific expected exception types (e.g., HttpResponseError) and re-raise unexpected ones, or at minimum log at warning level.


🟡 Agent credential fallback — bare except at debug level (Medium)

In _generated_rai_client.py, the agent token fallback silently swallows all errors:

try:
    token = self.token_manager.credential.get_token(TokenScope.DEFAULT_AZURE_MANAGEMENT.value).token
    headers["aml-aca-token"] = token
except Exception:
    self.logger.debug("Could not set aml-aca-token from existing credential", exc_info=True)

If the credential is misconfigured or expired, execution continues without the token header. The subsequent service call will fail with a cryptic auth error that's very hard to trace back to this swallowed exception.

Fix: Log at warning level instead of debug.


🟡 _fallback_response returns [] on missing request (Medium)

In _rai_service_target.py, when neither message nor prompt_request is in retry kwargs:

request = retry_state.kwargs.get("message") or retry_state.kwargs.get("prompt_request")
if request is None:
    logger.warning("_fallback_response: no 'message' or 'prompt_request' in retry kwargs")
    return []

This is the retry error callback after 5 attempts are exhausted. Returning [] breaks the List[Message] contract — callers accessing response[0] will get IndexError. Consider raising an exception instead since this represents an unrecoverable error state.


🟠 Import inside method in _rai_scorer.py (Low)

async def _score_piece_async(self, ...):
    from azure.ai.evaluation._common.rai_service import (
        _SYNC_TO_LEGACY_METRIC_NAMES, _LEGACY_TO_SYNC_METRIC_NAMES,
    )

This runs on every scoring call. Move to module level for clarity and minor perf improvement.

- Upgrade XPIA agent fallback log from debug to warning (_red_team.py)
- Upgrade aml-aca-token credential fallback log from debug to warning (_generated_rai_client.py)
- Raise RuntimeError instead of returning [] in _fallback_response (_rai_service_target.py)
- Move metric name imports to module level (_rai_scorer.py)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@nagkumar91 nagkumar91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four issues from previous review are addressed:

  1. XPIA agent fallback — log level raised from debug to warning
  2. Agent credential fallback — log level raised from debug to warning
  3. _fallback_response empty list — now raises RuntimeError instead of returning []
  4. Import inside method — moved to module-level in _rai_scorer.py

LGTM.

@slister1001 slister1001 enabled auto-merge (squash) March 23, 2026 21:20
@slister1001 slister1001 merged commit 4e84079 into Azure:main Mar 23, 2026
21 checks passed
slister1001 added a commit that referenced this pull request Mar 24, 2026
* Add 15 Foundry red team E2E tests for full RAISvc contract coverage

Tests cover: basic execution, XPIA, multiple risk categories, application
scenarios, strategy combinations, model_config targets, agent callbacks,
agent tool context, ProtectedMaterial/CodeVulnerability/TaskAdherence
categories, SensitiveDataLeakage, agent-only risk rejection, multi-turn,
and crescendo attacks.

Also fixes PROXY_URL() TypeError in conftest.py (PROXY_URL is a str, not callable).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix PROXY_URL() call and apply black formatting

- Revert PROXY_URL back to PROXY_URL() (it's a function, not a variable)
- Apply black formatting to assert statements in test_red_team_foundry.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix Windows encoding bug in tqdm output and use custom seeds for agent risk categories

- Add _safe_tqdm_write() wrapper to handle UnicodeEncodeError on Windows cp1252 terminals
- Replace all tqdm.write() calls with _safe_tqdm_write() in _red_team.py
- Add custom seed prompt files for agent-only risk categories (task_adherence,
  sensitive_data_leakage, prohibited_actions) that lack server-side seed data
- Update test_foundry_task_adherence_category and test_foundry_agent_sensitive_data_leakage
  to use custom_attack_seed_prompts, bypassing get_attack_objectives API
- Apply black formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Re-record foundry E2E tests after merging upstream/main

- Merge upstream/main (7 commits) into foundry-e2e-tests branch
- Fix PROXY_URL() call in conftest.py (PROXY_URL is a string, not callable)
- Re-record all 15 foundry red team E2E tests with updated source code

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix PROXY_URL handling for both callable and string variants

In CI, devtools_testutils.config.PROXY_URL is a function that must be
called. Locally (pip-installed), it's a string constant. Use callable()
check to handle both environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix test_foundry_with_model_config_target recording playback failure

Patch random.sample and random.choice to return deterministic (first-N)
results for the model config target test. This ensures the same objectives
are selected during both recording and playback, preventing test proxy
404 mismatches caused by non-deterministic objective selection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix Azure OpenAI endpoint normalization for PyRIT 0.11+ compatibility

Extend /openai/v1 path normalization to all Azure endpoint patterns
(*.openai.azure.com, *.cognitiveservices.azure.com, sovereign clouds)
not just Foundry endpoints. PyRIT 0.11+ uses AsyncOpenAI(base_url=)
which appends /chat/completions directly, requiring the /openai/v1 prefix.

Without this fix, model config targets using classic AOAI endpoints
get 404 errors because PyRIT sends requests to the bare endpoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix hate_unfairness metric name mismatch in RAI scorer

RISK_CATEGORY_METRIC_MAP mapped HateUnfairness to HATE_FAIRNESS (legacy name),
but the sync eval API returns results under hate_unfairness (canonical name).
The scorer's result matching compared against the un-normalized hate_fairness,
causing it to never match and silently fall back to score=0 — making ASR
always 0% for hate_unfairness regardless of actual model behavior.

Changes:
- metric_mapping.py: Map HateUnfairness to HATE_UNFAIRNESS (canonical name).
  The routing layer in evaluate_with_rai_service_sync normalizes to the
  legacy name when use_legacy_endpoint=True, so both paths work.
- _rai_scorer.py: Match results against both canonical and legacy aliases
  using _SYNC_TO_LEGACY_METRIC_NAMES, so future metric renames don't
  silently break scoring.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update recording for model config target test

* Update unit tests for Azure OpenAI endpoint normalization

Tests now expect /openai/v1 suffix on all Azure endpoints, matching
the updated get_chat_target() behavior needed for PyRIT 0.11+.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix agent seed auth for local SDK usage

When target_type=agent and no client_id is provided (local execution,
not ACA), fall back to the existing credential to set aml-aca-token
header. Previously this header was only set via ACA managed identity,
causing 'Authorization failed for seeds' when running agent-target
red team scans locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "Fix agent seed auth for local SDK usage"

This reverts commit abb47c4.

* Fix send_prompt_async parameter name for PyRIT 0.11+ and agent seed auth

Two fixes:

1. _rai_service_target.py: Accept both 'message' (PyRIT 0.11+) and
   'prompt_request' (legacy) parameter names in send_prompt_async().
   PyRIT 0.11 changed the interface from prompt_request= to message=,
   causing TypeError on multi-turn and crescendo attacks.

2. _generated_rai_client.py: Set aml-aca-token header from existing
   credential for agent-type seed requests when no client_id (ACA
   managed identity) is available. Enables local SDK testing of
   agent targets without ACA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update recordings for foundry E2E tests

* Update unit tests

* Apply black formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR #45579 review feedback

- Fix list[Message] -> List[Message] type hint for Python 3.8 compat
- Guard _fallback_response against None when retry kwargs are malformed
- Add CHANGELOG entries for metric fix, PyRIT compat, endpoint
  normalization, and agent token fallback
- Move _AZURE_OPENAI_HOST_SUFFIXES to module-level constant
- Use _validate_attack_details shared helper in multi-turn/crescendo tests
- Change agent token fallback log level from debug to warning

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review: improve logging, error handling, and imports

- Upgrade XPIA agent fallback log from debug to warning (_red_team.py)
- Upgrade aml-aca-token credential fallback log from debug to warning (_generated_rai_client.py)
- Raise RuntimeError instead of returning [] in _fallback_response (_rai_service_target.py)
- Move metric name imports to module level (_rai_scorer.py)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants