Skip to content

[Evaluation] Fix UTF-8 encoding for red team JSONL files on Windows#45500

Merged
slister1001 merged 8 commits intoAzure:mainfrom
slister1001:fix/redteam-encoding
Mar 10, 2026
Merged

[Evaluation] Fix UTF-8 encoding for red team JSONL files on Windows#45500
slister1001 merged 8 commits intoAzure:mainfrom
slister1001:fix/redteam-encoding

Conversation

@slister1001
Copy link
Copy Markdown
Member

What

Add explicit encoding=utf-8 to all open() calls in the PyRIT result processing path for red team scans.

Why

On Windows, Python open() defaults to the system locale encoding (cp1252/charmap). When JSONL files contain non-ASCII characters (from UnicodeConfusable strategy or CJK language prompts), reading them back fails with UnicodeDecodeError. This causes 0 attack_details in final results.

What Changed

4 open() calls fixed across 2 files:

  • _result_processor.py:202 - data file read
  • _utils/formatting_utils.py:306 - existing file line count read
  • _utils/formatting_utils.py:338 - JSONL write (replace path)
  • _utils/formatting_utils.py:378 - JSONL write (new file path)

Testing

All 367 red team unit tests pass. Fixes bug bash tests 1.7 (UnicodeConfusable) and 1.16 (Japanese/Chinese).

Risk

None - the write side already produces UTF-8 content; this makes the read side match.

@github-actions github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 4, 2026
Add explicit encoding='utf-8' to all file open() calls in the PyRIT result
processing path. Without this, Windows defaults to the system locale encoding
(charmap/cp1252), causing UnicodeDecodeError when reading JSONL files containing
non-ASCII characters from UnicodeConfusable strategy or CJK languages.

Fixes: Tests 1.7 (UnicodeConfusable), 1.16 (Japanese/Chinese)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 force-pushed the fix/redteam-encoding branch from dab8c4e to c6e7d30 Compare March 4, 2026 01:08
Test CJK characters, Unicode confusables, and mixed scripts to prevent
future regressions of the charmap encoding bug on Windows.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 added a commit to slister1001/azure-sdk-for-python that referenced this pull request Mar 4, 2026
Cherry-picked all 3 red team bug fixes for the bug bash:
- Fix UTF-8 encoding for JSONL files on Windows (PR Azure#45500)
- Fix model_config 404 for Foundry-style endpoints (PR Azure#45502)
- Add early input validation to scan() (PR Azure#45501)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 marked this pull request as ready for review March 9, 2026 15:37
@slister1001 slister1001 requested a review from a team as a code owner March 9, 2026 15:37
Copilot AI review requested due to automatic review settings March 9, 2026 15:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Windows-specific UnicodeDecodeError failures when reading red team JSONL files that contain non-ASCII text by ensuring UTF-8 is explicitly used for JSONL file I/O in the PyRIT result processing flow.

Changes:

  • Add encoding="utf-8" to JSONL reads in the red team result processor.
  • Add encoding="utf-8" to JSONL read/write operations in red team formatting utilities.
  • Add unit tests intended to cover Unicode JSONL round-trips; update version/changelog for the next release.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py Reads JSONL data files using explicit UTF-8 to avoid Windows locale decode errors.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_utils/formatting_utils.py Uses explicit UTF-8 when reading existing JSONL and writing JSONL output files.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_formatting_utils.py Adds tests intended to validate Unicode JSONL read/write behavior.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_version.py Bumps package version to 1.16.0.
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Adds an Unreleased 1.16.0 entry describing the UTF-8 fix.

Comment thread sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Outdated
Copy link
Copy Markdown
Member

@nagkumar91 nagkumar91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core fix is correct — adding encoding="utf-8" to all 4 open() calls is the right approach for Windows compatibility.

Minor suggestions (non-blocking):

  • The new tests do manual open() round-trips but don't exercise the actual production functions (write_pyrit_outputs_to_file() / to_red_team_result()). Consider adding integration tests that call those functions with Unicode data to validate the complete fixed path.
  • Note: PR #45502 also bumps the version to 1.16.0, so whichever merges second will hit a conflict on _version.py and CHANGELOG.md.

slister1001 and others added 5 commits March 9, 2026 11:56
…GELOG

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@slister1001 slister1001 enabled auto-merge (squash) March 10, 2026 00:48
@slister1001 slister1001 merged commit 6a69095 into Azure:main Mar 10, 2026
21 checks passed
aprilk-ms pushed a commit that referenced this pull request Mar 11, 2026
…45500)

* Fix UTF-8 encoding for red team JSONL files on Windows

Add explicit encoding='utf-8' to all file open() calls in the PyRIT result
processing path. Without this, Windows defaults to the system locale encoding
(charmap/cp1252), causing UnicodeDecodeError when reading JSONL files containing
non-ASCII characters from UnicodeConfusable strategy or CJK languages.

Fixes: Tests 1.7 (UnicodeConfusable), 1.16 (Japanese/Chinese)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add encoding regression tests for non-ASCII JSONL round-trip

Test CJK characters, Unicode confusables, and mixed scripts to prevent
future regressions of the charmap encoding bug on Windows.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Format with black

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address review comments: test production code paths, consolidate CHANGELOG

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Apply black formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
singankit pushed a commit that referenced this pull request Mar 16, 2026
…45500)

* Fix UTF-8 encoding for red team JSONL files on Windows

Add explicit encoding='utf-8' to all file open() calls in the PyRIT result
processing path. Without this, Windows defaults to the system locale encoding
(charmap/cp1252), causing UnicodeDecodeError when reading JSONL files containing
non-ASCII characters from UnicodeConfusable strategy or CJK languages.

Fixes: Tests 1.7 (UnicodeConfusable), 1.16 (Japanese/Chinese)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add encoding regression tests for non-ASCII JSONL round-trip

Test CJK characters, Unicode confusables, and mixed scripts to prevent
future regressions of the charmap encoding bug on Windows.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Format with black

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address review comments: test production code paths, consolidate CHANGELOG

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Apply black formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants