Skip to content

Revert _evaluate.py changes from #46436 on mohessie/standardize_output_schema#46835

Merged
m7md7sien merged 2 commits into
mohessie/standardize_output_schemafrom
copilot/revert-evaluate-py-changes
May 11, 2026
Merged

Revert _evaluate.py changes from #46436 on mohessie/standardize_output_schema#46835
m7md7sien merged 2 commits into
mohessie/standardize_output_schemafrom
copilot/revert-evaluate-py-changes

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 11, 2026

This PR removes only the _evaluate.py portion of #46436 from mohessie/standardize_output_schema so those edits can be handled in a separate PR. The target file is restored to exactly match main.

  • Scope

    • Reverts only:
      • sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py
    • No other files are changed.
  • What was reverted (inverse of Standardize Output Schema for Evalautors #46436 in this file)

    • Reintroduced EVALUATION_PASS_FAIL_MAPPING import.
    • Switched binary aggregation back from *_passed columns to *_result columns.
    • Restored pass counting logic to:
      • value_counts.get(EVALUATION_PASS_FAIL_MAPPING[True], 0)
    • Restored _aggregate_metrics exclusion logic from *_passed back to *_result.
    • Restored related wording/comments ("pass" vs passed) to match main.
  • Result

    • _evaluate.py now aligns byte-for-byte with main, while keeping PR scope limited to the requested split.
# restored aggregation behavior
result_columns = [col for col in df.columns if col.startswith("outputs.") and col.endswith("_result")]
pass_count = value_counts.get(EVALUATION_PASS_FAIL_MAPPING[True], 0)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • centralus-2.in.applicationinsights.azure.com
    • Triggering command: /home/REDACTED/bin/azsdk /home/REDACTED/bin/azsdk ingest-telemetry --client-type copilot-cli --event-type skill_invocation --session-id 8462065c-c6cf-473a-9421-84eaf0a44b5b --skill-name find-package-skill (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This PR reverts the changes made to sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py in PR #46436 (#46436), so that those changes can be split into a separate PR. When this PR is merged into the mohessie/standardize_output_schema branch, the _evaluate.py portion of PR #46436 will be undone.

Goal

On a new branch created off mohessie/standardize_output_schema, restore sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py to exactly match its content on the main branch (i.e., revert the _evaluate.py edits introduced by PR #46436), then open a PR targeting mohessie/standardize_output_schema.

Do not modify any other file in the repository. Do not run formatters, linters, or any tooling that may alter other lines. Do not touch the changelog, version files, or any other file changed in PR #46436.

Exact diff to apply

Apply this unified diff verbatim to sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py on the mohessie/standardize_output_schema branch. This is the inverse of the diff applied in PR #46436:

@@ -26,6 +26,7 @@
 
 from .._constants import (
     CONTENT_SAFETY_DEFECT_RATE_THRESHOLD_DEFAULT,
+    EVALUATION_PASS_FAIL_MAPPING,
     EvaluationMetrics,
     DefaultOpenEncoding,
     Prefixes,
@@ -250,7 +251,7 @@ def _aggregation_binary_output(df: pd.DataFrame) -> Dict[str, float]:
     """
     Aggregate binary output results (pass/fail) from evaluation dataframe.
 
-    For each evaluator, calculates the proportion of passed results.
+    For each evaluator, calculates the proportion of "pass" results.
 
     :param df: The dataframe of evaluation results.
     :type df: ~pandas.DataFrame
@@ -259,12 +260,12 @@ def _aggregation_binary_output(df: pd.DataFrame) -> Dict[str, float]:
     """
     results = {}
 
-    # Find all columns that end with "_passed"
-    passed_columns = [col for col in df.columns if col.startswith("outputs.") and col.endswith("_passed")]
+    # Find all columns that end with "_result"
+    result_columns = [col for col in df.columns if col.startswith("outputs.") and col.endswith("_result")]
 
-    for col in passed_columns:
+    for col in result_columns:
         # Extract the evaluator name from the column name
-        # (outputs.<evaluator>.<metric>_passed)
+        # (outputs.<evaluator>.<metric>_result)
         parts = col.split(".")
         evaluator_name = None
         if len(parts) >= 3:
@@ -291,7 +292,7 @@ def _aggregation_binary_output(df: pd.DataFrame) -> Dict[str, float]:
 
             # Calculate the proportion of EVALUATION_PASS_FAIL_MAPPING[True] results
             total_rows = len(df)
-            pass_count = df[col].sum()
+            pass_count = value_counts.get(EVALUATION_PASS_FAIL_MAPPING[True], 0)
             proportion = pass_count / total_rows if total_rows > 0 else 0.0
 
             # Set the result with the evaluator name as the key
@@ -383,10 +384,10 @@ def _aggregate_metrics(df: pd.DataFrame, evaluators: Dict[str, Callable]) -> Dic
     token_count_cols = _get_token_count_columns_to_exclude(df)
     handled_columns.extend(token_count_cols)
 
-    # Exclude threshold and passed columns from aggregation
+    # Exclude threshold and result columns from aggregation
     # These are per-row metadata, not metrics to be averaged
-    threshold_and_passed_cols = [col for col in df.columns if col.endswith("_threshold") or col.endswith("_passed")]
-    handled_columns.extend(threshold_and_passed_cols)
+    threshold_and_result_cols = [col for col in df.columns if col.endswith("_threshold") or col.endswith("_result")]
+    handled_columns.extend(threshold_and_result_cols)
 
     # For rest of metrics, we will calculate mean
     df.drop(columns=handled_columns, inplace=True)

Authoritative reference (simplest approach)

The fastest and safest way to do this is to copy the file from main and overwrite the file on the new branch. The content after revert must be byte-for-byte equal to the version of the file on main:

After the commit, git diff origin/main -- sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py must produce no output (the file matches main exactly).

Branch and PR requirements

  • Create a new branch off mohessie/standardize_output_schema (suggested name: m7md7sien/revert-evaluate-py-in-46436).
  • The PR must target the mohessie/standardize_output_schema branch (NOT main). This is critical — the goal is to remove the _evaluate.py changes from PR Standardize Output Schema for Evalautors #46436, so the PR base must be that PR's head branch.
  • Commit only the change...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Revert changes to _evaluate.py from PR #46436 Revert _evaluate.py changes from #46436 on mohessie/standardize_output_schema May 11, 2026
Copilot AI requested a review from m7md7sien May 11, 2026 23:02
@m7md7sien m7md7sien marked this pull request as ready for review May 11, 2026 23:02
@m7md7sien m7md7sien requested a review from a team as a code owner May 11, 2026 23:02
@m7md7sien m7md7sien merged commit 36e0cbb into mohessie/standardize_output_schema May 11, 2026
2 checks passed
@m7md7sien m7md7sien deleted the copilot/revert-evaluate-py-changes branch May 11, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants