Update Relevance prompt to fix prompt, enhance evaluation steps and define output structure #41762

ghyadav · 2025-06-25T06:49:07Z

Fix rubric 2 to not check "incorrect" instead check relevace only
Update output to get score and explanation only
Update prompt to be more crisp. Include evaluation steps

Description

Relevance V1 Results:

Relevance V2 Results:

Relevance V2 [2025-07-02 update]:

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

- Fix rubric 2 to not check "incorrect" instead check relevace only - Update output to get score and explanation only - Update prompt to be more crisp. Include evaluation steps

Copilot

Pull Request Overview

This pull request updates the relevance evaluator prompt to fix the relevance rubric, streamline the evaluation instructions, and clearly define the expected output format. The changes include refining the system and user instructions, updating the evaluation steps, and providing explicit sample output formats.

Fixed the rubric to check relevance only.
Enhanced and clarified evaluation steps.
Defined a crisp output structure for score and explanation.

Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/relevance.prompty:152

The sample output only includes and tags, whereas earlier instructions mention providing answers between , , and . Consider updating the sample outputs or the instructions to ensure consistency in the expected output format.

<S2>5</S2>

ghyadav · 2025-06-25T08:57:17Z

Relevance Evaluator Prompt Optimization Summary

Pull Request: [#41762](#41762)

Key Prompt Updates:

Removed factual correctness requirement from Score 2 to focus purely on relevance.
Simplified output: only returns a Score (1–5) and a brief Explanation.
Streamlined prompt structure with clearer evaluation steps and concise examples.

Evaluation Metrics:

Metric	V1	V2	Change
Intra Model Variance	0.0934	0.0715	↓ 23.5%
Inter Model Variance (Weighted Avg)	0.3289	0.2947	↓ 10.4%
Inter Model Agreement (% Mode)	73.86%	74.79%	↑ 1.3%

Model-Wise Pairwise Agreement (%):

Model	Likert V1	Likert V2	Binary V1	Binary V2
gpt-4.1	68.7	67.3	87.9	90.2
gpt-4.1-mini	67.9	67.0	87.1	89.9
gpt-4o	65.6	62.7	85.6	85.9
gpt-4o-mini	54.1	56.0	84.3	88.4
o4-mini	71.8	62.7	88.7	89.2
gpt-4.1-nano	54.2	45.6	82.2	87.0

Summary of Improvements:

Primary Outcome: 23.5% reduction in intra-model variance suggests improved clarity and reduced subjectivity.
Secondary Outcome: 10.4% drop in inter-model variance highlights stronger consistency across models.
Supporting Evidence: Binary agreement increased across all models, leading to clearer “Pass/Fail” classification for end users.
Minor drops observed in Likert agreement for gpt-4.1-nano and o4-mini.

changliu2 · 2025-06-25T19:39:22Z

Supporting Evidence: Binary agreement increased across all models, leading to clearer “Pass/Fail” classification for end users.
Minor drops observed in Likert agreement for gpt-4.1-nano and o4-mini.

Looks good. May I ask you to add the mean and std for agreement metrics across models?

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/relevance.prompty

update examples for rubric 4 and 5

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/_relevance.py

changliu2 · 2025-06-30T23:55:57Z

#sign-off

…into ghyadav/relevance_v2_update # Conflicts: # sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/_relevance.py

…into ghyadav/relevance_v2_update

Update Relevance prompt:

c94f41f

- Fix rubric 2 to not check "incorrect" instead check relevace only - Update output to get score and explanation only - Update prompt to be more crisp. Include evaluation steps

Copilot AI review requested due to automatic review settings June 25, 2025 06:49

ghyadav requested a review from a team as a code owner June 25, 2025 06:49

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Jun 25, 2025

Copilot AI reviewed Jun 25, 2025

View reviewed changes

JoseCSantos reviewed Jun 25, 2025

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/relevance.prompty Show resolved Hide resolved

ghyadav added 5 commits June 26, 2025 14:00

change to json objects

d32a1d7

update examples for rubric 4 and 5

change to json objects

2259d52

update examples for rubric 4 and 5

change to json objects

f0b72bc

update examples for rubric 4 and 5

Fix formatting

41a912d

Fix formatting

416b6cb

singankit reviewed Jun 30, 2025

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/_relevance.py Show resolved Hide resolved

ghyadav added 5 commits July 2, 2025 13:53

updating the prompt to cover "Failures" explicitly

ec5b3b3

Merge branch 'main' of https://github.com/ghyadav/azure-sdk-for-python …

43ba409

…into ghyadav/relevance_v2_update # Conflicts: # sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/_relevance.py

Merge branch 'main' of https://github.com/ghyadav/azure-sdk-for-python …

0951604

…into ghyadav/relevance_v2_update

Adding CHANGELOG.md

de47848

Running black

1a58114

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Relevance prompt to fix prompt, enhance evaluation steps and define output structure #41762

Update Relevance prompt to fix prompt, enhance evaluation steps and define output structure #41762

ghyadav commented Jun 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

ghyadav commented Jun 25, 2025 •

edited

Loading

Uh oh!

changliu2 commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

changliu2 commented Jun 30, 2025

Uh oh!

Uh oh!

Update Relevance prompt to fix prompt, enhance evaluation steps and define output structure #41762

Are you sure you want to change the base?

Update Relevance prompt to fix prompt, enhance evaluation steps and define output structure #41762

Conversation

ghyadav commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

ghyadav commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevance Evaluator Prompt Optimization Summary

Key Prompt Updates:

Evaluation Metrics:

Model-Wise Pairwise Agreement (%):

Summary of Improvements:

Uh oh!

changliu2 commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

changliu2 commented Jun 30, 2025

Uh oh!

Uh oh!

ghyadav commented Jun 25, 2025 •

edited

Loading

ghyadav commented Jun 25, 2025 •

edited

Loading