Skip to content

Update Relevance prompt to fix prompt, enhance evaluation steps and define output structure #41762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ghyadav
Copy link
Contributor

@ghyadav ghyadav commented Jun 25, 2025

  • Fix rubric 2 to not check "incorrect" instead check relevace only
  • Update output to get score and explanation only
  • Update prompt to be more crisp. Include evaluation steps

Description

Relevance V1 Results:
image
image
image

Relevance V2 Results:
image
image
image

Relevance V2 [2025-07-02 update]:
image

image
image

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

- Fix rubric 2 to not check "incorrect" instead check relevace only
- Update output to get score and explanation only
- Update prompt to be more crisp. Include evaluation steps
@Copilot Copilot AI review requested due to automatic review settings June 25, 2025 06:49
@ghyadav ghyadav requested a review from a team as a code owner June 25, 2025 06:49
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Jun 25, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request updates the relevance evaluator prompt to fix the relevance rubric, streamline the evaluation instructions, and clearly define the expected output format. The changes include refining the system and user instructions, updating the evaluation steps, and providing explicit sample output formats.

  • Fixed the rubric to check relevance only.
  • Enhanced and clarified evaluation steps.
  • Defined a crisp output structure for score and explanation.
Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/relevance.prompty:152

  • The sample output only includes and tags, whereas earlier instructions mention providing answers between , , and . Consider updating the sample outputs or the instructions to ensure consistency in the expected output format.
<S2>5</S2>

@ghyadav
Copy link
Contributor Author

ghyadav commented Jun 25, 2025

Relevance Evaluator Prompt Optimization Summary

Pull Request: [#41762](#41762)

Key Prompt Updates:

  • Removed factual correctness requirement from Score 2 to focus purely on relevance.
  • Simplified output: only returns a Score (1–5) and a brief Explanation.
  • Streamlined prompt structure with clearer evaluation steps and concise examples.

Evaluation Metrics:

Metric V1 V2 Change
Intra Model Variance 0.0934 0.0715 ↓ 23.5%
Inter Model Variance (Weighted Avg) 0.3289 0.2947 ↓ 10.4%
Inter Model Agreement (% Mode) 73.86% 74.79% ↑ 1.3%

Model-Wise Pairwise Agreement (%):

Model Likert V1 Likert V2 Binary V1 Binary V2
gpt-4.1 68.7 67.3 87.9 90.2
gpt-4.1-mini 67.9 67.0 87.1 89.9
gpt-4o 65.6 62.7 85.6 85.9
gpt-4o-mini 54.1 56.0 84.3 88.4
o4-mini 71.8 62.7 88.7 89.2
gpt-4.1-nano 54.2 45.6 82.2 87.0

Summary of Improvements:

  • Primary Outcome: 23.5% reduction in intra-model variance suggests improved clarity and reduced subjectivity.
  • Secondary Outcome: 10.4% drop in inter-model variance highlights stronger consistency across models.
  • Supporting Evidence: Binary agreement increased across all models, leading to clearer “Pass/Fail” classification for end users.
    Minor drops observed in Likert agreement for gpt-4.1-nano and o4-mini.

@changliu2
Copy link
Member

  • Supporting Evidence: Binary agreement increased across all models, leading to clearer “Pass/Fail” classification for end users.
    Minor drops observed in Likert agreement for gpt-4.1-nano and o4-mini.

Looks good. May I ask you to add the mean and std for agreement metrics across models?

ghyadav added 5 commits June 26, 2025 14:00
update examples for rubric 4 and 5
update examples for rubric 4 and 5
update examples for rubric 4 and 5
@changliu2
Copy link
Member

#sign-off

ghyadav added 5 commits July 2, 2025 13:53
…into ghyadav/relevance_v2_update

# Conflicts:
#	sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_relevance/_relevance.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Evaluation Issues related to the client library for Azure AI Evaluation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants