Skip to content

Add evaluate answer with llm-as-a-judge#459

Open
hesteban-tuenti wants to merge 7 commits intoTelefonica:masterfrom
hesteban-tuenti:feat_add_evaluate_answer
Open

Add evaluate answer with llm-as-a-judge#459
hesteban-tuenti wants to merge 7 commits intoTelefonica:masterfrom
hesteban-tuenti:feat_add_evaluate_answer

Conversation

@hesteban-tuenti
Copy link
Contributor

@hesteban-tuenti hesteban-tuenti commented Mar 5, 2026

This PR extends the work done in the test_text_similarity, by adding new functions more focused on querying a LLM-as-a-judge which will compare the texts (an LLM answer and a referece answer) adding to the context the user's question and optionally providing to the model a response_format object in case we need extra analysis from the LLM-as-a-judge

Changes:

  • Extend openai_request to support structured parsing via response_format and normalize message construction.
  • Add evaluate_answer.py with helpers to score similarity between an LLM answer and reference answer(s), plus an assertion helper.
  • Add Azure OpenAI-backed pytest coverage for the new evaluation helpers (skipped unless env var is present).

Tests results:

image

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an “LLM-as-a-judge” answer evaluation utility on top of the existing OpenAI/Azure OpenAI integration, including optional structured output support.

Changes:

  • Extend openai_request to support structured parsing via response_format and normalize message construction.
  • Add evaluate_answer.py with helpers to score similarity between an LLM answer and reference answer(s), plus an assertion helper.
  • Add Azure OpenAI-backed pytest coverage for the new evaluation helpers (skipped unless env var is present).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
toolium/utils/ai_utils/openai.py Adds response_format handling and refactors message building for OpenAI/Azure calls.
toolium/utils/ai_utils/evaluate_answer.py New module implementing LLM-as-a-judge answer evaluation + assertion helper.
toolium/test/utils/ai_utils/test_answer_evaluation.py New integration-style tests for Azure OpenAI answer evaluation, including structured outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@rgonalo rgonalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

faltaría meter algo de doc en ai_utils.rst y una línea en el CHANGELOG.rst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants