Add evaluate answer with llm-as-a-judge#459
Open
hesteban-tuenti wants to merge 7 commits intoTelefonica:masterfrom
Open
Add evaluate answer with llm-as-a-judge#459hesteban-tuenti wants to merge 7 commits intoTelefonica:masterfrom
hesteban-tuenti wants to merge 7 commits intoTelefonica:masterfrom
Conversation
hesteban-tuenti
commented
Mar 5, 2026
There was a problem hiding this comment.
Pull request overview
Adds an “LLM-as-a-judge” answer evaluation utility on top of the existing OpenAI/Azure OpenAI integration, including optional structured output support.
Changes:
- Extend
openai_requestto support structured parsing viaresponse_formatand normalize message construction. - Add
evaluate_answer.pywith helpers to score similarity between an LLM answer and reference answer(s), plus an assertion helper. - Add Azure OpenAI-backed pytest coverage for the new evaluation helpers (skipped unless env var is present).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
toolium/utils/ai_utils/openai.py |
Adds response_format handling and refactors message building for OpenAI/Azure calls. |
toolium/utils/ai_utils/evaluate_answer.py |
New module implementing LLM-as-a-judge answer evaluation + assertion helper. |
toolium/test/utils/ai_utils/test_answer_evaluation.py |
New integration-style tests for Azure OpenAI answer evaluation, including structured outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
rgonalo
reviewed
Mar 5, 2026
rgonalo
reviewed
Mar 5, 2026
Member
rgonalo
left a comment
There was a problem hiding this comment.
faltaría meter algo de doc en ai_utils.rst y una línea en el CHANGELOG.rst
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR extends the work done in the test_text_similarity, by adding new functions more focused on querying a LLM-as-a-judge which will compare the texts (an LLM answer and a referece answer) adding to the context the user's question and optionally providing to the model a response_format object in case we need extra analysis from the LLM-as-a-judge
Changes:
Tests results: