Add evaluate answer with llm-as-a-judge by hesteban-tuenti · Pull Request #459 · Telefonica/toolium

hesteban-tuenti · 2026-03-05T11:33:55Z

This PR extends the work done in the test_text_similarity, by adding new functions more focused on querying a LLM-as-a-judge which will compare the texts (an LLM answer and a referece answer) adding to the context the user's question and optionally providing to the model a response_format object in case we need extra analysis from the LLM-as-a-judge

Changes:

Extend openai_request to support structured parsing via response_format and normalize message construction.
Add evaluate_answer.py with helpers to score similarity between an LLM answer and reference answer(s), plus an assertion helper.
Add Azure OpenAI-backed pytest coverage for the new evaluation helpers (skipped unless env var is present).

Tests results:

toolium/utils/ai_utils/evaluate_answer.py

Copilot

Pull request overview

Adds an “LLM-as-a-judge” answer evaluation utility on top of the existing OpenAI/Azure OpenAI integration, including optional structured output support.

Changes:

Extend openai_request to support structured parsing via response_format and normalize message construction.
Add evaluate_answer.py with helpers to score similarity between an LLM answer and reference answer(s), plus an assertion helper.
Add Azure OpenAI-backed pytest coverage for the new evaluation helpers (skipped unless env var is present).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
`toolium/utils/ai_utils/openai.py`	Adds `response_format` handling and refactors message building for OpenAI/Azure calls.
`toolium/utils/ai_utils/evaluate_answer.py`	New module implementing LLM-as-a-judge answer evaluation + assertion helper.
`toolium/test/utils/ai_utils/test_answer_evaluation.py`	New integration-style tests for Azure OpenAI answer evaluation, including structured outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

toolium/utils/ai_utils/evaluate_answer.py

toolium/test/utils/ai_utils/test_answer_evaluation.py

toolium/utils/ai_utils/openai.py

toolium/utils/ai_utils/evaluate_answer.py

…/toolium into feat_add_evaluate_answer

toolium/utils/ai_utils/openai.py

toolium/utils/ai_utils/evaluate_answer.py

rgonalo

faltaría meter algo de doc en ai_utils.rst y una línea en el CHANGELOG.rst

hesteban-tuenti added 2 commits March 5, 2026 12:26

chore: add evaluate answer llm-as-a-judge

ce7dc1f

chore: fix import

4796ef5

hesteban-tuenti self-assigned this Mar 5, 2026

hesteban-tuenti requested a review from Copilot March 5, 2026 11:34

Copilot started reviewing on behalf of hesteban-tuenti March 5, 2026 11:34 View session

hesteban-tuenti commented Mar 5, 2026

View reviewed changes

toolium/utils/ai_utils/evaluate_answer.py Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 5, 2026

View reviewed changes

hesteban-tuenti added 4 commits March 5, 2026 14:14

chore: uncomment test data

2b55318

Update toolium/utils/ai_utils/evaluate_answer.py

019dcf1

chore: address copilot comments

4b229d7

Merge branch 'feat_add_evaluate_answer' of github.com:hesteban-tuenti…

429c61c

…/toolium into feat_add_evaluate_answer

rgonalo reviewed Mar 5, 2026

View reviewed changes

toolium/utils/ai_utils/openai.py Outdated Show resolved Hide resolved

toolium/utils/ai_utils/evaluate_answer.py Show resolved Hide resolved

toolium/utils/ai_utils/evaluate_answer.py Outdated Show resolved Hide resolved

chore: address comment and remove main call

2a3881f

rgonalo reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluate answer with llm-as-a-judge#459

Add evaluate answer with llm-as-a-judge#459
hesteban-tuenti wants to merge 7 commits intoTelefonica:masterfrom
hesteban-tuenti:feat_add_evaluate_answer

hesteban-tuenti commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgonalo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hesteban-tuenti commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgonalo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hesteban-tuenti commented Mar 5, 2026 •

edited

Loading