# AI as Judge

[G-Eval](https://deepeval.com/docs/metrics-llm-evals) is a framework that uses LLM as a judge to evaluate LLM outputs. The evaluation can be based on any criteria. G-Eval is implemented by a library called [DeepEval](https://deepeval.com/) which includes a broader set of tests.


In [48]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [49]:
from openai import OpenAI
import os

document_folder = "../../05_src/documents/"
blue_cross_file = "the_blue_cross.txt"
file_path = os.path.join(document_folder, blue_cross_file)

with open(file_path, "r", encoding="utf-8") as f:
    blue_cross_text = f.read()

In [50]:
instructions = "You are an helpful assistant that summarizes works of fiction with a quirky and bubbly approach."
PROMPT = """
    Summarize the following story in at most four paragraphs. Please include all key characters and plot points.
    <story>
    {story}
    </story>
    In addition to the summary, add an introduction paragraph where you greet the reader and a conclusion where you share an opinion about the story.
"""

In [51]:
client = OpenAI()
response = client.responses.create(
    model="gpt-4o-mini",
    instructions=instructions,
    input=[
        {"role": "user", 
         "content": PROMPT.format(story=blue_cross_text)}
    ],
    temperature=1.2
)

In [52]:
response.output_text

'Hello there, fabulous reader! ✨ Get ready to dive into a delightful tale filled with quirky characters and delightful mysteries. We\'re all aboard with notable twists and a dash of cleverness in "The Blue Cross." So grab your magnifying glass, let\'s investigate together!\n\nIn this captivating story, we meet Valentin, the keen head of the Paris police, shrouded in apparent simplicity yet hiding significant intellect and determination. He jumps onto a boat in Harwich, on the trail of the infamous criminal Flambeau, who had slipped through the fingers of law enforcement in various thrilling episodes nationwide. Amidst the chaos of an important congress in London, Valentin is on a quest to arrest Flambeau, whose escapades range from elaborate scams to bewildering heists, including impersonating clergy!\n\nAs Valentin follows leads and engages with unsuspecting locals—including a bumbling little priest—he artfully uncovers subtle clues that lead him through a series of entertaining misad

# Answer Relevancy

The answer relevancy metric evaluates how relevant the actual output of the LLM app is compared to the provided input. This metric is self-explaining in the sense that the output includes a reason for the metric score.

The metric is calculated as:

$$
AnswerRelevancy=\frac{NumberRelevantStatements}{TotalStatements}
$$

Reference: [Answer Relevancy](https://deepeval.com/docs/metrics-answer-relevancy). 

In [53]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

test_case = LLMTestCase(
    input=PROMPT.format(story=blue_cross_text),
    actual_output=response.output_text
)

In [54]:
metric.measure(test_case)

Output()

1.0

In [55]:
print(metric.score,metric.reason)

1.0 The score is 1.00 because the response was entirely relevant, providing a clear summary of the story while including all key characters and plot points as requested. There were no irrelevant statements, demonstrating a strong understanding of the input requirements.


# Other Metrics

Other useful metric functions include:

+ [Faithfulness](https://deepeval.com/docs/metrics-faithfulness): evaluates whether the `actual_output` factually aligns with the contents of  `retrieval_context`. 
+ [Contextual Precision](https://deepeval.com/docs/metrics-contextual-precision): evaluates whether nodes in your `retrieval_context` that are relevant to the given input are ranked higher than irrelevant ones. 
+ [Contextual Recall](https://deepeval.com/docs/metrics-contextual-recall): evaluates the extent of which the retrieval_context aligns with the expected_output. 
+ [Contextual Relevancy](https://deepeval.com/docs/metrics-contextual-relevancy): evaluates the overall relevance of the information presented in your retrieval_context for a given input. 

# G-Eval

[G-Eval](https://deepeval.com/docs/metrics-llm-evals) is a framework that uses LLM-as-a-judge with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. The G-Eval metric is the most versatile type of metric deepeval offers.

In [56]:
instructions = "You are an helpful assistant that specializes in works of fiction."
PROMPT = """
    Based on the story below, answer the question provided.
    <story>
    {story}
    </story>
    <question>
    Who is the main antagonist in the story and what motivates their actions?
    </question>
"""

In [57]:
client = OpenAI()
response = client.responses.create(
    model="gpt-4o",
    instructions=instructions,
    input=[
        {"role": "user", 
         "content": PROMPT.format(story=blue_cross_text)}
    ],
    temperature=0.7
)

In [58]:
response.output_text

"The main antagonist in the story is Flambeau, a notorious criminal known for his ingenious and bold crimes. His actions are motivated by his desire for theft, particularly targeting valuable items such as the sapphire cross. Flambeau's reputation is built on his ability to commit crimes with creativity and cunning, often employing disguises and elaborate ruses to achieve his goals. In this story, he disguises himself as a priest to deceive Father Brown and steal the valuable cross."

## Evaluation Criteria

The most straightforward way to establish a metric is by using a single criteria.

In [59]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the context.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    model="gpt-4o-mini"
)

In [60]:
test_case = LLMTestCase(
    input=PROMPT.format(story=blue_cross_text),
    actual_output=response.output_text
)
evaluate(test_cases=[test_case], metrics=[correctness_metric])

Output()



Metrics Summary

  - ✅ Correctness [GEval] (score: 0.9777299856015766, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: The response accurately identifies Flambeau as the main antagonist and clearly explains his motivations, which align with the details provided in the story. It highlights his reputation for theft and his use of disguises, specifically mentioning his plan to deceive Father Brown to steal the sapphire cross, which is central to the plot. This demonstrates a strong understanding of the narrative and its characters., error: None)

For test case:

  - input: 
    Based on the story below, answer the question provided.
    <story>
    The Blue Cross

Between the silver ribbon of morning and the green glittering ribbon of
sea, the boat touched Harwich and let loose a swarm of folk like flies,
among whom the man we must follow was by no means conspicuous--nor
wished to be. There was nothing notable about him, except a slight
contrast between the holiday

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Correctness [GEval]', threshold=0.5, success=True, score=0.9777299856015766, reason='The response accurately identifies Flambeau as the main antagonist and clearly explains his motivations, which align with the details provided in the story. It highlights his reputation for theft and his use of disguises, specifically mentioning his plan to deceive Father Brown to steal the sapphire cross, which is central to the plot. This demonstrates a strong understanding of the narrative and its characters.', strict_mode=False, evaluation_model='gpt-4o-mini', error=None, evaluation_cost=0.0016881, verbose_logs='Criteria:\nDetermine whether the actual output is factually correct based on the context. \n \nEvaluation Steps:\n[\n    "1. Identify the context provided in the input to establish the factual basis for evaluation.",\n    "2. Analyze the actual output to determine if it aligns with the

## Evaluation Steps 

G-Eval is flexible in many ways: notice that we can establish an evaluation criteria or a set of evaluation steps, that can help in guiding the model to follow specific steps to perform the evaluation.

In [61]:
...

correctness_metric = GEval(
    name="Correctness",
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'input'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are not OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

In [62]:
test_case = LLMTestCase(
    input=PROMPT.format(story=blue_cross_text),
    actual_output=response.output_text
)
evaluate(test_cases=[test_case], metrics=[correctness_metric])

Output()

PermissionDeniedError: Error code: 403 - {'error': {'message': 'Project `proj_azcDlGrYmDy6eV8yO8hoT2pv` does not have access to model `gpt-4.1`', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}