# AI as Judge

[G-Eval](https://deepeval.com/docs/metrics-llm-evals) is a framework that uses LLM as a judge to evaluate LLM outputs. The evaluation can be based on any criteria. G-Eval is implemented by a library called [DeepEval](https://deepeval.com/) which includes a broader set of tests.


In [1]:
%load_ext dotenv
%dotenv ../../05_src/.secrets

In [2]:
from openai import OpenAI
import os

document_folder = "../../05_src/documents/"
blue_cross_file = "the_blue_cross.txt"
file_path = os.path.join(document_folder, blue_cross_file)

with open(file_path, "r", encoding="utf-8") as f:
    blue_cross_text = f.read()

FileNotFoundError: [Errno 2] No such file or directory: '../../05_src/documents/the_blue_cross.txt'

In [None]:
instructions = "You are an helpful assistant that summarizes works of fiction with a quirky and bubbly approach."
PROMPT = """
    Summarize the following story in at most four paragraphs. Please include all key characters and plot points.
    <story>
    {story}
    </story>
    In addition to the summary, add an introduction paragraph where you greet the reader and a conclusion where you share an opinion about the story.
"""

In [None]:
client = OpenAI()
response = client.responses.create(
    model="gpt-4o-mini",
    instructions=instructions,
    input=[
        {"role": "user", 
         "content": PROMPT.format(story=blue_cross_text)}
    ],
    temperature=1.2
)

In [None]:
response.output_text

# Answer Relevancy

The answer relevancy metric evaluates how relevant the actual output of the LLM app is compared to the provided input. This metric is self-explaining in the sense that the output includes a reason for the metric score.

The metric is calculated as:

$$
AnswerRelevancy=\frac{NumberRelevantStatements}{TotalStatements}
$$

Reference: [Answer Relevancy](https://deepeval.com/docs/metrics-answer-relevancy). 

In [None]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

test_case = LLMTestCase(
    input=PROMPT.format(story=blue_cross_text),
    actual_output=response.output_text
)

In [None]:
metric.measure(test_case)

In [None]:
print(metric.score,metric.reason)

# Other Metrics

Other useful metric functions include:

+ [Faithfulness](https://deepeval.com/docs/metrics-faithfulness): evaluates whether the `actual_output` factually aligns with the contents of  `retrieval_context`. 
+ [Contextual Precision](https://deepeval.com/docs/metrics-contextual-precision): evaluates whether nodes in your `retrieval_context` that are relevant to the given input are ranked higher than irrelevant ones. 
+ [Contextual Recall](https://deepeval.com/docs/metrics-contextual-recall): evaluates the extent of which the retrieval_context aligns with the expected_output. 
+ [Contextual Relevancy](https://deepeval.com/docs/metrics-contextual-relevancy): evaluates the overall relevance of the information presented in your retrieval_context for a given input. 

# G-Eval

[G-Eval](https://deepeval.com/docs/metrics-llm-evals) is a framework that uses LLM-as-a-judge with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. The G-Eval metric is the most versatile type of metric deepeval offers.

In [None]:
instructions = "You are an helpful assistant that specializes in works of fiction."
PROMPT = """
    Based on the story below, answer the question provided.
    <story>
    {story}
    </story>
    <question>
    Who is the main antagonist in the story and what motivates their actions?
    </question>
"""

In [None]:
client = OpenAI()
response = client.responses.create(
    model="gpt-4o-mini",
    instructions=instructions,
    input=[
        {"role": "user", 
         "content": PROMPT.format(story=blue_cross_text)}
    ],
    temperature=0.7
)

In [None]:
response.output_text

## Evaluation Criteria

The most straightforward way to establish a metric is by using a single criteria.

In [None]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the context.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

In [None]:
test_case = LLMTestCase(
    input=PROMPT.format(story=blue_cross_text),
    actual_output=response.output_text
)
evaluate(test_cases=[test_case], metrics=[correctness_metric])

## Evaluation Steps 

G-Eval is flexible in many ways: notice that we can establish an evaluation criteria or a set of evaluation steps, that can help in guiding the model to follow specific steps to perform the evaluation.

In [None]:
...

correctness_metric = GEval(
    name="Correctness",
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'input'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are not OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

In [None]:
test_case = LLMTestCase(
    input=PROMPT.format(story=blue_cross_text),
    actual_output=response.output_text
)
evaluate(test_cases=[test_case], metrics=[correctness_metric])