# LLM Evals

Today's notebook will be significantly shorter than other's as we will play a little with a library that I have never used before: DeepEval. Overall, DeepEval is implements some LLM-as-a-Judge models, that is, they use LLMs to evaluate the performance of other LLMs. It has several models for this, but we will mainly focus on G-Eval, which seems to be their most used module.

## G-Eval

G-Rval consists of a LLM-as-a-Judge model which, given an evaluation instance, automatically generate a Chain of Thoughts (CoT) to judge how well the evalued LLM performed. This can be done either in respect to itself or, more generally, in respect to an expected ouptut sentence. With this, we expect that the evaluation will be more sensitive to overall semantic and syntatic differences of the LLM output with respect to the expected answer.
For now, we will play with some fake outputs, but later, we will use Hugging Face to evaluate real models. We will be using Ollama because it is free :)

In [2]:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from deepeval.models import OllamaModel
from deepeval.metrics.g_eval import Rubric


model = OllamaModel(
    model = "deepseek-r1:1.5b"
)

test_case = LLMTestCase(
    context =  ["The dog chased the cat called Tom, up the tree"],
    input = "Who ran up the tree?",
    actual_output = "Tom, the cat.",
    expected_output = "The cat.")

coherence_metric = GEval(
    name = "Coherence",
    criteria = "Coherence - the collective quality of all sentences in the actual output",
    evaluation_params = [LLMTestCaseParams.INPUT, 
                         LLMTestCaseParams.ACTUAL_OUTPUT, 
                         LLMTestCaseParams.EXPECTED_OUTPUT, 
                         LLMTestCaseParams.CONTEXT],
    model = model,
    rubric=[
        Rubric(score_range=(0,2), expected_outcome="Factually incorrect."),
        Rubric(score_range=(3,6), expected_outcome="Mostly correct."),
        Rubric(score_range=(7,9), expected_outcome="Correct but missing minor details."),
        Rubric(score_range=(10,10), expected_outcome="100% correct."),
    ],
)

coherence_metric.measure(test_case)
print(coherence_metric.score, coherence_metric.reason)

0.3 The actual output 'Tom, the cat' does not directly answer the question 'Who ran up the tree?' and is mostly correct in relevance to the context but lacks specificity regarding who ran up the tree.


We can check the actual CoT that the model created to evaluate this case.

In [3]:
coherence_metric.evaluation_steps

['Assess Input Alignment with Expected Output',
 'Evaluate Clarity and Relevance of Actual Output',
 'Ensure Context Supports Both Input and Expected Output',
 'Check Coherence by Ensuring Internal Flow and Connections']

Just for comparison, we can manually force some CoT.

In [4]:
coherence_metric = GEval(
    name = "Coherence",
    criteria = "Coherence - the collective quality of all sentences in the actual output",
    evaluation_params = [LLMTestCaseParams.INPUT, 
                         LLMTestCaseParams.ACTUAL_OUTPUT, 
                         LLMTestCaseParams.EXPECTED_OUTPUT, 
                         LLMTestCaseParams.CONTEXT],
    model = model,
    rubric=[
        Rubric(score_range=(0,2), expected_outcome="Factually incorrect."),
        Rubric(score_range=(3,6), expected_outcome="Mostly correct."),
        Rubric(score_range=(7,9), expected_outcome="Correct but missing minor details."),
        Rubric(score_range=(10,10), expected_outcome="100% correct."),
    ],
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should not penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
)

coherence_metric.measure(test_case)
print(coherence_metric.score, coherence_metric.reason)

0.7 The actual output 'Tom, the cat.' is mostly correct as it accurately describes who ran up the tree. It omits 'Tom' but includes his name, which is a minor detail. The context and expected output are consistent, so the score is 7-9.


As expected, the model performance improves!


Let us see how this works for some summarization task. But for that, we will use a real model.

In [5]:
from transformers import pipeline

summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Let us apply this model for some summarization task.

In [18]:
# I got some random Wikipedia entry https://en.wikipedia.org/wiki/Ovalipes_catharus
text =  '''
Ovalipes catharus has an oval-shaped, streamlined, and slightly grainy carapace with five large, 
sawtooth-like projections to either side of the eyes and four smaller ones at the front.
The dorsal carapace has two large, maroon eyespots at the rear, two smaller eyespots near the front, and cervical grooves which form a 
butterfly-shaped mark near the centre.
It is overall sandy grey with orange-red highlights and dotted with small, brown spots.
The crab's underside is white, and its rear legs – which are flattened and function as swimming paddles – have a purplish tinge.
Unlike about half of Ovalipes species, O. catharus' body exhibits no iridescence.
'''

In [19]:
summary = summarizer(text)[0]['summary_text']
print(summary)

 Ovalipes catharus has an oval-shaped, streamlined, and slightly grainy carapace with five large, sawtooth-like projections to either side of the eyes and four smaller ones at the front . It is overall sandy grey with orange-red highlights and dotted with small, brown spots .


In [20]:
test_case = LLMTestCase(
    input = text,
    actual_output = summary)

coherence_metric = GEval(
    name = "Coherence",
    criteria = "Determine how good a summary the 'actual output' is to the 'input'",
    evaluation_params = [LLMTestCaseParams.INPUT, 
                         LLMTestCaseParams.ACTUAL_OUTPUT],
    model = model
)

coherence_metric.measure(test_case)
print(coherence_metric.score, coherence_metric.reason)

1.0 The actual output is a concise summary of the input, accurately describing the carapace structure and features without unnecessary details.


Great, the model seems to be very well evaluated! Let us see some chaos, I'll take now the Wikipedia article of another animal in the same family.

In [21]:
fake_text = '''
The carapace of O. ocellatus is slightly wider than long, at 8.9 centimetres (3.5 in) wide, and 7.5 cm (3.0 in) long.
The carapace is yellow-grey or light purplish, with "leopardlike clusters of purple dots". 
It exhibits a limited iridescence as a form of signalling.
'''

In [22]:
test_case = LLMTestCase(
    input = fake_text,
    actual_output = summary)

coherence_metric = GEval(
    name = "Coherence",
    criteria = "Determine how good a summary the 'actual output' is to the 'input'",
    evaluation_params = [LLMTestCaseParams.INPUT, 
                         LLMTestCaseParams.ACTUAL_OUTPUT],
    model = model
)

coherence_metric.measure(test_case)
print(coherence_metric.score, coherence_metric.reason)

0.5 The actual output is concise but could be clearer by using 'catharus' instead of 'Ocellatus' for brevity. It also includes more descriptive language that might seem redundant compared to the input.


Great, it was able to tell that there's something wrong. In fact, we enforce a lower grade.

In [23]:
test_case = LLMTestCase(
    input = fake_text,
    actual_output = summary)

coherence_metric = GEval(
    name = "Coherence",
    criteria = "Determine how good a summary the 'actual output' is to the 'input'. Penalize the model if hallucinating too much.",
    evaluation_params = [LLMTestCaseParams.INPUT, 
                         LLMTestCaseParams.ACTUAL_OUTPUT],
    model = model
)

coherence_metric.measure(test_case)
print(coherence_metric.score, coherence_metric.reason)

0.0 The input focuses on O. ocellatus while the output is about O. ovalipes, a different species. The main idea of the input is captured in the output but with differences in focus.


Much better!

## Statistical metrics

DeepEvals has several other metrics and appliances that I didn't have the time to cover here. Nonetheless, I'd still like to compare our LLM-as-a-Judge with the two classical statistical metrics for text summarization, BLEU and ROUGE. I'll be using Hugging Face's implementation of these metrics as that is the syntax I am more used to.

In [48]:
import evaluate
metric = evaluate.load("bleu")  
metric.compute(predictions = [text], references = [summary])

{'bleu': 0.3857500553435857,
 'precisions': [0.4,
  0.3949579831932773,
  0.3813559322033898,
  0.36752136752136755],
 'brevity_penalty': 1.0,
 'length_ratio': 2.5,
 'translation_length': 120,
 'reference_length': 48}

In [50]:
metric.compute(predictions = [text], references = [fake_summary])

{'bleu': 0.0,
 'precisions': [0.225, 0.025210084033613446, 0.0, 0.0],
 'brevity_penalty': 1.0,
 'length_ratio': 2.1052631578947367,
 'translation_length': 120,
 'reference_length': 57}

And also ROUGE

In [51]:
metric = evaluate.load("rouge")  
metric.compute(predictions = [text], references = [summary])

{'rouge1': np.float64(0.5882352941176471),
 'rouge2': np.float64(0.5695364238410596),
 'rougeL': np.float64(0.5882352941176471),
 'rougeLsum': np.float64(0.5882352941176471)}

In [52]:
metric.compute(predictions = [text], references = [fake_summary])

{'rouge1': np.float64(0.2802547770700637),
 'rouge2': np.float64(0.0),
 'rougeL': np.float64(0.11464968152866241),
 'rougeLsum': np.float64(0.1910828025477707)}

So they also indicate not too good matches and they also indicate that the fake summary is, indeed, fake. Nonethless, they seem much less confident than LLMs-as-judges.