# Chapter 11: Evaluating LLM Outputs

In previous chapters, we explored how to guide LLMs towards generating structured outputs. While this approach is effective, it quickly becomes cumbersome to manually evaluate the outputs every time we build an AI agent or a workflow. Running inputs through the model and verifying each result by hand is not only time-consuming but also prone to inconsistency. This chapter introduces a more systematic approach: automated evaluation using test cases. By creating a small test set, consisting of around 10 test cases, we can streamline the process of validating LLM outputs. Furthermore, if we’ve already defined the expected output type from the LLM, we can also establish clear criteria for what the final answer should look like.

In [1]:
import numpy as np
from pydantic import BaseModel, EmailStr, Field
from collections import Counter
from datetime import datetime
from language_models.agent import Agent, OutputType, PromptingStrategy, get_schema_from_args
from language_models.models.llm import OpenAILanguageModel
from language_models.models.embedding import SentenceTransformerEmbeddingModel
from language_models.proxy_client import ProxyClient
from language_models.settings import settings

In [2]:
proxy_client = ProxyClient(
    client_id=settings.CLIENT_ID,
    client_secret=settings.CLIENT_SECRET,
    auth_url=settings.AUTH_URL,
    api_base=settings.API_BASE,
)

In [3]:
llm = OpenAILanguageModel(
    proxy_client=proxy_client,
    model="gpt-4",
    max_tokens=500,
    temperature=0.2,
)

To grasp the basics of evaluating LLM outputs, we'll walk through a few examples. Rather than building an evaluation framework from scratch, we'll focus on a more streamlined, budget-friendly approach. The key takeaway is understanding how to compare outputs, which should feel familiar if you have experience in software development, as the principles are quite similar.

**String**

When evaluating string outputs, one straightforward method is to perform a direct equality comparison, particularly when the output is a single string or a distinct category. This method works well for cases where the expected output is clear-cut, such as predefined labels, specific commands, or short, unambiguous responses.

However, not all LLM outputs are this simple. In cases where the output is a sentence or a paragraph, direct comparison may not be sufficient. For these scenarios, we can compute the similarity between the expected answer and the LLM-generated output. In addition to cosine similarity, we can use specialized metrics like BLEU and ROUGE scores to evaluate how closely the generated text matches the expected output, accounting for variations in content, structure, and phrasing.

Additionally, it is possible to use another LLM to evaluate the output (correct/incorrect answer or rate answer quality from 1 to 10). However, one should exercise caution, as the evaluating LLM might have a tendency to prefer longer outputs or those that resemble its own writing style. This could potentially introduce bias into the evaluation process.

In [4]:
system_prompt = "You are an AI assistant designed to help users with a variety of tasks."

agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{prompt}",
    prompt_variables=["prompt"],
    output_type=OutputType.STRING,
    prompting_strategy=PromptingStrategy.SINGLE_COMPLETION,
    verbose=True,
)

In [5]:
output = agent.invoke({"prompt": "What is the capital city of France?"})

[1m[38;2;50;164;103mFinal Answer[0m[1m[0m: Paris


In [6]:
print(output.final_answer)

Paris


Check for equality.

In [7]:
print(output.final_answer == "Paris")

True


Check for similarity.

In [8]:
embedding_model = SentenceTransformerEmbeddingModel(model="all-MiniLM-L6-v2")
embedding1 = embedding_model.embed_query(output.final_answer)
embedding2 = embedding_model.embed_query("Paris")
cosine_similarity = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
print(f"Cosine similarity: {cosine_similarity:.4f}")

Cosine similarity: 1.0000


**Integer/Float**

When evaluating integer or float outputs, we can use a variety of comparison methods depending on the context. A direct `==` check can be used for exact matches, while comparisons such as `<`, `<=`, `>`, or `>=` are useful when the output needs to be within a certain threshold. Additionally, we can verify if the value falls within a specific range, which is particularly helpful for scenarios where the exact number isn't as important as ensuring it lies within acceptable bounds.

In [9]:
system_prompt = "You are an AI assistant designed to help users with a variety of tasks."

agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{prompt}",
    prompt_variables=["prompt"],
    output_type=OutputType.FLOAT,
    prompting_strategy=PromptingStrategy.SINGLE_COMPLETION,
    verbose=True,
)

In [10]:
output = agent.invoke({"prompt": "What is the value of Pi up to two decimal places?"})

[1m[38;2;50;164;103mFinal Answer[0m[1m[0m: 3.14


In [11]:
print(output.final_answer)

3.14


Check for equality.

In [12]:
print(output.final_answer == 3.14)

True


Check if the final answer is within a specific range.

In [13]:
print(3.13 < output.final_answer < 3.15)

True


**Boolean**

When evaluating boolean outputs, the process is straightforward. We simply check if the output matches the expected boolean value, whether it's True or False.

In [14]:
system_prompt = "You are an AI assistant designed to help users with a variety of tasks."

agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{prompt}",
    prompt_variables=["prompt"],
    output_type=OutputType.BOOLEAN,
    prompting_strategy=PromptingStrategy.SINGLE_COMPLETION,
    verbose=True,
)

In [15]:
output = agent.invoke({"prompt": "Is the number 5 greater than 3?"})

[1m[38;2;50;164;103mFinal Answer[0m[1m[0m: True


In [16]:
print(output.final_answer)

True


In [17]:
print(output.final_answer == True)

True


**Date/Timestamp**

When evaluating dates or timestamps, we check if the output matches the expected value and adheres to the specified format. This involves verifying that the date or timestamp is correctly formatted and accurately represents the intended moment in time.

In [18]:
system_prompt = "You are an AI assistant designed to help users with a variety of tasks."

agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{prompt}",
    prompt_variables=["prompt"],
    output_type=OutputType.DATE,
    output_schema="%Y-%m-%d",
    prompting_strategy=PromptingStrategy.SINGLE_COMPLETION,
    verbose=True,
)

In [19]:
output = agent.invoke({"prompt": "We are excited to announce that our annual company retreat will be held on April 15, 2024. This event will be a great opportunity for team building and strategic planning."})

[1m[38;2;50;164;103mFinal Answer[0m[1m[0m: 2024-04-15


In [20]:
print(output.final_answer)

2024-04-15


In [21]:
print(output.final_answer == datetime.strptime("2024-04-15", "%Y-%m-%d").date())

True


**Array**

When evaluating arrays, the approach depends on the expected format and content. If the output array is supposed to be sorted, we can check for equality against a pre-sorted reference array. If the sorting is not a requirement, we can instead verify that all elements meet specific criteria, such as ensuring all elements are the same or checking for other expected patterns or values.

In [22]:
system_prompt = """You are an AI assistant designed to help users with a variety of tasks.

Extract all numbers from the user's input text."""

agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{prompt}",
    prompt_variables=["prompt"],
    output_type=OutputType.ARRAY_INTEGER,
    prompting_strategy=PromptingStrategy.SINGLE_COMPLETION,
    verbose=True,
)

In [23]:
prompt = """Last weekend, six of us went on a 15-kilometer hike, starting at 7 AM.

By noon, we had covered 10 kilometers and reached Mount Elbert's 4,401-meter summit by 2 PM, with a temperature of 12°C.

We camped 5 kilometers away by 6 PM with 12 others and returned home by 5 PM the next day."""

output = agent.invoke({"prompt": prompt})

[1m[38;2;50;164;103mFinal Answer[0m[1m[0m: [6, 15, 7, 10, 4401, 2, 12, 5, 6, 12, 5]


In [24]:
print(output.final_answer)

[6, 15, 7, 10, 4401, 2, 12, 5, 6, 12, 5]


Check if all elements are the same.

In [25]:
print(Counter(output.final_answer) == Counter([6, 15, 7, 10, 4401, 2, 12, 5, 6, 12, 5]))

True


**Struct/Object**

When comparing structs or objects, we perform equality checks to ensure that all elements match. Since these elements can include various data types—such as arrays, booleans, dates, or other structures - we can also apply the evaluation methods that we have introduced for each type. For instance, arrays are checked for equality or pattern and dates for correct formatting and value.

In [26]:
system_prompt = "Write an email."

class Email(BaseModel):
    to: list[EmailStr] = Field(description="List of people to send it to")
    subject: str = Field(description="Subject of the email")
    body: str = Field(description="Body of the email")

agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{prompt}",
    prompt_variables=["prompt"],
    output_type=OutputType.OBJECT,
    output_schema=Email,
    prompting_strategy=PromptingStrategy.SINGLE_COMPLETION,
    verbose=True,
)

In [27]:
prompt = """Send the email to:
1. johndoe@example.com
2. janedoe@example.com
3. alice.smith@company.org

Here is what we did: Weekend Hiking Trip Recap

Here is some context:
Last weekend, six of us went on a 15-kilometer hike, starting at 7 AM.
By noon, we had covered 10 kilometers and reached Mount Elbert's 4,401-meter summit by 2 PM, with a temperature of 12°C.
We camped 5 kilometers away by 6 PM with 12 others and returned home by 5 PM the next day."""

output = agent.invoke({"prompt": prompt})

[1m[38;2;50;164;103mFinal Answer[0m[1m[0m: to=['johndoe@example.com', 'janedoe@example.com', 'alice.smith@company.org'] subject='Weekend Hiking Trip Recap' body='Dear All,\n\nI hope this email finds you well. I am writing to recap our amazing hiking trip last weekend. Six of us embarked on a 15-kilometer hike, starting at 7 AM. By noon, we had covered 10 kilometers and reached the summit of Mount Elbert, standing tall at 4,401 meters, by 2 PM. The weather was quite pleasant with a temperature of 12°C.\n\nWe set up our camp 5 kilometers away by 6 PM, where we were joined by 12 other fellow hikers. After a night under the stars, we packed up and returned home by 5 PM the next day.\n\nLooking forward to more such adventures in the future.\n\nBest,\n[Your Name]'


Check if all elements are the same.

In [28]:
print(Counter(output.final_answer.to) == Counter(["johndoe@example.com", "janedoe@example.com", "alice.smith@company.org"]))

True


Check for similarity.

In [29]:
embedding1 = embedding_model.embed_query(output.final_answer.subject)
embedding2 = embedding_model.embed_query("Weekend Hiking Trip Recap")
cosine_similarity = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
print(f"Cosine similarity: {cosine_similarity:.4f}")

Cosine similarity: 1.0000


Check for similarity.

In [30]:
embedding1 = embedding_model.embed_query(output.final_answer.body)

body = """Hello all,

I hope this email finds you well.
I wanted to share a quick recap of our weekend hiking trip.
Last weekend, six of us embarked on a 15-kilometer hike, starting at 7 AM.
By noon, we had covered 10 kilometers and reached Mount Elbert's 4,401-meter summit by 2 PM, with a temperature of 12°C.

We set up camp 5 kilometers away by 6 PM, where we were joined by 12 others.
The next day, we packed up and returned home by 5 PM. It was a memorable experience and I look forward to our next adventure.

Best regards,
[Your Name]"""
embedding2 = embedding_model.embed_query(body)
cosine_similarity = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
print(f"Cosine similarity: {cosine_similarity:.4f}")

Cosine similarity: 0.9664
