# Example Notebook for Text Generation Metric Evaluation

This notebook demonstrates example use cases for the Valor text generation metrics. The Valor text generation metrics can be used across a variety of tasks which typically, but not always, involve prompting an LLM to generate some text. Use cases include Query Answering, Retrieval Augmented Generation (which can be thought of as a subcase of Q&A), Summarization and Content Generation. 

Some of the metrics can be applied across different use cases. For example, the BLEU metric can be used to compare predictions (generated text) to groundtruth answers in the case of Q&A/RAG, and can also be used to compare predictions (generated text) to groundtruth summaries in the case of Summarization. Conversely, some of the metrics are specific to a use case, such as the ContextRecall metric for RAG or the Summarization score for Summarization. 

In all three use cases below, we generate answers using GPT3.5-turbo and evaluate those answers with a variety of metrics. For the text comparison metrics, we compare GPT3.5-turbo's responses to groundtruth Huggingface answers/summaries for the RAG and Summarization datasets. For the llm guided metrics (which include the RAG metrics, Summarization metrics and general text generation metrics), we are using GPT4o to evaluate the responses of GPT3.5-turbo. 

The first example is RAG for Q&A. We get a RAG dataset from HuggingFace, use Llama-Index and GPT3.5-turbo to generate answers, and evaluate those answers with text comparison metrics, RAG metrics and general text generation metrics.

The second example is Summarization. We download a CNN news dataset from HuggingFace which includes groundtruth summaries. We ask GPT3.5-turbo to summarize those articles. Then we evaluate those summaries with text comparison metrics, summarization metrics and general text generation metrics.

The third example is content generation. We manually create a few queries, each of a different query type (creative, education, professional). Then we evaluate the generated content with general text generation metrics. 

# Connect to Valor API

In [None]:
import os
import pandas as pd
from valor.enums import EvaluationStatus
from valor import Annotation, Datum, Dataset, Model, GroundTruth, Client, Prediction, connect

# Connect to Valor API.
connect("http://0.0.0.0:8000")
client = Client()

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
MISTRAL_API_KEY = os.environ["MISTRAL_API_KEY"]

# Use Case #1: RAG for Q&A

## Download and Save the Corpus for the RAG Pipeline

In [None]:
from datasets import load_dataset

In [None]:
corpus_dataset = load_dataset("rag-datasets/mini_wikipedia", "text-corpus")["passages"]
print(corpus_dataset)

# For each passage in corpus_dataset, save that passage to a .txt file with the passage_id as the filename.
for passage in corpus_dataset:
    with open(f"./rag_corpus/{passage["id"]}.txt", "w") as f:
        f.write(passage["passage"])

## Load Queries and get Answers with Llama-Index

In [None]:
import csv
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

In [None]:
# Load the query dataset. 
qa_dataset = load_dataset("rag-datasets/mini_wikipedia", "question-answer")["test"]
qa_dataset = qa_dataset.shuffle(seed=42)
print(qa_dataset)

In [None]:
# Loads in the rag_corpus and builds an index.
# Initially a query_engine, which will use GPT3.5-turbo by default with calls to OpenAI's API.
# You must specify your OpenAI API key in the environment variable OPENAI_API_KEY for the below code to function. 
documents = SimpleDirectoryReader("rag_corpus").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

In [None]:
# sample use
response = query_engine.query("What country borders Argentina and Brazil?")
print(response)
print(response.source_nodes)

response = query_engine.query("What color is a penguin?")
print(response)
print(response.source_nodes)

In [None]:
if os.path.exists("rag_data.csv"):
    os.remove("rag_data.csv")

NUMBER_OF_RECORDS = 50

with open("rag_data.csv", mode="w") as data_file:
    data_writer = csv.writer(data_file, delimiter=",", quoting=csv.QUOTE_MINIMAL)
    data_writer.writerow(["query", "groundtruth", "prediction", "context_list"])

    for i in range(NUMBER_OF_RECORDS):
        query = qa_dataset[i]["question"]
        groundtruth = qa_dataset[i]["answer"]
        print(f"{i}: {query}")

        response_object = query_engine.query(query)
        response = response_object.response
        print(f"response: {response}")
        context_list = []
        for i in range(len(response_object.source_nodes)):
            context_list.append(response_object.source_nodes[i].text)
        data_writer.writerow([query, groundtruth, response, context_list])
    
    data_file.close()

## Evaluation in Valor

In this example, the RAG pipeline produces answers to the given queries by retrieving context and then generating answers based on the context and query. Groundtruth answers are also known for these queries. Both the datums (which contain the queries) and the groundtruths are added to the dataset. Then, the predictions are added to the model, which includes the answer and the context used to generate that answer. 

The metrics requested include some text comparison metrics (BLEU, ROUGE, LDistance), which do a text comparison between the generated answer and the groundtruth answer for the same datum. If the user only desires these metrics, then they do not need to include the context_list in the prediction and they do not need to supply the llm_api_parameters. 

However, other metrics are requested that use llm guided evaluation (Coherence, ContextRelevance, AnswerRelevance, Hallucination, Toxicity). To get these metrics, the user needs to specify an api url, an api key and a model name, along with any other model kwargs. Each of these metrics will use API calls to the specified LLM service to get information relevant for computing the desired metrics. Some of these metrics, such as Toxicity, do not require any context, so can be used with a Q&A model that does not use context. However other metrics, such as ContextRelevance, require context to be passed in the prediction, as that context is used for computing the metric. 

In [None]:
import ast

In [None]:
# Read in the dataset of queries, groundtruths and predictions. 
df = pd.read_csv("rag_data.csv")

In [None]:
# Create, build and finalize the dataset and model.
dataset = Dataset.create(
    name="rag_dataset",
    metadata={
        "hf_dataset_name": "rag-datasets/mini_wikipedia",
        "hf_dataset_subset": "question-answer",
        "hf_dataset_split": "test",
        "shuffle_seed": 42,
        "number_of_records": 50,
    }
)
model = Model.create(
    name="rag_model",
    metadata={
        "embedding_model_name": "text-embedding-ada-002", # When we ran llama-index above, it defaulted to text-embedding-ada-002.
        "llm_model_name": "GPT3.5-turbo", # When we ran llama-index above, it defaulted to GPT3.5.
    }
)

# Create a list of datums
datum_list = []
for i in range(len(df)):
    row = df.iloc[i]

    datum_list.append(
        Datum(
            uid=f"query{i}",
            text=row["query"],
        )
    )

# Build and finalize the dataset
for i in range(len(df)):
    row = df.iloc[i]
    datum = datum_list[i]

    dataset.add_groundtruth(
        GroundTruth(
            datum=datum,
            annotations=[
                # Perhaps you have multiple correct or good groundtruth answers to the query.
                # The labels below are a trivial example, but you could have less trivial examples.
                # For example, to the query "When was the United States of America founded?", you might 
                # consider both "During the American Revolution" or "July 4th, 1776" to be good answers.
                Annotation(
                    text=row["groundtruth"],
                    metadata={"annotator": "Alice"},
                ),
                Annotation(
                    text="The answer is " + row["groundtruth"],
                    metadata={"annotator": "Bob"},
                ),
            ],
        )
    )
dataset.finalize()

# Build and finalize the model
for i in range(len(df)):
    row = df.iloc[i]
    datum = datum_list[i]

    model.add_prediction(
        dataset, 
        Prediction(
            datum=datum,
            annotations=[
                Annotation(
                    text=row["prediction"],
                    context=ast.literal_eval(row["context_list"]),
                ),
            ],
        )
    )
model.finalize_inferences(dataset)

In [None]:
# Using GPT4o to evaluate GPT3.5-turbo's predictions across a variety of metrics. 
eval_job = model.evaluate_text_generation(
    dataset,
    metrics_to_return=["AnswerRelevance", "BLEU", "Coherence", "ROUGE"],
    llm_api_params = {
        "client":"openai",
        "api_key":OPENAI_API_KEY,
        "data":{
            "model":"gpt-4o",
            "seed":2024,
        },
    },    
    metric_params={
        "BLEU": {
            "weights": [1, 0, 0, 0],
        }
    }
)

assert eval_job.wait_for_completion() == EvaluationStatus.DONE

# These are the computed metrics.
eval_job.metrics

# Here are some example metrics. These are all for query49 and were evaluated by GPT-4o.
example_expected_metrics = [
    {
        "type": "AnswerRelevance", 
        "value": 1.0,
        "parameters": {
            "dataset": "rag_dataset", 
            "datum_uid": "query49", 
            "prediction": "The government is seeking to increase the media industry's GDP contribution to 3% by 2012."
        }, 
    },
    {
        "type": "BLEU", 
        "value": 0.7058823529411765,
        "parameters": {
            "dataset": "rag_dataset", 
            "weights": [1.0, 0.0, 0.0, 0.0], 
            "datum_uid": "query49", 
            "prediction": "The government is seeking to increase the media industry's GDP contribution to 3% by 2012."
        },
    },
    {
        "type": "Coherence", 
        "value": 3.0,
        "parameters": {
            "dataset": "rag_dataset", 
            "datum_uid": "query49", 
            "prediction": "The government is seeking to increase the media industry's GDP contribution to 3% by 2012."
        }, 
    },
    {
        "type": "ROUGE", 
        "value": {
            "rouge1": 0.7741935483870969, 
            "rouge2": 0.5384615384615385, 
            "rougeL": 0.7142857142857143, 
            "rougeLsum": 0.7142857142857143
        },
        "parameters": {
            "dataset": "rag_dataset", 
            "datum_uid": "query49", 
            "prediction": "The government is seeking to increase the media industry's GDP contribution to 3% by 2012.", 
            "rouge_types": ["rouge1", "rouge2", "rougeL", "rougeLsum"], 
            "use_stemmer": False
        }, 
    },
]

# Use Case #2: Summarization

## Load CNN Articles and get Summaries with GPT3.5-turbo

In [None]:
import csv
import os
from datasets import load_dataset
import openai

openai_client = openai.OpenAI()

In [None]:
# Load the cnn dataset. 
cnn_dataset = load_dataset("cnn_dailymail", "3.0.0")["test"]
cnn_dataset = cnn_dataset.shuffle(seed=42)
print(cnn_dataset)

In [None]:
if os.path.exists("summarization_data.csv"):
    os.remove("summarization_data.csv")

NUMBER_OF_RECORDS = 50

instruction="You are a helpful assistant. Please summarize the following article in a few sentences."

with open("summarization_data.csv", mode="w") as data_file:
    data_writer = csv.writer(data_file, delimiter=",", quoting=csv.QUOTE_MINIMAL)
    data_writer.writerow(["text", "groundtruth", "prediction"])

    for i in range(NUMBER_OF_RECORDS):
        article = cnn_dataset[i]["article"]
        groundtruth = cnn_dataset[i]["highlights"]

        print(f"{i}: {groundtruth}")
        messages = [
            {"role": "system", "content": instruction},
            {"role": "user", "content": article},
        ]

        response_object = openai_client.chat.completions.create(
            model="gpt-3.5-turbo", messages=messages, seed=42
        )
        prediction = response_object.choices[0].message.content

        print(f"prediction: {prediction}")
        data_writer.writerow([article, groundtruth, prediction])
    
    data_file.close()

## Evaluation in Valor

In [None]:
# Read in the dataset of queries, groundtruths and predictions. 
df = pd.read_csv("summarization_data.csv")

In [None]:
# Create, build and finalize the dataset and model.
dataset = Dataset.create("summarization_dataset")
model = Model.create("summarization_model")

# Create a list of datums
datum_list = []
for i in range(len(df)):
    row = df.iloc[i]

    datum_list.append(
        Datum(
            uid=f"article{i}",
            text=row["text"],
            metadata={
                "query": "Summarize this article in a few sentences.", 
            }
        )
    )
    
# Build and finalize the dataset
for i in range(len(df)):
    row = df.iloc[i]
    datum = datum_list[i]

    dataset.add_groundtruth(
        GroundTruth(
            datum=datum,
            annotations=[
                Annotation(
                    text=row["groundtruth"],
                ),
            ],
        )
    )
dataset.finalize()

# Build and finalize the model
for i in range(len(df)):
    row = df.iloc[i]
    datum = datum_list[i]

    model.add_prediction(
        dataset, 
        Prediction(
            datum=datum,
            annotations=[
                Annotation(
                    text=row["prediction"],
                )
            ],
        )
    )
model.finalize_inferences(dataset)

In [None]:
# Using GPT4o to evaluate GPT3.5-turbo's predictions across a variety of metrics. 
eval_job = model.evaluate_text_generation(
    dataset,
    metrics_to_return=["BLEU", "ROUGE", "Coherence"],
    llm_api_params = {
        "client":"openai",
        "api_key":OPENAI_API_KEY,
        "data":{
            "model":"gpt-4o",
            "seed":2024,
        },
    },   
    metric_params={
        "BLEU": {
            "weights": [1, 0, 0, 0],
        }
    }
)

assert eval_job.wait_for_completion() == EvaluationStatus.DONE

eval_job.metrics

example_expected_metrics = [
    {
        "type": "BLEU", 
        "value": 0.10000000000000002,
        "parameters": {
            "dataset": "rag_dataset",  
            "datum_uid": "query40", 
            "prediction": "Diving ducks are heavier than dabbling ducks, which makes it more difficult for them to take off and fly.",
            "weights": [1.0, 0.0, 0.0, 0.0],
        }, 
    },
    {
        "type": "Coherence", 
        "value": 5.0,
        "parameters": {
            "dataset": "rag_dataset", 
            "datum_uid": "query40", 
            "prediction": "Diving ducks are heavier than dabbling ducks, which makes it more difficult for them to take off and fly.",
        }, 
    },
    {
        "type": "ROUGE", 
        "value": {
            "rouge1": 0.18181818181818182, 
            "rouge2": 0.09999999999999999, 
            "rougeL": 0.18181818181818182, 
            "rougeLsum": 0.18181818181818182
        },
        "parameters": {
            "dataset": "rag_dataset", 
            "datum_uid": "query40", 
            "prediction": "Diving ducks are heavier than dabbling ducks, which makes it more difficult for them to take off and fly.", 
            "rouge_types": ["rouge1", "rouge2", "rougeL", "rougeLsum"], 
            "use_stemmer": False
            }, 
    },
]

# Use Case #3: Content Generation

## Some Example Content Generation Queries

In [None]:
queries = [
    "Write about a haunted house from the perspective of the ghost.",
    "Explain to an elementary school student how to do long multiplication with the example 43 times 22. The resulting answer should be 946.",
    "Draft an email to a coworker explaining a project delay. Explain that the delay is due to funding cuts, which resulted in multiple employees being moved to different projects. Inform the coworker that the project deadline will have to be pushed back. Be apologetic and professional. Express eagerness to still complete the project as efficiently as possible.",
]

query_metadata = [
    {
        "request_type": "creative",
    },
    {
        "request_type": "educational",
    },
    {
        "request_type": "professional",
    },
]

In [None]:
if os.path.exists("content_generation_data.csv"):
    os.remove("content_generation_data.csv")

instruction="You are a helpful assistant."

with open("content_generation_data.csv", mode="w") as data_file:
    data_writer = csv.writer(data_file, delimiter=",", quoting=csv.QUOTE_MINIMAL)
    data_writer.writerow(["query", "prediction"])

    for i in range(len(queries)):
        query = queries[i]

        messages = [
            {"role": "system", "content": instruction},
            {"role": "user", "content": query},
        ]
        response_object = client.chat.completions.create(
            model="gpt-3.5-turbo", messages=messages, seed=42
        )
        prediction = response_object.choices[0].message.content

        print(f"prediction: {prediction}")
        data_writer.writerow([query, prediction])
    
    data_file.close()

## Evaluation in Valor

In [None]:
# Read in the dataset of queries and predictions.
df = pd.read_csv("content_generation_data.csv")

In [None]:
# Create, build and finalize the dataset and model.
dataset = Dataset.create("content_generation_dataset")
model = Model.create("content_generation_model")

# Create a list of datums
datum_list = []
for i in range(len(df)):
    row = df.iloc[i]

    datum_list.append(
        Datum(
            uid=f"query{i}",
            text=row["query"],
        )
    )

# Build and finalize the dataset
for i in range(len(df)):
    row = df.iloc[i]
    datum = datum_list[i]

    # There are no groundtruth annotations for content generation.
    dataset.add_groundtruth(
        GroundTruth(
            datum=datum,
            annotations=[],
        )
    )
dataset.finalize()

# Build and finalize the model
for i in range(len(df)):
    row = df.iloc[i]
    datum = datum_list[i]

    model.add_prediction(
        dataset, 
        Prediction(
            datum=datum,
            annotations=[
                Annotation(
                    text=row["prediction"],
                )
            ],
        )
    )
model.finalize_inferences(dataset)

In [None]:
# Using GPT4o to evaluate GPT3.5-turbo's predictions across a variety of metrics. 
eval_job = model.evaluate_text_generation(
    dataset,
    metrics_to_return=["Coherence"],
    llm_api_params = {
        "client":"openai",
        "api_key":OPENAI_API_KEY,
        "data":{
            "model":"gpt-4o",
            "seed":2024,
        },
    },
)

assert eval_job.wait_for_completion() == EvaluationStatus.DONE

eval_job.metrics

example_expected_metrics = [
    {
        "value": 5.0,
        "type": "Coherence",
        "parameters": {
            "dataset": "content_generation_dataset",
            "datum_uid": "query2",
            "prediction": """Subject: Project Delay Due to Funding Cuts

Dear [Coworker's Name],

I hope this message finds you well. I am writing to update you on the status of our project and unfortunately, convey some disappointing news regarding a delay in its completion.

Due to recent funding cuts within our department, our project team has been significantly affected. Several team members, including myself, have been relocated to work on other projects to address the shifting priorities resulting from the budget constraints.

As a consequence of these unexpected changes, it is with regret that I must inform you that the original deadline for our project will need to be extended. I understand the inconvenience that this may cause, and I sincerely apologize for any inconvenience this delay may bring to you and your plans.

Rest assured that despite this setback, I am fully committed to ensuring that we still deliver the project with utmost efficiency and quality. I am exploring all possible avenues to mitigate the delay and work towards completing our project in a timely manner.

I appreciate your understanding and patience during this challenging time. Your ongoing support and collaboration are invaluable as we navigate through this situation together. If you have any concerns or questions, please do not hesitate to reach out to me.

Thank you for your understanding, and I look forward to working with you to successfully finalize our project.

Warm regards,

[Your Name]""",
        },
    },
]    