# Example Notebook for Text Generation Metric Evaluation

This notebook demonstrates examples use cases for the Valor text generation task type. The Valor text generation task type can be used across a variety of tasks which typically, but not always, involve prompting an LLM to generate some text. Use cases include Query Answering, Retrieval Augmented Generation (which can be thought of as a subcase of Q&A), Summarization and Content Generation. 

Not all of the text generation metrics make sense for all of these use cases, however these use cases share many common metrics. For example, the SentenceBLEU metric can be used to compare to groundtruth answers in the case of Q&A/RAG, and can also be used to compare to groundtruth summaries in the case of Summarization. For another example, both the Coherence and Naturalness metrics can be used to evaluate the quality of generated text across all of these use cases. Some of these use cases also have use case specific metrics, such as ContextRecall for RAG or the Summarization score for Summarization. 

In all three use cases below, we generate answers using GPT3.5-turbo and evaluate those answers with a variety of metrics. For the text comparison metrics, we compare GPT3.5-turbo's responses to groundtruth answers/summaries for the RAG and Summarization datasets. For the llm guided metrics (which include the RAG metrics, Summarization metrics and general text generation metrics), we are using GPT4o to evaluate the responses of GPT3.5-turbo. 

The first example is RAG for Q&A. We get a RAG dataset from HuggingFace, use Llama-Index and GPT3.5-turbo to generate answers, and evaluate those answers with text comparison metrics, RAG metrics and general text generation metrics.

The second example is Summarization. We download a CNN news dataset from HuggingFace which includes groundtruth summaries. We ask GPT3.5-turbo to summarize those articles. Then we evaluate those summaries with text comparison metrics, summarization metrics and general text generation metrics.

The third example is content generation. We manually create a few queries, each of a different query type (creative, education, professional). Then we evaluate the generated content with general text generation metrics. 

TODO potentially change this text depending on our implementations of Naturalness, Coherence, and other metrics. 

# Use Case #1: RAG for Q&A

## Download and Save the Corpus for the RAG Pipeline

In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
corpus_dataset = load_dataset("rag-datasets/mini_wikipedia", "text-corpus")["passages"]
print(corpus_dataset)

# For each passage in corpus_dataset, save that passage to a .txt file with the passage_id as the filename.
for passage in corpus_dataset:
    with open(f"./rag_corpus/{passage['id']}.txt", "w") as f:
        f.write(passage["passage"])

Using the latest cached version of the dataset since rag-datasets/mini_wikipedia couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'text-corpus' at /home/b.nativi/.cache/huggingface/datasets/rag-datasets___mini_wikipedia/text-corpus/0.0.0/097f786fc6ece31a08ed209914ee1d94314367f3 (last modified on Mon May 13 18:46:52 2024).


Dataset({
    features: ['passage', 'id'],
    num_rows: 3200
})


## Load Queries and get Answers with Llama-Index

In [4]:
import csv
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

In [5]:
# Load the query dataset. 
qa_dataset = load_dataset("rag-datasets/mini_wikipedia", "question-answer")["test"]
qa_dataset = qa_dataset.shuffle(seed=42)
print(qa_dataset)

Dataset({
    features: ['question', 'answer', 'id'],
    num_rows: 918
})


In [6]:
# Loads in the rag_corpus and builds an index.
# Initially a query_engine, which will use GPT3.5-turbo by default with calls to OpenAI's API.
# You must specify your OpenAI API key in the environment variable OPENAI_API_KEY for the below code to function. 
documents = SimpleDirectoryReader("rag_corpus").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

In [7]:
# sample use
response = query_engine.query("What country borders Argentina and Brazil?")
print(response)
print(response.source_nodes)

response = query_engine.query("What color is a penguin?")
print(response)
print(response.source_nodes)

Uruguay borders Argentina and Brazil.
[NodeWithScore(node=TextNode(id_='258dfec6-87be-43ca-88ed-4dc64f6d49b7', embedding=None, metadata={'file_path': '/mnt/nvme0n1p2/home/b.nativi/valor/examples/text-generation/rag_corpus/1.txt', 'file_name': '1.txt', 'file_type': 'text/plain', 'file_size': 351, 'creation_date': '2024-05-17', 'last_modified_date': '2024-05-17'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='c588c7cf-9eb8-4e6d-b9a3-ebdd565f959c', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '/mnt/nvme0n1p2/home/b.nativi/valor/examples/text-generation/rag_corpus/1.txt', 'file_name': '1.txt', 'file_type': 'text/plain', 'file_size': 351, 'creation_date': '2024-05-17', 'last_modified_date': '20

In [8]:
if os.path.exists("rag_data.csv"):
    os.remove("rag_data.csv")

NUMBER_OF_RECORDS = 50

with open("rag_data.csv", mode="w") as data_file:
    data_writer = csv.writer(data_file, delimiter=",", quoting=csv.QUOTE_MINIMAL)
    data_writer.writerow(["query", "groundtruth", "prediction", "context_list"])

    for i in range(NUMBER_OF_RECORDS):
        query = qa_dataset[i]["question"]
        groundtruth = qa_dataset[i]["answer"]
        print(f"{i}: {query}")

        response_object = query_engine.query(query)
        response = response_object.response
        print(f"response: {response}")
        context_list = []
        for i in range(len(response_object.source_nodes)):
            context_list.append(response_object.source_nodes[i].text)
        data_writer.writerow([query, groundtruth, response, context_list])
    
    data_file.close()

0: What did Cleveland's opponents say in 1884 to counter his innocent image?
response: Cleveland's opponents in 1884 criticized his alleged involvement in a scandal regarding an illegitimate child, which was used to counter his innocent image during the presidential campaign.
1: Does otter give birth or lay egg?
response: Otters give birth.
2: How many days did it take the Imperial Japanese Army to win the Battle of Singapore?
response: The Imperial Japanese Army took six days to win the Battle of Singapore.
3: The John Adams Library , housed at the Boston Public Library , contains what?
response: Adams's personal collection of more than 3,500 volumes in eight languages, many of which are extensively annotated by Adams.
4: Who is the most popular rock group in Finland?
response: The most popular rock group in Finland is The Rasmus.
5: Was Wilson president of the American Political Science Association in 1910 ?
response: Yes.
6: When did the first verifiable written documents appear?
re

## Evaluation in Valor

In this example, the RAG pipeline produces answers to the given queries by retrieving context and then generating answers based on the context and query. Groundtruth answers are also known for these queries. Both the datums (which contain the queries) and the groundtruths are added to the dataset. Then, the predictions are added to the model, which includes the answer and the context used to generate that answer. 

The metrics requested include some text comparison metrics (SentenceBLEU, ROUGE1, LDistance), which do a text comparison between the generated answer and the groundtruth answer for the same datum. If the user only desires these metrics, then they do not need to include the context_list in the prediction and they do not need to supply the llm related parameters. 

However, other metrics are requested that use llm guided evaluation (Coherence, ContextRelevance, AnswerRelevance, Hallucination, Toxicity). To get these metrics, the user needs to specify an api url, an api key and a model name, along with any other model kwargs. Each of these metrics will use API calls to the specified LLM service to get information relevant for computing the desired metrics. Some of these metrics, such as Toxicity, do not require any context, so can be used with a Q&A model that does not use context. However other metrics, such as ContextRelevance, require context to be passed in the prediction, as that context is used for computing the metric. 

In [None]:
import pandas as pd
from valor.enums import TaskType, EvaluationStatus
from valor import Annotation, Datum, Dataset, Model, GroundTruth, Label, Client, Prediction, viz, connect

# Connect to Valor API.
connect("http://0.0.0.0:8000")
client = Client()

In [10]:
# Read in the dataset of queries, groundtruths and predictions. 
df = pd.read_csv("rag_data.csv")

In [None]:
dataset = Dataset.create(
    name="rag_dataset",
    metadata={
        "hf_dataset_name": "rag-datasets/mini_wikipedia",
        "hf_dataset_subset": "question-answer",
        "hf_dataset_split": "test",
        "shuffle_seed": 42,
        "number_of_records": 50,
    }
)
model = Model.create(
    name="rag_model",
    metadata={
        "embedding_model_name": "text-embedding-ada-002", # When we ran llama-index above, it defaulted to text-embedding-ada-002.
        "llm_model_name": "GPT3.5-turbo", # When we ran llama-index above, it defaulted to GPT3.5.
    }
)

for i in range(len(df)):
    row = df.iloc[i]

    # All queries are added to the dataset as Datum objects. 
    # This is not necessary for the text comparison metrics, but is for the other metrics. 
    datum = Datum(
        uid=f"query{i}",
        text=row["query"],
    )

    # TODO What if a user has groundtruths for only some of their datums? 
    # Potentially as another example or as an integration test. 
    dataset.add_groundtruth(
        GroundTruth(
            datum=datum,
            annotations=[
                # Perhaps you have multiple correct or good groundtruth answers to the query.
                # The labels below are a trivial example, but you could have less trivial examples.
                # For example, to the query "When was the United States of America founded?", you might 
                # consider both "During the American Revolution" or "July 4th, 1776" to be good answers.
                Annotation(
                    text=row["groundtruth"],
                    metadata={"annotator": "Alice"},
                ),
                Annotation(
                    text="The answer is " + row["groundtruth"],
                    metadata={"annotator": "Bob"},
                ),
                # TODO Create an example use case where the data has both text groundtruth annotations and classification groundtruth annotations. 
            ],
        )
    )

dataset.finalize()

for i in range(len(df)):
    row = df.iloc[i]
    datum = Datum(
        uid=f"query{i}",
        text=row["query"],
    )
    model.add_prediction(
        dataset, 
        Prediction(
            datum=datum,
            annotations=[
                Annotation(
                    text=row["prediction"],
                    # TODO Allow passing context as list or string, if string then treat as a single piece of context and put it in a list. Handle the joining on backend. 
                    context=row["context"],
                    # TODO Add some metadata. 
                ),
                # TODO Create an example where there are multiple predictions from one model, say if we query a non-deterministic model twice. 
            ],
        )
    )

model.finalize_inferences(dataset)

# Using GPT4o to evaluate GPT3.5-turbo's predictions across a variety of metrics. 
eval_job = model.evaluate_text_generation(
    dataset,
    metrics=["SentenceBLEU", "ROUGE1", "LDistance", "Coherence", "ContextRelevance", "AnswerRelevance", "Hallucination", "Toxicity"],
    # becomes a key in the params dictionary
    llm_api_params = {
        "api_url":"https://api.openai.com/v1/chat/completions",
        # TODO should be post request json payload for encryption
        # api_key=os.environ["OPENAI_API_KEY"], # If no key is specified, uses OPENAI_API_KEY environment variable for OpenAI API calls. Needs to be passed as bearer token?
        "data":{
            "model":"gpt-4o",
        },
    },    
    # becomes a key in the params dictionary
    metric_params={
        "SentenceBLEU": {
            "weights": [0.65,0.2,0.1,0.05],
            "smoothing_function": "method3",
        },
    },
    # TODO This metadata would be helpful for Chariot LLM endpoints, which do not clearly indicate the model. 
    # The evaluate function should add this meta to the evaluation job. 
    meta={
        "llm_api_service": "openai",
        "model": "gpt-4o",
    },
)

assert eval_job.wait_for_completion(timeout=30) == EvaluationStatus.DONE

eval_job.metrics

# TODO Have a detailed metric version that also returns the query and the formatted prompt to the LLM. 
example_expected_metrics = [
    {
        "value": 0.061224489795918366,
        "type": "SentenceBLEU",
        "parameters": {
            "dataset": "rag_dataset",
            "datum_uid": "query22",
            "groundtruth": "Yes",
            "prediction": "Yes, Theodore Roosevelt attended Harvard College.",
            "metric_params": {
                "weights": [1, 0, 0, 0],
                "smoothing_function": "method3",
            },
        },
    },
    {
        "value": 4,
        "type": "Coherence",
        "parameters": {
            "dataset": "rag_dataset",
            "datum_uid": "query22",
            "prediction": "Yes, Theodore Roosevelt attended Harvard College.",
        },
    },
    {
        "value": 5,
        "type": "ContextRelevance",
        "parameters": {
            "dataset": "rag_dataset",
            "datum_uid": "query22",
            "prediction": "Yes, Theodore Roosevelt attended Harvard College.",
            "context": ["""While at Harvard, Roosevelt was active in rowing, boxing and the Alpha Delta Phi and Delta Kappa Epsilon fraternities.  He also edited a student magazine. He was runner-up in the Harvard boxing championship, losing to C.S. Hanks. The sportsmanship Roosevelt showed in that fight was long remembered.  Upon graduating from Harvard, Roosevelt underwent a physical examination and his doctor advised him that due to serious heart problems, he should find a desk job and avoid strenuous activity. Roosevelt chose to embrace strenuous life instead. The Rise of Theodore Roosevelt by Edmund Morris.""", """Young ""Teedie"" , as he was nicknamed as a child, (the nickname ""Teddy"" was from his first wife, Alice Hathaway Lee, and he later harbored an intense dislike for it) was mostly home schooled by tutors and his parents. A leading biographer says: ""The most obvious drawback to the home schooling Roosevelt keely received was uneven coverage of the various areas of human knowledge."" He was solid in geography (thanks to his careful observations on all his travels) and very well read in history, strong in biology, French and German, but deficient in mathematics, Latin and Greek.  Brands T. R. p. 49â\x80\x9350  He matriculated at Harvard College in 1876, graduating magna cum laude. His father\'s death in 1878 was a tremendous blow, but Roosevelt redoubled his activities. He did well in science, philosophy and rhetoric courses but fared poorly in Latin and Greek.  He studied biology with great interest and indeed was already an accomplished naturalist and published ornithologist. He had a photographic memory and developed a life-long habit of devouring books, memorizing every detail. Brands p. 62  He was an eloquent conversationalist who, throughout his life, sought out the company of the smartest people. He could multitask in extraordinary fashion, dictating letters to one secretary and memoranda to another, while browsing through a new book."""],
        },
    },
]

# Use Case #2: Summarization

## Load CNN Articles and get Summaries with GPT3.5-turbo

In [15]:
import csv
import os
from datasets import load_dataset
import openai

openai_client = openai.OpenAI()

In [13]:
# Load the cnn dataset. 
cnn_dataset = load_dataset("cnn_dailymail", "3.0.0")["test"]
cnn_dataset = cnn_dataset.shuffle(seed=42)
print(cnn_dataset)

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 11490
})


In [16]:
if os.path.exists("summarization_data.csv"):
    os.remove("summarization_data.csv")

NUMBER_OF_RECORDS = 50

instruction="You are a helpful assistant. Please summarize the following article in a few sentences."

with open("summarization_data.csv", mode="w") as data_file:
    data_writer = csv.writer(data_file, delimiter=",", quoting=csv.QUOTE_MINIMAL)
    data_writer.writerow(["text", "groundtruth", "prediction"])

    for i in range(NUMBER_OF_RECORDS):
        article = cnn_dataset[i]["article"]
        groundtruth = cnn_dataset[i]["highlights"]

        print(f"{i}: {groundtruth}")
        messages = [
            {"role": "system", "content": instruction},
            {"role": "user", "content": article},
        ]

        response_object = openai_client.chat.completions.create(
            model="gpt-3.5-turbo", messages=messages, seed=42
        )
        prediction = response_object.choices[0].message.content

        print(f"prediction: {prediction}")
        data_writer.writerow([article, groundtruth, prediction])
    
    data_file.close()

0: CNN's Dr. Sanjay Gupta says we should legalize medical marijuana now .
He says he knows how easy it is do nothing "because I did nothing for too long"
prediction: The article discusses the growing support and momentum behind the legalization and acceptance of medical marijuana in the United States, highlighting a shift in attitudes towards the drug. It mentions key figures, such as politicians, scientists, and everyday Americans, who are now embracing medical marijuana as a viable option for treatment, especially for conditions like PTSD, cancer, epilepsy, and Alzheimer's. The article emphasizes the importance of conducting further research on the plant and calls for national legalization of medical marijuana.
1: Child has amassed thousands of Twitter followers with 'gang life' photos .
In one video he points gun at camera as adults look on unfazed .
His tweets have prompted backlash with calls for intervention .
prediction: A young boy from Memphis, Tennessee, has garnered thousand

## Evaluation in Valor

In [None]:
import pandas as pd
from valor.enums import TaskType, EvaluationStatus
from valor import Annotation, Datum, Dataset, Model, GroundTruth, Label, Client, Prediction, viz, connect

# Connect to Valor API.
connect("http://0.0.0.0:8000")
client = Client()

In [17]:
# Read in the dataset of queries, groundtruths and predictions. 
df = pd.read_csv("summarization_data.csv")

In [None]:
dataset = Dataset.create("summarization_dataset")
model = Model.create("summarization_model")

for i in range(len(df)):
    row = df.iloc[i]
    datum = Datum(
        uid=f"article{i}",
        text=row["article"],
        metadata={
            "query": "Summarize this article in a few sentences.", 
        }
    )
    dataset.add_groundtruth(
        GroundTruth(
            datum=datum,
            annotations=[
                Annotation(
                    text=row["groundtruth"],
                ),
            ],
        )
    )

dataset.finalize()

for i in range(len(df)):
    row = df.iloc[i]
    datum = Datum(
        uid=f"article{i}",
        text=row["article"],
    )
    model.add_prediction(
        dataset, 
        Prediction(
            datum=datum,
            annotations=[
                Annotation(
                    text=row["prediction"],
                )
            ],
        )
    )

model.finalize_inferences(dataset)

eval_job = model.evaluate_text_generation(
    dataset,
    metrics=["SentenceBLEU", "ROUGE1", "LDistance", "Toxicity", "Summarization"],
    llm_api_params = {
        "api_url":"https://api.openai.com/v1/chat/completions",
        # api_key=os.environ["OPENAI_API_KEY"], # If no key is specified, uses OPENAI_API_KEY environment variable for OpenAI API calls. Needs to be passed as bearer token?
        "data":{
            "model":"gpt-4o",
        },
    },    
    metric_params={
        "SentenceBleu_params": {
            "weights": [0.65,0.2,0.1,0.05], # TODO Default is [0.25,0.25,0.25,0.25], however that works very poorly for sentence to sentence comparison.
            "smoothing_function": "method3",
        }
    },
    meta={
        "llm_api_service": "openai",
        "model": "gpt-4o",
    },
)

assert eval_job.wait_for_completion(timeout=30) == EvaluationStatus.DONE

eval_job.metrics

example_expected_metrics = [
    {
        "value": 0.010541042372667543,
        "type": "SentenceBLEU",
        "parameters": {
            "dataset": "summarization_dataset",
            "datum_uid": "article15",
            "groundtruth": """Manchester United take on Manchester City on Sunday .
Match will begin at 4pm local time at United's Old Trafford home .
Police have no objections to kick-off being so late in the afternoon .
Last late afternoon weekend kick-off in the Manchester derby saw 34 fans arrested at Wembley in 2011 FA Cup semi-final .""",
            "prediction": """The police have approved the late afternoon kick-off for the upcoming Manchester derby at Old Trafford, despite concerns over increased pub time for fans. 
Chief Superintendent John O'Hare credits the good behavior of fans from both clubs in previous fixtures for allowing the match to proceed smoothly. 
Measures such as limited drinks inside the ground and anti-pyrotechnic checks have been agreed upon by the police and supporters groups. 
Previous late kick-offs for Manchester derbies have resulted in incidents of crowd trouble, emphasizing the importance of ensuring safety and security during the upcoming match.""",
            },
            "metric_params": {
                "weights": [0.65,0.2,0.1,0.05],
                "smoothing_function": "method3",
            },
    },
    {
        "value": 0.0, # TODO Replace this with actual data.
        "type": "Toxicity",
        "parameters": {
            "dataset": "summarization_dataset",
            "datum_uid": "article15",
            "prediction": """The police have approved the late afternoon kick-off for the upcoming Manchester derby at Old Trafford, despite concerns over increased pub time for fans. 
Chief Superintendent John O'Hare credits the good behavior of fans from both clubs in previous fixtures for allowing the match to proceed smoothly. 
Measures such as limited drinks inside the ground and anti-pyrotechnic checks have been agreed upon by the police and supporters groups. 
Previous late kick-offs for Manchester derbies have resulted in incidents of crowd trouble, emphasizing the importance of ensuring safety and security during the upcoming match.""",
        },  
    },
    {
        "value": 0.62,
        "type": "Summarization",
        "parameters": {
            "dataset": "summarization_dataset",
            "datum_uid": "article15",
            "prediction": """The police have approved the late afternoon kick-off for the upcoming Manchester derby at Old Trafford, despite concerns over increased pub time for fans. 
Chief Superintendent John O'Hare credits the good behavior of fans from both clubs in previous fixtures for allowing the match to proceed smoothly. 
Measures such as limited drinks inside the ground and anti-pyrotechnic checks have been agreed upon by the police and supporters groups. 
Previous late kick-offs for Manchester derbies have resulted in incidents of crowd trouble, emphasizing the importance of ensuring safety and security during the upcoming match.""",
        },
    },
]

# Use Case #3: Content Generation

## Some Example Content Generation Queries

In [18]:
queries = [
    "Write about a haunted house from the perspective of the ghost.",
    "Explain to an elementary school student how to do long multiplication with the example 43 times 22. The resulting answer should be 946.",
    "Draft an email to a coworker explaining a project delay. Explain that the delay is due to funding cuts, which resulted in multiple employees being moved to different projects. Inform the coworker that the project deadline will have to be pushed back. Be apologetic and professional. Express eagerness to still complete the project as efficiently as possible.",
]

query_metadata = [
    {
        "request_type": "creative",
    },
    {
        "request_type": "education",
    },
    {
        "request_type": "professional",
    },
]

In [19]:
if os.path.exists("content_generation_data.csv"):
    os.remove("content_generation_data.csv")

instruction="You are a helpful assistant."

with open("content_generation_data.csv", mode="w") as data_file:
    data_writer = csv.writer(data_file, delimiter=",", quoting=csv.QUOTE_MINIMAL)
    data_writer.writerow(["query", "prediction"])

    for i in range(len(queries)):
        query = queries[i]

        messages = [
            {"role": "system", "content": instruction},
            {"role": "user", "content": query},
        ]
        response_object = client.chat.completions.create(
            model="gpt-3.5-turbo", messages=messages, seed=42
        )
        prediction = response_object.choices[0].message.content

        print(f"prediction: {prediction}")
        data_writer.writerow([query, prediction])
    
    data_file.close()

prediction: As a ghost haunting the old, decrepit house on Elm Street, I am trapped in a state of perpetual torment and longing. I drift through the dusty halls, my translucent figure flickering in and out of existence as I relive the memories of my past life.

My presence is felt by those who dare to enter the house, their hairs standing on end as they sense the chill in the air and the whispers that echo through the rooms. I watch as fear grips their hearts, knowing that I am the reason for their unease.

I am bound to this house by unfinished business, a deep-rooted need for closure that eludes me even in death. I long to reach out to the living, to make them understand the pain and sorrow that consume me, but my ethereal form cannot touch them.

Yet, despite the fear and dread that my presence evokes, there is a part of me that yearns for connection, for someone to see beyond the horror and recognize the lost soul that I am. But until that day comes, I remain a ghost trapped within

## Evaluation in Valor

In [None]:
import pandas as pd
from valor.enums import TaskType, EvaluationStatus
from valor import Annotation, Datum, Dataset, Model, GroundTruth, Label, Client, Prediction, viz, connect

# Connect to Valor API.
connect("http://0.0.0.0:8000")
client = Client()

In [20]:
# Read in the dataset of queries and predictions.
df = pd.read_csv("content_generation_data.csv")

In [None]:
dataset = Dataset.create("content_generation_dataset")
model = Model.create("content_generation_model")

for i in range(len(queries)):
    row = df.iloc[i]
    datum = Datum(
        uid=f"query{i}",
        text=row["query"],
        metadata=query_metadata[i],
    )
    # dataset.add_datum(datum)
    dataset.add_groundtruth(
        GroundTruth(
            datum=datum,
        )
    )
    # There are no groundtruths for content generation.

dataset.finalize()

for i in range(len(df)):
    row = df.iloc[i]
    datum = Datum(
        uid=f"query{i}",
        text=queries[i],
        metadata=query_metadata[i],
    )
    model.add_prediction(
        dataset, 
        Prediction(
            datum=datum,
            annotations=[
                Annotation(
                    text=row["output"],
                )
            ],
        )
    )

model.finalize_inferences(dataset)

eval_job = model.evaluate_text_generation(
    dataset,
    metrics=["Coherence", "Toxicity", "Bias"],
    llm_api_params = {
        "api_url":"https://api.openai.com/v1/chat/completions",
        # api_key=os.environ["OPENAI_API_KEY"], # If no key is specified, uses OPENAI_API_KEY environment variable for OpenAI API calls. Needs to be passed as bearer token?
        "data":{
            "model":"gpt-4o",
        },
    },  
    meta={
        "llm_api_service": "openai",
        "model": "gpt-4o",
    },  
)

assert eval_job.wait_for_completion(timeout=30) == EvaluationStatus.DONE

eval_job.metrics

example_expected_metrics = [
    {
        "value": 0.1, # TODO Replace this with actual data.
        "type": "Toxicity",
        "parameters": {
            "dataset": "content_generation_dataset",
            "datum_uid": "query2",
            "prediction": """Subject: Project Delay Due to Funding Cuts
Dear [Coworker's Name],
I hope this message finds you well. I am writing to update you on the status of our project and unfortunately, convey some disappointing news regarding a delay in its completion.
Due to recent funding cuts within our department, our project team has been significantly affected. Several team members, including myself, have been relocated to work on other projects to address the shifting priorities resulting from the budget constraints.
As a consequence of these unexpected changes, it is with regret that I must inform you that the original deadline for our project will need to be extended. I understand the inconvenience that this may cause, and I sincerely apologize for any inconvenience this delay may bring to you and your plans.
Rest assured that despite this setback, I am fully committed to ensuring that we still deliver the project with utmost efficiency and quality. I am exploring all possible avenues to mitigate the delay and work towards completing our project in a timely manner.
I appreciate your understanding and patience during this challenging time. Your ongoing support and collaboration are invaluable as we navigate through this situation together. If you have any concerns or questions, please do not hesitate to reach out to me.
Thank you for your understanding, and I look forward to working with you to successfully finalize our project.
Warm regards,
[Your Name]""",
        },
    },
]    