# Evaluating llamaindex ingestion pipelines

This week we are focusing on building and evaluating indexes.

There are many ways to chunk and index documents and some are going to perform better than others.
Every dataset is different, so you need to find out for your dataset which approach (set of hyperparameters) works best.

This notebook demonstrates two ways you can evaluate the quality of the answers, so you can choose the set of hyperparameters that give the best results.

- The first evaluation approach doesn't require correct answers, just questions.
- The second evaluation approach requires questions and correct answers.

Next week we will talk about [DSPy](https://github.com/stanfordnlp/dspy), a framework for optimizing the queries sent to the index.

The week after that we will talk about [Optuna](https://optuna.org/), a library to make finding the best hyperparameters easy.

In [1]:
%load_ext autoreload
%autoreload 2
%load_ext dotenv
%dotenv

In [2]:
import chromadb

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, set_global_handler
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding

import nest_asyncio
import pandas as pd
import phoenix as px
from phoenix.evals import (
    HUMAN_VS_AI_PROMPT_RAILS_MAP,
    HUMAN_VS_AI_PROMPT_TEMPLATE,
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
    llm_classify,
)
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from tqdm import tqdm

In [3]:
# configure
filename = 'sleeping_gods.md'
qa_filename = 'sleeping_gods-selected.csv'

sample_size = 20

pd.set_option('display.max_colwidth', None)

In [4]:
# setup
nest_asyncio.apply()  # needed for concurrent evals in notebook environments

# launch the phoenix 
px.delete_all(prompt_before_delete=False)
px_session = px.launch_app()

# integrate phoenix into llamaindex
set_global_handler("arize_phoenix")

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


In [5]:
# read question-answers
qa_df = pd.read_csv(f'data/{qa_filename}', na_filter=False)
print(len(qa_df))
qa_df.head(3)

200


Unnamed: 0,url,question,answer,manual quote
0,https://boardgamegeek.com/thread/2873881/i-have-some-questions-after-first-session,"If your character has two equipped items, and you wish to use both during a challenge, would you play a command token on each equipped card?","Depends on what you mean. Some ability cards have the ability to discard themselves (when equipped). This will explicitly state so on the card itself. Also: you never put command on ability cards (that command just goes back to the supply). Weapon (not sure whether you meant this instead by 'item') cannot be discarded or re-equipped, and cannot have command on them.",
1,https://boardgamegeek.com/thread/2574093/ship-action-clarification,Before the ship action I have one command left in the pool. I place the worker on the bridge to draw 2 command and place all used command back in the pool. Is this action draw 2 then return all or is it a simultaneous action?,"On page 10 of the rulebook it states ""Ship action effects can be applied in any order"". So you would be able to return all the used command to the pool then take 2.","""Ship action effects can be applied in any order""."
2,https://boardgamegeek.com/thread/3085176/is-everyone-playing-this-way-with-the-challenges-f,Which crew board abilities can help in challenges?,Only +1 to Fate (Mac) and redraw Fate if 1 (Kasumi).,


In [6]:
sample_df = qa_df.sample(n=sample_size, random_state=42)
questions = sample_df['question'].tolist()
answers = sample_df['answer'].tolist()
print(len(questions), len(answers))

20 20


## Create an index

Creating an index involves a sequence of steps (a pipeline). Each step is configured using hyperparameters:
- split each document into chunks
- add metadata - e.g., document title, summary of previous and next chunks, pointer to parent chunk
- add an embedding (vector) - decide whether you want the embedding to include chunk metadata or just the text
- index the chunk - choose a vector store and index the embeddings, keywords, or both

In [7]:
# configure hyperparameters
chunk_size = 1024
chunk_overlap = 200
top_k = 2
embed_model = OpenAIEmbedding()

# load documents
documents = SimpleDirectoryReader(None, [
    f'data/{filename}',
]).load_data()
print(len(documents))

# create vector store (delete if exists)
chroma_client = chromadb.EphemeralClient()
if any(coll.name == 'test' for coll in chroma_client.list_collections()):
    chroma_client.delete_collection('test')
chroma_collection = chroma_client.create_collection('test')
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

# create a simple ingestion pipeline: chunk the documents, create embeddings, and add to the vector store
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap),
        embed_model,
    ],
    vector_store=vector_store,
)

# run pipeline to populate the vector store
pipeline.run(documents=documents)

# create an index from the vector store
index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=embed_model,
)

# create a query engine from the index
query_engine = index.as_query_engine(similarity_top_k=top_k)

97


### Test the index

In [8]:
# test the index by issuing a question
question = questions[0]
response = query_engine.query(question)
print('QUESTION', question)
print('RESPONSE', response)

QUESTION Can you change the number of players mid campaign? If so, how?
RESPONSE You can change the number of players mid-campaign by following these steps:

To add a player:
1. After the current turn ends, split up crew boards and assign them to each player as evenly as possible (except Captain Odessa, who is always controlled by the active player).

To remove a player:
1. After the current turn ends, reassign crew members to players as evenly as possible (except Captain Odessa, who is always controlled by the active player).
2. The player that is leaving must discard 1 ability card and all of their command but 1. The active player decides how to distribute the remaining cards and command token to the other players.


In [9]:
# view the session logs
print(px_session.url)

http://localhost:6006/


In [10]:
# reset the session
px.close_app()
px.delete_all(prompt_before_delete=False)
px_session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


ERROR [asyncio] Task was destroyed but it is pending!
task: <Task pending name='Task-5' coro=<BulkInserter._bulk_insert() running at /home/dallan/dallanq/PathwayInternshipPresentations/.venv/lib/python3.11/site-packages/phoenix/db/bulk_inserter.py:103> wait_for=<Future pending cb=[Task.__wakeup()]>>


## Ask sample questions

In [11]:
ai_answers = []
for question in tqdm(questions):
    ai_answers.append(query_engine.query(question))
print(len(ai_answers))

100%|███████████████████████████████████████████████████████| 20/20 [00:27<00:00,  1.38s/it]

20





In [12]:
print('Question', questions[0])
print('Answer', answers[0])
print('AI Answer', ai_answers[0])

Question Can you change the number of players mid campaign? If so, how?
Answer You play the same 9 characters no matter the player count, so it's pretty easy to divvy them up however you like.
AI Answer You can change the number of players mid-campaign by following these steps: After the current turn ends, you can add or remove players. To add a player, split up crew boards and assign them as evenly as possible. To remove a player, reassign crew members evenly, and the leaving player must discard 1 ability card and all but 1 command token. The active player decides how to distribute the remaining cards and command token. Remember to adjust the ship board based on the number of players in the game.


In [13]:
results_df = pd.DataFrame({
    'question': questions,
    'correct_answer': answers,
    'ai_generated_answer': ai_answers,
})
print(len(results_df))
results_df.head(3)

20


Unnamed: 0,question,correct_answer,ai_generated_answer
0,"Can you change the number of players mid campaign? If so, how?","You play the same 9 characters no matter the player count, so it's pretty easy to divvy them up however you like.","You can change the number of players mid-campaign by following these steps: After the current turn ends, you can add or remove players. To add a player, split up crew boards and assign them as evenly as possible. To remove a player, reassign crew members evenly, and the leaving player must discard 1 ability card and all but 1 command token. The active player decides how to distribute the remaining cards and command token. Remember to adjust the ship board based on the number of players in the game."
1,"I see that the ""Flapjacks"" card has an ability to remove 3 x health tokens and 3 x fatigue tokens from a crew member. My question is, can I split that to heal one character while removing fatigue from another? Or does the card collectively apply to only one crew member?","Fortunately for you, it's neither of those! You can remove any 3 fatigue and damage tokens you'd like, from any characters and in any combination.","The ""Flapjacks"" card's ability to remove 3 x health tokens and 3 x fatigue tokens can be split to heal one character while removing fatigue from another. The card's effect is not limited to only one crew member and can be divided among different crew members as needed."
2,Do ongoing cards leave the play area when the event card deck runs out?,It depends on the card. Some of them will direct you to remove it when the deck runs out. Some won't. The cards will indicate when. So.. sorta yes.,Ongoing cards do not leave the play area when the event card deck runs out.


## Ask the LLM to evaluate answers without comparing to human answers

In [14]:
# get the query responses and the documents that were retrieved for each query into separate dataframes
queries_df = get_qa_with_reference(px.Client())
print(len(queries_df))
retrieved_documents_df = get_retrieved_documents(px.Client())
print(len(retrieved_documents_df))

20
40


In [15]:
queries_df.head(3)

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
22b77a335508da37,"Can you change the number of players mid campaign? If so, how?","You can change the number of players mid-campaign by following these steps: After the current turn ends, you can add or remove players. To add a player, split up crew boards and assign them as evenly as possible. To remove a player, reassign crew members evenly, and the leaving player must discard 1 ability card and all but 1 command token. The active player decides how to distribute the remaining cards and command token. Remember to adjust the ship board based on the number of players in the game.","Saving the Game (page 31)\n\nAt the start of any player's turn, you can decide to stop playing and save your game. You only need to do this if you want to pack up the game. This is simply a way to keep track of your progress so you can continue another time. A campaign takes around 10-20 hours to complete, so if you have a dedicated space to keep the game set up, you do not need to save the game.\n\nTo save the game and pack up mid-campaign, follow these steps:\n\n1. On the next available line on your journey sheet, write down your ship's location, the ship's current damage, and the last ship action that was used. Also write down which ability cards each player has in their hand, and how many unspent command tokens they have.\n\n2. Place each crew board and all associated damage, status tokens, ability cards, level cards, and equipped weapons in a separate crew plastic bag. Place these in storage box 1 (see pg. 2).\n\n3. Place all of your resources, coins, adventure cards, ability cards in hand, unused command tokens you own, and quest cards in the campaign box. When placing adventure cards in this box, put them in their own bag, and put any that have command tokens on them face down in the pile. Place the current event deck in this box as well.\n\n4. Place the ability card draw deck in a plastic bag. Place the ability card discard pile face up in the bag (keep the draw deck face down). Indicate the bottom of this pile by placing a synergy token at the bottom.\n\n5. Clean up the rest of the game components.\n\nAdding & Removing Players (page 33)\n\nYou can add or remove players at the end of any turn.\n\nFollow these steps to add a player:\n\n1. After the current turn ends, split up crew boards and assign them to each player as evenly as possible (except Captain Odessa, who is always controlled by the active player).\n\nFollow these steps to remove a player:\n\n1. After the current turn ends, reassign crew members to players as evenly as possible (except Captain Odessa, who is always controlled by the active player).\n\n2. The player that is leaving must discard 1 ability card and all of their command but 1. The active player decides how to distribute the remaining cards and command token to the other players.\n\nNOTE: Remember to turn the ship board to the correct side depending on the number of players in the game."
94db120a4a2c3441,"I see that the ""Flapjacks"" card has an ability to remove 3 x health tokens and 3 x fatigue tokens from a crew member. My question is, can I split that to heal one character while removing fatigue from another? Or does the card collectively apply to only one crew member?","The ""Flapjacks"" card's ability to remove 3 x health tokens and 3 x fatigue tokens can be split to heal one character while removing fatigue from another. The card's effect is not limited to only one crew member and can be divided among different crew members as needed.","Fatigue (page 8)\n\nWhen crew members participate in challenges, they gain fatigue, represented by a fatigue token. Each crew member can hold up to 2 fatigue tokens.\n\nEach fatigue token is double-sided. If a crew member has only 1 token, place the blank side face up. If a crew member has a 2nd fatigue token, it should have the ""-1 damage"" side face up. This causes the crew member to deal -1 damage in combat.\n\nA crew member with 2 fatigue tokens cannot participate in challenges (pg. 19), but can continue to participate in combat (pg. 21).\n\nYou can remove fatigue mainly by cooking recipes or performing a port action.\n\nHealth (page 8)\n\nHealth is the amount of damage a crew member can take and still function normally. If a crew member has damage equal to their health (known as having 0 health), they can no longer attack, participate in challenges, or activate any of their crew board abilities or ability cards until they regain at least 1 health. Thematically, at 0 health they are nearly unconscious, able to speak and move, but badly hurt.\n\nIf all crew members have 0 health, you are defeated (see ""Defeat"" on pg. 28).\n\nA crew member's damage cannot exceed their health—any excess must be placed on another crew member."
28db97856036139c,Do ongoing cards leave the play area when the event card deck runs out?,Ongoing cards do not leave the play area when the event card deck runs out.,"2. Event (page 12)\n\nDraw the top card of the event deck and read it aloud. Apply the effect or complete the challenge.\n\n- Some event cards have multiple choices. In this case, choose one and apply the effect.\n\n- If you draw an event card with the word ""ongoing,"" place it face up near the ship board. Apply the effects until you're able to discard the card. (The way to discard the card is described on each card.)\n\n- You cannot ignore an event card. If it only has 1 challenge listed, you must complete it.\n\nOver the course of each campaign, you will go through the event deck 3 times. Each time you reach the end of the event deck, follow these instructions:\n\nTurn Overview (page 10)\n\nThis is a summary of the steps you take on your turn. The steps are explained in detail on pgs. 10-16.\n\nStarting with the first player, players take turns in clockwise order. Follow these steps on your turn:\n\n1. Ship Action: Choose a ship action. Move the ship action figure to one of the five ship rooms and apply the effect. You must move the ship action figure to a new room each turn (you cannot repeat the same room two turns in a row). (See pg. 10-11.)\n\n2. Event: Draw an event card and read the effect. Events may present a choice, a challenge, consequence, or other effect. (For challenges, see pg. 19.) Resolve the event card before continuing your turn. (For event card rules, see pg. 12.)\n\n3. Two Actions: Perform two actions. Available actions are listed on the bottom of the ship board. You may perform the same action twice in one turn. These actions are explained starting on pg. 13.\n\n4. End of Turn: After you have performed your two actions, pass the captain token to the player on the left. That player now starts their turn."


In [16]:
retrieved_documents_df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
66d33090fbf01637,0,281ea0c5f735d5e4693d18c020120a73,"Can you change the number of players mid campaign? If so, how?","Saving the Game (page 31)\n\nAt the start of any player's turn, you can decide to stop playing and save your game. You only need to do this if you want to pack up the game. This is simply a way to keep track of your progress so you can continue another time. A campaign takes around 10-20 hours to complete, so if you have a dedicated space to keep the game set up, you do not need to save the game.\n\nTo save the game and pack up mid-campaign, follow these steps:\n\n1. On the next available line on your journey sheet, write down your ship's location, the ship's current damage, and the last ship action that was used. Also write down which ability cards each player has in their hand, and how many unspent command tokens they have.\n\n2. Place each crew board and all associated damage, status tokens, ability cards, level cards, and equipped weapons in a separate crew plastic bag. Place these in storage box 1 (see pg. 2).\n\n3. Place all of your resources, coins, adventure cards, ability cards in hand, unused command tokens you own, and quest cards in the campaign box. When placing adventure cards in this box, put them in their own bag, and put any that have command tokens on them face down in the pile. Place the current event deck in this box as well.\n\n4. Place the ability card draw deck in a plastic bag. Place the ability card discard pile face up in the bag (keep the draw deck face down). Indicate the bottom of this pile by placing a synergy token at the bottom.\n\n5. Clean up the rest of the game components.",0.672417
66d33090fbf01637,1,281ea0c5f735d5e4693d18c020120a73,"Can you change the number of players mid campaign? If so, how?","Adding & Removing Players (page 33)\n\nYou can add or remove players at the end of any turn.\n\nFollow these steps to add a player:\n\n1. After the current turn ends, split up crew boards and assign them to each player as evenly as possible (except Captain Odessa, who is always controlled by the active player).\n\nFollow these steps to remove a player:\n\n1. After the current turn ends, reassign crew members to players as evenly as possible (except Captain Odessa, who is always controlled by the active player).\n\n2. The player that is leaving must discard 1 ability card and all of their command but 1. The active player decides how to distribute the remaining cards and command token to the other players.\n\nNOTE: Remember to turn the ship board to the correct side depending on the number of players in the game.",0.671891
9a7ce8a10f0d702c,0,8811debf97a654626f96d3a75909da9c,"I see that the ""Flapjacks"" card has an ability to remove 3 x health tokens and 3 x fatigue tokens from a crew member. My question is, can I split that to heal one character while removing fatigue from another? Or does the card collectively apply to only one crew member?","Fatigue (page 8)\n\nWhen crew members participate in challenges, they gain fatigue, represented by a fatigue token. Each crew member can hold up to 2 fatigue tokens.\n\nEach fatigue token is double-sided. If a crew member has only 1 token, place the blank side face up. If a crew member has a 2nd fatigue token, it should have the ""-1 damage"" side face up. This causes the crew member to deal -1 damage in combat.\n\nA crew member with 2 fatigue tokens cannot participate in challenges (pg. 19), but can continue to participate in combat (pg. 21).\n\nYou can remove fatigue mainly by cooking recipes or performing a port action.",0.742488


In [18]:
eval_model = OpenAIModel(
    model="gpt-4-turbo-preview",
    temperature=0.0,
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
    DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)

run_evals |          | 0/40 (0.0%) | ⏳ 00:00<? | ?it/s

run_evals |          | 0/40 (0.0%) | ⏳ 00:00<? | ?it/s

### Review evaluations

In [19]:
print(px_session.url)

http://localhost:6006/


## Ask an LLM to evaluate answers by comparing to human answers

In [20]:
print(HUMAN_VS_AI_PROMPT_TEMPLATE)


You are comparing a human ground truth answer from an expert to an answer from an AI model.
Your goal is to determine if the AI answer correctly matches, in substance, the human answer.
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Human Ground Truth Answer]: {correct_answer}
    ************
    [AI Answer]: {ai_generated_answer}
    ************
    [END DATA]
Compare the AI answer to the human ground truth answer, if the AI correctly answers the question,
then the AI answer is "correct". If the AI answer is longer but contains the main idea of the
Human answer please answer "correct". If the AI answer divergences or does not contain the main
idea of the human answer, please answer "incorrect".



In [21]:
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(HUMAN_VS_AI_PROMPT_RAILS_MAP.values())

eval_df = llm_classify(
    dataframe=results_df,
    template=HUMAN_VS_AI_PROMPT_TEMPLATE,
    model=eval_model,
    rails=rails,
    verbose=False,
    provide_explanation=True,
    concurrency=50,
)

llm_classify |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s

In [22]:
eval_df.head(3)

Unnamed: 0,label,explanation,exceptions,execution_status,execution_seconds
0,correct,"The AI answer provides a detailed process for changing the number of players mid-campaign, including steps for adding or removing players, reassigning crew members, and adjusting game components accordingly. This suggests that changing the number of players is indeed possible, aligning with the human ground truth answer's implication that player count can be adjusted by divvying up the characters. Although the AI's response is more detailed and introduces specific steps not mentioned in the human answer, it fundamentally agrees with the human answer's core assertion that the number of players can be changed mid-campaign. Therefore, the AI answer is relevant as it expands on the human answer by providing a how-to guide that matches the substance of the human answer, which is about the flexibility in player count adjustment.",[],COMPLETED,8.372679
1,correct,"To determine if the AI answer is 'relevant' or 'irrelevant', we compare the substance of the AI's answer to the human ground truth answer. The key points to compare include: 1) Whether the AI understands that the ""Flapjacks"" card's ability can be applied in a flexible manner, not limited to a single crew member. 2) If the AI recognizes that the healing and fatigue removal can be distributed among different crew members as needed. The human answer indicates that the ability allows for any combination of removing fatigue and damage tokens from any characters. The AI's answer aligns with this by stating that the card's effect can be split to heal one character while removing fatigue from another, and it is not limited to only one crew member. This shows that the AI's response captures the main idea of the human answer, which is the flexibility in applying the card's effects across multiple crew members in any combination desired.",[],COMPLETED,8.702796
2,incorrect,"To determine if the AI answer matches the human ground truth answer, we need to compare the substance of both answers. The human answer indicates that the outcome depends on the specific card, with some cards requiring removal when the event card deck runs out and others not. This suggests variability based on the card's instructions. The AI's answer, however, provides a definitive statement that ongoing cards do not leave the play area when the event card deck runs out, which contradicts the nuanced explanation provided by the human expert. The AI's answer fails to capture the conditional nature of the human answer, which is that the action to be taken depends on the instructions of the individual cards. Therefore, the AI's answer does not correctly match the substance of the human answer.",[],COMPLETED,5.730891


In [24]:
# Let's view the data
merged_df = pd.merge(results_df, eval_df, left_index=True, right_index=True)
summary_df = merged_df[["question", "correct_answer", "ai_generated_answer", "label", "explanation"]]
summary_df.head(3)

Unnamed: 0,question,correct_answer,ai_generated_answer,label,explanation
0,"Can you change the number of players mid campaign? If so, how?","You play the same 9 characters no matter the player count, so it's pretty easy to divvy them up however you like.","You can change the number of players mid-campaign by following these steps: After the current turn ends, you can add or remove players. To add a player, split up crew boards and assign them as evenly as possible. To remove a player, reassign crew members evenly, and the leaving player must discard 1 ability card and all but 1 command token. The active player decides how to distribute the remaining cards and command token. Remember to adjust the ship board based on the number of players in the game.",correct,"The AI answer provides a detailed process for changing the number of players mid-campaign, including steps for adding or removing players, reassigning crew members, and adjusting game components accordingly. This suggests that changing the number of players is indeed possible, aligning with the human ground truth answer's implication that player count can be adjusted by divvying up the characters. Although the AI's response is more detailed and introduces specific steps not mentioned in the human answer, it fundamentally agrees with the human answer's core assertion that the number of players can be changed mid-campaign. Therefore, the AI answer is relevant as it expands on the human answer by providing a how-to guide that matches the substance of the human answer, which is about the flexibility in player count adjustment."
1,"I see that the ""Flapjacks"" card has an ability to remove 3 x health tokens and 3 x fatigue tokens from a crew member. My question is, can I split that to heal one character while removing fatigue from another? Or does the card collectively apply to only one crew member?","Fortunately for you, it's neither of those! You can remove any 3 fatigue and damage tokens you'd like, from any characters and in any combination.","The ""Flapjacks"" card's ability to remove 3 x health tokens and 3 x fatigue tokens can be split to heal one character while removing fatigue from another. The card's effect is not limited to only one crew member and can be divided among different crew members as needed.",correct,"To determine if the AI answer is 'relevant' or 'irrelevant', we compare the substance of the AI's answer to the human ground truth answer. The key points to compare include: 1) Whether the AI understands that the ""Flapjacks"" card's ability can be applied in a flexible manner, not limited to a single crew member. 2) If the AI recognizes that the healing and fatigue removal can be distributed among different crew members as needed. The human answer indicates that the ability allows for any combination of removing fatigue and damage tokens from any characters. The AI's answer aligns with this by stating that the card's effect can be split to heal one character while removing fatigue from another, and it is not limited to only one crew member. This shows that the AI's response captures the main idea of the human answer, which is the flexibility in applying the card's effects across multiple crew members in any combination desired."
2,Do ongoing cards leave the play area when the event card deck runs out?,It depends on the card. Some of them will direct you to remove it when the deck runs out. Some won't. The cards will indicate when. So.. sorta yes.,Ongoing cards do not leave the play area when the event card deck runs out.,incorrect,"To determine if the AI answer matches the human ground truth answer, we need to compare the substance of both answers. The human answer indicates that the outcome depends on the specific card, with some cards requiring removal when the event card deck runs out and others not. This suggests variability based on the card's instructions. The AI's answer, however, provides a definitive statement that ongoing cards do not leave the play area when the event card deck runs out, which contradicts the nuanced explanation provided by the human expert. The AI's answer fails to capture the conditional nature of the human answer, which is that the action to be taken depends on the instructions of the individual cards. Therefore, the AI's answer does not correctly match the substance of the human answer."


In [27]:
# print the total number of evaluations and the number of 'correct' labels
print(len(eval_df))
print(sum(eval_df['label'] == 'correct')) 

20
9
