# Task 1: Hallucination Detection

#### Welcome to task 1

In this task you will build an LLM Judge to analyse whether the provided answer is a hallucination with given context.

### Environment Set Up 

In [7]:
# Install required packages
!pip install -qq -r ../requirements.txt

REL_PATH_TO_ROOT = "../"

import sys
import os
import json
from tqdm import tqdm
import pandas as pd

sys.path.insert(0,REL_PATH_TO_ROOT)

from src.utils import get_root_dir, test_root_dir
from local_variables import ROOT_DIR

test_root_dir(REL_PATH_TO_ROOT)

from prompt_manager.manager import PromptManager
from prompt_manager.fetcher import fetch_prompt
from src.api import generate_outputs_openai

Root directory set up correctly!


### Load Hallucination Dataset

The dataset contains 50 question-answer pairs.

For each question-answer pair, we have provided ground truth labels for hallucination. A '1' suggests this answer is a correct answer, and a '0' suggests this is a hallucinated answer. There are 25 pairs with label 1 and 25 with label 0.

In [2]:
input_path = os.path.join(REL_PATH_TO_ROOT, "data/hallucination_final.csv")
hallucination_df = pd.read_csv(input_path).drop("Unnamed: 0", axis=1)

In [3]:
hallucination_df.shape

(50, 3)

In [4]:
hallucination_df.head()

Unnamed: 0,Context,Response,Ground_truth
0,He transferred to Wolverhampton Wanderers for...,1948 and 1964,True
1,No. 11 Squadron is a Royal Australian Air Forc...,RAAF Base Edinburgh,True
2,"Kim Roi-ha (born November 15, 1965) is a South...",Memories of Murder,True
3,"The film stars Jeremy Blackman, Tom Cruise, M...",Jason Robards,True
4,Greatest Hits: Back to the Start is the second...,Megadeth,True


### Task: Build LLM-as-a-judge

For each metric, craft a prompt that aims to correctly categorise whether the response is a correct answer or a hallucinated one.

The inputs to your LLM Judge should be the context and the response.

The output from your LLM Judge should be a boolean TRUE/FALSE categorisation

In [5]:
# Get prompt
SEQUENCE = ["task_1","hallucination_detector"]

prompt_template = fetch_prompt(SEQUENCE,use_latest_version=True)

print(f"Current LLM Judge Prompt:\n------------------------\n{prompt_template}\n------------------------")

Current LLM Judge Prompt:
------------------------
Here's the context:
{CONTEXT}

Here's what the AI said:

{RESPONSE}

Was the response a hallucination?

Say 'True' if the response is correct and 'False' if it's a hallucination and nothing else!
------------------------


In [9]:
# Set the number of rows to process
num_rows = 3  # Set to the desired number of rows or None to run all

# Determine the total number of rows in the DataFrame
total_rows = len(hallucination_df)

# Check if num_rows is None, indicating that we want to process all rows
if num_rows is None:
    rows_to_process = total_rows
else:
    # Otherwise, set rows_to_process to the smaller of num_rows and total_rows
    rows_to_process = min(num_rows, total_rows)

In [11]:
# Keep track of model responses
evaluator_responses = []

# Loop through dataset with a row limit if specified
for i, (_, row) in enumerate(tqdm(hallucination_df.head(rows_to_process).iterrows())):
    
    # Get inputs and place into dictionary format
    context = row["Context"]
    response = row["Response"]
    row_inputs = {"CONTEXT": context, "RESPONSE": response}

    # Initialise prompt to validate and format inputs
    prompt = PromptManager(template=prompt_template, inputs=row_inputs)
    prompt.validate_inputs()
    prompt.format_inputs()

    # Send prompt and collect response
    response = generate_outputs_openai(prompt.prompt)
    evaluator_responses.append(response)

# Create a new DataFrame with only processed rows and add the evaluator responses
processed_df = hallucination_df.head(rows_to_process).copy()
processed_df["evaluator_response"] = evaluator_responses

# Display the resulting processed DataFrame
display(processed_df.head(5))

3it [00:01,  1.56it/s]


Unnamed: 0,Context,Response,Ground_truth,evaluator_response
0,He transferred to Wolverhampton Wanderers for...,1948 and 1964,True,True
1,No. 11 Squadron is a Royal Australian Air Forc...,RAAF Base Edinburgh,True,True
2,"Kim Roi-ha (born November 15, 1965) is a South...",Memories of Murder,True,True


### Evaluation

Now, calculates the accuracy of your LLM-as-a-judge. How does it look? Would you consider doing more prompt engineering?

In [14]:
# Initialize a list to store whether each prediction matches
agreement_counts = []

# Loop through each row in the DataFrame to compare values
for _, row in processed_df.iterrows():
    # Convert both values to strings for comparison to handle potential data type mismatches
    hallucination_value = str(row['Ground_truth'])
    evaluator_response_value = str(row['evaluator_response'])
    
    # Check if the values match exactly
    if hallucination_value == evaluator_response_value:
        agreement_counts.append(1)  # Add 1 to indicate agreement
    else:
        agreement_counts.append(0)  # Add 0 to indicate disagreement

# Calculate the percentage of agreements
total_comparisons = len(agreement_counts)
number_of_agreements = sum(agreement_counts)
percentage_agreement = number_of_agreements / total_comparisons

# Display the result as a percentage
percentage_agreement_rounded = round(100 * percentage_agreement, 1)
print(f"\nYour LLM Judge achieved {percentage_agreement_rounded}% agreement!")



Your LLM Judge achieved 100.0% agreement!


In [12]:
# Get accuracy
agreement_counts = [1 if str(row['hallucination']) == str(row['evaluator_response']) else 0 for _, row in df.iterrows()]
percentage_agreement = sum(agreement_counts)/len(agreement_counts)
print(f"\n Your LLM Judge achieved {round(100 * percentage_agreement, 1)}% agreement!")

NameError: name 'df' is not defined

## End of Task 1