# Task 1: Hallucination Detection

#### Welcome to task 1

In this task you will build an LLM Judge to analyse whether the provided answer is a hallucination with given context.

### Environment Set Up 

In [None]:
# Install required packages
#!pip install -qq -r ../requirements.txt

REL_PATH_TO_ROOT = "../"

import sys
import os
import json
import pandas as pd

sys.path.insert(0,REL_PATH_TO_ROOT)

from src.utils import get_root_dir, test_root_dir, preprocess_hallucination
from local_variables import ROOT_DIR

test_root_dir(REL_PATH_TO_ROOT)

from prompt_manager.manager import PromptManager
from prompt_manager.fetcher import fetch_prompt
#from src.api import generate_outputs_openai

### Load Hallucination Dataset

The dataset contains 50 question-answer pairs.

For each question-answer pair, we have provided ground truth labels for hallucination. A '1' suggests this answer is a correct answer, and a '0' suggests this is a hallucinated answer. There are 25 pairs with label 1 and 25 with label 0.

In [None]:
input_path = os.path.join(REL_PATH_TO_ROOT, "data/hallucination_final.csv")
hallucination_df = pd.read_csv(input_path).drop("Unnamed: 0", axis=1)

In [None]:
hallucination_df.shape

In [None]:
hallucination_df.head()

### Task: Build LLM-as-a-judge

For each metric, craft a prompt that aims to correctly categorise whether the response is a correct answer or a hallucinated one.

The inputs to your LLM Judge should be the context and the response.

The output from your LLM Judge should be a boolean TRUE/FALSE categorisation

In [None]:
# Get prompt
SEQUENCE = ["task_1","hallucination_detector"]

prompt_template = fetch_prompt(SEQUENCE,use_latest_version=True)

print(f"Current LLM Judge Prompt:\n------------------------\n{prompt_template}\n------------------------")

In [None]:
# Apply prompt to dataset
evaluator_responses = []

# Loop through dataset
for _, row in tqdm.tqdm(hallucination_df.iterrows()):

    # Get inputs and place into dictionary format
    context = row["Context"]

    response = row["Response"]

    row_inputs = {"CONTEXT" : context, "RESPONSE" : response}

    # Initialise prompt to validate and format inputs
    prompt = PromptManager(template=prompt_template,inputs=row_inputs)
    prompt.validate_inputs()
    prompt.format_inputs()

    # Send prompt and collect response
    response = generate_outputs_openai(prompt.prompt)
    evaluator_responses.append(response)

hallucination_df["evaluator_response"] = evaluator_responses
display(hallucination_df.head(5))

### Evaluation

Now, calculates the accuracy of your LLM-as-a-judge. How does it look? Would you consider doing more prompt engineering?

In [None]:
# Get accuracy
agreement_counts = [1 if str(row['hallucination']) == str(row['evaluator_response']) else 0 for _, row in df.iterrows()]
percentage_agreement = sum(agreement_counts)/len(agreement_counts)
print(f"\n Your LLM Judge achieved {round(100 * percentage_agreement, 1)}% agreement!")

## End of Task 1