## Automated evaluation Pipeline

### What Needs to Be Automated?
- Generate questions
- Check retrieval Performance
- Detect model hallucinations

### Pipeline

1. **Generate Questions:**  
   - Provide 100 case articles to ChatGPT and let it generate a question for each case article. 
   - Get (case article, question)

2. **Generate Answers:**  
   - Use our RAG model to answer the generated questions.
   - Get (case article, question, answer, retrieved articles)

3. **Evaluate Retrieval Performance:**  
   - Check if each case article is in the retrieved articles using:
     - Accuracy@3
     - Accuracy@2
     - Accuracy@1
   - Get (case article, question, answer, retrieved articles, Retrieval status)

4. **Detect Hallucination:**  
   - Provide the answer and retrieved articles to ChatGPT.  
   - Let ChatGPT determine if hallucinations occured and provide a reason.
   - Get (case article, answer, retrieved articles, Retrieval status, hallucinated, reason)



In [2]:
import sys
import os
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
sys.path.insert(0, parent_dir)

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import torch
from components.rag import query_answering_system
from openai import OpenAI


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Generate questions by Chatgpt
client = OpenAI()
client.api_key = os.getenv("OPENAI_API_KEY")

def generate_question(article):
    prompt = (
        "Generate a simple, easy-to-understand and not too long question based on the following news content."
        "Also, the question must be specific and clear. For example, instead of using 'servants,' it should specify which region or country’s servants are being referred to."
        "Additionally, the question should not be too difficult, and the answer must be explicitly contained within the following News content.\n\n"
        f"News content:\n{article}\n\n"
        "Question:"
    )

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": prompt
            }
        ]
    )

    return completion.choices[0].message.content.strip()


df_1K = pd.read_csv('../data/1K_news.csv', encoding='utf-8')

# Select the first 100 rows
df = df_1K.head(100)

# Apply the function to generate questions
df["generated_question"] = df["text"].apply(generate_question)

# Save the results to a CSV file
# df.to_csv("./100_news_questions.csv", index=False, encoding='utf-8')

print("Question generation completed. Results saved to news_questions.csv")


### **Case Article**

Bitcoin dropped to its lowest level in 3.5 months on Friday, dragged by uncertainty about U.S. President Donald Trump's tariff plans, crypto policy, and flagging investor confidence after a $1.5 billion hack in rival cryptocurrency Ether.

Bitcoin, the world's largest cryptocurrency by market value, was last down more than 5% on the day at **$79,666**, trading below $80,000 for the first time since November 11.

### **GPT-generated Question**
What was the price of Bitcoin when it dropped to its lowest level in 3.5 months on Friday?


In [None]:
# Generate answers by Rag
df = pd.read_csv('./100_news_questions.csv', encoding='utf-8')
document_dataset = "../data/1K_news.csv"
for i in range(len(df)):
    query = df.loc[i, 'generated_question']
    output = query_answering_system(query, document_dataset)
    df.loc[i, 'generated_response'] = output['answer']
    df.loc[i, 'retrieved_docs_id'] = str(output['retrieved_docs_id'])[1:-1]

    print(f"Processing question ({i+1}/{len(df)}) and response generated.")

df.to_csv("./100_news_QA.csv", index=False, encoding='utf-8')

### GPT-generated Question  

What was the price of Bitcoin when it dropped to its lowest level in 3.5 months on Friday?

### Rag-generated Answer  

Bitcoin, the world's largest cryptocurrency by market value, was last down more than 5% on the day at **$79,666**.  


In [3]:
# Evaluate retrieval performance using Accuracy metrics

df = pd.read_csv('./100_news_QA.csv', encoding='utf-8')
TP_3 = 0
TP_2 = 0
TP_1 = 0
for i in range(len(df)):

    content_id = str(df['content_id'][i])
    retrieved_docs_id = df['retrieved_docs_id'][i].split(', ')

    if content_id == retrieved_docs_id[0]:
        TP_1 += 1
        TP_2 += 1
        TP_3 += 1
    elif content_id == retrieved_docs_id[1]:
        TP_2 += 1
        TP_3 += 1
    elif content_id == retrieved_docs_id[2]:
        TP_3 += 1

Accuracy_top3 = TP_3 / len(df)
Accuracy_top2 = TP_2 / len(df)
Accuracy_top1 = TP_1 / len(df)

print(f"Retrieval Performance Metrics:")
print(f"--------------------------------")
print(f"Top-1 Accuracy: {Accuracy_top1:.4f} - The proportion of cases where the case article is ranked first.")
print(f"Top-2 Accuracy: {Accuracy_top2:.4f} - The proportion of cases where the case article is in the top 2 results.")
print(f"Top-3 Accuracy: {Accuracy_top3:.4f} - The proportion of cases where the case article is in the top 3 results.")



Retrieval Performance Metrics:
--------------------------------
Top-1 Accuracy: 0.8500 - The proportion of cases where the case article is ranked first.
Top-2 Accuracy: 0.9400 - The proportion of cases where the case article is in the top 2 results.
Top-3 Accuracy: 0.9500 - The proportion of cases where the case article is in the top 3 results.


In [None]:
# Use Chat GPT evaluate if Rag has model hallucination
client = OpenAI()
client.api_key = os.getenv("OPENAI_API_KEY")

def evaluate_rag_hallucinated(retrieved_articles, answer):
    prompt = (
        "Evaluate whether the Knowledge-based AI assistant's state is factually accurate based only on the given documents.\n"
        "Your response should be in two rows:\n"
        "1. First row: True or False (True if the answer is fully supported by the documents, False if it contains hallucinated or unverifiable information).\n"
        "2. Second row: A short explanation for your decision.\n\n"
        "Documents:\n"
        f"{retrieved_articles}\n\n"
        "Answer from the AI assistant:\n"
        f"{answer}\n\n"
    )

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": prompt
            }
        ]
    )

    return completion.choices[0].message.content.strip()

df = pd.read_csv('./100_news_QA.csv', encoding='utf-8')
df_1K = pd.read_csv('../data/1K_news.csv', encoding='utf-8')

df["hallucinated"] = None
df["reason"] = None

for i in range(len(df)):

    doc_ids = list(map(int, df["retrieved_docs_id"][i].split(", ")))
    retrieved_articles = df_1K[df_1K['content_id'].isin(doc_ids)]['text']
    retrieved_articles = "\n\n".join([f"document{j+1}: {retrieved_articles.iloc[j]}" for j in range(min(3, len(retrieved_articles)))])
        
    isHallucinated, reason = evaluate_rag_hallucinated(retrieved_articles, df.loc[i, "generated_response"]).splitlines()
    df.loc[i, 'hallucinated'] = isHallucinated
    df.loc[i, 'reason'] = reason

df.to_csv("./100_news_QA_hallucination.csv", index=False, encoding='utf-8')

**Hallucination rate:** **40%**

### Answer
Apollo Global Management is discussing **a $305 million funding** round to help Meta Platforms develop data centers in the U.S.

### Why Hallucinated
Incorrectly states the financing amount as *$305 million*, while the documents indicate that Apollo Global Management is in talks for a financing package of roughly **$35 billion** for Meta Platforms.
