# Measuring Hallucination Rate in LLMs Across Domains and Model Versions
## Dataset Creation Phase

### Overview
This notebook contains the code for creating our structured dataset by sampling questions from established public datasets across five knowledge domains. We will collect 80 questions from each domain to create a balanced dataset for LLM hallucination analysis.

### Target Dataset Structure
- **Total Questions:** 400 (80 per domain)
- **Domains:** General Knowledge, Science, History, Pop Culture, Politics
- **Models to Test:** GPT-3.5-turbo, Claude 3 Sonnet, GPT-4o
- **Expected Total Responses:** 1,200

### Data Sources
- **General Knowledge:** TriviaQA - Trivia-style Q&A with verified answers
- **Science:** SciQ - Middle-school science questions  
- **History:** Natural Questions - Questions grounded in Wikipedia
- **Pop Culture:** HotpotQA - Multi-hop Q&A with supporting passages
- **Politics/Current Affairs:** NewsQA - Q&A over news articles

## 1. General Knowledge Questions (TriviaQA Dataset)

### Dataset Information
- **Source:** TriviaQA (unfiltered-web-dev.json)
- **Target Sample:** 80 questions
- **Domain:** General Knowledge

In [8]:
import json
import csv
import random

# -------------------------------
# Step 1: Load the TriviaQA dataset
# -------------------------------
# Source: https://nlp.cs.washington.edu/triviaqa/
with open('triviaqa-unfiltered/unfiltered-web-dev.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# -------------------------------
# Step 2: Randomly sample 80 questions
# -------------------------------
samples = random.sample(data['Data'], k=80)

# -------------------------------
# Step 3: Write to CSV with required format
# -------------------------------
with open('qna_dataset.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    # Write header
    writer.writerow(['question_id', 'question_text', 'domain', 'ground_truth'])

    for idx, item in enumerate(samples, start=1):
        question_text = item.get('Question', '').strip()
        answer = item.get('Answer', {}).get('Value', '').strip()
        domain = "General Knowledge"
        writer.writerow([idx, question_text, domain, answer])


## 2. Science Questions (SciQ Dataset)

### Dataset Information
- **Source:** SciQ (Hugging Face: allenai/sciq)
- **Target Sample:** 80 questions
- **Domain:** Science

In [9]:
from datasets import load_dataset
import csv
import random

# -------------------------------
# Step 1: Load the SciQ dataset
# -------------------------------
# Source: https://huggingface.co/datasets/allenai/sciq
dataset = load_dataset("allenai/sciq")

# -------------------------------
# Step 2: Randomly sample 80 questions from train split
# -------------------------------
train_data = list(dataset['train'])
samples = random.sample(train_data, k=80)

# -------------------------------
# Step 3: Append to existing CSV file
# -------------------------------
with open('qna_dataset.csv', 'a', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    
    for idx, item in enumerate(samples, start=81):  # Start from 81 to continue numbering
        question_text = item.get('question', '').strip()
        answer = item.get('correct_answer', '').strip()
        domain = "Science"
        writer.writerow([idx, question_text, domain, answer])

## 3. History Questions (Natural Questions Dataset)

### Dataset Information
- **Source:** Natural Questions (nq_open)
- **Target Sample:** 80 questions
- **Domain:** History

In [11]:
from datasets import load_dataset
import csv
import random

# -------------------------------
# Step 1: Load the Natural Questions (open) dataset
# -------------------------------
# Source: https://huggingface.co/datasets/nq_open
dataset = load_dataset("nq_open")

# -------------------------------
# Step 2: Randomly sample 80 questions from train split
# -------------------------------
train_data = list(dataset['train'])
samples = random.sample(train_data, k=80)

# -------------------------------
# Step 3: Append to existing CSV file
# -------------------------------
with open('qna_dataset.csv', 'a', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    
    for idx, item in enumerate(samples, start=161):  # Start from 161 to continue numbering
        question_text = item.get('question', '').strip()
        # NQ open has answers as a list, join all valid answers with commas
        answer_list = item.get('answer', [])
        if answer_list:
            # Join multiple answers with ", " separator
            answer = ', '.join([ans.strip() for ans in answer_list if ans.strip()])
        else:
            answer = ''
        domain = "History"
        writer.writerow([idx, question_text, domain, answer])

## 4. Pop Culture Questions (HotpotQA Dataset)

### Dataset Information
- **Source:** HotpotQA (hotpot_qa)
- **Target Sample:** 80 questions
- **Domain:** Pop Culture

In [14]:
from datasets import load_dataset
import csv
import random

# -------------------------------
# Step 1: Load the HotpotQA dataset
# -------------------------------
# Source: https://huggingface.co/datasets/hotpot_qa
dataset = load_dataset("hotpot_qa", "fullwiki", trust_remote_code=True)

# -------------------------------
# Step 2: Randomly sample 80 questions from train split
# -------------------------------
train_data = list(dataset['train'])
samples = random.sample(train_data, k=80)

# -------------------------------
# Step 3: Append to existing CSV file
# -------------------------------
with open('qna_dataset.csv', 'a', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    
    for idx, item in enumerate(samples, start=241):  # Start from 241 to continue numbering
        question_text = item.get('question', '').strip()
        answer = item.get('answer', '').strip()
        domain = "Pop Culture"
        writer.writerow([idx, question_text, domain, answer])

Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7405 [00:00<?, ? examples/s]

## 5. Healthcare Questions (MedMCQA Dataset)

### Dataset Information
- **Source:** MedMCQA (openlifescienceai/medmcqa)
- **Target Sample:** 80 questions
- **Domain:** Healthcare

In [16]:
from datasets import load_dataset
import csv
import random

# -------------------------------
# Step 1: Load the MedMCQA dataset
# -------------------------------
# Source: https://huggingface.co/datasets/openlifescienceai/medmcqa
dataset = load_dataset("openlifescienceai/medmcqa")

# -------------------------------
# Step 2: Filter out "except" questions and randomly sample 80
# -------------------------------
train_data = list(dataset['train'])

# Filter out questions with "except" patterns that need multiple choice context
filtered_questions = []
except_patterns = ['except', 'not', 'false', 'incorrect', 'all of the following', 'which of the following']

for item in train_data:
    question_text = item.get('question', '').lower()
    # Skip questions that contain "except" or similar patterns
    if not any(pattern in question_text for pattern in except_patterns):
        filtered_questions.append(item)

print(f"Filtered out {len(train_data) - len(filtered_questions)} 'except' questions")
print(f"Remaining questions: {len(filtered_questions)}")

# Sample 80 questions from filtered ones
samples = random.sample(filtered_questions, k=80)

# -------------------------------
# Step 3: Append to existing CSV file
# -------------------------------
with open('qna_dataset.csv', 'a', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    
    for idx, item in enumerate(samples, start=321):  # Start from 321 to continue numbering
        question_text = item.get('question', '').strip()
        
        # MedMCQA has multiple choice answers, we need the correct one
        options = [
            item.get('opa', ''),
            item.get('opb', ''),
            item.get('opc', ''),
            item.get('opd', '')
        ]
        
        # Get the correct answer index (0=A, 1=B, 2=C, 3=D)
        correct_idx = item.get('cop', 0)  # cop = correct option
        answer = options[correct_idx].strip() if correct_idx < len(options) else ''
        
        domain = "Healthcare"
        writer.writerow([idx, question_text, domain, answer])

Filtered out 58593 'except' questions
Remaining questions: 124229


## Add Question Length Column

Add `question_length` column to measure question complexity (character count) for hypothesis H3 testing.

In [17]:
import pandas as pd

# -------------------------------
# Step 1: Load the existing dataset
# -------------------------------
df = pd.read_csv('qna_dataset.csv')

# -------------------------------
# Step 2: Calculate question length (character count)
# -------------------------------
df['question_length'] = df['question_text'].str.len()

# -------------------------------
# Step 3: Save the updated dataset
# -------------------------------
df.to_csv('qna_dataset.csv', index=False)

# -------------------------------
# Step 4: Display summary statistics
# -------------------------------
print(f"\nFirst 5 rows with question lengths:")
print(df[['question_id', 'question_text', 'domain', 'question_length']].head())


First 5 rows with question lengths:
   question_id                                      question_text  \
0            1         What is the most common bird in the world?   
1            2  What British sitcom that aired from 1979 to 19...   
2            3  May 4, 1904 saw the US begin construction the ...   
3            4  "In a 2007 interview, which actor 'animatedly'...   
4            5      In which US state are the Catskill Mountains?   

              domain  question_length  
0  General Knowledge               42  
1  General Knowledge              100  
2  General Knowledge              193  
3  General Knowledge              171  
4  General Knowledge               45  


## GPT-3.5 Question Analysis

### Purpose: Use GPT-3.5 to analyze questions and add classification columns

In [None]:
import pandas as pd
from openai import OpenAI
import json
import time
from datetime import datetime

# -------------------------------
# Step 1: Setup OpenAI API (v1.0+)
# -------------------------------
# Set your OpenAI API key
client = OpenAI(api_key="YOUR_OPENAI_API_KEY_HERE")  # Replace with your actual API key


# -------------------------------
# Step 2: Load the existing dataset
# -------------------------------
df = pd.read_csv('qna_dataset.csv')

# -------------------------------
# Step 3: Define GPT-3.5 analysis function
# -------------------------------
def analyze_question_with_gpt(question_text):
    prompt = f"""
Analyze this question and provide ONLY a JSON response with these exact fields:

Question: "{question_text}"

Return JSON format:
{{
    "question_type": "closed-ended" or "open-ended",
    "question_nature": "topical" or "general", 
    "question_style": "qualitative" or "quantitative",
    "response_text": "answer to the question"
}}

Rules:
- question_type: "closed-ended" if the question has a clear, direct, factual answer. "open-ended" if it requires descriptive, subjective, or interpretive responses
- question_nature: "topical" if about specific current events/trends, "general" if timeless knowledge
- question_style: "quantitative" if involves numbers/measurements, "qualitative" otherwise  
- response_text: Provide appropriate answer - can be direct fact, explanation, or description as needed. No conversational phrases like "That's a great question" or "How do you feel about it". Just answer the question naturally and informatively.
"""

    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_tokens=200
        )
        
        # Parse the JSON response
        result = json.loads(response.choices[0].message.content.strip())
        return result
        
    except Exception as e:
        print(f"Error processing question: {e}")
        return {
            "question_type": "unknown",
            "question_nature": "unknown", 
            "question_style": "unknown",
            "response_text": "error"
        }

# -------------------------------
# Step 4: Process ALL 400 questions
# -------------------------------
print("Starting GPT-3.5 analysis for ALL questions...")
print(f"Processing {len(df)} questions...")

results = []
for idx, row in df.iterrows():
    print(f"Processing question {idx+1}/{len(df)}: {row['question_text'][:50]}...")
    
    # Analyze question with GPT-3.5
    analysis = analyze_question_with_gpt(row['question_text'])
    
    # Add collection date
    analysis['collection_date'] = datetime.now().strftime('%Y-%m-%d')
    
    results.append(analysis)
    
    # Add small delay to respect API rate limits
    time.sleep(0.5)
    
    # Save progress every 50 questions (in case of interruption)
    if (idx + 1) % 50 == 0:
        print(f"✅ Completed {idx + 1} questions - saving progress...")

# -------------------------------
# Step 5: Add new columns to dataframe
# -------------------------------
df['question_type'] = [r['question_type'] for r in results]
df['question_nature'] = [r['question_nature'] for r in results] 
df['question_style'] = [r['question_style'] for r in results]
df['response_text'] = [r['response_text'] for r in results]
df['collection_date'] = [r['collection_date'] for r in results]

# -------------------------------
# Step 6: Save enhanced dataset
# -------------------------------
df.to_csv('qna_dataset_GPT3.5.csv', index=False)

print("GPT-3.5 analysis complete!")
print(f"Saved enhanced dataset to 'qna_dataset_GPT3.5.csv'")

# Display sample results
print(f"\nSample results:")
print(df[['question_text', 'question_type', 'question_nature', 'question_style']].head())

# Show distribution of classifications
print(f"\nClassification distributions:")
print(f"Question Type: {df['question_type'].value_counts().to_dict()}")
print(f"Question Nature: {df['question_nature'].value_counts().to_dict()}")
print(f"Question Style: {df['question_style'].value_counts().to_dict()}")

Starting GPT-3.5 analysis for ALL questions...
Processing 400 questions...
Processing question 1/400: What is the most common bird in the world?...
Processing question 2/400: What British sitcom that aired from 1979 to 1981 i...
Processing question 3/400: May 4, 1904 saw the US begin construction the Cana...
Processing question 4/400: "In a 2007 interview, which actor 'animatedly' bem...
Processing question 5/400: In which US state are the Catskill Mountains?...
Processing question 6/400: What was the name of Canada's first woman Prime Mi...
Processing question 7/400: Who was the former wife of war hero Leonard Cheshi...
Processing question 8/400: Who was the first British winner of the US Women’s...
Processing question 9/400: What type of beer does Homer Simpson drink?...
Processing question 10/400: Which role is being played in a recently released ...
Processing question 11/400: What was the surname of the butler played by Gordo...
Processing question 12/400: Which French phrase comm

In [26]:
# Show distribution of classifications
print(f"\nClassification distributions:")
print(f"Question Type: {df['question_type'].value_counts().to_dict()}")
print(f"Question Nature: {df['question_nature'].value_counts().to_dict()}")
print(f"Question Style: {df['question_style'].value_counts().to_dict()}")


Classification distributions:
Question Type: {'closed-ended': 391, 'open-ended': 9}
Question Nature: {'general': 337, 'topical': 63}
Question Style: {'qualitative': 373, 'quantitative': 27}


In [27]:
## Add Response Length Column

### Purpose: Add response_length column to measure GPT-3.5 response complexity (character count)

import pandas as pd

# -------------------------------
# Step 1: Load the existing dataset
# -------------------------------
df = pd.read_csv('qna_dataset_GPT3.5.csv')

# -------------------------------
# Step 2: Calculate response length (character count)
# -------------------------------
df['response_length'] = df['response_text'].str.len()

# -------------------------------
# Step 3: Save the updated dataset
# -------------------------------
df.to_csv('qna_dataset_GPT3.5.csv', index=False)

# -------------------------------
# Step 4: Display summary statistics
# -------------------------------
print("Response length column added successfully!")
print(f"\nDataset shape: {df.shape}")
print(f"\nResponse length statistics:")
print(df['response_length'].describe())

print(f"\nSample data with both lengths:")
print(df[['question_text', 'question_length', 'response_text', 'response_length']].head())

Response length column added successfully!

Dataset shape: (400, 11)

Response length statistics:
count    400.000000
mean      98.452500
std       79.212336
min        4.000000
25%       47.000000
50%       83.000000
75%      126.250000
max      416.000000
Name: response_length, dtype: float64

Sample data with both lengths:
                                       question_text  question_length  \
0         What is the most common bird in the world?               42   
1  What British sitcom that aired from 1979 to 19...              100   
2  May 4, 1904 saw the US begin construction the ...              193   
3  "In a 2007 interview, which actor 'animatedly'...              171   
4      In which US state are the Catskill Mountains?               45   

                                       response_text  response_length  
0  The most common bird in the world is the domes...               85  
1                                  To the Manor Born               17  
2  The US preside

## GPT-4o Response Generation (All 400 Questions)

### Purpose: Generate responses using GPT-4o for all questions

In [None]:
import pandas as pd
from openai import OpenAI
import time
from datetime import datetime

# -------------------------------
# Step 1: Setup OpenAI API
# -------------------------------
client = OpenAI(api_key="YOUR_OPENAI_API_KEY_HERE")  # Replace with your actual API key

# -------------------------------
# Step 2: Load existing dataset into new dataframe
# -------------------------------
df_gpt4o = pd.read_csv('qna_dataset_GPT3.5.csv')

print(f"Processing ALL {len(df_gpt4o)} questions using GPT-4o...")

# -------------------------------
# Step 3: Define GPT-4o response function
# -------------------------------
def get_gpt4o_response(question_text):
    prompt = f"""
Provide appropriate answer - can be direct fact, explanation, or description as needed. No conversational phrases like "That's a great question" or "How do you feel about it". Just answer the question naturally and informatively.

Question: "{question_text}"
"""

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_tokens=300
        )
        
        return response.choices[0].message.content.strip()
        
    except Exception as e:
        print(f"Error processing question: {e}")
        return "error"

# -------------------------------
# Step 4: Generate GPT-4o responses for ALL questions
# -------------------------------
print("Generating GPT-4o responses for all 400 questions...")

gpt4o_responses = []
for idx, row in df_gpt4o.iterrows():
    print(f"Processing question {idx+1}/{len(df_gpt4o)}: {row['question_text'][:50]}...")
    
    # Get GPT-4o response
    response = get_gpt4o_response(row['question_text'])
    gpt4o_responses.append(response)
    
    # Add small delay to respect API rate limits
    time.sleep(0.25)
    
    # Save progress every 50 questions (in case of interruption)
    if (idx + 1) % 50 == 0:
        print(f"Completed {idx + 1} questions - progress saved...")

# -------------------------------
# Step 5: Replace response columns with GPT-4o data
# -------------------------------
df_gpt4o['response_text'] = gpt4o_responses
df_gpt4o['response_length'] = df_gpt4o['response_text'].str.len()

# Update collection date
df_gpt4o['collection_date'] = datetime.now().strftime('%Y-%m-%d')

# -------------------------------
# Step 6: Save to new CSV
# -------------------------------
df_gpt4o.to_csv('qna_dataset_GPT4o.csv', index=False)

print("GPT-4o responses generated for all questions!")
print(f"Saved to 'qna_dataset_GPT4o.csv'")
print(f"\nDataset shape: {df_gpt4o.shape}")

Processing ALL 400 questions using GPT-4o...
Generating GPT-4o responses for all 400 questions...
Processing question 1/400: What is the most common bird in the world?...
Processing question 2/400: What British sitcom that aired from 1979 to 1981 i...
Processing question 3/400: May 4, 1904 saw the US begin construction the Cana...
Processing question 4/400: "In a 2007 interview, which actor 'animatedly' bem...
Processing question 5/400: In which US state are the Catskill Mountains?...
Processing question 6/400: What was the name of Canada's first woman Prime Mi...
Processing question 7/400: Who was the former wife of war hero Leonard Cheshi...
Processing question 8/400: Who was the first British winner of the US Women’s...
Processing question 9/400: What type of beer does Homer Simpson drink?...
Processing question 10/400: Which role is being played in a recently released ...
Processing question 11/400: What was the surname of the butler played by Gordo...
Processing question 12/400: W

In [31]:
# -------------------------------
# Step 7: Show final statistics
# -------------------------------
print(f"\nGPT-4o response length statistics:")
print(df_gpt4o['response_length'].describe())

print(f"\nSample GPT-4o responses:")
print(df_gpt4o[['question_text', 'response_text', 'response_length']].head())


GPT-4o response length statistics:
count     400.000000
mean      253.672500
std       203.662014
min         8.000000
25%       101.000000
50%       207.500000
75%       349.750000
max      1487.000000
Name: response_length, dtype: float64

Sample GPT-4o responses:
                                       question_text  \
0         What is the most common bird in the world?   
1  What British sitcom that aired from 1979 to 19...   
2  May 4, 1904 saw the US begin construction the ...   
3  "In a 2007 interview, which actor 'animatedly'...   
4      In which US state are the Catskill Mountains?   

                                       response_text  response_length  
0  The most common bird in the world is the domes...              276  
1  The British sitcom that aired from 1979 to 198...              125  
2  The construction of the Panama Canal began und...              236  
3  In a 2007 interview, actor Antonio Banderas an...              265  
4  The Catskill Mountains are locat

## Claude 3.5 Sonnet Response Generation (All 400 Questions)

### Purpose: Generate responses using Claude 3.5 Sonnet for all questions

In [None]:

import pandas as pd
import anthropic
import time
from datetime import datetime

# -------------------------------
# Step 1: Setup Anthropic API
# -------------------------------
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY_HERE")  # Replace with your actual API key

# -------------------------------
# Step 2: Load existing dataset into new dataframe
# -------------------------------
df_claude = pd.read_csv('qna_dataset_GPT3.5.csv')

print(f"Processing ALL {len(df_claude)} questions using Claude 3.5 Sonnet...")

# -------------------------------
# Step 3: Define Claude 3.5 Sonnet response function
# -------------------------------
def get_claude_response(question_text):
    prompt = f"""
Provide appropriate answer - can be direct fact, explanation, or description as needed. No conversational phrases like "That's a great question" or "How do you feel about it". Just answer the question naturally and informatively.

Question: "{question_text}"
"""

    try:
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",  # Updated to latest Claude 3.5 Sonnet
            max_tokens=300,
            temperature=0.1,
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.content[0].text.strip()
        
    except Exception as e:
        print(f"Error processing question: {e}")
        return "error"

# -------------------------------
# Step 4: Generate Claude 3.5 Sonnet responses for ALL questions
# -------------------------------
print("Generating Claude 3.5 Sonnet responses for all 400 questions...")

claude_responses = []
for idx, row in df_claude.iterrows():
    print(f"Processing question {idx+1}/{len(df_claude)}: {row['question_text'][:50]}...")
    
    # Get Claude 3.5 Sonnet response
    response = get_claude_response(row['question_text'])
    claude_responses.append(response)
    
    # Add small delay to respect API rate limits
    time.sleep(0.25)
    
    # Save progress every 50 questions (in case of interruption)
    if (idx + 1) % 50 == 0:
        print(f"Completed {idx + 1} questions - progress saved...")

# -------------------------------
# Step 5: Replace response columns with Claude data
# -------------------------------
df_claude['response_text'] = claude_responses
df_claude['response_length'] = df_claude['response_text'].str.len()

# Update collection date
df_claude['collection_date'] = datetime.now().strftime('%Y-%m-%d')

# -------------------------------
# Step 6: Save to new CSV
# -------------------------------
df_claude.to_csv('qna_dataset_Claude3.5Sonnet.csv', index=False)

print("Claude 3.5 Sonnet responses generated for all questions!")
print(f"Saved to 'qna_dataset_Claude3.5Sonnet.csv'")
print(f"\nDataset shape: {df_claude.shape}")

Processing ALL 400 questions using Claude 3.5 Sonnet...
Generating Claude 3.5 Sonnet responses for all 400 questions...
Processing question 1/400: What is the most common bird in the world?...
Processing question 2/400: What British sitcom that aired from 1979 to 1981 i...
Processing question 3/400: May 4, 1904 saw the US begin construction the Cana...
Processing question 4/400: "In a 2007 interview, which actor 'animatedly' bem...
Processing question 5/400: In which US state are the Catskill Mountains?...
Processing question 6/400: What was the name of Canada's first woman Prime Mi...
Processing question 7/400: Who was the former wife of war hero Leonard Cheshi...
Processing question 8/400: Who was the first British winner of the US Women’s...
Processing question 9/400: What type of beer does Homer Simpson drink?...
Processing question 10/400: Which role is being played in a recently released ...
Processing question 11/400: What was the surname of the butler played by Gordo...
Process

In [38]:
# -------------------------------
# Step 7: Show final statistics
# -------------------------------
print(f"\nClaude 3.5 Sonnet response length statistics:")
print(df_claude['response_length'].describe())

print(f"\nSample Claude 3.5 Sonnet responses:")
print(df_claude[['question_text', 'response_text', 'response_length']].head())


Claude 3.5 Sonnet response length statistics:
count     400.000000
mean      391.595000
std       199.831699
min         6.000000
25%       250.750000
50%       349.000000
75%       501.750000
max      1075.000000
Name: response_length, dtype: float64

Sample Claude 3.5 Sonnet responses:
                                       question_text  \
0         What is the most common bird in the world?   
1  What British sitcom that aired from 1979 to 19...   
2  May 4, 1904 saw the US begin construction the ...   
3  "In a 2007 interview, which actor 'animatedly'...   
4      In which US state are the Catskill Mountains?   

                                       response_text  response_length  
0  The House Sparrow (Passer domesticus) is the m...              403  
1  To The Manor Born aired on BBC1 from 1979 to 1...              352  
2  Theodore Roosevelt orchestrated Panama's indep...              646  
3  Bill Murray made these comments about Garfield...              342  
4  The Catski

## Hallucination Analysis for All Models (All Questions)

### Purpose: Analyze responses from GPT-3.5, GPT-4o, and Claude 3.5 Sonnet for hallucinations

In [None]:
import pandas as pd
from openai import OpenAI
import json
import time
from datetime import datetime

# -------------------------------
# Step 1: Setup OpenAI API
# -------------------------------
client = OpenAI(api_key="YOUR_OPENAI_API_KEY_HERE")  # Replace with your actual API key

# -------------------------------
# Step 2: Load all three datasets
# -------------------------------
files_to_process = [
    ('qna_dataset_GPT3.5.csv', 'qna_dataset_GPT3.5_final.csv'),
    ('qna_dataset_GPT4o.csv', 'qna_dataset_GPT4o_final.csv'),
    ('qna_dataset_Claude3.5Sonnet.csv', 'qna_dataset_Claude3.5Sonnet_final.csv')
]

print("Starting hallucination analysis for all 3 models (ALL questions)...")

# -------------------------------
# Step 3: Define hallucination analysis function
# -------------------------------
def analyze_response_for_hallucination(question, ground_truth, response_text):
    prompt = f"""
Analyze this Q&A response and provide ONLY a JSON response with these exact fields:

Question: "{question}"
Ground Truth: "{ground_truth}"
Response: "{response_text}"

Return JSON format:
{{
    "citation_present": true or false,
    "hallucination_present": true or false,
    "factscore": "completely right" or "somewhat correct" or "somewhat inaccurate" or "totally wrong",
    "confidence_markers_count": number,
    "uncertainty_markers_count": number
}}

Rules:
- citation_present: true if response contains citations or source-like phrases (e.g., "According to...", "Research shows...", "Studies indicate...")
- hallucination_present: true if response contradicts or significantly differs from ground truth
- factscore: Compare response accuracy against ground truth
- confidence_markers_count: Count phrases like "certainly," "definitely," "absolutely," "clearly," "obviously" in the response
- uncertainty_markers_count: Count phrases like "possibly," "might be," "perhaps," "likely," "probably," "may be" in the response
"""

    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_tokens=150
        )
        
        # Parse the JSON response
        result = json.loads(response.choices[0].message.content.strip())
        return result
        
    except Exception as e:
        print(f"Error processing analysis: {e}")
        return {
            "citation_present": False,
            "hallucination_present": False,
            "factscore": "unknown",
            "confidence_markers_count": 0,
            "uncertainty_markers_count": 0
        }

# -------------------------------
# Step 4: Process each file (ALL questions)
# -------------------------------
for input_file, output_file in files_to_process:
    print(f"\nProcessing {input_file} (ALL questions)...")
    
    # Load full dataset
    df = pd.read_csv(input_file)
    
    print(f"Analyzing {len(df)} responses from {input_file}...")
    
    # Initialize lists for new columns
    citation_present = []
    hallucination_present = []
    factscore = []
    confidence_markers_count = []
    uncertainty_markers_count = []
    
    # Analyze each response
    for idx, row in df.iterrows():
        print(f"  Analyzing response {idx+1}/{len(df)}: {row['question_text'][:50]}...")
        
        analysis = analyze_response_for_hallucination(
            row['question_text'],
            row['ground_truth'],
            row['response_text']
        )
        
        citation_present.append(analysis['citation_present'])
        hallucination_present.append(analysis['hallucination_present'])
        factscore.append(analysis['factscore'])
        confidence_markers_count.append(analysis['confidence_markers_count'])
        uncertainty_markers_count.append(analysis['uncertainty_markers_count'])
        
        # Small delay for API rate limits
        time.sleep(0.25)
        
        # Save progress every 50 questions
        if (idx + 1) % 50 == 0:
            print(f"    Completed {idx + 1} questions...")
    
    # Add new columns to dataframe
    df['citation_present'] = citation_present
    df['hallucination_present'] = hallucination_present
    df['factscore'] = factscore
    df['confidence_markers_count'] = confidence_markers_count
    df['uncertainty_markers_count'] = uncertainty_markers_count
    
    # Save to final file
    df.to_csv(output_file, index=False)
    
    print(f"Saved complete analysis to {output_file}")
    
print("\nComplete hallucination analysis finished for all models!")
print(f"\nFinal analysis files created:")
print(f"- qna_dataset_GPT3.5_final.csv")
print(f"- qna_dataset_GPT4o_final.csv") 
print(f"- qna_dataset_Claude3.5Sonnet_final.csv")

Starting hallucination analysis for all 3 models (ALL questions)...

Processing qna_dataset_GPT3.5.csv (ALL questions)...
Analyzing 400 responses from qna_dataset_GPT3.5.csv...
  Analyzing response 1/400: What is the most common bird in the world?...
  Analyzing response 2/400: What British sitcom that aired from 1979 to 1981 i...
  Analyzing response 3/400: May 4, 1904 saw the US begin construction the Cana...
  Analyzing response 4/400: "In a 2007 interview, which actor 'animatedly' bem...
  Analyzing response 5/400: In which US state are the Catskill Mountains?...
  Analyzing response 6/400: What was the name of Canada's first woman Prime Mi...
  Analyzing response 7/400: Who was the former wife of war hero Leonard Cheshi...
  Analyzing response 8/400: Who was the first British winner of the US Women’s...
  Analyzing response 9/400: What type of beer does Homer Simpson drink?...
  Analyzing response 10/400: Which role is being played in a recently released ...
  Analyzing response 1

## Summary Statistics for Final Files

### Purpose: Show summary statistics for the 3 final analysis files

In [41]:
import pandas as pd

# -------------------------------
# Final files to analyze
# -------------------------------
files_to_analyze = [
    'qna_dataset_GPT3.5_final.csv',
    'qna_dataset_GPT4o_final.csv',
    'qna_dataset_Claude3.5Sonnet_final.csv'
]

# -------------------------------
# Generate summary for each file
# -------------------------------
for input_file in files_to_analyze:
    print(f"\n{'='*60}")
    print(f"SUMMARY FOR {input_file}")
    print(f"{'='*60}")
    
    # Load the dataset
    df = pd.read_csv(input_file)
    
    # Extract the new columns
    citation_present = df['citation_present'].tolist()
    hallucination_present = df['hallucination_present'].tolist()
    factscore = df['factscore'].tolist()
    confidence_markers_count = df['confidence_markers_count'].tolist()
    uncertainty_markers_count = df['uncertainty_markers_count'].tolist()
    
    # Show summary statistics
    print(f"Summary for {input_file}:")
    print(f"  Total responses: {len(df)}")
    print(f"  Citations present: {sum(citation_present)}/{len(citation_present)} ({sum(citation_present)/len(citation_present)*100:.1f}%)")
    print(f"  Hallucinations detected: {sum(hallucination_present)}/{len(hallucination_present)} ({sum(hallucination_present)/len(hallucination_present)*100:.1f}%)")
    print(f"  FactScore distribution: {pd.Series(factscore).value_counts().to_dict()}")
    print(f"  Average confidence markers: {sum(confidence_markers_count)/len(confidence_markers_count):.1f}")
    print(f"  Average uncertainty markers: {sum(uncertainty_markers_count)/len(uncertainty_markers_count):.1f}")


SUMMARY FOR qna_dataset_GPT3.5_final.csv
Summary for qna_dataset_GPT3.5_final.csv:
  Total responses: 400
  Citations present: 1/400 (0.2%)
  Hallucinations detected: 75/400 (18.8%)
  FactScore distribution: {'completely right': 210, 'somewhat inaccurate': 87, 'somewhat correct': 87, 'totally wrong': 16}
  Average confidence markers: 0.0
  Average uncertainty markers: 0.1

SUMMARY FOR qna_dataset_GPT4o_final.csv
Summary for qna_dataset_GPT4o_final.csv:
  Total responses: 400
  Citations present: 14/400 (3.5%)
  Hallucinations detected: 47/400 (11.8%)
  FactScore distribution: {'completely right': 255, 'somewhat correct': 79, 'somewhat inaccurate': 55, 'totally wrong': 11}
  Average confidence markers: 0.0
  Average uncertainty markers: 0.1

SUMMARY FOR qna_dataset_Claude3.5Sonnet_final.csv
Summary for qna_dataset_Claude3.5Sonnet_final.csv:
  Total responses: 400
  Citations present: 26/400 (6.5%)
  Hallucinations detected: 49/400 (12.2%)
  FactScore distribution: {'completely right': 