# Install Requirements

In [None]:
import pandas as pd
import numpy as np

# Load and Manipulate Datasets

In [10]:
# dataset from https://github.com/RUCAIBox/HaluEval
df1 = pd.read_json("/content/general_data.json", lines=True)

# data cleaning, only need prompt and binary yes/no for hallucination
df1.drop(['chatgpt_response', 'ID', 'hallucination_spans'], axis=1, inplace=True, errors='ignore')

df1['hallucination_binary'] = df1['hallucination'].map({'yes': 1, 'no': 0})

df1.drop('hallucination', axis=1, inplace=True, errors='ignore')

# necessary for trainer
df1.rename(columns={'hallucination_binary':'label'}, inplace=True)

# rename for consistency
df1.rename(columns={'user_query':'input'}, inplace=True)

print(len(df1))
print(df1.columns)
print(df1.head())
print(df1['label'].value_counts())

4507
Index(['input', 'label'], dtype='object')
                                               input  label
0  Produce a list of common words in the English ...      0
1              Provide a few examples of homophones.      1
2  Create a chart outlining the world's populatio...      1
3         Design a shape with 10 vertices (corners).      1
4  Automatically generate a 10 by 10 multiplicati...      1
label
0    3692
1     815
Name: count, dtype: int64


This dataset includes generic questions from users such as the most common words in the English language. It contains 4507 samples and is heavily biased towards "no hallucination," with 3692 prompts resulting in no hallucination and only 815 resulting in a hallucination in ChatGPT's response. To prepare this dataset, I converted the hallucination column to an integer label (0 for no hallucination, 1 for a hallucination) and removed all columns except for the prompt (input) and label.

In [16]:
# datasets from https://github.com/Arize-ai/LibreEval
csv_file_paths = [
    '/content/docs_databricks_com_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/earthobservatory_nasa_gov_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/experienceleague_adobe_com_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/medlineplus_gov_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/pmc_ncbi_nlm_nih_gov_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/www_investopedia_com_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/www_law_cornell_edu_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/www_mongodb_com_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/www_ncbi_nlm_nih_gov_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    '/content/www_noaa_gov_gpt_4o_synthetic_gpt_4o_claude_3_5_sonnet_latest_en_answer.txt',
    ]

df2_list = [pd.read_csv(file) for file in csv_file_paths]
df2 = pd.concat(df2_list, ignore_index=True)

print(df2['label'].value_counts())

df2.drop(columns=["reference", "output", "explanation_gpt-4o",
                  "label_claude-3-5-sonnet-latest", "explanation_claude-3-5-sonnet-latest",
                  "label_litellm/together_ai/Qwen/Qwen2.5-7B-Instruct-Turbo",
                  "explanation_litellm/together_ai/Qwen/Qwen2.5-7B-Instruct-Turbo",
                  "rag_model", "force_even_split", "website", "synthetic",
                  "language", "hallucination_type_realized", "question_type", "hallucination_type_encouraged",
                  "hallucination_type_realized_ensemble", "label_mistral-large-latest",
                  "explanation_mistral-large-latest", "human_label", "label_gpt-4o"], errors="ignore", inplace=True)

df2 = df2[df2['label'].str.upper() != "NOT_PARSABLE"]
df2['label'] = df2['label'].map({'hallucinated': 1, 'factual': 0})

print(len(df2))
print(df2.columns)
print(df2.head())
print(df2['label'].value_counts())

label
factual         3270
hallucinated     976
NOT_PARSABLE       2
Name: count, dtype: int64
4246
Index(['input', 'label'], dtype='object')
                                               input  label
0  What actions can be performed on an external l...      0
1  What versions of Databricks Runtime does the i...      0
2  What is the default access restriction for mat...      0
3  Who can query materialized views and streaming...      0
4  What is required to enable Iceberg reads on ta...      0
label
0    3270
1     976
Name: count, dtype: int64


I used LibreEval's synthetic "even-split-of-hallucinations-and-factuals" datasets (path: labeled_datasets/gpt-4o-hallucinations/synthetic/even-split-of-hallucinations-and-factuals) in the hopes of reducing the dataset imbalance caused by HaluEval, but even before my own manipulation there was an uneven distribution of factual and hallucinated examples--3270 samples were factual, and 976 hallucinated (77% to 23%).

I used the "answer" datasets for ChatGPT to retreive both the prompt and whether or not a hallucination occured, converted the label column to an integer value (0 for no hallucination, 1 for a hallucination).

Here are descriptions for a few of the datasets which I used:
The databricks docs answer dataset contains user queries about the [databricks](https://www.databricks.com/) docs, such as how comments are handled in databricks. The earth observatory dataset contains user queries about the environment and science such as how snow and ice influence the climate from [their data](https://science.nasa.gov/earth/earth-observatory/).

Overall, these datasets combine a wide range of topics for a total of 4246 data points of prompts paired with whether or not ChatGPT hallucinated. In order to answer correctly, the LLM must extract the answer from the corresponding webpage. If you want to check out more of these datasets I used, you can see them [here](https://github.com/Arize-ai/LibreEval/tree/main/labeled_datasets/gpt-4o-hallucinations/synthetic/even-split-of-hallucinations-and-factuals/en).

In [14]:
# dataset from https://huggingface.co/datasets/opencompass/anah?row=0
from datasets import load_dataset

anah_dataset = load_dataset("opencompass/anah")

# only has train set on hugging face
df3 = anah_dataset["train"].to_pandas()

# filter for english
df3 = df3[df3['language'] == 'en']

# drop non-gpt columns
df3.drop(columns=["InternLM_answers", "human_InternLM_answers_ann", "name", "documents", "language", "GPT3.5_answers_D"], inplace=True)

# get true/false hallucination data, code from ChatGPT
df3['label'] = df3['human_GPT3.5_answers_D_ann'].apply(
    lambda ann: any('<Hallucination>' in str(a) for a in ann)
)

df3['label'] = df3['label'].map({True: 1, False: 0})

# remove LLM response
df3.drop(columns="human_GPT3.5_answers_D_ann", inplace=True)

# rename for consistency
df3.rename(columns={'selected_questions':'input'}, inplace=True)

# extract strings for arrays in every row, code from ChatGPT

df3['input'] = df3['input'].apply(
    lambda x: x[0] if isinstance(x, (list, np.ndarray)) and len(x) > 0 else x
)

print(df3.head())
print(df3.columns)
print(len(df3))
print(df3['label'].value_counts())

                                               input  label
0  What was the aftermath of the Battle of Sobrao...      1
1  What were the consequences of the Kapp Putsch ...      1
2  What were the main factors leading to the Batt...      0
3  How did the Battle of the Camel unfold, and wh...      0
4  How was the leadership vote conducted and what...      1
Index(['input', 'label'], dtype='object')
497
label
1    454
0     43
Name: count, dtype: int64


ANAH contains a comprehensive range of 700 topics and its 1.2k questions are primarily factual/historical in nature. It contains 497 data points in English, which I filtered for. I then used ChatGPT to write code which would take the answer analysis column (human analysis of ChatGPT answers) and convert it to either 0 or 1 in a label column depending on whether or not it contained "&lt;Hallucination&gt;". This works because responses which contain hallucinations have "&lt;Hallucination&gt;". in their breakdown while those which do not have an empty analysis. Finally, I dropped all columns which weren't the original input or the derived label.

Here's the citation to this dataset:

```
@inproceedings{ji2024anah,
  title={ANAH: Analytical Annotation of Hallucinations in Large Language Models},
  author={Ji, Ziwei and Gu, Yuzhe and Zhang, Wenwei and Lyu, Chengqi and Lin, Dahua and Chen, Kai},
  booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={8135--8158},
  year={2024}
}
```



# Combine Datasets

In [21]:
combined_df = pd.concat([df1, df2, df3], ignore_index=True)
print(combined_df.shape)
print(combined_df.columns)
print(combined_df.head())

print(combined_df['label'].value_counts())


combined_df.to_csv("final_dataset.csv", index=False)

(9250, 2)
Index(['input', 'label'], dtype='object')
                                               input  label
0  Produce a list of common words in the English ...      0
1              Provide a few examples of homophones.      1
2  Create a chart outlining the world's populatio...      1
3         Design a shape with 10 vertices (corners).      1
4  Automatically generate a 10 by 10 multiplicati...      1
label
0    7005
1    2245
Name: count, dtype: int64


My final dataset has 9250 data points. It has 7005 examples which didn't cause a hallucination (75.73%) and 2245 which did (24.27%). The final dataset contains two columns: an "input", representing a user's prompt to ChatGPT, and a "label", an integer value (0 or 1) representing whether or not that prompt caused a hallucination. If you have run each of these cells (which requires downloading the original datasets) you can download the final dataset as a .csv from Colab's file display. If not, it can be downloaded directly from my GitHub.