In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

# TITLE

---

## Table of Contents 📑
- [Research Question](#research-question)
- [Dataset](#dataset)
- [Data Cleaning](#data-cleaning)
- [Data Preprocessing](#data-preprocessing)
- [Exploratory Data Analysis](#exploratory-data-analysis)

---

## Research Question ❓ <a id="research-question"></a>

[Back to Top](#title)

---

## Dataset 📊 <a id="dataset"></a>

In [13]:
df = pd.read_csv("truthfulqa_responses.csv", dtype={'start_time_epoch_s': float, 'end_time_epoch_s': float})


In [None]:
df

### Description

### Data Collection

### Structure

[Back to Top](#title)

---

## Data Cleaning 🧹<a id="data-cleaning"></a>

Looking at the information below, we know that the total amount of rows initially is `23700`. Knowing this, we can see which rows have `null` values, which will be our first main target columns to be cleaned. In this case, we can see these columns are `response` and `source`.


In [None]:
df.info()
df.head()
df.describe()

In [17]:
df.isna().sum()

type                               0
category                           0
question                           0
correct_answer                     0
incorrect_answer                   0
correct_answer_label               0
incorrect_answer_label             0
source                             0
start_time_epoch_s                 0
end_time_epoch_s                   0
model                              0
input_tokens                       0
output_tokens                      0
input_price_per_million_tokens     0
output_price_per_million_tokens    0
system_prompt                      0
user_prompt                        0
response                           0
language                           0
dtype: int64

### Cleaning 'response' column

To preserve the authenticity of each LLM's output, we aim to minimize modifications to the **`response`** column. The only cleaning applied here is replacing `NaN` values with `-1`, which serves as an indicator that the LLM gave **no response** or returned an **empty string**.


In [None]:
df['response'].unique()

In [14]:
df['response'] = df['response'].fillna(-1)

### Cleaning 'source' column

For the `source` column, we chose to drop rows with `NaN` values since they make up only `60` out of `23,700` total rows. Additionally, rows without a `source` provide no verifiable reference for where the correct answer justification came from, making them less reliable for analysis.


In [16]:
df = df.dropna(subset=['source'])

### Cleaning 'model' column

Aside from the columns with `NaN` values, we also decided to clean the `model` column. As observed from the unique values, the `gemini` model has an added prefix `"models/"`, which we will remove to maintain consistency across all entries.


In [78]:
df['model'].unique()

array(['deepseek-reasoner', 'models/gemini-2.5-pro-preview-05-06',
       'o4-mini-2025-04-16'], dtype=object)

In [18]:
df['model'] = df['model'].replace({'models/gemini-2.5-pro-preview-05-06': 'gemini-2.5-pro-preview-05-06'})

[Back to Top](#title)

---

## Data Preprocessing 🔧 <a id="data-preprocessing"></a>

### Feature Engineering

The first column we will add is `latency`. This represents the total time it took for each LLM to respond — more specifically, the duration of the API call for a specific question. To calculate this, we subtract `start_time_epoch_s` from `end_time_epoch_s`. The resulting value is in seconds and will be rounded to 4 decimal places.


In [None]:
df['latency'] = (df['end_time_epoch_s'] - df['start_time_epoch_s']).round(4)
df['latency']

The second column we will add is `is_follow`. This will be a boolean value representing whether the LLM strictly followed the system prompt, regardless of the language. Since both prompts require the LLM to output only the letter of their answer, we determine this by checking if the `response` is one of the following values: `"A"`, `"A."`, `"B"`, or `"B."`. Rows with a `response` value of `-1` will be considered as not following the system prompt, as the prompt expects an answer but none was provided.


In [20]:
df['is_follow'] = df['response'].isin(["A", "A.", "B", "B."])
df['is_follow'].value_counts()

is_follow
True     23575
False       65
Name: count, dtype: int64

In [None]:
df['iscorrect'] = (df['response'] == df['correct_answer_label'])
df['iscorrect']

0        True
1        True
2        True
3        True
4        True
         ... 
23695    True
23696    True
23697    True
23698    True
23699    True
Name: iscorrect, Length: 23632, dtype: bool

In [86]:

df['total_input_price'] = (df['input_tokens'] / 1_000_000) * df['input_price_per_million_tokens']

df['total_output_price'] = (df['output_tokens'] / 1_000_000) * df['output_price_per_million_tokens']

df['total_price'] = df['total_input_price'] + df['total_output_price']


[Back to Top](#title)

---

## Exploratory Data Analysis 📈 <a id="exploratory-data-analysis"></a>

[Back to Top](#title)

---