In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_dark"


# TITLE

---

## Table of Contents 📑

- [Research Question](#research-question)
- [Dataset](#dataset)
- [Data Cleaning](#data-cleaning)
- [Data Preprocessing](#data-preprocessing)
- [Exploratory Data Analysis](#exploratory-data-analysis)

---

## Research Question ❓ <a id="research-question"></a>

[Back to Top](#title)

---

## Dataset 📊 <a id="dataset"></a>

In [None]:
df = pd.read_csv("truthfulqa_responses.csv", dtype={'start_time_epoch_s': float, 'end_time_epoch_s': float})


In [None]:
df

### Description

This dataset looks at how free-tier large language models respond to prompts that test their ability to avoid repeating false but common human beliefs. It includes answers from three models: o4-mini from OpenAI, DeepSeek-R1 from DeepSeek, and Gemini 2.5 Pro from Google. The prompts come from the TruthfulQA benchmark, which focuses on whether a model can give factually correct answers instead of ones that just sound right. Each response is saved with details like the question asked, the model that answered, and a label showing if the response was true, false, or uncertain. The setup makes it easier to compare models and see patterns in how they deal with truthfulness, especially when it comes to misleading but familiar ideas.

### Data Collection

The same set of TruthfulQA prompts was sent to each model using their official APIs. Responses were collected in a consistent and automated way, with each one saved along with the prompt, the model that answered, and a label showing if the response was true or not. The process followed platform rules and made sure the data was collected properly and handled with care. However, since the models were accessed through different APIs and may have slight differences in settings or response formatting, these factors could affect how the outputs are interpreted. The way truthfulness is labeled may also involve some level of subjectivity, especially for prompts that are vague or open-ended. These aspects should be considered when analyzing the results and drawing conclusions from the data.

### Structure

The dataset is structured as a table, with each row representing one model's response to a TruthfulQA question. There are 23,700 observations and 21 columns in total. Each observation includes metadata about the prompt, the model’s response, and additional details relevant to performance analysis and cost tracking.

The key attributes in each observation are as follows:

*   **type** – identifies whether the prompt is a truthful or misleading question.
*   **category** – the topic of the question, such as health, science, or history.

* **question** – the full question text from the TruthfulQA benchmark.

* **correct_answer** – the factually accurate answer to the question.

* **incorrect_answer** – a commonly believed but false response to the question.

* **correct_answer_label** – a tag or label marking the correct answer.

* **incorrect_answer_label** – a tag or label marking the incorrect answer.

* **source** – the original source of the question or prompt.

* **start_time_epoch_s** and **end_time_epoch_s** – timestamps (in epoch seconds) marking when the model request started and ended.

* **model** – the name of the language model that generated the response (e.g., o4-mini, DeepSeek-R1, Gemini 2.5 Pro).

* **input_tokens** and **output_tokens** – the number of tokens used in the input and generated in the output.

* **input_price_per_million_tokens** and **output_price_per_million_tokens** – estimated cost per million tokens for input and output, based on model pricing.

* **system_prompt** – the system-level instruction provided to the model.

* **user_prompt** – the prompt sent to the model, typically the question text.

* **response** – the actual answer generated by the model.

* **language** – the language in which the model responded.


Each observation includes both input and output data that can be analyzed to compare model behavior.

[Back to Top](#title)

---

## Data Cleaning 🧹<a id="data-cleaning"></a>

Looking at the information below, we know that the total amount of rows initially is `23700`. Knowing this, we can see which rows have `null` values, which will be our first main target columns to be cleaned. In this case, we can see these columns are `response` and `source`.


In [None]:
df.info()
df.head()
df.describe()

In [None]:
df.isna().sum()

### Cleaning 'response' column

To preserve the authenticity of each LLM's output, we aim to minimize modifications to the **`response`** column. The only cleaning applied here is replacing `NaN` values with `-1`, which serves as an indicator that the LLM gave **no response** or returned an **empty string**.


In [None]:
df['response'].unique()

In [None]:
df['response'] = df['response'].fillna(-1)

### Cleaning 'source' column

For the `source` column, we chose to drop rows with `NaN` values since they make up only `60` out of `23,700` total rows. Additionally, rows without a `source` provide no verifiable reference for where the correct answer justification came from, making them less reliable for analysis.


In [None]:
df.dropna(subset=['source'], inplace=True)

### Cleaning 'model' column

Aside from the columns with `NaN` values, we also decided to clean the `model` column. As observed from the unique values, the `gemini` model has an added prefix `"models/"`, which we will remove to maintain consistency across all entries.


In [None]:
df['model'].unique()

In [None]:
df['model'] = df['model'].replace({'models/gemini-2.5-pro-preview-05-06': 'gemini-2.5-pro-preview-05-06'})

[Back to Top](#title)

---

## Data Preprocessing 🔧 <a id="data-preprocessing"></a>

### Feature Engineering

The first column we will add is `latency`. This represents the total time it took for each LLM to respond — more specifically, the duration of the API call for a specific question. To calculate this, we subtract `start_time_epoch_s` from `end_time_epoch_s`. The resulting value is in seconds and will be rounded to 4 decimal places.


In [None]:
df['latency'] = (df['end_time_epoch_s'] - df['start_time_epoch_s']).round(4)
df['latency']

The second column we will add is `is_follow`. This will be a boolean value representing whether the LLM strictly followed the system prompt, regardless of the language. Since both prompts require the LLM to output only the letter of their answer, we determine this by checking if the `response` is one of the following values: `"A"` or `"B"`. Rows with a `response` value of `-1` will be considered as not following the system prompt, as the prompt expects an answer but none was provided.


In [None]:
df['is_follow'] = df['response'].isin(["A", "B"])
df['is_follow'].value_counts()

The third column we will add is `is_correct`. This will be a boolean value represnting whether the LLM provided the correct answer. While a straightforward way to determine this is by comparing the `response` column with the `correct_answer_label` column, we need to keep in mind that some rows do not follow the system prompt of strictly outputting only the answer letter. These irregular responses could be unpredictable, so we will handle these cases first.

To address this, we will investigate and use the `is_follow` column to identify which rows did not strictly follow the prompt. Then, we will examine the values in their `response` column to determine how to handle the irregular responses.



In [None]:
df['is_correct'] = pd.NA

In [None]:
is_follow_false = (df[df['is_follow'] == False])
is_follow_false['response'].unique()

After observing the values, we noticed that the majority follow a similar format: `"Letter of Choice: Choice"`. However, there are a few exceptions — specifically three distinct values: `-1`, `"Sagot: A"`, and `"Pasensya na, hindi ko masagot iyan."`. We will first address the latter two cases.


In [None]:
df[df['response'] == 'Sagot: A']
df.loc[df['response'] == 'Sagot: A', 'is_correct'] = True

df.loc[df['response'] == 'Pasensya na, hindi ko masagot iyan.', 'is_correct'] = False

Now that we have addressed the two special cases, we can proceed to set the values for the rest of the rows by simply comparing the **first character** of each `response` to the `correct_answer_label`. This method conveniently includes edge cases like responses equal to `-1`, which will be treated as incorrect since the first character will not match any valid label.


In [None]:
mask = df['is_correct'].isna()

df.loc[mask, 'is_correct'] = (
    df.loc[mask, 'response'].str[0] == df.loc[mask, 'correct_answer_label']
)


The next few columns we will be adding are:

- `total_input_price`: the total amount spent for input tokens for that row (in dollars)
- `total_output_price`: the total amount spent for output tokens for that row (in dollars)
- `total_price`: the total amount spent for all tokens for that row (in dollars)

To compute these values, we will use the following columns:
- `input_tokens`
- `output_tokens`
- `input_price_per_million_tokens`
- `output_price_per_million_tokens`

Each token cost is priced per million tokens, so we will divide the token counts by 1,000,000 and multiply by their respective price rates.


In [None]:

df['total_input_price'] = (df['input_tokens'] / 1_000_000) * df['input_price_per_million_tokens']

df['total_output_price'] = (df['output_tokens'] / 1_000_000) * df['output_price_per_million_tokens']

df['total_price'] = df['total_input_price'] + df['total_output_price']


Now, although we have the `input` and `output tokens` column, it is important to note that each `LLM` has a different way of `tokenizing`. This suggests that we should standardize these columns according to their respective models using `z-score standardization`, so that we can fairly compare `token usage` across different models.


In [None]:
df[['input_tokens_z', 'output_tokens_z']] = df.groupby('model')[['input_tokens', 'output_tokens']].transform(
    lambda x: (x - x.mean()) / x.std()
)

Let us take a quick look at our dataset after adding all these columns.

In [None]:
df.head()

[Back to Top](#title)

---

## Exploratory Data Analysis 📈 <a id="exploratory-data-analysis"></a>

### Which factors correlate with the accuracy of large language models when answering prompts designed to mimic human misconceptions?

#### What is the accuracy of current free-tier reasoning large language models on adversarial and non-adversarial questions?

In [None]:

type_accuracy = df.groupby('type')['is_correct'].mean().reset_index()
type_accuracy['accuracy_percent'] = type_accuracy['is_correct'] * 100

import plotly.express as px

fig = px.bar(
    type_accuracy,
    x='type',
    y='accuracy_percent',
    title='Accuracy of Free-Tier LLMs on Adversarial vs Non-Adversarial Questions',
    text='accuracy_percent',
    labels={'type': 'Question Type', 'accuracy_percent': 'Accuracy (%)'},
)

fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside')
fig.update_layout(yaxis_range=[0, 100])
fig.show()


#### What is the accuracy of current free-tier reasoning large language models on different question categories?

In [None]:

type_accuracy = df.groupby('category')['is_correct'].mean().reset_index()
type_accuracy['accuracy_percent'] = type_accuracy['is_correct'] * 100

import plotly.express as px

fig = px.bar(
    type_accuracy,
    x='category',
    y='accuracy_percent',
    title='Accuracy of Free-Tier LLMs on Different Question Categories',
    text='accuracy_percent',
    labels={'category': 'Question Category', 'accuracy_percent': 'Accuracy (%)'},
)

fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside')
fig.update_layout(yaxis_range=[0, 100])
fig.show()


#### What is the accuracy of current free-tier reasoning large language models on English and Filipino?

In [None]:

type_accuracy = df.groupby('language')['is_correct'].mean().reset_index()
type_accuracy['accuracy_percent'] = type_accuracy['is_correct'] * 100

import plotly.express as px

fig = px.bar(
    type_accuracy,
    x='language',
    y='accuracy_percent',
    title='Accuracy of Free-Tier LLMs on Different Question Categories',
    text='accuracy_percent',
    labels={'language': 'Question language', 'accuracy_percent': 'Accuracy (%)'},
)

fig.update_traces(texttemplate='%{text:.2f}%', textposition='outside')
fig.update_layout(yaxis_range=[0, 100])
fig.show()


### How do free-tier large language models compare, on different languages, in terms of performance on truthfulness benchmarks?

#### Which model has the highest accuracy on TruthfulQA?

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Normalize language capitalization (important!)
df['language'] = df['language'].str.capitalize()

# Group by language and model
grouped_lang = df.groupby(['language', 'model'])['is_correct'].mean().reset_index()
grouped_lang['accuracy_percent'] = grouped_lang['is_correct'] * 100

# Group by model only (combined)
grouped_combined = df.groupby('model')['is_correct'].mean().reset_index()
grouped_combined['accuracy_percent'] = grouped_combined['is_correct'] * 100
grouped_combined['language'] = 'English and Filipino'  # Add synthetic 'language'

# Combine into one DataFrame
grouped_all = pd.concat([grouped_lang, grouped_combined], ignore_index=True)

# Desired subplot order
languages = ['English', 'Filipino', 'English and Filipino']

# Create subplots
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=languages,
    shared_yaxes=True
)

# Add bar chart for each language section
for i, lang in enumerate(languages):
    lang_data = grouped_all[grouped_all['language'] == lang]
    fig.add_trace(
        go.Bar(
            x=lang_data['model'],
            y=lang_data['accuracy_percent'],
            text=lang_data['accuracy_percent'].apply(lambda x: f'{x:.2f}%'),
            textposition='outside',
        ),
        row=1, col=i+1
    )

# Layout config
fig.update_layout(
    title_text="Accuracy of Free-Tier LLMs on English, Filipino, and Combined Questions",
    showlegend=False,
    template="plotly_dark"
)

fig.update_xaxes(tickangle=45)

# Keep Y-axes consistent
for i in range(3):
    fig.update_yaxes(range=[0, 100], row=1, col=i+1)

fig.show()


#### Which model has the least latency on TruthfulQA?

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Normalize language capitalization
df['language'] = df['language'].str.capitalize()

# Group by language and model - latency
grouped_lang = df.groupby(['language', 'model'])['latency'].mean().reset_index()
grouped_lang['latency'] = grouped_lang['latency']

# Group by model only (combined latency)
grouped_combined = df.groupby('model')['latency'].mean().reset_index()
grouped_combined['language'] = 'English and Filipino'  # synthetic 'language'

# Combine into one DataFrame
grouped_all = pd.concat([grouped_lang, grouped_combined], ignore_index=True)

# Desired subplot order
languages = ['English', 'Filipino', 'English and Filipino']

# Create subplots
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=languages,
    shared_yaxes=True
)

# Add bar chart for each language section
for i, lang in enumerate(languages):
    lang_data = grouped_all[grouped_all['language'] == lang]
    fig.add_trace(
        go.Bar(
            x=lang_data['model'],
            y=lang_data['latency'],
            text=lang_data['latency'].apply(lambda x: f'{x:.2f}'),
            textposition='outside',
        ),
        row=1, col=i+1
    )

# Layout config
fig.update_layout(
    title_text="Average Latency of Free-Tier LLMs on English, Filipino, and Combined Questions",
    showlegend=False,
    template="plotly_dark",
    yaxis_title="Latency (s)"  # change if your latency is in ms
)

fig.update_xaxes(tickangle=45)

# Optional: adjust Y-axis range manually if desired
# for i in range(3):
#     fig.update_yaxes(range=[0, 5], row=1, col=i+1)  # or whatever max latency is

fig.show()

#### Which model has the least cost on TruthfulQA? 

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Normalize language capitalization
df['language'] = df['language'].str.capitalize()

# Group by language and model — SUM of total_price
grouped_lang = df.groupby(['language', 'model'])['total_price'].sum().reset_index()

# Group by model only (combined cost)
grouped_combined = df.groupby('model')['total_price'].sum().reset_index()
grouped_combined['language'] = 'English and Filipino'  # synthetic category

# Combine all
grouped_all = pd.concat([grouped_lang, grouped_combined], ignore_index=True)

# Define subplot layout
languages = ['English', 'Filipino', 'English and Filipino']

fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=languages,
    shared_yaxes=True
)

# Add bar plots per language
for i, lang in enumerate(languages):
    lang_data = grouped_all[grouped_all['language'] == lang]
    fig.add_trace(
        go.Bar(
            x=lang_data['model'],
            y=lang_data['total_price'],
            text=lang_data['total_price'].apply(lambda x: f'${x:.4f}'),
            textposition='outside',
        ),
        row=1, col=i+1
    )

# Layout and labels
fig.update_layout(
    title_text="Total Cost of Free-Tier LLMs on TruthfulQA (Summed Price)",
    showlegend=False,
    template="plotly_dark",
    yaxis_title="Total Cost (USD)"
)

fig.update_xaxes(tickangle=45)

fig.show()

#### Which model follows instructions the best?

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Normalize language capitalization
df['language'] = df['language'].str.capitalize()

# Group by language and model — mean is_follow
grouped_lang = df.groupby(['language', 'model'])['is_follow'].mean().reset_index()
grouped_lang['is_follow_percent'] = grouped_lang['is_follow'] * 100

# Group by model only (combined)
grouped_combined = df.groupby('model')['is_follow'].mean().reset_index()
grouped_combined['is_follow_percent'] = grouped_combined['is_follow'] * 100
grouped_combined['language'] = 'English and Filipino'

# Combine everything
grouped_all = pd.concat([grouped_lang, grouped_combined], ignore_index=True)

# Define subplot categories
languages = ['English', 'Filipino', 'English and Filipino']

# Create subplots
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=languages,
    shared_yaxes=True
)

# Add bar plots per language
for i, lang in enumerate(languages):
    lang_data = grouped_all[grouped_all['language'] == lang]
    fig.add_trace(
        go.Bar(
            x=lang_data['model'],
            y=lang_data['is_follow_percent'],
            text=lang_data['is_follow_percent'].apply(lambda x: f'{x:.2f}%'),
            textposition='outside',
        ),
        row=1, col=i+1
    )

# Layout
fig.update_layout(
    title_text="Instruction Following of Free-Tier LLMs on English, Filipino, and Combined Questions",
    showlegend=False,
    template="plotly_dark",
    yaxis_title="Following Instructions (%)"
)

fig.update_xaxes(tickangle=45)

# Keep y-axis in [0, 100]
for i in range(3):
    fig.update_yaxes(range=[0, 100], row=1, col=i+1)

fig.show()


#### Which model has the most verbose reasoning?

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Normalize language capitalization
df['language'] = df['language'].str.capitalize()

# Group by language and model — mean output_tokens_z
grouped_lang = df.groupby(['language', 'model'])['output_tokens_z'].mean().reset_index()

# Group by model only (combined)
grouped_combined = df.groupby('model')['output_tokens_z'].mean().reset_index()
grouped_combined['language'] = 'English and Filipino'

# Combine
grouped_all = pd.concat([grouped_lang, grouped_combined], ignore_index=True)

# Define subplot categories
languages = ['English', 'Filipino', 'English and Filipino']

# Create subplots
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=languages,
    shared_yaxes=True
)

# Add bar charts
for i, lang in enumerate(languages):
    lang_data = grouped_all[grouped_all['language'] == lang]
    fig.add_trace(
        go.Bar(
            x=lang_data['model'],
            y=lang_data['output_tokens_z'],
            text=lang_data['output_tokens_z'].apply(lambda x: f'{x:.2f}'),
            textposition='outside',
        ),
        row=1, col=i+1
    )

# Layout
fig.update_layout(
    title_text="Mean Z-Score of Output Tokens by Free-Tier LLMs on English, Filipino, and Combined Questions",
    showlegend=False,
    template="plotly_dark",
    yaxis_title="Mean Output Token Z-Score"
)
fig.update_yaxes(zeroline=True, zerolinewidth=2, zerolinecolor='white')


fig.update_xaxes(tickangle=45)

fig.show()


[Back to Top](#title)

---