# Comparing Response Lengths Between LLMs and USHMM Articles

This notebook analyzes and compares the response lengths between different LLMs (GPT-4, Gemini, and Grok) and USHMM articles for Holocaust-related queries. The goal is to identify which model produces responses most similar in length to the authoritative USHMM content.

This notebook should be run before the similarity_scores.ipynb notebook.

## Import data

In [34]:
import pandas as pd
import re

## Pre-process Data

In [35]:
# Load the processed data into a wide format dataframe, with one row per query
# and columns for each source's data (USHMM, GPT-4, Gemini, Grok)

def create_wide_format_data(processed_data):
    # Create list of base columns that get repeated for each source
    base_columns = list(processed_data.columns[~processed_data.columns.isin(['id', 'source', 'original_query'])])

    # Define sources
    sources = processed_data['source'].unique()

    # Create full column list starting with id and original_query
    columns = ['id', 'original_query']

    # Add source-specific columns
    for source in sources:
        source_columns = [f"{source}_{col}" for col in base_columns]
        columns.extend(source_columns)

    # Create empty dataframe with the defined columns
    wide_df = pd.DataFrame(columns=columns)

    # Copy data from processed_data to wide_df
    for idx, row in processed_data.iterrows():
        source = row['source']
        query_id = row['id']
        
        # If this query_id doesn't exist in wide_df yet, create it
        if query_id not in wide_df['id'].values:
            new_row = pd.DataFrame({
                'id': [query_id],
                'original_query': row['original_query']  # Copy original_query when creating new row
            })
            wide_df = pd.concat([wide_df, new_row], ignore_index=True)
        
        # Get the row index in wide_df
        wide_idx = wide_df[wide_df['id'] == query_id].index[0]
        
        # Copy over the data
        for base_col in base_columns:
            if base_col in row:
                wide_col = f"{source}_{base_col}"
                wide_df.at[wide_idx, wide_col] = row[base_col]

    return wide_df

# Try to read existing wide format data, if it fails create it
try:
    wide_df = pd.read_csv('wide_data_1000_queries.csv')
except FileNotFoundError:
    processed_data = pd.read_csv('../processed_data_1000_queries.csv')
    wide_df = create_wide_format_data(processed_data)

wide_df.head()

Unnamed: 0,id,original_query,USHMM_response,USHMM_response_cleaned,USHMM_response_no_headers,USHMM_response_no_headers_or_markdown,USHMM_response_language,USHMM_response_refusal,USHMM_response_keep,USHMM_response_keep_for_all_sources,...,grok_response,grok_response_cleaned,grok_response_no_headers,grok_response_no_headers_or_markdown,grok_response_language,grok_response_refusal,grok_response_keep,grok_response_keep_for_all_sources,grok_response_already_complete_sentences,grok_response_complete_sentences
0,0,how many people died in the holocaust,#How Many People did the Nazis Murder? | Holoc...,#How Many People did the Nazis Murder? | Holoc...,Nazi Germany committed mass murder on an unpre...,Nazi Germany committed mass murder on an unpre...,en,no,yes,yes,...,The Holocaust was a period of systematic perse...,The Holocaust was a period of systematic perse...,The Holocaust was a period of systematic perse...,The Holocaust was a period of systematic perse...,en,no,yes,yes,True,The Holocaust was a period of systematic perse...
1,1,armenian genocide,#The Armenian Genocide (1915-16): Overview | H...,#The Armenian Genocide (1915-16): Overview | H...,Sometimes called the first genocide of the twe...,Sometimes called the first genocide of the twe...,en,no,yes,yes,...,The Armenian Genocide was the systematic exter...,The Armenian Genocide was the systematic exter...,The Armenian Genocide was the systematic exter...,The Armenian Genocide was the systematic exter...,en,no,yes,yes,False,The Armenian Genocide was the systematic exter...
2,2,holocaust encyclopedia,#Introduction to the Holocaust: What was the H...,#Introduction to the Holocaust: What was the H...,"The Holocaust (1933–1945) was the systematic, ...","The Holocaust (1933–1945) was the systematic, ...",en,no,yes,yes,...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,en,no,yes,yes,False,The Holocaust Encyclopedia is a comprehensive ...
3,3,first they came for,"#Martin Niemöller: ""First they came for the So...","#Martin Niemöller: ""First they came for the So...","> First they came for the socialists, and I di...","First they came for the socialists, and I did ...",en,no,yes,yes,...,"""First they came ..."" is the beginning of a fa...","""First they came ..."" is the beginning of a fa...","""First they came ..."" is the beginning of a fa...","""First they came ..."" is the beginning of a fa...",en,no,yes,yes,True,"""First they came ..."" is the beginning of a fa..."
4,4,holocaust,#Introduction to the Holocaust: What was the H...,#Introduction to the Holocaust: What was the H...,"The Holocaust (1933–1945) was the systematic, ...","The Holocaust (1933–1945) was the systematic, ...",en,no,yes,no,...,"Der Holocaust war eine systematische, staatlic...","Der Holocaust war eine systematische, staatlic...","Der Holocaust war eine systematische, staatlic...","Der Holocaust war eine systematische, staatlic...",de,no,no,no,,


In [36]:
# Load the wide format data
wide_df = pd.read_csv('../wide_data_1000_queries.csv')

# Filter to keep only rows where all *_keep_for_all_sources columns are 'yes'
keep_columns = [col for col in wide_df.columns if col.endswith('_keep_for_all_sources')]
filtered_df = wide_df[wide_df[keep_columns].eq('yes').all(axis=1)]
wide_df = filtered_df

# Select specific columns
wide_df = wide_df[['id', 'original_query', 'USHMM_response_no_headers', 'gpt_4o_response_no_headers', 'gemini_response_no_headers', 'grok_response_no_headers']]

# Rename columns to match expected format
wide_df = wide_df.rename(columns={
    'original_query': 'top_queries',
    'USHMM_response_no_headers': 'ushmm_article', 
    'gpt_4o_response_no_headers': 'gpt_4o_response',
    'gemini_response_no_headers': 'gemini_response',
    'grok_response_no_headers': 'grok_response'
})

print(wide_df.shape)
df_803 = wide_df

(803, 6)


In [37]:
df_803.head()

Unnamed: 0,id,top_queries,ushmm_article,gpt_4o_response,gemini_response,grok_response
0,0,how many people died in the holocaust,Nazi Germany committed mass murder on an unpre...,Approximately 6 million Jews were killed durin...,Historians estimate that the Nazis murdered ap...,The Holocaust was a period of systematic perse...
1,1,armenian genocide,Sometimes called the first genocide of the twe...,The Armenian Genocide was the systematic mass ...,The Armenian Genocide was the systematic destr...,The Armenian Genocide was the systematic exter...
2,2,holocaust encyclopedia,"The Holocaust (1933–1945) was the systematic, ...",The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...
3,3,first they came for,"> First they came for the socialists, and I di...","The phrase ""First they came for..."" is the ope...","""First they came for the socialists, and I did...","""First they came ..."" is the beginning of a fa..."
6,6,the holocaust,"The Holocaust (1933–1945) was the systematic, ...","The Holocaust was the systematic, state-sponso...",The Holocaust was a genocide during World War ...,The Holocaust was a period of systematic perse...


## Create dataset

In [38]:
#Functions to calculate word count

def word_count(text):
    # Returns the number of words in a given text using regex
    words = re.findall(r'\b\w+\b', text)
    return len(words)

def process_word_count(df):
    # Ensure required columns exist
    required_columns = ["ushmm_article", "gpt_4o_response", "gemini_response", "grok_response"]
    if not all(col in df.columns for col in required_columns):
        raise ValueError(f"CSV must contain the columns: {required_columns}")

    # Compute Readability and Word Count
    for col in required_columns:
        df[f"{col}_word_count"] = df[col].apply(word_count)
    return df

#apply to df_803
df_803 = process_word_count(df_803)

# Define relevant columns
ushmm_col = "ushmm_article_word_count"
gpt_col = "gpt_4o_response_word_count"
gemini_col = "gemini_response_word_count"
grok_col = "grok_response_word_count"

# Create filters
min_words = 100

# ±50% bounds
lower_bound = df_803[ushmm_col] * 0.5
upper_bound = df_803[ushmm_col] * 1.5

# Combined condition
filtered_df = df_803[
    (df_803[ushmm_col] >= min_words) &
    (df_803[gpt_col] >= min_words) &
    (df_803[gemini_col] >= min_words) &
    (df_803[grok_col] >= min_words) &
    (df_803[gpt_col] >= lower_bound) & (df_803[gpt_col] <= upper_bound) &
    (df_803[gemini_col] >= lower_bound) & (df_803[gemini_col] <= upper_bound) &
    (df_803[grok_col] >= lower_bound) & (df_803[grok_col] <= upper_bound)
]

# Output the number of matching rows
print(f"Number of rows where all responses are at least 100 words and within ±50% of USHMM length: {len(filtered_df)}")

# Preview the filtered rows
filtered_df.head()

# Print number of matching rows
print(f"Number of rows with comparable word counts: {len(filtered_df)}")

# Optionally: preview matching rows
print(filtered_df.head())

#Write to CSV
filtered_df.to_csv("39_query_dataset.csv", index=False)

Number of rows where all responses are at least 100 words and within ±50% of USHMM length: 39
Number of rows with comparable word counts: 39
      id                  top_queries  \
80    80        the armenian genocide   
171  171                  ms st louis   
178  178         bolshevik revolution   
199  199  when did antisemitism start   
231  231     effects of the holocaust   

                                         ushmm_article  \
80   Sometimes called the first genocide of the twe...   
171  The voyage of the *St. Louis*, a German ocean ...   
178  Since early 1917, Russia had been in a state o...   
199  Sometimes called "the longest hatred," antisem...   
231  In 1945, when Allied troops entered the concen...   

                                       gpt_4o_response  \
80   The Armenian Genocide refers to the systematic...   
171  The **MS St. Louis** was a German ocean liner ...   
178  The Bolshevik Revolution, also known as the Oc...   
199  Antisemitism—hostility, pr