# LLM Data Processing Pipeline

This notebook processes the LLM sanitization dataset by cleaning, restructuring, and transforming responses from multiple sources (USHMM, GPT-4o, Gemini, and Grok) into a standardized format suitable for analysis.

## Overview

The pipeline transforms the original wide-format dataset into a long-format structured dataset where each row represents one response from one source to one query. This allows for systematic comparison across different LLM sources and enables various types of analysis on response quality, language detection, and content processing.


# 1. Setup: Imports and OpenAI Configuration

Loads core libraries for text processing and initializes the OpenAI client for structured GPT-4o interactions.

In [2]:
# ===== IMPORTS =====
import os                          # Securely fetches environment variables (API key)
import re                          # Regex‑based text cleansing helpers
import glob                        # Batch file discovery for \*.csv, \*.json, etc.
from typing import Any             # Static typing for helper stubs

import pandas as pd                # DataFrame operations and I/O
from langdetect import detect      # Lightweight language ID (<20 KB model)
from openai import OpenAI          # Official OpenAI Python SDK
from pydantic import BaseModel, Field  # Data‑validation & parsing

# ===== CONFIGURATION =====
# OpenAI API Configuration
openai_api_key = 'Your Key'
client = OpenAI(api_key=openai_api_key)


# 2. Data Loading and Initial Examination

Loads the original CSV dataset and inspects its structure, columns, and sample rows to understand the data format and content.

In [184]:
# ===== DATA LOADING =====
# Load the original dataset and examine structure
df = pd.read_csv('../original_data_1000_queries.csv')

print(f"Dataset shape: {df.shape}")
print("Columns in the dataset:")
print(df.columns.tolist())
print("\nFirst few rows:")
print(df.head())


Dataset shape: (1000, 5)
Columns in the dataset:
['original_query', 'ushmm_article', 'chatgpt_4o_response', 'gemini_response', 'grok_response']

First few rows:
                          original_query  \
0  how many people died in the holocaust   
1                      armenian genocide   
2                 holocaust encyclopedia   
3                    first they came for   
4                              holocaust   

                                       ushmm_article  \
0  #How Many People did the Nazis Murder? | Holoc...   
1  #The Armenian Genocide (1915-16): Overview | H...   
2  #Introduction to the Holocaust: What was the H...   
3  #Martin Niemöller: "First they came for the So...   
4  #Introduction to the Holocaust: What was the H...   

                                 chatgpt_4o_response  \
0  Approximately 6 million Jews were killed durin...   
1  The Armenian Genocide was the systematic mass ...   
2  The Holocaust Encyclopedia is a comprehensive ...   
3  The phrase

# 3. Text Cleaning Helper Functions

Defines comprehensive text cleaning functions to remove unwanted elements from articles, especially USHMM encyclopedia entries which contain extensive metadata and formatting. These functions systematically clean text by removing various unwanted elements

## Footnote & Section Removal
- **`remove_double_numbered_footnotes(text)`** – Removes double-numbered footnotes and their headers.
- **`remove_critical_thinking_block(text)`** – Removes “Critical Thinking Questions” sections and their bullet points.
- **`remove_language_section(text)`** – Removes multilingual availability notices and associated bullets.
- **`remove_tags_section(text)`** – Removes “Tags” sections including headers and bullet items.
- **`remove_unique_tags_and_footers(text)`** – Removes USHMM footers, branding tags, and extra separators.

## General Markdown and Refusal Cleaning
- **`remove_headers(md)`** – Removes markdown headers, standalone bold lines, and Setext-style titles.
- **`check_for_refusal(response)`** – Returns `'yes'` if refusal language like “sorry” or “AI assistant” is found.
- **`remove_all_markdown_keep_table_content(md)`** – Strips markdown syntax while preserving table content.

## Comprehensive USHMM Article Cleaner
- **`clean_ushmm_article(article)`** – Applies all relevant filters and cleanup functions to USHMM articles.


### USHMM-Specific Cleaning:
USHMM articles contain extensive metadata that needs removal:
- Author attribution lines
- Museum attribution
- Citation/print/share buttons
- Image captions and credits
- Video player fallback text
- Last edited timestamps
- Source attributions

### Text Processing Strategy:
1. **Line-by-line filtering**: Removes specific unwanted line patterns
2. **Section removal**: Identifies and removes entire sections (footnotes, questions, etc.)
3. **Format normalization**: Standardizes whitespace and removes multiple consecutive newlines
4. **Content preservation**: Ensures core article text remains intact and readable


In [45]:
# ===== HELPER FUNCTIONS =====

# --- Footnote & Section Removal ---

def remove_double_numbered_footnotes(text):
    """Remove footnotes sections with double numbering format"""
    # First remove the ## Footnotes header and following blank line
    text = re.sub(r'## Footnotes\n\n', '', text)
    # Then remove the double numbered footnotes and their content
    text = re.sub(
        r'\d+\.\s+\d+\.\s*\n(?:\s{2,}.*\n?)+',
        '',
        text
    )
    return text

def remove_critical_thinking_block(text):
    """Remove Critical Thinking Questions sections"""
    lines = text.splitlines()
    result = []
    inside_ctq_block = False
    
    for line in lines:
        if line.strip() == "## Critical Thinking Questions":
            inside_ctq_block = True
            continue  # skip the header line
        
        if inside_ctq_block:
            # Skip the blank line after the header
            if line.strip() == "":
                continue
            # Skip all bullet points that start with *
            elif line.strip().startswith("*"):
                continue
            # If we encounter another section header (##), we're out of CTQ block
            elif line.strip().startswith("##"):
                inside_ctq_block = False
                result.append(line)
            # If we encounter any other non-empty content that's not a bullet point,
            # we're likely out of the CTQ section
            elif line.strip() and not line.strip().startswith("*"):
                inside_ctq_block = False
                result.append(line)
            # Continue skipping if it's still part of CTQ block
            else:
                continue
        else:
            result.append(line)
    
    return "\n".join(result)

def remove_language_section(text):
    """Remove language availability sections"""
    lines = text.splitlines()
    cleaned_lines = []
    skip = False
    expecting_bullets = False

    for line in lines:
        stripped = line.strip()

        # Detect header line
        if "### This content is available in the following languages" in stripped:
            skip = True
            expecting_bullets = True
            continue

        if skip:
            # Handle blank lines after header
            if expecting_bullets and stripped == "":
                continue

            # Handle bullet lines
            if stripped.startswith("*") or stripped.startswith("+"):
                expecting_bullets = False  # We found bullets, no longer expecting them
                continue
            else:
                # Found a non-bullet line → stop skipping
                skip = False

        if not skip:
            cleaned_lines.append(line)

    return "\n".join(cleaned_lines)

def remove_tags_section(text):
    """Remove tags sections from articles"""
    lines = text.splitlines()
    cleaned_lines = []
    skip = False
    expecting_bullets = False

    for line in lines:
        stripped = line.strip()

        # Detect both header types
        if stripped == "#### Tags" or stripped == "Tags":
            skip = True
            expecting_bullets = True
            continue

        if skip:
            if expecting_bullets:
                if stripped == "":
                    continue  # skip the blank line right after the header
                elif stripped.startswith(("*", "+", "-")):
                    expecting_bullets = False
                    continue
                else:
                    skip = False  # Unexpected format, stop skipping
            else:
                if stripped.startswith(("*", "+", "-")):
                    continue  # keep skipping bullet lines
                else:
                    skip = False  # End of tag block

        if not skip:
            cleaned_lines.append(line)

    return "\n".join(cleaned_lines)


def remove_unique_tags_and_footers(text: str) -> str:
    """Remove unique tags, footers, single-dash lines, and collapse blank lines."""
    # 1) Remove sections from '--- ### Tags' up to the next '---'
    text = re.sub(
        r'---\s*### Tags(?:\n.+?)*?(?=\n---)', 
        '', 
        text, 
        flags=re.DOTALL
    )
    # 2) Drop the USHMM footer line
    text = re.sub(r'\* US Holocaust Memorial Museum', '', text)
    # 3) Normalize lines consisting only of '---' into a single newline
    text = re.sub(r'(?m)^[ \t]*---[ \t]*\n?', '\n', text)
    # 4) Remove only lines that consist of a single dash
    lines = text.splitlines()
    kept = [
        line
        for line in lines
        if line.strip() != '-'
    ]
    # 5) Rejoin and collapse 3+ newlines into 2
    cleaned = "\n".join(kept)
    cleaned = re.sub(r'\n{3,}', '\n\n', cleaned)
    return cleaned.strip()



# --- Markdown and Refusal Cleaning ---

def remove_headers(md: str) -> str:
    """
    Remove all Markdown headers (lines starting with '#'), standalone bullet/hyphen lines,
    Setext-style headers, and lines containing only bold text (e.g., **hello**).
    """
    lines = md.splitlines()
    output_lines = []
    skip_next = False

    for i, line in enumerate(lines):
        stripped = line.strip()

        if skip_next:
            skip_next = False
            continue

        # Skip ATX-style headers (lines starting with '#')
        if stripped.startswith('#'):
            continue

        # Skip standalone hyphens or asterisks
        if stripped in ('-', '*'):
            continue

        # Skip lines that consist solely of bold text (e.g., **hello**)
        if re.fullmatch(r'\*\*[^*]+\*\*', stripped):
            continue

        # Skip Setext-style headers (underlines of '=' or '-')
        if i + 1 < len(lines) and re.fullmatch(r'[=-]+', lines[i + 1].strip()):
            skip_next = True
            continue

        output_lines.append(line)
    
    # Collapse multiple newlines
    text = re.sub(r'\n{3,}', '\n\n', text)

    return "\n".join(output_lines)

def remove_all_markdown_keep_table_content(md: str) -> str:
    """
    Remove all common Markdown syntax from the input string,
    while preserving the text content of tables.
    """
    text = md

    # Remove code blocks
    text = re.sub(r'```[\s\S]*?```', '', text)

    # Remove inline code
    text = re.sub(r'`([^`]*)`', r'\1', text)

    # Remove images ![alt](url)
    text = re.sub(r'!\[.*?\]\(.*?\)', '', text)

    # Remove links [text](url)
    text = re.sub(r'\[([^\]]+)\]\(.*?\)', r'\1', text)

    # Remove bold and italic markers
    text = re.sub(r'(\*\*|__)(.*?)\1', r'\2', text)
    text = re.sub(r'(\*|_)(.*?)\1', r'\2', text)

    # Remove ATX headers
    text = re.sub(r'^\s{0,3}#{1,6}\s*', '', text, flags=re.MULTILINE)

    # Remove Setext-style headers
    text = re.sub(r'^[=\-]{2,}\s*$', '', text, flags=re.MULTILINE)

    # Remove blockquotes
    text = re.sub(r'^\s{0,3}>\s?', '', text, flags=re.MULTILINE)

    # Remove list markers
    text = re.sub(r'^\s*[-*+]\s+', '', text, flags=re.MULTILINE)
    text = re.sub(r'^\s*\d+\.\s+', '', text, flags=re.MULTILINE)

    # Remove Markdown table formatting but keep content
    text = re.sub(r'^\s*\|?[-:| ]+\|?\s*$', '', text, flags=re.MULTILINE)  # strip table rule lines
    text = re.sub(r'^\s*\|\s*', '', text, flags=re.MULTILINE)              # strip leading pipes
    text = re.sub(r'\s*\|\s*', '\t', text)                                 # convert pipes to tabs
    text = re.sub(r'\|\s*$', '', text, flags=re.MULTILINE)                 # remove trailing pipe

    # Collapse multiple newlines
    text = re.sub(r'\n{3,}', '\n\n', text)

    return text.strip()

def check_for_refusal(response: str) -> str:
    """
    Checks if a response contains refusal-like language.

    Parameters:
        response (str): The text response to analyze.

    Returns:
        str: 'yes' if refusal keywords are detected, 'no' otherwise.
    """
    refusal_keywords = ['sorry', "i cannot", "i'm sorry", 'ai assistant']
    response_lower = response.lower()
    return 'yes' if any(keyword in response_lower for keyword in refusal_keywords) else 'no'

# --- USHMM Article Cleaner ---

def clean_ushmm_article(article):
    """Apply comprehensive cleaning to USHMM articles"""
    # Skip if article contains error indicators
    if any(x in article for x in ['Error', 'sorry']):
        return article
    
    # Apply all cleaning steps for USHMM articles
    if any(x in article for x in ['Author(s):', 'United States Holocaust Memorial Museum, Washington, DC', 
                                  '* US Holocaust Memorial Museum Collection', 'View Archival Details', 
                                  '## Footnotes', '* Cite', '* Print', '* Share', 'Last Edited:', 
                                  'caption=', 'credit=', '## Critical Thinking Questions', 'Source:', 
                                  'Primary resources by the Jewish Partisan Educational Foundation', '[![', 
                                  '![', '](', 'Your browser does not support the video tag.', 
                                  "## This content is available in the following languages", "Tags"]):
        
        # Split into lines and filter out unwanted lines
        lines = article.split('\n')
        cleaned_lines = []
        skip_next = False
        
        for i, line in enumerate(lines):
            if skip_next:
                skip_next = False
                continue
                
            # Skip various unwanted content
            if any(pattern in line for pattern in ['caption=', 'credit=', '* Cite', '* Print', '* Share', 
                                                  'Last Edited:', 'Author(s):', 
                                                  'United States Holocaust Memorial Museum, Washington, DC',
                                                  'Source:', 'Primary resources by the Jewish Partisan Educational Foundation',
                                                  '[![', '![', '](', 'Your browser does not support the video tag.']):
                if 'caption=' in line or 'credit=' in line:
                    skip_next = True
                continue
                
            cleaned_lines.append(line)
        
        # Rejoin the cleaned lines
        cleaned_article = '\n'.join(cleaned_lines)
        
        # Apply function-based cleaning
        if "## Critical Thinking Questions" in cleaned_article:
            cleaned_article = remove_critical_thinking_block(cleaned_article)
        if "## Footnotes" in cleaned_article:
            cleaned_article = remove_double_numbered_footnotes(cleaned_article)
        if "## This content is available in the following languages" in cleaned_article:
            cleaned_article = remove_language_section(cleaned_article)  
        if "Tags" in cleaned_article:
            cleaned_article = remove_tags_section(cleaned_article)
        cleaned_article = remove_unique_tags_and_footers(cleaned_article)
            
        # Remove multiple consecutive newlines
        cleaned_article = re.sub(r'\n{3,}', '\n', cleaned_article)
        return cleaned_article.strip()
    
    return article.strip()


# 4. Data Restructuring and Transformation

## Purpose
Transforms the original wide-format dataset into a structured long-format dataset following the specified schema, with each row representing one response from one source to one query.

## What it does:

### Schema Transformation:
Converts from wide format (1 row = 1 query with 4 responses) to long format (4 rows = 1 query with 1 response each):

**New Schema:**
- **`id`**: Numeric unique query ID (1, 2, 3...)
- **`original_query`**: The original question/prompt text
- **`source`**: Response source (USHMM, gpt_4o, gemini, grok)
- **`response`**: Original unprocessed response text
- **`response_cleaned`**: Cleaned response (USHMM only - removes metadata)
- **`response_no_headers`**: Response with markdown headers removed
- **`response_no_headers_or_markdown`**: Response with all markdown formatting removed
- **`response_language`**: Detected language code (en, es, de, etc.)
- **`response_refusal`**: Yes/No flag for LLM refusal to answer
- **`response_keep`**: Yes/No flag for if response should be kept (English, non-refusal)
- **`response_already_complete_sentences`**: Yes/No flag for if original response (or cleaned response in case of USHMM) is already all complete sentences
- **`response_complete_sentences`**: Text converted to complete sentences

### Processing Pipeline per Response:
1. **Language Detection**: Identifies response language using `langdetect`
2. **Refusal Detection**: Checks for common refusal patterns ("sorry", "I cannot", etc.)
3. **Source-Specific Cleaning**: Applies specialized cleaning for USHMM articles
4. **Header Removal**: Strips markdown headers while preserving content
5. **Markdown Removal**: Removes all formatting while keeping table content
6. **Sentence Completion**: Converts to complete sentences (for quality responses only)

### Quality Control:
- **Conditional Processing**: Only processes English, non-refusal responses
- **Error Handling**: Gracefully handles processing failures
- **Length Validation**: Ensures reasonable text length after processing
- **Preservation**: Maintains original text when processing isn't needed

### USHMM Special Handling:
USHMM articles receive comprehensive cleaning to remove:
- Author attributions and museum credits
- Navigation elements (Cite, Print, Share buttons)
- Metadata (Last Edited, View Archival Details)
- Media elements (images, videos, captions)
- Footnotes and critical thinking questions
- Language availability notices and tags


In [190]:
# ===== DATA RESTRUCTURING =====
# Transform the dataset into the required structure with proper schema

def create_base_dataframe(df):
    """
    Transform the original dataframe into the base structured format with:
    - id: numeric unique query ID
    - original_query: string 
    - source: USHMM/gpt_4o/gemini/grok
    - response: original response string
    - response_cleaned: cleaned response (USHMM only)
    
    Other columns are initialized empty for later processing:
    - response_no_headers
    - response_no_headers_or_markdown
    - response_language
    - response_refusal
    - response_keep
    - response_keep_for_all_sources
    - response_already_complete_sentences
    - response_complete_sentences
    """
    
    structured_data = []
    
    for idx, row in df.iterrows():
        query_id = idx 
        original_query = row['original_query']
        
        # Process each source
        sources = {
            'USHMM': row['ushmm_article'],
            'gpt_4o': row['chatgpt_4o_response'], 
            'gemini': row['gemini_response'],
            'grok': row['grok_response']
        }
        
        for source_name, response in sources.items():
            if pd.isna(response):
                continue
                
            # Only clean USHMM articles at this stage
            response_cleaned = ''
            if source_name == 'USHMM':
                response_cleaned = clean_ushmm_article(str(response))
            else:
                response_cleaned = response
            
            # Create row with base fields populated and others empty
            structured_row = {
                'id': query_id,
                'original_query': original_query,
                'source': source_name,
                'response': str(response),
                'response_cleaned': response_cleaned,
                'response_no_headers': '',
                'response_no_headers_or_markdown': '',
                'response_language': '',
                'response_refusal': '',
                'response_keep': '',
                'response_keep_for_all_sources': '',
                'response_already_complete_sentences': '',
                'response_complete_sentences': ''
            }
            
            structured_data.append(structured_row)
    
    return pd.DataFrame(structured_data)

def process_responses(df):
    """Process all responses in the dataframe by cleaning text, removing headers/markdown,
    detecting language, and checking for refusal patterns.
    
    Args:
        df (DataFrame): The dataframe containing responses to process
    
    Returns:
        DataFrame: DataFrame with all responses processed
    """
    for idx, row in df.iterrows():
        response_text = row['response_cleaned'] if row['source'] == 'USHMM' and row['response_cleaned'] else row['response']
        
        # Remove headers and markdown
        no_headers = remove_headers(str(response_text))
        no_headers = re.sub(r'\n{3,}', '\n\n', no_headers)
        df.at[idx, 'response_no_headers'] = no_headers
        
        no_markdown = remove_all_markdown_keep_table_content(no_headers)
        no_markdown = re.sub(r'\n{3,}', '\n\n', no_markdown)
        df.at[idx, 'response_no_headers_or_markdown'] = no_markdown
        
        # Detect language
        try:
            lang = detect(no_markdown)
            df.at[idx, 'response_language'] = lang
        except:
            df.at[idx, 'response_language'] = 'unknown'
        
        # Set refusal - 'no' for USHMM, check patterns for other sources
        if row['source'] == 'USHMM':
            if "error: no main content found for " in response_text.lower():
                df.at[idx, 'response_refusal'] = 'yes'
            else:
                df.at[idx, 'response_refusal'] = 'no'
        else:
            df.at[idx, 'response_refusal'] = check_for_refusal(no_markdown)

        # Set keep flag based on language and refusal status
        df.at[idx, 'response_keep'] = 'yes' if (df.at[idx, 'response_language'] == 'en' and df.at[idx, 'response_refusal'] == 'no') else 'no'

    return df


In [191]:
def process_keep_for_all_sources(df):
    """Process response_keep flag to ensure all sources for a query are kept or none are.
    If any source for a query has response_keep='no', set all sources for that query to 'no'.
    
    Args:
        df (DataFrame): DataFrame containing responses with response_keep flags
        
    Returns:
        DataFrame: DataFrame with updated response_keep flags
    """
    # Get unique query IDs
    unique_query_ids = df['id'].unique()
    
    # For each query ID
    for query_id in unique_query_ids:
        # Get all responses for this query
        query_responses = df[df['id'] == query_id]
        
        # If any response has keep='no', set all to 'no', otherwise set all to 'yes'
        if (query_responses['response_keep'] == 'no').any():
            df.loc[df['id'] == query_id, 'response_keep_for_all_sources'] = 'no'
        else:
            df.loc[df['id'] == query_id, 'response_keep_for_all_sources'] = 'yes'
            
    return df


In [192]:
# Create and process the dataframe
base_df = create_base_dataframe(df)
processed_data = process_responses(base_df)
processed_data = process_keep_for_all_sources(processed_data)
processed_data.to_csv('../processed_data_1000_queries.csv', index=False)

In [193]:
# Count how many query_ids have all responses kept and save IDs
keep_count = 0
unique_query_ids = list(processed_data['id'].unique())
kept_query_ids = []

for query_id in unique_query_ids:
    # Get all rows for this query_id
    query_rows = processed_data[processed_data['id'] == query_id]
    
    # Check if all responses for this query are kept
    if (query_rows['response_keep_for_all_sources'] == 'yes').all():
        keep_count += 1
        kept_query_ids.append(int(query_id))

print(f"Number of queries where all responses were kept: {keep_count}")
print(f"Total number of unique queries: {len(unique_query_ids)}")
print(f"Percentage: {(keep_count/len(unique_query_ids))*100:.2f}%")
print(f"\nQuery IDs with all responses kept: {kept_query_ids}")

Number of queries where all responses were kept: 803
Total number of unique queries: 1000
Percentage: 80.30%

Query IDs with all responses kept: [0, 1, 2, 3, 6, 7, 8, 9, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 28, 29, 30, 32, 33, 35, 38, 39, 40, 41, 42, 43, 44, 45, 46, 49, 50, 52, 55, 57, 58, 59, 60, 61, 62, 63, 66, 68, 69, 71, 72, 73, 76, 77, 79, 80, 81, 83, 84, 85, 86, 87, 88, 90, 91, 93, 94, 95, 96, 97, 99, 100, 101, 102, 103, 105, 108, 109, 110, 111, 113, 114, 116, 119, 121, 124, 125, 127, 128, 129, 131, 133, 138, 139, 141, 142, 144, 145, 146, 147, 148, 149, 151, 152, 153, 154, 155, 156, 157, 161, 162, 163, 164, 167, 168, 169, 170, 171, 172, 176, 177, 178, 179, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 192, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 206, 207, 208, 209, 210, 211, 213, 215, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 229, 231, 232, 233, 234, 236, 237, 239, 240, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 255,

# 5. OpenAI Text Processing Functions

## Purpose
Implements intelligent text processing using OpenAI's GPT-4o to analyze and improve text quality by converting incomplete sentences into complete, standalone sentences. Processes text only when necessary, preserving original meaning without adding new information, and gracefully falls back to the original on API failure. Uses GPT-4o with a temperature of 0 for consistent output, supports up to 16,000 tokens for long inputs, and employs Pydantic models for structured, reliable parsing. Validates output length, retries processing when results are inconsistent, and falls back to the original text if processing fails.

The "sentencified" data is used in claim comparison.

### Sentence Completion Functions:
- **`are_all_sentences_complete()`**: Uses OpenAI's structured output to determine if text contains complete sentences
- **`complete_sentences()`**: Transforms incomplete text into complete sentences using GPT-4o


In [75]:
# ===== OPENAI PROCESSING FUNCTIONS =====

class SentenceCompleteness(BaseModel):
    """Structured output indicating whether all sentences are complete."""
    all_sentences_complete: bool

def are_all_sentences_complete(text: str) -> bool:
    """
    Returns True if all sentences in the text are complete; False if any are incomplete.
    """
    try:
        response = client.responses.parse(
            model="gpt-4o",
            input=[
                {
                    "role": "system",
                    "content": (
                        "Analyze the following text. Return true if all sentences are complete; "
                        "Return false if there are tables, bullet points, or lists in the text"
                        "return false if any sentence is incomplete. "
                        "A complete sentence must have a subject, a predicate, and express a complete thought."
                    ),
                },
                {
                    "role": "user",
                    "content": text,
                },
            ],
            text_format=SentenceCompleteness,
            temperature=0,
        )
        return response.output_parsed.all_sentences_complete
    except Exception as e:
        print(f"Error checking sentence completeness: {e}")
        return False
    
def complete_sentences(text: str) -> str:
    """Convert incomplete sentences into complete sentences using OpenAI"""
    instruction = "Convert only incomplete sentences into complete sentences, adding no new information. If a sentence is already complete, do not change it in any way. Each new sentence must include enough context to be fully understood on its own. No headers. Transform all bullet points and tables if they exist into complete sentences. Do not include explanatory text about the transformation process."
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": instruction},
                {"role": "user", "content": f"""
                    Instruction: {instruction}
                    
                    Text: {text}
                """}
            ],
            temperature=0,
            max_tokens=16000
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error completing sentences: {e}")
        return text



# 6. Turn into Complete Sentences 

This section transforms responses into complete sentences to enable accurate claim comparison. The results are stored in `response_already_complete_sentences` for sentences that were already complete, and `response_complete_sentences` for the final complete sentences.
               

In [233]:
def process_data_batch(processed_data, start_idx=0, batch_size=5, verbose=True):
    """
    Process a batch of data to check and complete sentences.
    
    Args:
        processed_data (pd.DataFrame): DataFrame containing the data to process
        start_idx (int): Starting index for the batch
        batch_size (int): Size of batch to process
        verbose (bool): Whether to print progress information
        
    Returns:
        pd.DataFrame: Processed batch with complete sentences
    """
    # Get batch sample
    end_idx = min(start_idx + batch_size, len(processed_data))
    data_batch = processed_data.iloc[start_idx:end_idx].copy()
    
    # Only process rows where response_keep_for_all_sources is 'yes'
    data_batch_to_process = data_batch[data_batch['response_keep'] == 'yes'].copy()
    
    if len(data_batch_to_process) > 0:
        # Apply sentence completeness check and transform incomplete sentences
        if verbose:
            print(f"\nProcessing batch from index {start_idx} to {end_idx-1}")
            print("\nInitial sentence completeness check:")
            for idx, row in data_batch_to_process.iterrows():
                result = are_all_sentences_complete(row['response_cleaned'])
                print(f"Row {idx}: {result}")

        # Check sentence completeness for all rows
        data_batch.loc[data_batch_to_process.index, 'response_already_complete_sentences'] = \
            data_batch_to_process['response_cleaned'].apply(are_all_sentences_complete)

        # Initialize response_complete_sentences column with response_no_headers_or_markdown
        data_batch.loc[data_batch_to_process.index, 'response_complete_sentences'] = \
            data_batch_to_process['response_no_headers_or_markdown']
        # Process rows with incomplete sentences
        incomplete_rows = data_batch_to_process[~data_batch_to_process['response_cleaned'].apply(are_all_sentences_complete)]
        for idx in incomplete_rows.index:
            data_batch.loc[idx, 'response_complete_sentences'] = complete_sentences(incomplete_rows.loc[idx, 'response_cleaned'])

        # Verify all sentences are now complete
        if verbose:
            print("\nFinal sentence completeness check:")
            for idx, row in data_batch_to_process.iterrows():
                result = are_all_sentences_complete(row['response_complete_sentences'])
                print(f"Row {idx}: {result}")

            # Print summary
            num_transformed = len(incomplete_rows)
            num_incomplete = sum(~data_batch_to_process['response_complete_sentences'].apply(are_all_sentences_complete))
            print(f"\nNumber of responses transformed: {num_transformed}")
            print(f"Responses with incomplete sentences remaining: {num_incomplete}")
    
    return data_batch


In [None]:
# Process data in batches and save to cleaned_output folder
batch_size = 5
total_rows = len(processed_data)

for start_idx in range(0, total_rows, batch_size):
    # Process batch
    batch_df = process_data_batch(processed_data, start_idx=start_idx, batch_size=batch_size, verbose=True)
    
    # Update processed_data with the processed batch
    end_idx = min(start_idx + batch_size, total_rows)
    processed_data.iloc[start_idx:end_idx] = batch_df
    
    # Create filename with batch range
    filename = f'/cleaned_output/processed_data_batch_{start_idx}_{end_idx}.csv'
    
    # Save batch to CSV
    batch_df.to_csv(filename, index=False)
    print(f"Saved batch {start_idx}-{end_idx} to {filename}")

In [None]:
# Get list of all CSV files in cleaned_output directory
batch_files = glob.glob('cleaned_output/processed_data_batch_*.csv')

# Sort files by batch number
batch_files.sort(key=lambda x: int(x.split('_')[-2]))

# Read and combine all batches
combined_df = pd.concat([pd.read_csv(f) for f in batch_files], ignore_index=True)

# Save combined dataset
combined_df.to_csv('../processed_data_1000_queries.csv', index=False)
print(f"Combined {len(batch_files)} batches into processed_data_combined.csv")
print(f"Total rows in combined dataset: {len(combined_df)}")

# 7. Wide Version of Data (Optional)

This is a wide-format dataset where each row represents one query, and columns contain responses from multiple sources.Each source’s outputs are stored in separate prefixed columns (e.g., gpt_4o_response, gemini_response_cleaned, USHMM_response_language, ...), enabling direct comparison across models.

In [17]:
def create_wide_format_data(processed_data):
    # Create list of base columns that get repeated for each source
    base_columns = list(processed_data.columns[~processed_data.columns.isin(['id', 'source', 'original_query'])])

    # Define sources
    sources = processed_data['source'].unique()

    # Create full column list starting with id and original_query
    columns = ['id', 'original_query']

    # Add source-specific columns
    for source in sources:
        source_columns = [f"{source}_{col}" for col in base_columns]
        columns.extend(source_columns)

    # Create empty dataframe with the defined columns
    wide_df = pd.DataFrame(columns=columns)

    # Copy data from processed_data to wide_df
    for idx, row in processed_data.iterrows():
        source = row['source']
        query_id = row['id']
        
        # If this query_id doesn't exist in wide_df yet, create it
        if query_id not in wide_df['id'].values:
            new_row = pd.DataFrame({
                'id': [query_id],
                'original_query': row['original_query']  # Copy original_query when creating new row
            })
            wide_df = pd.concat([wide_df, new_row], ignore_index=True)
        
        # Get the row index in wide_df
        wide_idx = wide_df[wide_df['id'] == query_id].index[0]
        
        # Copy over the data
        for base_col in base_columns:
            if base_col in row:
                wide_col = f"{source}_{base_col}"
                wide_df.at[wide_idx, wide_col] = row[base_col]

    return wide_df

# Create wide format data and save to CSV
wide_df = create_wide_format_data(processed_data)
wide_df.to_csv('wide_data_1000_queries.csv', index=False)
wide_df.head()

Unnamed: 0,id,original_query,USHMM_response,USHMM_response_cleaned,USHMM_response_no_headers,USHMM_response_no_headers_or_markdown,USHMM_response_language,USHMM_response_refusal,USHMM_response_keep,USHMM_response_keep_for_all_sources,...,grok_response,grok_response_cleaned,grok_response_no_headers,grok_response_no_headers_or_markdown,grok_response_language,grok_response_refusal,grok_response_keep,grok_response_keep_for_all_sources,grok_response_already_complete_sentences,grok_response_complete_sentences
0,0,how many people died in the holocaust,#How Many People did the Nazis Murder? | Holoc...,#How Many People did the Nazis Murder? | Holoc...,\nNazi Germany committed mass murder on an unp...,Nazi Germany committed mass murder on an unpre...,en,no,yes,yes,...,The Holocaust was a period of systematic perse...,The Holocaust was a period of systematic perse...,The Holocaust was a period of systematic perse...,The Holocaust was a period of systematic perse...,en,no,yes,yes,True,The Holocaust was a period of systematic perse...
1,1,armenian genocide,#The Armenian Genocide (1915-16): Overview | H...,#The Armenian Genocide (1915-16): Overview | H...,\nSometimes called the first genocide of the t...,Sometimes called the first genocide of the twe...,en,no,yes,yes,...,The Armenian Genocide was the systematic exter...,The Armenian Genocide was the systematic exter...,The Armenian Genocide was the systematic exter...,The Armenian Genocide was the systematic exter...,en,no,yes,yes,False,The Armenian Genocide was the systematic exter...
2,2,holocaust encyclopedia,#Introduction to the Holocaust: What was the H...,#Introduction to the Holocaust: What was the H...,\nThe Holocaust (1933–1945) was the systematic...,"The Holocaust (1933–1945) was the systematic, ...",en,no,yes,yes,...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,en,no,yes,yes,False,The Holocaust Encyclopedia is a comprehensive ...
3,3,first they came for,"#Martin Niemöller: ""First they came for the So...","#Martin Niemöller: ""First they came for the So...","\n\n> First they came for the socialists, and ...","First they came for the socialists, and I did ...",en,no,yes,yes,...,"""First they came ..."" is the beginning of a fa...","""First they came ..."" is the beginning of a fa...","""First they came ..."" is the beginning of a fa...","""First they came ..."" is the beginning of a fa...",en,no,yes,yes,True,"""First they came ..."" is the beginning of a fa..."
4,4,holocaust,#Introduction to the Holocaust: What was the H...,#Introduction to the Holocaust: What was the H...,\nThe Holocaust (1933–1945) was the systematic...,"The Holocaust (1933–1945) was the systematic, ...",en,no,yes,no,...,"Der Holocaust war eine systematische, staatlic...","Der Holocaust war eine systematische, staatlic...","Der Holocaust war eine systematische, staatlic...","Der Holocaust war eine systematische, staatlic...",de,no,no,no,,


# 8. Dataset Helper Functions (Optional)

These are functions to help create smaller versions of the datasets that are fed into other analysis/pipelines. 

In [3]:
# Read in the processed data
processed_data = pd.read_csv('../processed_data_1000_queries.csv')

In [18]:
# Create dataset with only USHMM and GPT data where response_keep is 'yes'
def create_ushmm_gpt_dataset(df):
    # Filter to keep only rows where response_keep is 'yes'
    ushmm_gpt_filtered = df[df['response_keep'] == 'yes']
    
    # Filter to only USHMM and GPT sources
    ushmm_gpt_data = ushmm_gpt_filtered[ushmm_gpt_filtered['source'].isin(['USHMM', 'gpt_4o'])]

    # Group by id and check that both USHMM and gpt_4o exist
    grouped = ushmm_gpt_data.groupby('id')['source'].apply(set)
    valid_ids = grouped[grouped.apply(lambda x: {'USHMM', 'gpt_4o'}.issubset(x))].index
    
    # Filter to only keep rows where both sources exist
    ushmm_gpt_filtered = ushmm_gpt_data[ushmm_gpt_data['id'].isin(valid_ids)]
    
    return ushmm_gpt_filtered

# Create the filtered dataset
ushmm_gpt_df = create_ushmm_gpt_dataset(processed_data)
print(f"Created filtered dataset with {len(ushmm_gpt_df)} rows")

# Create wide format of USHMM and GPT data
ushmm_gpt_wide = create_wide_format_data(ushmm_gpt_df)
print(f"Created wide format dataset with {len(ushmm_gpt_wide)} rows")

ushmm_gpt_wide.head()


Created filtered dataset with 1910 rows


Created wide format dataset with 955 rows


Unnamed: 0,id,original_query,USHMM_response,USHMM_response_cleaned,USHMM_response_no_headers,USHMM_response_no_headers_or_markdown,USHMM_response_language,USHMM_response_refusal,USHMM_response_keep,USHMM_response_keep_for_all_sources,...,gpt_4o_response,gpt_4o_response_cleaned,gpt_4o_response_no_headers,gpt_4o_response_no_headers_or_markdown,gpt_4o_response_language,gpt_4o_response_refusal,gpt_4o_response_keep,gpt_4o_response_keep_for_all_sources,gpt_4o_response_already_complete_sentences,gpt_4o_response_complete_sentences
0,0,how many people died in the holocaust,#How Many People did the Nazis Murder? | Holoc...,#How Many People did the Nazis Murder? | Holoc...,\nNazi Germany committed mass murder on an unp...,Nazi Germany committed mass murder on an unpre...,en,no,yes,yes,...,Approximately 6 million Jews were killed durin...,Approximately 6 million Jews were killed durin...,Approximately 6 million Jews were killed durin...,Approximately 6 million Jews were killed durin...,en,no,yes,yes,False,Approximately 6 million Jews were killed durin...
1,1,armenian genocide,#The Armenian Genocide (1915-16): Overview | H...,#The Armenian Genocide (1915-16): Overview | H...,\nSometimes called the first genocide of the t...,Sometimes called the first genocide of the twe...,en,no,yes,yes,...,The Armenian Genocide was the systematic mass ...,The Armenian Genocide was the systematic mass ...,The Armenian Genocide was the systematic mass ...,The Armenian Genocide was the systematic mass ...,en,no,yes,yes,False,"April 24, 1915, is often marked as the beginni..."
2,2,holocaust encyclopedia,#Introduction to the Holocaust: What was the H...,#Introduction to the Holocaust: What was the H...,\nThe Holocaust (1933–1945) was the systematic...,"The Holocaust (1933–1945) was the systematic, ...",en,no,yes,yes,...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,en,no,yes,yes,False,The Holocaust Encyclopedia is a comprehensive ...
3,3,first they came for,"#Martin Niemöller: ""First they came for the So...","#Martin Niemöller: ""First they came for the So...","\n\n> First they came for the socialists, and ...","First they came for the socialists, and I did ...",en,no,yes,yes,...,"The phrase ""First they came for..."" is the ope...","The phrase ""First they came for..."" is the ope...","The phrase ""First they came for..."" is the ope...","The phrase ""First they came for..."" is the ope...",en,no,yes,yes,False,"The phrase ""First they came for..."" is the ope..."
4,4,holocaust,#Introduction to the Holocaust: What was the H...,#Introduction to the Holocaust: What was the H...,\nThe Holocaust (1933–1945) was the systematic...,"The Holocaust (1933–1945) was the systematic, ...",en,no,yes,no,...,"The Holocaust was the systematic, state-sponso...","The Holocaust was the systematic, state-sponso...","The Holocaust was the systematic, state-sponso...","The Holocaust was the systematic, state-sponso...",en,no,yes,no,True,"The Holocaust was the systematic, state-sponso..."


In [21]:
# Read in the processed data
processed_data = pd.read_csv('../processed_data_1000_queries.csv')

def create_filtered_dataset(df):
    # Filter to keep only rows where response_keep_for_all_sources is 'yes'
    filtered_data = df[df['response_keep_for_all_sources'] == 'yes']
    # Select only the specified columns
    filtered_data = filtered_data[['id', 'original_query', 'source', 'response', 'response_cleaned']]
    return filtered_data

# Create the filtered dataset
filtered_df = create_filtered_dataset(processed_data)
print(f"Created filtered dataset with {len(filtered_df)} rows")

filtered_df.head()

Created filtered dataset with 3212 rows


Unnamed: 0,id,original_query,source,response,response_cleaned
0,0,how many people died in the holocaust,USHMM,#How Many People did the Nazis Murder? | Holoc...,#How Many People did the Nazis Murder? | Holoc...
1,0,how many people died in the holocaust,gpt_4o,Approximately 6 million Jews were killed durin...,Approximately 6 million Jews were killed durin...
2,0,how many people died in the holocaust,gemini,Historians estimate that the Nazis murdered ap...,Historians estimate that the Nazis murdered ap...
3,0,how many people died in the holocaust,grok,The Holocaust was a period of systematic perse...,The Holocaust was a period of systematic perse...
4,1,armenian genocide,USHMM,#The Armenian Genocide (1915-16): Overview | H...,#The Armenian Genocide (1915-16): Overview | H...
