<a href="https://colab.research.google.com/github/AI-Cultural-Heritage-Lab/llm_sanitization/blob/main/Sentence_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Embeddings

- Author: Ulysses Pascal
- Date: May 9, 2025
- Description: Creating sentence by sentence embedding for all sentences in USHMM and ChatGPT data. The goal is to be able to sentences on a scale from positive to negative and compare overall, how different the sources are in tone.
- goals:
- tone classifier (very negative, somewhat negative, neutral, somewhat positive/optimisitic, very postive/optimistic)
- get average positivity score across each source
- create a graph that shows average (or stacked??) positivity accross  all docs for each percent of the document. ie x axis: 0 -> 100% of document (hypothesis: end of the document tends to have a positive uptick for gpt)
- see if similar sentences content wise are similar tone wise.  




In [64]:
#@title Mount Google Drive
import gspread
from google.colab import auth
from gspread_dataframe import get_as_dataframe
import pandas as pd


# Mount gdrive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [79]:
#@title Create a Base Directory for the Project

import os
BASE_DIR = f'/content/drive/MyDrive/AI and Cultural Heritage Lab/Current Projects/LLM Language Sanitization Project/Experiments   Data Analysis/11 Compare Sentences — Ulysses'

# Create the base directory if it doesn't exist
os.makedirs(BASE_DIR, exist_ok=True)

In [6]:
#@title Import Data
import pandas as pd

def read_gsheet(sheet_id, gid, *args, **kwargs):
    """
    Read a Google Sheet into a DataFrame.
    Accepts all the same arguments as pd.read_csv()
    """
    url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv&gid={gid}"
    df = pd.read_csv(url, *args, **kwargs)
    return df


# Import Combined Data from Google Sheets
sheet_id = "1_qAn-BNXdIzotr_bXXF2dAjJfa7pDFH1Q84Y9fEBby4"
gid = "1499249206"  # gid for combined_data_v6

df = read_gsheet(sheet_id, gid)
df.head()


Unnamed: 0,id,location,translated_query,query_type,query_thematic_tags,original_query,google_translate,4o_mini_translate,ushmm_article,chatgpt-4o-response,Gemini Response,Grok Response,ai_overview_standalone,comments,related_questions,ushmm_links,final_language
0,0,United States,,Factual,"['Holocaust', 'World War II', 'History', 'Geno...",how many people died in the holocaust,,,#How Many People did the Nazis Murder? | Holoc...,Approximately 6 million Jews were killed durin...,Historians estimate that the Nazis murdered ap...,The Holocaust was a period of systematic perse...,[],,[],['https://encyclopedia.ushmm.org/content/en/ar...,English
1,1,United States,,Factual,"['Genocide', 'Armenian history', 'World War I'...",armenian genocide,,,#The Armenian Genocide (1915-16): Overview | H...,The Armenian Genocide was the systematic mass ...,The Armenian Genocide was the systematic destr...,The Armenian Genocide was the systematic exter...,[],,[],['https://encyclopedia.ushmm.org/content/en/ar...,English
2,2,United States,,Factual,"['Holocaust', 'History', 'Encyclopedia', 'Worl...",holocaust encyclopedia,,,#Introduction to the Holocaust: What was the H...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,The Holocaust Encyclopedia is a comprehensive ...,[],,[{'question': 'Is the Holocaust Encyclopedia c...,"['https://encyclopedia.ushmm.org/', 'https://e...",English
3,3,United States,,Interpretative,"['Oppression', 'Historical Awareness', 'Moral ...",first they came for,,,"#Martin Niemöller: ""First they came for the So...","The phrase ""First they came for..."" is the ope...","""First they came for the socialists, and I did...","""First they came ..."" is the beginning of a fa...",[],,"[{'question': 'What is the saying ""first they ...",['https://encyclopedia.ushmm.org/content/en/ar...,English
4,4,United States,,Factual,"['Holocaust', 'Genocide', 'World War II', 'His...",holocaust,,,#Introduction to the Holocaust: What was the H...,"The Holocaust was the systematic, state-sponso...","The Holocaust was the systematic, state-sponso...","Der Holocaust war eine systematische, staatlic...",[],,[{'question': 'Are any Holocaust survivors sti...,['https://encyclopedia.ushmm.org/content/en/ar...,English


## Tokenize Sentences.

The goal of this section is to split the document into meaningful "sentences".

Challenges
1. Some headings dont have punctuation

In [69]:
import pandas as pd
import re
from typing import List, Tuple

# Your existing tokenizer code (keeping it as is)
import spacy

# Compile regex patterns once
MARKDOWN_PATTERNS = [
    (re.compile(r'(\*\*|__)(.*?)\1'), r'\2'),  # Bold
    (re.compile(r'(\*|_)(.*?)\1'), r'\2'),     # Italic
    (re.compile(r'`([^`]*)`'), r'\1'),         # Inline code
    (re.compile(r'\[([^\]]+)\]\([^)]+\)'), r'\1'),  # Links
    (re.compile(r'!\[([^\]]*)\]\([^)]+\)'), r'\1'),  # Images
    (re.compile(r'^\s*>\s*', re.MULTILINE), ''),     # Blockquotes
    (re.compile(r'^\s*[-*+]\s+', re.MULTILINE), ''), # List markers
]

HEADING_PATTERN = re.compile(r'^\s*#+')
HASH_REMOVE_PATTERN = re.compile(r'^\s*#+\s*')
SENTENCE_SPLIT_PATTERN = re.compile(r'[.!?]+\s+')

def remove_markdown_fast(text):
    for pattern, replacement in MARKDOWN_PATTERNS:
        text = pattern.sub(replacement, text)
    return text.strip()

def is_markdown_heading_cached(line):
    """Cache heading detection for repeated patterns"""
    return bool(HEADING_PATTERN.match(line))

def is_short_line(line, min_sentence_length=4):
    """Check if line is shorter than threshold"""
    return len(line.strip()) < min_sentence_length

def is_short_without_punctuation(line):
    """Check if line ends without punctuation"""

    #ingore long sentences
    num_words = len(line.split())
    if num_words > 9:
        return False

    # Get last 4 characters and check if any are punctuation
    last_chars = line.rstrip()[-4:] if len(line.rstrip()) >= 4 else line.rstrip()

    # Check if any of the last characters are punctuation
    punctuation_chars = '.!?'
    return not any(char in punctuation_chars for char in last_chars)


def sentence_tokenize_document_fast(text) -> Tuple[List[str], List[bool], List[str]]:
    """
    Tokenize document and return sentences with marking information.

    Args:
        text: Input text to tokenize

    Returns:
        Tuple of (sentences, marked_for_removal_flags, removal_reasons)
    """
    # Load spaCy model once and reuse
    if not hasattr(sentence_tokenize_document_fast, 'nlp'):
        sentence_tokenize_document_fast.nlp = spacy.load("en_core_web_sm")
    nlp = sentence_tokenize_document_fast.nlp

    lines = text.split('\n')
    units = []
    marked_for_removal = []
    removal_reasons = []

    for i, line in enumerate(lines):
        stripped = line.strip()
        if not stripped:
            continue

        if is_markdown_heading_cached(stripped):
            clean = HASH_REMOVE_PATTERN.sub('', stripped)
            clean = remove_markdown_fast(clean)
            units.append(clean)

            # Check if this heading should be marked
            marked_for_removal.append(True)
            removal_reasons.append('Markdown heading')

        else:
            clean = remove_markdown_fast(stripped)
            original_line = clean  # Keep track of original line for line break detection

            try:
                doc = nlp(clean)
                sentences = [sent.text.strip() for sent in doc.sents]
            except:
                sentences = [s.strip() for s in SENTENCE_SPLIT_PATTERN.split(clean) if s.strip()]

            # Process each sentence from this line
            for j, sentence in enumerate(sentences):
                if is_short_line(sentence, min_sentence_length=4):
                    units.append(sentence)
                    marked_for_removal.append(True)
                    removal_reasons.append('Short sentence')
                elif is_short_without_punctuation(sentence):
                    units.append(sentence)
                    marked_for_removal.append(True)
                    removal_reasons.append('Sentence without punctuation')
                else:
                    units.append(sentence)
                    marked_for_removal.append(False)
                    removal_reasons.append('')

    return units, marked_for_removal, removal_reasons

def batch_process_to_sentences(df: pd.DataFrame,
                             text_columns: List[str],
                             tokenizer_func,
                             batch_size: int = 100,
                             id_col: str = 'id',
                             additional_cols: List[str] = None
                                ) -> pd.DataFrame:
    """
    Process text columns in batches and directly output long-format sentence DataFrame.
    Uses tokenizer that returns marking information.

    Args:
        df: Input DataFrame
        text_columns: List of column names containing text to process
        tokenizer_func: Function to tokenize text (should return sentences, marks, reasons)
        batch_size: Number of documents to process at once
        id_col: Name of ID column
        additional_cols: Other columns to preserve
        tokenizer)

    Returns:
        Long-form DataFrame with columns: [id_col, sentence_id, additional_cols..., source, sentence, marked_for_removal, removal_reason]
    """
    if additional_cols is None:
        additional_cols = []

    all_sentence_rows = []

    # Process each text column
    for source_col in text_columns:
        print(f"Processing {source_col}...")

        # Process this column in batches
        for i in range(0, len(df), batch_size):
            batch_df = df.iloc[i:i+batch_size]

            # Process each document in the batch
            for _, row in batch_df.iterrows():
                text = row[source_col]

                # Skip empty/null text
                if pd.isna(text) or text == '':
                    continue

                # Tokenize the text and get marking information
                sentences, marks, reasons = tokenizer_func(text)

                # Create rows for each sentence
                for sentence_id, (sentence, marked, reason) in enumerate(zip(sentences, marks, reasons), 1):
                    if sentence.strip():  # Skip empty sentences
                        sentence_row = {
                            id_col: row[id_col],
                            'sentence_id': sentence_id,
                            'source': source_col,
                            'sentence': sentence.strip(),
                            'marked_for_removal': marked,
                            'removal_reason': reason
                        }

                        # Add additional columns
                        for col in additional_cols:
                            sentence_row[col] = row[col]

                        all_sentence_rows.append(sentence_row)

            if i % min(batch_size, 10) == 0:  # At least every 10 docs
                print(f"  Processed {min(i + batch_size, len(df))}/{len(df)} documents")

    # Convert to DataFrame
    print("Creating final DataFrame...")
    result_df = pd.DataFrame(all_sentence_rows)

    # Reorder columns nicely
    if not result_df.empty:
        final_cols = [id_col, 'sentence_id'] + additional_cols + ['source', 'sentence', 'marked_for_removal', 'removal_reason']
        result_df = result_df[final_cols]

        # rename id_col to query_id
        result_df.rename(columns={id_col: 'query_id'}, inplace=True)

        #insert doc_id after 'source' where doc_id = {query_id}_{source}
        result_df.insert(2, 'doc_id', result_df.apply(lambda row: f"{row['query_id']}_{row['source']}", axis=1))

        # Print summary statistics
        if len(result_df) > 0:
            total_sentences = len(result_df)
            marked_count = result_df['marked_for_removal'].sum()
            print(f"\nProcessing complete!")
            print(f"Total sentences: {total_sentences:,}")
            print(f"Marked for removal: {marked_count:,} ({marked_count/total_sentences*100:.1f}%)")

            if marked_count > 0:
                print("\nRemoval reasons breakdown:")
                reason_counts = result_df[result_df['marked_for_removal']]['removal_reason'].value_counts()
                for reason, count in reason_counts.items():
                    print(f"  {reason}: {count:,}")

    return result_df

## Test
test_df = df[0:5].copy()

# Process with the enhanced tokenizer
result = batch_process_to_sentences(
    df=test_df,
    text_columns=['ushmm_article', 'chatgpt-4o-response',
       'Gemini Response', 'Grok Response'],
    tokenizer_func=sentence_tokenize_document_fast,
    batch_size=2,
    additional_cols=['original_query', 'location']
)

print("\nSample results:")
result[['sentence', 'marked_for_removal', 'removal_reason']].head(10)

Processing ushmm_article...
  Processed 5/10 documents
  Processed 10/10 documents
Processing chatgpt-4o-response...
  Processed 5/10 documents
  Processed 10/10 documents
Processing Gemini Response...
  Processed 5/10 documents
  Processed 10/10 documents
Processing Grok Response...
  Processed 5/10 documents
  Processed 10/10 documents
Creating final DataFrame...

Processing complete!
Total sentences: 1,958
Marked for removal: 305 (15.6%)

Removal reasons breakdown:
  Markdown heading: 144
  Sentence without punctuation: 117
  Short sentence: 44

Sample results:


Unnamed: 0,sentence,marked_for_removal,removal_reason
0,How Many People did the Nazis Murder? | Holocaust Encyclopedia,True,Markdown heading
1,-,True,Short sentence
2,Nazi Germany committed mass murder on an unprecedented scale.,False,
3,"Before and especially during World War II, the Nazi German regime perpetrated the Holocaust and other mass atrocities.",False,
4,"In the aftermath of these crimes, calculating the number of victims became important for legal, historical, ethical, and educational reasons.",False,
5,The statistics below were calculated using a number of different sources.,False,
6,"These sources include surviving Nazi German reports and records; prewar and postwar demographic studies; records created by Jews during and after the war; documentation created by resistance groups and underground activists; as well as other available, extant archival sources.",False,
7,These death statistics lay bare the enormity of the Holocaust and other Nazi crimes.,False,
8,They are a starting point for confronting the scale of human loss unleashed by Nazi Germany.,False,
9,How many Jewish people died in the Holocaust?,True,Markdown heading


In [70]:
#@title Sentence tokenize entire df
sent_df = batch_process_to_sentences(
    df=df,
    text_columns=['ushmm_article', 'chatgpt-4o-response',
       'Gemini Response', 'Grok Response'],
    tokenizer_func=sentence_tokenize_document_fast,
    batch_size=100,
    additional_cols=['original_query', 'location']
)

Processing ushmm_article...
  Processed 100/6000 documents
  Processed 200/6000 documents
  Processed 300/6000 documents
  Processed 400/6000 documents
  Processed 500/6000 documents
  Processed 600/6000 documents
  Processed 700/6000 documents
  Processed 800/6000 documents
  Processed 900/6000 documents
  Processed 1000/6000 documents
  Processed 1100/6000 documents
  Processed 1200/6000 documents
  Processed 1300/6000 documents
  Processed 1400/6000 documents
  Processed 1500/6000 documents
  Processed 1600/6000 documents
  Processed 1700/6000 documents
  Processed 1800/6000 documents
  Processed 1900/6000 documents
  Processed 2000/6000 documents
  Processed 2100/6000 documents
  Processed 2200/6000 documents
  Processed 2300/6000 documents
  Processed 2400/6000 documents
  Processed 2500/6000 documents
  Processed 2600/6000 documents
  Processed 2700/6000 documents
  Processed 2800/6000 documents
  Processed 2900/6000 documents
  Processed 3000/6000 documents
  Processed 3100/6000

In [72]:
# display full "sentence" col
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 20)  # This will show n rows

# Show all sentences marked for removal with this reason
filtered_df = sent_df[sent_df['removal_reason'] == 'Sentence without punctuation']
print(f"Total sentences with this removal reason: {len(filtered_df)}")
filtered_df

Total sentences with this removal reason: 112338


Unnamed: 0,query_id,sentence_id,doc_id,original_query,location,source,sentence,marked_for_removal,removal_reason
22,0,23,0_ushmm_article,how many people died in the holocaust,United States,ushmm_article,| --- | --- |,True,Sentence without punctuation
45,0,46,0_ushmm_article,how many people died in the holocaust,United States,ushmm_article,| Killing center | Number of Jewish victims |,True,Sentence without punctuation
46,0,47,0_ushmm_article,how many people died in the holocaust,United States,ushmm_article,| --- | --- |,True,Sentence without punctuation
47,0,48,0_ushmm_article,how many people died in the holocaust,United States,ushmm_article,"| Chełmno | at least 167,000 |",True,Sentence without punctuation
48,0,49,0_ushmm_article,how many people died in the holocaust,United States,ushmm_article,"| Belzec | approximately 435,000 |",True,Sentence without punctuation
...,...,...,...,...,...,...,...,...,...
695817,999,3,999_Grok Response,the bielski brothers,United States,Grok Response,Here’s a brief overview:,True,Sentence without punctuation
695819,999,5,999_Grok Response,the bielski brothers,United States,Grok Response,1. Tuvia Bielski (1906-1987),True,Sentence without punctuation
695822,999,8,999_Grok Response,the bielski brothers,United States,Grok Response,2. Asael Bielski (1908-1945),True,Sentence without punctuation
695826,999,12,999_Grok Response,the bielski brothers,United States,Grok Response,Zus Bielski (1912-1995),True,Sentence without punctuation


In [80]:
#@title Save Raw Sentence DF
sent_df.to_csv(f"{BASE_DIR}/all_models_sentence_tokenized_raw.csv", index=False)

In [81]:
# Mark Sections for Removal

import re

# Define section headers and metadata patterns to remove
section_headers = ['Critical Thinking Questions', 'Footnotes']

# Define regex patterns for metadata that typically appears at document end
metadata_patterns = [
    r'^Last Edited:\s*',           # "Last Edited: Sep 26, 2023"
    r'^Author\(s\):\s*',           # "Author(s):"
]

# pre-compile patterns for efficiency
compiled_patterns = [re.compile(pattern, re.IGNORECASE) for pattern in metadata_patterns]

def matches_metadata_pattern(sentence):
    """Check if sentence matches any metadata pattern"""
    sentence = sentence.strip()
    return any(pattern.match(sentence) for pattern in compiled_patterns)

def mark_sections_for_removal(df,
                              source_name,
                              section_headers,
                              metadata_patterns):
    """Mark entire sections for removal from end of documents"""

    # Check for section headers and metadata patterns in USHMM articles only
    df['is_section_header'] = (
        (df['source'] == 'ushmm_article') &
        (df['sentence'].isin(section_headers))
    )

    df['is_metadata'] = (
        (df['source'] == 'ushmm_article') &
        (df['sentence'].apply(matches_metadata_pattern))
    )

    # Combine both types of markers
    df['is_section_removal_marker'] = df['is_section_header'] | df['is_metadata']

    # Initialize section_to_remove column
    df['section_to_remove'] = False

    # Process each USHMM document
    ushmm_mask = df['source'] == 'ushmm_article'
    if ushmm_mask.any():
        for doc_id in df[ushmm_mask]['doc_id'].unique():
            doc_mask = (df['doc_id'] == doc_id) & (df['source'] == 'ushmm_article')
            doc_group = df[doc_mask]

            # Find first occurrence of ANY removal marker (section header OR metadata)
            marker_rows = doc_group[doc_group['is_section_removal_marker']]

            if not marker_rows.empty:
                # Get the sentence_id of the first marker
                first_marker_sentence_id = marker_rows['sentence_id'].min()

                # Mark all sentences from first marker to end of document
                max_sentence_id = doc_group['sentence_id'].max()

                # Mark rows for removal
                removal_mask = (doc_group['sentence_id'] >= first_marker_sentence_id)
                df.loc[doc_group[removal_mask].index, 'section_to_remove'] = True

                print(f"Doc {doc_id}: Found marker at sentence {first_marker_sentence_id}, marking sentences {first_marker_sentence_id}-{max_sentence_id} for removal")

                # Show what triggered the removal
                trigger_sentence = marker_rows[marker_rows['sentence_id'] == first_marker_sentence_id]['sentence'].iloc[0]
                print(f"  Triggered by: '{trigger_sentence}'")

        return df

pre_processed_df = mark_sections_for_removal(sent_df, 'ushmm_article', section_headers, metadata_patterns)


Doc 0_ushmm_article: Found marker at sentence 99, marking sentences 99-106 for removal
  Triggered by: 'Footnotes'
Doc 1_ushmm_article: Found marker at sentence 22, marking sentences 22-29 for removal
  Triggered by: 'Last Edited: Nov 7, 2024'
Doc 2_ushmm_article: Found marker at sentence 304, marking sentences 304-318 for removal
  Triggered by: 'Footnotes'
Doc 3_ushmm_article: Found marker at sentence 82, marking sentences 82-98 for removal
  Triggered by: 'Footnotes'
Doc 4_ushmm_article: Found marker at sentence 304, marking sentences 304-318 for removal
  Triggered by: 'Footnotes'
Doc 5_ushmm_article: Found marker at sentence 97, marking sentences 97-106 for removal
  Triggered by: 'Last Edited: Oct 18, 2019'
Doc 6_ushmm_article: Found marker at sentence 304, marking sentences 304-318 for removal
  Triggered by: 'Footnotes'
Doc 7_ushmm_article: Found marker at sentence 99, marking sentences 99-106 for removal
  Triggered by: 'Footnotes'
Doc 8_ushmm_article: Found marker at sentence

In [111]:
#@title Consolidate Removal Metadata


def consolidate_removal_metadata(df):
    """
    Consolidate removal metadata into a single column.
    """

    df = df.copy()

    df['marked_for_removal'] = df['marked_for_removal'] | df['section_to_remove']

    # Update removal_reason for section removals
    section_removal_mask = df['section_to_remove']

    # For rows with existing removal reasons, append "Section removal"
    existing_reason_mask = section_removal_mask & (df['removal_reason'] != '')
    df.loc[existing_reason_mask, 'removal_reason'] = df.loc[existing_reason_mask, 'removal_reason'] + '; Section removal'

    # For rows with no existing removal reason, set to "Section removal"
    no_reason_mask = section_removal_mask & (df['removal_reason'] == '')
    df.loc[no_reason_mask, 'removal_reason'] = 'Section removal'

    # Drop the temporary columns
    columns_to_drop = ['is_section_header', 'is_metadata', 'is_section_removal_marker', 'section_to_remove']
    df = df.drop(columns=columns_to_drop)

    return df

consolidated_df = consolidate_removal_metadata(pre_processed_df)

In [117]:

cols_to_insepct = ['sentence', 'marked_for_removal', 'removal_reason']

consolidated_df[cols_to_insepct].iloc[150:190]

Unnamed: 0,sentence,marked_for_removal,removal_reason
150,The Nazi German regime implemented this genocide between 1941 and 1945.,False,
151,"By the end of the Holocaust, the Nazi German regime and their allies and collaborators had murdered six million European Jews.",False,
152,Why did the Nazis target Jews?,True,Markdown heading
153,The Nazis targeted Jews because the Nazis were radically antisemitic.,False,
154,This means that they were prejudiced against and hated Jews.,False,
...,...,...,...
185,Where did the Holocaust take place?,True,Markdown heading
186,The Holocaust was a Nazi German initiative that took place throughout German- and Axis-controlled Europe.,False,
187,"It affected nearly all of Europe’s Jewish population, which in 1933 numbered 9 million people.",False,
188,The Holocaust began in Germany after Adolf Hitler was appointed chancellor in January 1933.,False,


In [118]:
#save df
consolidated_df.to_csv(f"{BASE_DIR}/all_models_sentence_tokenized_marked_for_removal.csv", index=False)

In [None]:
# count number of marked for removal

## Clean Data

In [134]:
#@title Create cleaned dataframe

def cleaning_summary(og_df, clean_df) -> None:
    """Helper function to summarize cleaning process."""
    # Cleaning summary
    print(f"\nCleaning Summary:")
    print(f"Original dataframe size: {len(og_df)}")
    print(f"Cleaned dataframe size: {len(clean_df)}")
    print(f"Rows removed: {len(og_df) - len(clean_df)}")

    # Summarize removal_reason breakdown (only for removed rows)
    if 'removal_reason' in og_df.columns:
        print("\nRemoval reasons breakdown:")
        removed_rows = og_df[og_df['marked_for_removal']]
        reason_counts = removed_rows['removal_reason'].value_counts()
        for reason, count in reason_counts.items():
            print(f"  {reason}: {count:,}")

    # Print examples for each removal reason
    if 'removal_reason' in og_df.columns:
        for reason in removed_rows['removal_reason'].unique():
            if reason:
                print(f"\nExamples of '{reason}' removal:")
                examples = removed_rows[removed_rows['removal_reason'] == reason]['sentence'].head(3)
                for example in examples:
                    print(f"  - '{example[:100]}{'...' if len(example) > 100 else ''}'")

def filter_unmarked_sentences(df: pd.DataFrame,
                             helper_columns=['marked_for_removal', 'removal_reason'],
                             show_cleaning_summary=True) -> pd.DataFrame:
    """Helper function to filter out sentences marked for removal."""
    df = df.copy()

    # Remove rows marked for removal
    clean_df = df[~df['marked_for_removal']]

    # Drop helper columns
    clean_df = clean_df.drop(columns=helper_columns)

    # Cleaning summary
    if show_cleaning_summary:
        cleaning_summary(df, clean_df)  # This will work now

    return clean_df

cleaned_df = filter_unmarked_sentences(consolidated_df[consolidated_df['location'] == 'United States'])


Cleaning Summary:
Original dataframe size: 136026
Cleaned dataframe size: 103250
Rows removed: 32776

Removal reasons breakdown:
  Sentence without punctuation: 13,002
  Markdown heading: 9,610
  Section removal: 3,741
  Short sentence: 3,438
  Sentence without punctuation; Section removal: 2,042
  Markdown heading; Section removal: 924
  Short sentence; Section removal: 19

Examples of 'Markdown heading' removal:
  - 'How Many People did the Nazis Murder? | Holocaust Encyclopedia'
  - 'How many Jewish people died in the Holocaust?'
  - 'How many non-Jewish people did the Nazis and their allies murder between 1933 and 1945?'

Examples of 'Short sentence' removal:
  - '-'
  - '|'
  - '|'

Examples of 'Sentence without punctuation' removal:
  - '| --- | --- |'
  - '| Killing center | Number of Jewish victims |'
  - '| --- | --- |'

Examples of 'Markdown heading; Section removal' removal:
  - 'Footnotes'
  - 'Critical Thinking Questions'
  - 'Critical Thinking Questions'

Examples of 'Se

In [141]:
#@title save cleaned_df
cleaned_df.to_csv(f"{BASE_DIR}/us_sentence_tokenized_df_filtered.csv", index=False)

#note, the sentence cleaning dropped some entire documents

In [140]:
count_by_group(cleaned_df, group = 'source', i='doc_id')


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Gemini Response,998.0,12.241483,12.247563,1.0,4.0,9.0,15.0,89.0
Grok Response,998.0,10.854709,6.929501,1.0,6.0,10.0,14.0,44.0
chatgpt-4o-response,998.0,14.188377,7.006415,1.0,9.0,14.0,19.0,40.0
ushmm_article,987.0,66.909828,59.56423,1.0,29.0,51.0,79.0,349.0


In [139]:
# Find documents where all sentences are marked for removal
def find_completely_removed_docs(df):
    """Find documents where every sentence is marked for removal"""

    # Group by doc_id and calculate removal stats
    doc_stats = df.groupby('doc_id').agg({
        'marked_for_removal': ['count', 'sum'],
        'source': 'first'
    }).round(2)

    # Flatten column names
    doc_stats.columns = ['total_sentences', 'marked_sentences', 'source']

    # Find docs where all sentences are marked
    completely_removed = doc_stats[doc_stats['total_sentences'] == doc_stats['marked_sentences']]

    print(f"Found {len(completely_removed)} documents completely marked for removal:")
    print(completely_removed)

    return completely_removed

# Find the problematic documents
problem_docs = find_completely_removed_docs(consolidated_df[consolidated_df['location'] == 'United States'])

# Look at examples from each source
print("\n" + "="*50)
print("EXAMPLES OF COMPLETELY REMOVED DOCUMENTS:")
print("="*50)

for source in problem_docs['source'].unique():
    source_docs = problem_docs[problem_docs['source'] == source]
    print(f"\n{source.upper()} - {len(source_docs)} documents completely removed")

    # Show 2-3 examples
    sample_doc_ids = source_docs.index[:3]

    for doc_id in sample_doc_ids:
        print(f"\n--- Example: {doc_id} ---")
        doc_data = consolidated_df[consolidated_df['doc_id'] == doc_id]

        print(f"Total sentences: {len(doc_data)}")
        print(f"Removal reasons:")
        reasons = doc_data['removal_reason'].value_counts()
        for reason, count in reasons.items():
            print(f"  {reason}: {count}")

        print(f"First few sentences:")
        for i, (idx, row) in enumerate(doc_data.head(3).iterrows()):
            print(f"  {i+1}. '{row['sentence'][:80]}...' -> {row['removal_reason']}")

Found 19 documents completely marked for removal:
                         total_sentences  marked_sentences  \
doc_id                                                       
159_ushmm_article                     27                27   
228_ushmm_article                      1                 1   
256_ushmm_article                     19                19   
342_ushmm_article                      1                 1   
363_ushmm_article                      1                 1   
373_ushmm_article                      1                 1   
446_ushmm_article                      1                 1   
534_Grok Response                      3                 3   
534_chatgpt-4o-response                1                 1   
534_ushmm_article                      9                 9   
539_ushmm_article                      1                 1   
583_Gemini Response                    1                 1   
583_Grok Response                      9                 9   
583_chatgpt-4o-respo

## Notes

####Removal summary
- USHMM output contains some fragments like "-" and ">" We filtered these out.
- The end of document has a lot of meta data, footnotes and critical questions which we decided to remove, as they might distract from the content of the article.
- Sentence tokenization will have to be adapted for chinese and korean texts.
- Error: No main content found for https://encyclopedia.ushmm.org/en/a-z/photo type content was removed.

NOTE: perhaps we should actually keep "critical thinking questions" in.

####Challenges:
- LLMs use a lot of lists. In a list, the subject of the sentence is missing, potentially making tone analysis problematic.
