Context: This script constitutes the core logic of the 02c_cleaning_manifestos.ipynb notebook. It addresses the need to integrate party manifestos (platforms) into the analysis, complementing the previously cleaned candidate speeches to form a comprehensive textual corpus.

In [22]:
import pandas as pd
import os
import re

# ==========================================
# STEP 1: DEFINE SPECIFIC CLEANING FUNCTION
# ==========================================
def clean_platform_text(text):
    """
    Applies text preprocessing specifically tailored for political manifestos 
    (often derived from PDF conversions), removing legal disclaimers and formatting artifacts.
    """
    # Remove standard campaign legal disclaimers (e.g., "Paid for by...").
    text = re.sub(r'Paid for by.*', '', text, flags=re.IGNORECASE)
    text = re.sub(r'Not Authorized By.*', '', text, flags=re.IGNORECASE)
    
    # Remove isolated page numbers often found in PDF text extracts.
    # Pattern 1: Digits appearing alone on a new line.
    text = re.sub(r'\n\s*\d+\s*\n', ' ', text) 
    # Pattern 2: Isolated digits surrounded by whitespace within the text.
    text = re.sub(r'\s\d+\s', ' ', text)       
    
    # Note: Repeated headers/footers (e.g., "Republican Platform 2012") can be added here
    # if identified as recurring patterns.
    
    # Whitespace Normalization (PDF artifact correction): 
    # Replace newline characters with spaces to reconstruct broken sentences.
    text = text.replace('\n', ' ')
    
    # Collapse multiple whitespace characters into a single space.
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# ==========================================
# STEP 2: FILE PROCESSING LOOP
# ==========================================
def process_platforms(folder_path):
    """
    Iterates through text files in the specified directory, extracts metadata 
    from filenames, and applies the cleaning function.
    """
    data = []
    
    # Iterate through all files in the directory
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(folder_path, filename)
            
            # Metadata Extraction: Parse year and party from the filename.
            # Assumes format: "YYYY-Party.txt" (e.g., "2004-Republicans.txt")
            try:
                year_str, party_raw = filename.replace('.txt', '').split('-')
                year = int(year_str)
                # Standardize party names based on filename content
                party = "Republican" if "Republicans" in party_raw else "Democrat"
            except:
                print(f"Warning: Unrecognized filename format for: {filename}")
                continue

            # Read and clean the file content
            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                raw_text = f.read()
                cleaned_text = clean_platform_text(raw_text)
            
            # Assign a generic candidate name for manifestos.
            candidate_name = "Party Platform" 
            
            data.append({
                'year': year,
                'party': party,
                'candidate': candidate_name, 
                'text': cleaned_text
            })
    
    return pd.DataFrame(data)

# ==========================================
# STEP 3: EXECUTION AND MERGING
# ==========================================

# Define the directory path containing the raw manifesto text files.
folder_path = '/Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/raw/manifestos' 

# Generate the manifestos DataFrame.
df_platforms = process_platforms(folder_path)

# Load the previously processed speeches dataset.
df_speeches = pd.read_csv('/Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/processed/president_speeches_clean.csv')

# Standardize column selection to ensure schema consistency before merging.
df_speeches = df_speeches[['year', 'party', 'candidate', 'text']]
df_platforms = df_platforms[['year', 'party', 'candidate', 'text']]

# Data Integration: Vertically concatenate speeches and manifestos.
df_nlp_final = pd.concat([df_speeches, df_platforms], ignore_index=True)

# Validation
print(f"Total number of texts in the corpus: {len(df_nlp_final)}")
print(df_nlp_final.head())

# Export the unified NLP database.
df_nlp_final.to_csv('nlp_database_complete.csv', index=False)

Total number of texts in the corpus: 27
   year       party candidate  \
0  2024    Democrat    Harris   
1  2024  Republican     Trump   
2  2020    Democrat     Biden   
3  2020  Republican     Trump   
4  2016    Democrat   Clinton   

                                                text  
0  Good evening! Kamala! Kamala! Kamala! Californ...  
1  Thank you very much. Thank you very, very much...  
2  Good evening. Ella Baker, a giant of the civil...  
3  Thank you very much. Thank you very much. Than...  
4  Thank you all very, very much! Thank you for t...  


Context: This segment serves as the Quality Assurance (QA) phase within the 02c_cleaning_manifestos.ipynb notebook. Before proceeding to descriptive analysis or modeling, it is imperative to validate the structural integrity and content richness of the unified NLP dataset.

In [23]:
# ==========================================
# STEP 1: DATA SORTING
# ==========================================
# Sort the dataframe chronologically (descending) and by party (alphabetical) 
# to ensure a structured view of the election cycles.
df_nlp_final = df_nlp_final.sort_values(by=['year', 'party'], ascending=[False, True])

# ==========================================
# STEP 2: SANITY CHECK & VALIDATION
# ==========================================
print("-" * 30)
print("DATA CONTENT VERIFICATION")
print("-" * 30)

# A. Document Completeness Check
# Verify the distribution of documents across years. 
# Ideally, each election cycle should contain approx. 4 documents (2 Speeches + 2 Platforms).
print("\nDocument count per year:")
print(df_nlp_final.groupby('year').size())

# B. Content Validity Assessment
# Calculate the character count of each text entry.
# This metric serves as a proxy to ensure that the cleaning process did not erroneously 
# delete valid content (i.e., checking for empty or near-empty strings).
df_nlp_final['text_length'] = df_nlp_final['text'].apply(len)

print("\nOverview of the first 15 documents (Year, Party, Candidate, Text Length):")
print(df_nlp_final[['year', 'party', 'candidate', 'text_length']].head(15))

# C. Platform-Specific Validation
# Isolate the 'Party Platform' entries to verify they exhibit substantial length.
# Manifestos are typically extensive documents (> 20,000 characters); 
# low counts here would indicate a PDF extraction failure.
print("\nFocus on Party Platforms (Manifestos):")
print(df_nlp_final[df_nlp_final['candidate'] == "Party Platform"][['year', 'party', 'text_length']])

------------------------------
DATA CONTENT VERIFICATION
------------------------------

Document count per year:
year
2000    4
2004    4
2008    4
2012    4
2016    4
2020    3
2024    4
dtype: int64

Overview of the first 15 documents (Year, Party, Candidate, Text Length):
    year       party       candidate  text_length
0   2024    Democrat          Harris        20974
20  2024    Democrat  Party Platform       270996
1   2024  Republican           Trump        66964
22  2024  Republican  Party Platform        39275
2   2020    Democrat           Biden        17591
19  2020    Democrat  Party Platform       287421
3   2020  Republican           Trump        40805
4   2016    Democrat         Clinton        29502
15  2016    Democrat  Party Platform       182915
5   2016  Republican           Trump        29128
23  2016  Republican  Party Platform       240697
6   2012    Democrat           Obama        25330
14  2012    Democrat  Party Platform       169538
7   2012  Republican   

Context: This segment of the 02c_cleaning_manifestos.ipynb notebook performs a targeted sanity check focusing on the 2012 election cycle. This step validates the structural integrity of the merged dataset (df_nlp_final) before proceeding to the descriptive analysis.

In [24]:
# ==========================================
# STEP 3: TARGETED VALIDATION (CASE STUDY: 2012)
# ==========================================
# Isolate the data for the 2012 election cycle to perform a granular spot check.
# This specific year serves as a benchmark to ensure that both speeches and manifestos 
# were correctly merged and attributed.
test_2012 = df_nlp_final[df_nlp_final['year'] == 2012]

print(f"Row count for the 2012 election cycle: {len(test_2012)}")

# Display key metadata and text metrics for the selected subset.
# This allows for a visual comparison between the length of oral speeches (Candidates) 
# and written platforms (Party Platform).
print(test_2012[['year', 'party', 'candidate', 'text_length']])

Row count for the 2012 election cycle: 4
    year       party       candidate  text_length
6   2012    Democrat           Obama        25330
14  2012    Democrat  Party Platform       169538
7   2012  Republican          Romney        22519
16  2012  Republican  Party Platform       211417


Context: This script functions as the concluding operation of the data cleaning pipeline within 02c_cleaning_manifestos.ipynb. It is responsible for finalizing the structure of the text corpus before it is handed off to the descriptive and predictive modeling stages.

In [25]:
# ==========================================
# STEP 1: FINAL SORTING AND INDEX RESET
# ==========================================

# 1. Sort the dataset
# The data is organized chronologically (Year: Descending) and then alphabetically by Party.
# This structure facilitates the analysis of the most recent election cycles first.
df_nlp_final = df_nlp_final.sort_values(by=['year', 'party'], ascending=[False, True])

# 2. CRITICAL: Reset the Dataframe Index
# After sorting, the indices are often disordered (e.g., 10, 5, 2). 
# Resetting ensures a sequential index (0, 1, 2...), which is essential for 
# consistent iteration and access using functions like .iloc[].
# The 'drop=True' parameter prevents the old disordered index from being added as a new column.
df_nlp_final = df_nlp_final.reset_index(drop=True)

# 3. Save the Final Processed Dataset (Checkpoint)
# Export the clean, sorted, and re-indexed dataframe to a CSV file.
# This file serves as the stable input for the subsequent Analysis and Machine Learning notebooks.
df_nlp_final.to_csv('nlp_database_sorted_clean.csv', index=False)

print("Database sorted and index reset successfully.")
print("Preview of the first 10 rows (Should start with the 2024 election cycle):")
print(df_nlp_final[['year', 'party', 'candidate']].head(10))

Database sorted and index reset successfully.
Preview of the first 10 rows (Should start with the 2024 election cycle):
   year       party       candidate
0  2024    Democrat          Harris
1  2024    Democrat  Party Platform
2  2024  Republican           Trump
3  2024  Republican  Party Platform
4  2020    Democrat           Biden
5  2020    Democrat  Party Platform
6  2020  Republican           Trump
7  2016    Democrat         Clinton
8  2016    Democrat  Party Platform
9  2016  Republican           Trump


Context: This script executes the final, most intensive data cleaning phase, designated as "V8," within the 02c_cleaning_manifestos.ipynb notebook. It addresses persistent data quality issues identified during initial exploratory analysis, specifically targeting Optical Character Recognition (OCR) errors and irrelevant numerical noise found in the party manifestos.

In [26]:
import pandas as pd
import re

# ==========================================
# STEP 1: DEFINE INTENSIVE CLEANING FUNCTION (V8)
# ==========================================
def cleaning_v8_scorched_earth(text):
    """
    Applies a rigorous, 'scorched earth' cleaning protocol to eliminate persistent 
    OCR artifacts, irrelevant numerical data, and specific noise tokens identified 
    during manual inspection.
    """
    if not isinstance(text, str): return ""
    
    # --- 1. NUMERICAL CLEANING ---
    # Strategy: Retain only years (19xx or 20xx). 
    # This removes irrelevant integers (e.g., page numbers, table data like "6060", "0002").
    # Regex lookahead ensures the digits are not preceded by 19 or 20.
    text = re.sub(r'\b(?!19|20)\d+\b', ' ', text) 
    # Remove decimal numbers often found in statistical tables (e.g., "0.0002").
    text = re.sub(r'\b\d+\.\d+\b', ' ', text)     
    
    # --- 2. EXPLICIT BLACKLIST FILTERING ---
    # Removal of specific nonsense tokens identified as recurrent OCR errors.
    blacklist = [
        "etek", "nee", "ane", "EERE", "PR", "ees", "MERICAN", 
        "itecee", "cscce", "erenere", "ri", "DDLE", "eeaeva"
    ]
    
    for bad_word in blacklist:
        # Regex removes the exact token (case-insensitive) to prevent partial matches within valid words.
        text = re.sub(r'\b' + re.escape(bad_word) + r'\b', ' ', text, flags=re.IGNORECASE)

    # --- 3. PUNCTUATION & SYMBOL CLEANING ---
    # Collapse repeated punctuation marks (e.g., ": :..") into a single space.
    text = re.sub(r'[\.:\-,_]{2,}', ' ', text) 
    # Remove non-standard symbols (e.g., currency symbols, OCR artifacts), keeping only alphanumeric and basic punctuation.
    text = re.sub(r'[^a-zA-Z0-9\s.,;:\'\-?!]', ' ', text)

    # --- 4. STRUCTURAL & PATTERN-BASED CLEANING ---
    # Remove specific OCR noise patterns: words > 12 characters containing high-frequency noise letters.
    text = re.sub(r'\b[cesinrat_\-]{12,}\b', ' ', text, flags=re.IGNORECASE)
    
    # Length Threshold: Remove strings exceeding 18 characters, which are statistically likely to be artifacts.
    text = re.sub(r'\b\w{18,}\b', ' ', text)
    
    # Fix Spaced Capitalization: Reconstruct words split by spaces (e.g., "T A B L E" -> "TABLE").
    text = re.sub(r'\b([A-Z])\s+(?=[A-Z]\b)', r'\1', text)
    
    # Remove administrative headers and section titles.
    text = re.sub(r'TABLE OF CONTENTS|CONTENTS|PREAMBLE|INTRODUCTION', ' ', text, flags=re.IGNORECASE)

    # Final Whitespace Normalization.
    return re.sub(r'\s+', ' ', text).strip()

print("Initiating V8 Intensive Cleaning Protocol...")

# ==========================================
# STEP 2: APPLY TRANSFORMATION
# ==========================================
df_nlp_final['text'] = df_nlp_final['text'].apply(cleaning_v8_scorched_earth)

# ==========================================
# STEP 3: SORTING AND INDEXING
# ==========================================
# Ensure chronological and categorical order is maintained.
df_nlp_final = df_nlp_final.sort_values(by=['year', 'party'], ascending=[False, True])
df_nlp_final = df_nlp_final.reset_index(drop=True)

# ==========================================
# STEP 4: EXPORT
# ==========================================
output_filename = 'nlp_database_CLEAN_V8.csv'
df_nlp_final.to_csv(output_filename, index=False)

print(f"Process Complete. Cleaned database saved to: {output_filename}")

# ==========================================
# STEP 5: VERIFICATION (POST-CONDITION CHECK)
# ==========================================
print("\nFinal Verification:")
check_list = ["etek", "nee", "6060", "0002", "ees", "MERICAN"]
clean_count = 0

for item in check_list:
    # check if the artifact persists in the text column
    matches = df_nlp_final[df_nlp_final['text'].str.contains(r'\b' + item + r'\b', case=False, na=False)]
    if len(matches) > 0:
        print(f"FAILURE: Artifact '{item}' persists in the corpus.")
    else:
        print(f"SUCCESS: Artifact '{item}' has been eliminated.")
        clean_count += 1

if clean_count == len(check_list):
    print("\nVERIFICATION SUCCESSFUL. The dataset is ready for Sentiment Analysis.")

Initiating V8 Intensive Cleaning Protocol...
Process Complete. Cleaned database saved to: nlp_database_CLEAN_V8.csv

Final Verification:
SUCCESS: Artifact 'etek' has been eliminated.
SUCCESS: Artifact 'nee' has been eliminated.
SUCCESS: Artifact '6060' has been eliminated.
SUCCESS: Artifact '0002' has been eliminated.
SUCCESS: Artifact 'ees' has been eliminated.
SUCCESS: Artifact 'MERICAN' has been eliminated.

VERIFICATION SUCCESSFUL. The dataset is ready for Sentiment Analysis.
